avoid per-frame cleanup closures on read to remove 2 allocs per frame read#565
Merged
Merged
Conversation
Every frame header and payload read called prepareRead, which returned a cleanup function to clear the timeout and translate close or cancellation errors. That function captured the context, connection, and the address of the caller's named error result. Because prepareRead returned the function, the closure outlived its stack frame and Go allocated its captured state on the heap for every frame read. Move the cleanup logic to a normal finishRead method and defer a direct method call instead. This preserves timeout cleanup and error translation without returning a closure. Compiler escape analysis with -gcflags=-m=2 confirms that the old function literal escaped and forced the named error result in readFrameHeader and readFramePayload onto the heap; neither escape remains after this change. The results below compare parent d099e16 with this commit on an Apple M4 Max using GOMAXPROCS=1 and benchstat over 10 samples. A temporary in-package harness, not included in this commit, repeatedly called one internal frame read. Header reads parse a minimal frame header; payload reads copy 512 bytes from a buffered repeating reader. The background cases use context.Background, while the cancelable cases use an uncanceled context.WithCancel. Removing the escaping closure eliminates 2 allocations and 64 bytes from every frame read: one allocation for the closure environment and one for the named error result retained by that closure. Header reads with a background context improve from 170.5 to 118.5 ns/op, 216 to 152 B/op, and 6 to 4 allocs/op. Header reads with a cancelable context improve from 211.1 to 158.7 ns/op with the same allocation reduction. Payload reads remove the same fixed overhead; they remain slower because the benchmark also copies 512 bytes. Both context types benefit because this commit does not change timeout registration. It only removes cleanup allocations made after every prepareRead call. An interleaved 12-sample BenchmarkConn/disabledCompress run, which exercises complete message reads, improves from 3.875 to 3.657 us/op (-5.63%), 42 to 32 allocs/op (-23.81%), and 1536 to 1216 B/op (-20.83%).
kylecarbs
approved these changes
Jun 15, 2026
mitchellh
added a commit
to mitchellh/websocket-fork
that referenced
this pull request
Jun 15, 2026
coder#565 Every frame read registered a context.AfterFunc callback, even when the context could not be canceled, and returned a cleanup closure that forced the caller's error and captured state onto the heap. Skip timeout setup for contexts with a nil Done channel and move read cleanup into a direct method call, while tracking whether a callback was installed so cancelable operations retain the same close-on-cancellation behavior. Writes use the same background-context fast path, and BenchmarkConn now joins its writer goroutine so repeated runs finish cleanly. On an Apple M4 Max with GOMAXPROCS=1, a 10-sample in-package benchmark against d099e16 reduced background header reads from 168.36 to 22.85 ns/op and 512-byte payload reads from 175.04 to 26.65 ns/op. Both fell from 216 B/op and 6 allocs/op to zero. Cancelable header reads improved from 203.40 to 160.50 ns/op and payload reads from 215.18 to 162.52 ns/op, with both falling to 152 B/op and 4 allocs/op. Comparing against the closure-only change confirms that skipping the dead timeout registration accounts for exactly 152 bytes and 4 allocations on background reads without changing cancelable allocations. BenchmarkConn/disabledCompress averaged 3.968 us/op, 1539 B/op, and 42 allocs/op before the change and 3.817 us/op, 1219 B/op, and 32 allocs/op after it. This is a 3.8 percent time reduction, 20.8 percent fewer bytes, and 23.8 percent fewer allocations per operation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace the per-frame cleanup closure returned by
prepareReadwith a normalfinishReadmethod.The closure captured the connection, context, and caller’s named error result. Escape analysis showed that both the closure and error result moved to the heap on every frame header and payload read. The new method preserves timeout cleanup and error mapping without those allocations.
This is admittedly a micro optimization but I decided to hunt out any allocation wins I can get in the direct request path for a project I'm working on, and assessed that the fixes were maintainable and easy to understand.
AI disclosure: I used AI to help find and fix this, but fully understand the problem myself and reviewed the code. The final shape contains manual adaptations, too.
Benchmarks
Each frame read removes 2 allocations and 64 bytes:
BenchmarkConn/disabledCompressThe focused frame benchmarks used a temporary in-package harness and are not included in this change. I didn't think you'd want them in the repo cause they're such a micro optimization.