avoid per-frame cleanup closures on read to remove 2 allocs per frame read by mitchellh · Pull Request #565 · coder/websocket

mitchellh · 2026-06-15T02:47:13Z

Replace the per-frame cleanup closure returned by prepareRead with a normal finishRead method.

The closure captured the connection, context, and caller’s named error result. Escape analysis showed that both the closure and error result moved to the heap on every frame header and payload read. The new method preserves timeout cleanup and error mapping without those allocations.

This is admittedly a micro optimization but I decided to hunt out any allocation wins I can get in the direct request path for a project I'm working on, and assessed that the fixes were maintainable and easy to understand.

AI disclosure: I used AI to help find and fix this, but fully understand the problem myself and reviewed the code. The final shape contains manual adaptations, too.

Benchmarks

Each frame read removes 2 allocations and 64 bytes:

Benchmark	Before	After	Change
Header, background context	170.5 ns, 216 B, 6 allocs	118.5 ns, 152 B, 4 allocs	-30.5% time
Header, cancelable context	211.1 ns, 216 B, 6 allocs	158.7 ns, 152 B, 4 allocs	-24.8% time
`BenchmarkConn/disabledCompress`	3.875 µs, 1536 B, 42 allocs	3.657 µs, 1216 B, 32 allocs	-5.63% time, -20.83% bytes, -23.81% allocs

The focused frame benchmarks used a temporary in-package harness and are not included in this change. I didn't think you'd want them in the repo cause they're such a micro optimization.

Every frame header and payload read called prepareRead, which returned a cleanup function to clear the timeout and translate close or cancellation errors. That function captured the context, connection, and the address of the caller's named error result. Because prepareRead returned the function, the closure outlived its stack frame and Go allocated its captured state on the heap for every frame read. Move the cleanup logic to a normal finishRead method and defer a direct method call instead. This preserves timeout cleanup and error translation without returning a closure. Compiler escape analysis with -gcflags=-m=2 confirms that the old function literal escaped and forced the named error result in readFrameHeader and readFramePayload onto the heap; neither escape remains after this change. The results below compare parent d099e16 with this commit on an Apple M4 Max using GOMAXPROCS=1 and benchstat over 10 samples. A temporary in-package harness, not included in this commit, repeatedly called one internal frame read. Header reads parse a minimal frame header; payload reads copy 512 bytes from a buffered repeating reader. The background cases use context.Background, while the cancelable cases use an uncanceled context.WithCancel. Removing the escaping closure eliminates 2 allocations and 64 bytes from every frame read: one allocation for the closure environment and one for the named error result retained by that closure. Header reads with a background context improve from 170.5 to 118.5 ns/op, 216 to 152 B/op, and 6 to 4 allocs/op. Header reads with a cancelable context improve from 211.1 to 158.7 ns/op with the same allocation reduction. Payload reads remove the same fixed overhead; they remain slower because the benchmark also copies 512 bytes. Both context types benefit because this commit does not change timeout registration. It only removes cleanup allocations made after every prepareRead call. An interleaved 12-sample BenchmarkConn/disabledCompress run, which exercises complete message reads, improves from 3.875 to 3.657 us/op (-5.63%), 42 to 32 allocs/op (-23.81%), and 1536 to 1216 B/op (-20.83%).

coder#565 Every frame read registered a context.AfterFunc callback, even when the context could not be canceled, and returned a cleanup closure that forced the caller's error and captured state onto the heap. Skip timeout setup for contexts with a nil Done channel and move read cleanup into a direct method call, while tracking whether a callback was installed so cancelable operations retain the same close-on-cancellation behavior. Writes use the same background-context fast path, and BenchmarkConn now joins its writer goroutine so repeated runs finish cleanly. On an Apple M4 Max with GOMAXPROCS=1, a 10-sample in-package benchmark against d099e16 reduced background header reads from 168.36 to 22.85 ns/op and 512-byte payload reads from 175.04 to 26.65 ns/op. Both fell from 216 B/op and 6 allocs/op to zero. Cancelable header reads improved from 203.40 to 160.50 ns/op and payload reads from 215.18 to 162.52 ns/op, with both falling to 152 B/op and 4 allocs/op. Comparing against the closure-only change confirms that skipping the dead timeout registration accounts for exactly 152 bytes and 4 allocations on background reads without changing cancelable allocations. BenchmarkConn/disabledCompress averaged 3.968 us/op, 1539 B/op, and 42 allocs/op before the change and 3.817 us/op, 1219 B/op, and 32 allocs/op after it. This is a 3.8 percent time reduction, 20.8 percent fewer bytes, and 23.8 percent fewer allocations per operation.

mitchellh mentioned this pull request Jun 15, 2026

conn: skip timeout callbacks for background contexts #566

Merged

kylecarbs approved these changes Jun 15, 2026

View reviewed changes

kylecarbs merged commit 7039364 into coder:master Jun 15, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid per-frame cleanup closures on read to remove 2 allocs per frame read#565

avoid per-frame cleanup closures on read to remove 2 allocs per frame read#565
kylecarbs merged 1 commit into
coder:masterfrom
mitchellh:read-cleanup-closures

mitchellh commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mitchellh commented Jun 15, 2026

Benchmarks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants