Skip to content

avoid per-frame cleanup closures on read to remove 2 allocs per frame read#565

Merged
kylecarbs merged 1 commit into
coder:masterfrom
mitchellh:read-cleanup-closures
Jun 15, 2026
Merged

avoid per-frame cleanup closures on read to remove 2 allocs per frame read#565
kylecarbs merged 1 commit into
coder:masterfrom
mitchellh:read-cleanup-closures

Conversation

@mitchellh

Copy link
Copy Markdown
Contributor

Replace the per-frame cleanup closure returned by prepareRead with a normal finishRead method.

The closure captured the connection, context, and caller’s named error result. Escape analysis showed that both the closure and error result moved to the heap on every frame header and payload read. The new method preserves timeout cleanup and error mapping without those allocations.

This is admittedly a micro optimization but I decided to hunt out any allocation wins I can get in the direct request path for a project I'm working on, and assessed that the fixes were maintainable and easy to understand.

AI disclosure: I used AI to help find and fix this, but fully understand the problem myself and reviewed the code. The final shape contains manual adaptations, too.

Benchmarks

Each frame read removes 2 allocations and 64 bytes:

Benchmark Before After Change
Header, background context 170.5 ns, 216 B, 6 allocs 118.5 ns, 152 B, 4 allocs -30.5% time
Header, cancelable context 211.1 ns, 216 B, 6 allocs 158.7 ns, 152 B, 4 allocs -24.8% time
BenchmarkConn/disabledCompress 3.875 µs, 1536 B, 42 allocs 3.657 µs, 1216 B, 32 allocs -5.63% time, -20.83% bytes, -23.81% allocs

The focused frame benchmarks used a temporary in-package harness and are not included in this change. I didn't think you'd want them in the repo cause they're such a micro optimization.

Every frame header and payload read called prepareRead, which returned a
cleanup function to clear the timeout and translate close or cancellation
errors. That function captured the context, connection, and the address of
the caller's named error result. Because prepareRead returned the function,
the closure outlived its stack frame and Go allocated its captured state on
the heap for every frame read.

Move the cleanup logic to a normal finishRead method and defer a direct
method call instead. This preserves timeout cleanup and error translation
without returning a closure. Compiler escape analysis with -gcflags=-m=2
confirms that the old function literal escaped and forced the named error
result in readFrameHeader and readFramePayload onto the heap; neither escape
remains after this change.

The results below compare parent d099e16 with this commit on an Apple
M4 Max using GOMAXPROCS=1 and benchstat over 10 samples. A temporary
in-package harness, not included in this commit, repeatedly called one
internal frame read. Header reads parse a minimal frame header; payload reads
copy 512 bytes from a buffered repeating reader. The background cases use
context.Background, while the cancelable cases use an uncanceled
context.WithCancel.

Removing the escaping closure eliminates 2 allocations and 64 bytes from
every frame read: one allocation for the closure environment and one for the
named error result retained by that closure. Header reads with a background
context improve from 170.5 to 118.5 ns/op, 216 to 152 B/op, and 6 to
4 allocs/op. Header reads with a cancelable context improve from 211.1 to
158.7 ns/op with the same allocation reduction. Payload reads remove the
same fixed overhead; they remain slower because the benchmark also copies
512 bytes.

Both context types benefit because this commit does not change timeout
registration. It only removes cleanup allocations made after every
prepareRead call. An interleaved 12-sample
BenchmarkConn/disabledCompress run, which exercises complete message reads,
improves from 3.875 to 3.657 us/op (-5.63%), 42 to 32 allocs/op (-23.81%),
and 1536 to 1216 B/op (-20.83%).
@kylecarbs kylecarbs merged commit 7039364 into coder:master Jun 15, 2026
4 checks passed
mitchellh added a commit to mitchellh/websocket-fork that referenced this pull request Jun 15, 2026
coder#565

Every frame read registered a context.AfterFunc callback, even when the
context could not be canceled, and returned a cleanup closure that
forced the caller's error and captured state onto the heap. Skip
timeout setup for contexts with a nil Done channel and move read
cleanup into a direct method call, while tracking whether a callback
was installed so cancelable operations retain the same
close-on-cancellation behavior. Writes use the same background-context
fast path, and BenchmarkConn now joins its writer goroutine so repeated
runs finish cleanly.

On an Apple M4 Max with GOMAXPROCS=1, a 10-sample in-package
benchmark against d099e16 reduced background header reads from
168.36 to 22.85 ns/op and 512-byte payload reads from 175.04 to
26.65 ns/op. Both fell from 216 B/op and 6 allocs/op to zero.
Cancelable header reads improved from 203.40 to 160.50 ns/op and
payload reads from 215.18 to 162.52 ns/op, with both falling to
152 B/op and 4 allocs/op. Comparing against the closure-only change
confirms that skipping the dead timeout registration accounts for
exactly 152 bytes and 4 allocations on background reads without
changing cancelable allocations.

BenchmarkConn/disabledCompress averaged 3.968 us/op, 1539 B/op, and 42
allocs/op before the change and 3.817 us/op, 1219 B/op, and 32 allocs/op
after it. This is a 3.8 percent time reduction, 20.8 percent fewer
bytes, and 23.8 percent fewer allocations per operation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants