What happened:
Running the conformance suite (v1.5.1, GATEWAY-HTTP profile) against cloudflare-tunnel-gateway-controller — a Gateway API implementation for Cloudflare Tunnel, where each request and its mirror copy both traverse the Cloudflare edge — the HTTPRouteRequestPercentageMirror test intermittently logs a failed distribution check before passing within its retry budget. The failing subcase is consistently /percent-mirror (the 20% mirror); I have observed mirrored counts as low as 78 and as high as 116 against the [85, 115] band. The suite ultimately passes because numDistributionChecks = 5 returns on the first success, so this is log noise rather than a hard failure — but it is caused by the test's tolerance being statistically too tight at the lowest mirror percentage, independent of the implementation under test.
What you expected to happen:
An implementation that mirrors at exactly the configured percentage should pass the distribution check reliably (well under ~0.1% per-attempt flake), the way the 35% and 50% subcases already do.
How to reproduce it (as minimally and precisely as possible):
This reproduces by analysis, independent of any implementation, because the check is a binomial sampling test.
The test sends totalRequests = 500 requests and compares the mirrored count to a relative ±tolerancePercentage = 15% band around the expected count (conformance/tests/httproute-request-percentage-mirror.go). Mirroring is Bernoulli, so the count is Binomial(500, p) with σ = √(n·p·(1−p)). The relative tolerance scales linearly with p, but σ scales with √(p(1−p)), so the band narrows in σ-units as p drops:
| Subcase |
p |
expected μ |
σ |
band (±15%) |
width |
P(outside) per check |
/percent-mirror |
20% |
100 |
8.94 |
[85, 115] |
±1.68σ |
~8–9% |
/percent-mirror-and-modify-headers |
35% |
175 |
10.67 |
[148.75, 201.25] |
±2.46σ |
~1.4% |
/percent-mirror-fraction |
50% |
250 |
11.18 |
[212.5, 287.5] |
±3.35σ |
~0.08% |
So even a perfect implementation flakes ~8–9% per distribution-check on the 20% subcase. The 5-retry budget hides it at the suite level (P(all 5 fail) on the order of 10⁻⁵), but roughly every 12th run logs a failed attempt and re-sends up to 2500 requests for that subcase.
Anything else we need to know?:
Ruling out a lossy mirror leg in cloudflare-tunnel-gateway-controller (the mirror copy traverses the Cloudflare edge, so dropped mirrors were a plausible worry): the observed overshoot (116 > 115) cannot come from dropped mirrors — a drop can only ever undercount. The deviations are symmetric (below 85 and above 115), which is the signature of sampling variance, not systematic loss.
Suggested fix — decouple the tolerance from a flat relative percentage. Options, roughly in order of preference:
- Derive the band from the binomial σ, e.g.
μ ± k·√(n·p(1−p)) with k ≈ 3.5 (≈ 4·10⁻⁴ per-check flake across all three subcases).
- Or raise
totalRequests so the 20% subcase clears ~3σ at ±15% (needs roughly n ≥ 1600).
- Or relax
tolerancePercentage for the low-percentage subcase specifically.
Each keeps the test sensitive to a genuinely wrong distribution while removing the sampling flake. The test is already Provisional; I'm happy to send a PR once a direction is agreed.
Constants are identical on main at time of writing (tolerancePercentage = 15.0, totalRequests = 500.0, numDistributionChecks = 5).
What happened:
Running the conformance suite (v1.5.1,
GATEWAY-HTTPprofile) against cloudflare-tunnel-gateway-controller — a Gateway API implementation for Cloudflare Tunnel, where each request and its mirror copy both traverse the Cloudflare edge — theHTTPRouteRequestPercentageMirrortest intermittently logs a failed distribution check before passing within its retry budget. The failing subcase is consistently/percent-mirror(the 20% mirror); I have observed mirrored counts as low as 78 and as high as 116 against the[85, 115]band. The suite ultimately passes becausenumDistributionChecks = 5returns on the first success, so this is log noise rather than a hard failure — but it is caused by the test's tolerance being statistically too tight at the lowest mirror percentage, independent of the implementation under test.What you expected to happen:
An implementation that mirrors at exactly the configured percentage should pass the distribution check reliably (well under ~0.1% per-attempt flake), the way the 35% and 50% subcases already do.
How to reproduce it (as minimally and precisely as possible):
This reproduces by analysis, independent of any implementation, because the check is a binomial sampling test.
The test sends
totalRequests = 500requests and compares the mirrored count to a relative±tolerancePercentage = 15%band around the expected count (conformance/tests/httproute-request-percentage-mirror.go). Mirroring is Bernoulli, so the count isBinomial(500, p)withσ = √(n·p·(1−p)). The relative tolerance scales linearly withp, butσscales with√(p(1−p)), so the band narrows in σ-units aspdrops:/percent-mirror/percent-mirror-and-modify-headers/percent-mirror-fractionSo even a perfect implementation flakes ~8–9% per distribution-check on the 20% subcase. The 5-retry budget hides it at the suite level (
P(all 5 fail)on the order of10⁻⁵), but roughly every 12th run logs a failed attempt and re-sends up to 2500 requests for that subcase.Anything else we need to know?:
Ruling out a lossy mirror leg in cloudflare-tunnel-gateway-controller (the mirror copy traverses the Cloudflare edge, so dropped mirrors were a plausible worry): the observed overshoot (116 > 115) cannot come from dropped mirrors — a drop can only ever undercount. The deviations are symmetric (below 85 and above 115), which is the signature of sampling variance, not systematic loss.
Suggested fix — decouple the tolerance from a flat relative percentage. Options, roughly in order of preference:
μ ± k·√(n·p(1−p))withk ≈ 3.5(≈ 4·10⁻⁴ per-check flake across all three subcases).totalRequestsso the 20% subcase clears ~3σ at ±15% (needs roughlyn ≥ 1600).tolerancePercentagefor the low-percentage subcase specifically.Each keeps the test sensitive to a genuinely wrong distribution while removing the sampling flake. The test is already
Provisional; I'm happy to send a PR once a direction is agreed.Constants are identical on
mainat time of writing (tolerancePercentage = 15.0,totalRequests = 500.0,numDistributionChecks = 5).