Skip to content

Java route harvesting can freeze target JVM while dumping symbol table #2304

@MrAlias

Description

@MrAlias

Summary

Java route harvesting currently uses dynamic attach to run jcmd VM.symboltable -verbose against the target JVM:

  • pkg/internal/transform/route/harvest/java.go

For JVMs with very large symbol tables, this can trigger a long stop-the-world safepoint in the target application. In #2301, the reporter observed application freezes lasting 20-100 seconds while jcmd <PID> VM.symboltable -verbose printed hundreds of thousands of lines.

PR #2303 disables Java route harvesting by default as an immediate mitigation, but we should track the underlying production-safety issue separately.

Impact

This affects the instrumented Java application, not just the OBI agent.

A route discovery attempt can make the target JVM unresponsive for tens of seconds. The current route_harvester_timeout only limits how long OBI waits for the extraction result. It does not prevent or cancel JVM-side work once the diagnostic command has triggered a safepoint.

Related Work

#2034 is related but narrower. It tracks that Java route extraction is not truly cancellable from OBI and can retain blocked workers after timeout. This issue tracks the target-JVM safety problem: the diagnostic command itself can pause the application.

Evidence

The reported safepoint log includes DumpHashtable and a long time at safepoint while dumping the symbol table. The Java route harvester currently extracts routes from the output of:

jcmd VM.symboltable -verbose

That command can scale poorly with symbol table size and can pause the JVM during collection.

Suggested Direction

Short term:

  • Keep Java route harvesting disabled by default.
  • Document that enabling Java route harvesting can pause the target JVM on large applications.
  • Make opt-in behavior explicit in both v1 and v2 configuration docs and schemas.

Long term:

  • Investigate a route discovery approach that does not require dumping the full JVM symbol table.
  • If dynamic attach remains necessary, add stronger safety controls before enabling by default again.
  • Coordinate with Timeout does not cancel blocked Java extraction #2034 so timeout/cancellation behavior is improved, while recognizing that cancellation alone may not prevent JVM safepoints already in progress.

Acceptance Criteria

  • Java route harvesting is safe by default for production users.
  • Documentation clearly describes the risk and opt-in behavior.
  • The v1 and v2 config defaults, examples, and schemas agree.
  • There is a tested strategy for avoiding, bounding, or isolating target-JVM pauses before Java route harvesting is enabled by default again.

Metadata

Metadata

Assignees

Labels

area: route-harvestingRoute extraction and route harvester behaviorbugSomething isn't workingjavaJava agent related

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions