Add gen_ai.agent.invocation.duration and gen_ai.tool.execution.duration metrics#201
Conversation
|
|
1fc6c02 to
8cfb4ed
Compare
7245c62 to
5bfa1dc
Compare
MikeGoldsmith
left a comment
There was a problem hiding this comment.
Looks good, thanks @pvlsirotkin.
There's a few things I think we need to fix up, but heading in the right direction.
| stability: development | ||
| attributes: | ||
| - ref_group: attributes.gen_ai.error | ||
| - ref: gen_ai.agent.name |
There was a problem hiding this comment.
Should this be required rather than conditionally_required when available_
Without gen_ai.agent.name the metric becomes a single global histogram and operators won't be able to break latency down by agent.
gen_ai.tool.name on the tool metric below is required as the primary dimension. The invoke_agent span already carries gen_ai.agent.name so I don't see why it would be missing at metric-recording time?
There was a problem hiding this comment.
The existing invoke_agent.client / invoke_agent.internal spans use conditionally_required: when available for gen_ai.agent.name (see spans.yaml#L111-L113) – that's what I was mirroring here.
Would recommended work as a compromise? Stronger than the current level, but leaves room for frameworks where the agent has no stable name. Happy to go all the way to required if you'd rather.
There was a problem hiding this comment.
#270 proposes to have an entity for agents so that hosted agents would record their info there instead of recording all agent info on every measurement.
So I think whatever we come up with will change in the future once we add entities/resources into the picture and attribute on the metric itself can't be required.
For now, I think it makes sense to keep it consistent with internal agent span that (for some unknown to me reason) has conditionally_required: When available level
There was a problem hiding this comment.
Sounds good — kept as conditionally_required: When available. (matching the internal invoke_agent span). Also fixed the capitalization per #245.
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric (recommended level, matching the `execute_tool` span). Addresses MikeGoldsmith feedback. * Explain in the doc why `gen_ai.tool.execution.duration` is recommended while `gen_ai.agent.invocation.duration` is required (tool executions may happen via paths the agent framework does not observe \u2014 external MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith feedback.
787f331 to
2b38207
Compare
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric (recommended level, matching the `execute_tool` span). Addresses MikeGoldsmith feedback. * Explain in the doc why `gen_ai.tool.execution.duration` is recommended while `gen_ai.agent.invocation.duration` is required (tool executions may happen via paths the agent framework does not observe — external MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith feedback.
lmolkova
left a comment
There was a problem hiding this comment.
Thanks for the PR!
Left some mostly cosmetic suggestions. I'm going to bring metric naming to the GenAI SIG call tomorrow and will post the outcome on this PR.
| - ref: gen_ai.agent.version | ||
| requirement_level: | ||
| conditionally_required: when available | ||
| - name: gen_ai.tool.execution.duration |
There was a problem hiding this comment.
we've been discussing general approach to metric naming - #249
I think we should align operation name with metric name and the proposal for this one looks like gen_ai.execute_tool.duration.
I'm planning to bring it up on tomorrow's GenAI call to confirm the naming pattern and will post an update here after
There was a problem hiding this comment.
Let's go ahead with gen_ai.execute_tool.duration since it aligns with the operation name on spans.
The discussion on the call for #249 didn't raise any objections.
There was a problem hiding this comment.
Done — renamed to gen_ai.execute_tool.duration (and gen_ai.invoke_agent.duration for the agent metric).
| - ref: gen_ai.workflow.name | ||
| requirement_level: | ||
| conditionally_required: If available. | ||
| - name: gen_ai.agent.invocation.duration |
There was a problem hiding this comment.
similar to execute_tool, following #249 the proposal is to align metric name with operation name and call it gen_ai.invoke_agent.duration on internal / local agents and gen_ai.client.invoke_agent.duration when invoking remote agents.
For the scope of this PR, I think it's fine to start with gen_ai.invoke_agent.duration only
There was a problem hiding this comment.
Done — renamed to gen_ai.invoke_agent.duration. Skipped the .client.invoke_agent.duration variant for this PR per your guidance.
| metric value SHOULD be the same as the span duration. | ||
|
|
||
| This metric SHOULD be specified with [ExplicitBucketBoundaries] of | ||
| [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92]. |
There was a problem hiding this comment.
should we scale up histogram boundaries? I would expect agent call duration to easily take longer than 90 sec, I think a reasonable range could be minutes, perhaps we can do something like
[0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4, 204.8, 409.6, 819.2] ?
There was a problem hiding this comment.
Agreed on scaling up. Bumped to [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4, 204.8, 409.6] — starts at 0.1s for sub-second resolution on fast agent calls, upper bound at ~6.8 min which should cover the typical synchronous range without going as wide as 14 min.
Kept the tool metric (gen_ai.tool.execution.duration) at [0.01..81.92] since single tool calls are closer to LLM-call durations. Happy to scale either further if you have data suggesting longer tails.
2b38207 to
023135d
Compare
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric (recommended level, matching the `execute_tool` span). Addresses MikeGoldsmith feedback. * Explain in the doc why `gen_ai.tool.execution.duration` is recommended while `gen_ai.agent.invocation.duration` is required (tool executions may happen via paths the agent framework does not observe — external MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith feedback.
* Drop `gen_ai.tool.version` entirely (from the registry and the tool metric attribute list) and drop `gen_ai.agent.version` from the new metrics' attribute lists. Tool/agent versions are reasonably covered by `service.version` (resource attribute) and instrumentation scope version. The ADK Python implementation also declares `gen_ai.tool.version` as a constant but doesn't actually record it. * Apply lmolkova's suggested rewrites to move the per-metric description from `note` into `brief`. * Move the section-level prose (intent paragraph, guidance about when to use `gen_ai.client.operation.duration` vs `gen_ai.workflow.duration`, tool/agent asymmetry reasoning) into the YAML metric `note` and remove the now-duplicated MD prose. Keeps the MD as a thin wrapper while letting the autogen block render the full description. * Scale up the bucket boundaries for `gen_ai.agent.invocation.duration` to `[0.1..409.6]` (~6.8 min upper bound) to cover longer multi-step agent calls while keeping sub-second resolution for fast ones. Tool metric kept at `[0.01..81.92]` since single tool calls are closer to LLM-call durations.
* Rename metrics: gen_ai.agent.request.size -> gen_ai.agent.input.content.size and gen_ai.agent.response.size -> gen_ai.agent.output.content.size. The new names don't imply a physical HTTP/gRPC request and are explicit that the metric is about content bytes (lmolkova's feedback). * Drop the per-invocation-increment framing. Metric semantics now: byte size of content the agent receives/produces at its entrypoint, whatever the framework sees natively. Addresses lmolkova's point that 'what's new' is framework-dependent and ambiguous, and trask's question about defining in terms of gen_ai.input.messages (which would force frameworks to serialize full chat history). * Spell out the byte-counting algorithm concretely: UTF-8 byte length for text parts, raw byte length for binary parts, framing bytes (JSON keys, role/metadata) not counted. Matches what the ADK reference implementation does. Addresses both Mike's and lmolkova's precision requests. * Bump gen_ai.agent.name from 'conditionally_required: when available' to 'recommended'. Same compromise as PR open-telemetry#201 - stronger than current but doesn't break unnamed-agent frameworks. * Add error.type via attributes.gen_ai.error ref_group (Mike's suggestion); held off on metric_attributes.gen_ai since address/port/provider/model don't add much for an in-process content-size metric. * Drop gen_ai.agent.version from attribute lists (same reasoning as PR open-telemetry#201 - service.version covers it). * Remove cross-reference to gen_ai.agent.invocation.duration since open-telemetry#201 has not landed yet. Will re-add later. * Restructure the docs/gen-ai/gen-ai-metrics.md section to follow the thin MD wrapper + rich YAML note pattern (same as PR open-telemetry#201 revision).
* Rename metrics: gen_ai.agent.request.size -> gen_ai.agent.input.content.size and gen_ai.agent.response.size -> gen_ai.agent.output.content.size. The new names don't imply a physical HTTP/gRPC request and are explicit that the metric is about content bytes (lmolkova's feedback). * Drop the per-invocation-increment framing. Metric semantics now: byte size of content the agent receives/produces at its entrypoint, whatever the framework sees natively. Addresses lmolkova's point that 'what's new' is framework-dependent and ambiguous, and trask's question about defining in terms of gen_ai.input.messages (which would force frameworks to serialize full chat history). * Spell out the byte-counting algorithm concretely: UTF-8 byte length for text parts, raw byte length for binary parts, framing bytes (JSON keys, role/metadata) not counted. Matches what the ADK reference implementation does. Addresses both Mike's and lmolkova's precision requests. * Bump gen_ai.agent.name from 'conditionally_required: when available' to 'recommended'. Same compromise as PR open-telemetry#201 - stronger than current but doesn't break unnamed-agent frameworks. * Add error.type via attributes.gen_ai.error ref_group (Mike's suggestion); held off on metric_attributes.gen_ai since address/port/provider/model don't add much for an in-process content-size metric. * Drop gen_ai.agent.version from attribute lists (same reasoning as PR open-telemetry#201 - service.version covers it). * Remove cross-reference to gen_ai.agent.invocation.duration since open-telemetry#201 has not landed yet. Will re-add later. * Restructure the docs/gen-ai/gen-ai-metrics.md section to follow the thin MD wrapper + rich YAML note pattern (same as PR open-telemetry#201 revision).
The original RFC and the production ADK implementation always intended this metric to count steps a *specific agent* takes during a single invocation - not steps in some larger workflow. A previous revision of this PR moved the metric into the gen_ai.workflow.* namespace, which caused most of the confusion in the latest review round (multiple reviewers asked whether the metric is a per-invocation count or a cumulative workflow-level aggregation; whether nested sub-agent steps should be counted; whether workflow.name should be required given the name implies workflow scope). This commit refocuses the metric back on its intended scope: * Rename gen_ai.workflow.steps -> gen_ai.agent.steps. Lives in the gen_ai.agent.* namespace alongside gen_ai.agent.invocation.duration (open-telemetry#201). * Drop gen_ai.workflow.name and gen_ai.agent.version from the attribute list. The metric is now dimensioned only by gen_ai.agent.name (at 'recommended' level, matching open-telemetry#201/open-telemetry#202). * Add attributes.gen_ai.error ref_group (Mike's feedback). * Rewrite the metric note to spell out per-agent semantics: - Counts only events the agent itself authored (e.g. in ADK, events whose event.author equals the agent name). - Tool results are NOT counted (authored by tool, not agent). - Sub-agent steps are NOT counted (each step recorded once across the call tree - addresses aabmass' 'recorded exactly once' invariant). - Failed and partial steps still count (they consumed execution time - addresses Mike's question). - Step counts MUST NOT be compared across frameworks (addresses Mike's stronger-wording ask). - Concrete framework-specific definitions for ADK, LangGraph, LangChain agents, CrewAI, with a link to the adk.dev event-loop documentation (aabmass' suggestion). - Histogram bucket recommendation included in the YAML note (Mike's #M3 ask). * Restructure the docs/gen-ai/gen-ai-metrics.md page: remove gen_ai.agent.steps from the 'Generative AI workflow metrics' section and create a new 'Generative AI agent metrics' section. MD body kept thin; the metric note in YAML carries the full semantics and is rendered via the autogen block.
|
Hi! As part of #275, this repository switched to Towncrier changelog fragments to reduce merge conflicts in Please move this PR's changelog entry out of Create Add `gen_ai.agent.invocation.duration` metric to track the end-to-end duration of a single agent invocation, and `gen_ai.tool.execution.duration` metric to track the duration of a single tool execution.After adding the fragment, please remove this PR's direct edit to Thanks! |
lmolkova
left a comment
There was a problem hiding this comment.
This looks great, just a few comments on naming and suggestions to add agent info attributes to metrics
| The duration of a single tool execution performed by or on behalf of a | ||
| GenAI agent. |
There was a problem hiding this comment.
nit, tool execution can be performed by a generic application using LLM (of course you can call it an agent too)
| The duration of a single tool execution performed by or on behalf of a | |
| GenAI agent. | |
| The duration of a single tool execution. |
| - ref: gen_ai.agent.version | ||
| requirement_level: | ||
| conditionally_required: when available | ||
| - name: gen_ai.tool.execution.duration |
There was a problem hiding this comment.
Let's go ahead with gen_ai.execute_tool.duration since it aligns with the operation name on spans.
The discussion on the call for #249 didn't raise any objections.
| Intended for instrumentations of agent frameworks (or of application | ||
| code that executes tools on behalf of an agent) that can reliably | ||
| bound a single tool call. | ||
|
|
||
| Unlike `gen_ai.agent.invocation.duration` (which is required), this | ||
| metric is only recommended because tools may be executed through | ||
| paths that the agent framework does not observe — for example, | ||
| external MCP servers or application-managed dispatch. | ||
| Instrumentations SHOULD record this metric for every tool execution | ||
| they observe but are not required to capture all tool calls across | ||
| the agentic system. |
There was a problem hiding this comment.
| Intended for instrumentations of agent frameworks (or of application | |
| code that executes tools on behalf of an agent) that can reliably | |
| bound a single tool call. | |
| Unlike `gen_ai.agent.invocation.duration` (which is required), this | |
| metric is only recommended because tools may be executed through | |
| paths that the agent framework does not observe — for example, | |
| external MCP servers or application-managed dispatch. | |
| Instrumentations SHOULD record this metric for every tool execution | |
| they observe but are not required to capture all tool calls across | |
| the agentic system. | |
| Instrumentation that can reliably bound a single tool call SHOULD | |
| record this metric for every tool execution they can observe. |
Suggesting a simplification. I believe we're moving away from having required and recommended levels for metrics to (on by default and off by default) - open-telemetry/semantic-conventions#3278
There was a problem hiding this comment.
Applied. Good context on the required/recommended direction change — thanks for the heads-up.
| requirement_level: recommended | ||
| - ref: gen_ai.agent.name | ||
| requirement_level: | ||
| conditionally_required: when available |
There was a problem hiding this comment.
nit
| conditionally_required: when available | |
| conditionally_required: When available. |
There was a problem hiding this comment.
Applied — and standardized the same capitalization on every conditionally_required note on these metrics.
| The end-to-end duration of a single agent invocation, | ||
| from the moment the agent is invoked to the moment it produces its final | ||
| response (or terminates with an error). |
There was a problem hiding this comment.
nit
| The end-to-end duration of a single agent invocation, | |
| from the moment the agent is invoked to the moment it produces its final | |
| response (or terminates with an error). | |
| The end-to-end duration of a single agent invocation, | |
| from the moment the invocation starts until the agent emits | |
| the last chunk of its final response or terminates with an error. |
| stability: development | ||
| attributes: | ||
| - ref_group: attributes.gen_ai.error | ||
| - ref: gen_ai.agent.name |
There was a problem hiding this comment.
invoke agent span has a few more attributes that could be useful on metrics. In particular:
- agent.id, version, and description
- request model
I think we should reference them. The agent info will also come through entity in #270
There was a problem hiding this comment.
Added gen_ai.agent.id, gen_ai.agent.version, and gen_ai.request.model (all conditionally_required: When/If available.). Skipped gen_ai.agent.description since it tends to be free-form text and risks high cardinality on a metric.
There was a problem hiding this comment.
Quick follow-up on this: ended up dropping gen_ai.agent.version and gen_ai.request.model again on a second look. Kept gen_ai.agent.id.
gen_ai.agent.version: the intent in this metric set is to use theservice.versionresource attribute (which is stable, applied process-wide via the resource) for agent/tool/app versioning, rather than carrying a per-signalgen_ai.agent.versiondimension on every metric. The registry attribute itself stays (it was added by @trask in a separate PR), it just isn't a dimension on this metric.gen_ai.request.model: not always available at the time this metric is recorded. The framework knows the agent it's invoking, but the model selection can happen later inside the agent (and may differ across the multiple LLM calls within one invocation). The model breakdown is already on the childgen_ai.inference/gen_ai.client.operation.durationsignals; adding it here would conflate two different boundaries.
Happy to revisit either if there's a concrete use case I'm missing.
There was a problem hiding this comment.
I believe agent version != service version for remote agents and even when you're hosting an agent the version of agent is NOT the version user sees - you can update your service without incrementing agent version because nothing has changed in user-related agent config or behavior.
For the request model, we can document that it's the model provided when agent was created. Given that pretty much all frameworks take it as constructor parameter, it seems to be common and important use case.
They can both be recommended: When available and meaningful / provide information not captured service.version / etc
…on metrics
Adds two new GenAI semantic convention metrics for agent and tool latency,
modeled on the recently-added gen_ai.workflow.duration metric:
* gen_ai.agent.invocation.duration (histogram, seconds): end-to-end
duration of a single agent invocation, aligned with the existing
gen_ai.invoke_agent.{client,internal} spans.
* gen_ai.tool.execution.duration (histogram, seconds): duration of a
single tool execution, aligned with the existing
gen_ai.execute_tool.internal span.
Also adds the gen_ai.tool.version attribute, used as a dimension on
gen_ai.tool.execution.duration (mirrors the existing gen_ai.agent.version).
NOTE: docs/registry/ and schema-snapshot/ regeneration via 'make
generate-all' has NOT been run in this commit (no Docker available in
the authoring environment). Run it locally before pushing for review.
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric (recommended level, matching the `execute_tool` span). Addresses MikeGoldsmith feedback. * Explain in the doc why `gen_ai.tool.execution.duration` is recommended while `gen_ai.agent.invocation.duration` is required (tool executions may happen via paths the agent framework does not observe — external MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith feedback.
* Drop `gen_ai.tool.version` entirely (from the registry and the tool metric attribute list) and drop `gen_ai.agent.version` from the new metrics' attribute lists. Tool/agent versions are reasonably covered by `service.version` (resource attribute) and instrumentation scope version. The ADK Python implementation also declares `gen_ai.tool.version` as a constant but doesn't actually record it. * Apply lmolkova's suggested rewrites to move the per-metric description from `note` into `brief`. * Move the section-level prose (intent paragraph, guidance about when to use `gen_ai.client.operation.duration` vs `gen_ai.workflow.duration`, tool/agent asymmetry reasoning) into the YAML metric `note` and remove the now-duplicated MD prose. Keeps the MD as a thin wrapper while letting the autogen block render the full description. * Scale up the bucket boundaries for `gen_ai.agent.invocation.duration` to `[0.1..409.6]` (~6.8 min upper bound) to cover longer multi-step agent calls while keeping sub-second resolution for fast ones. Tool metric kept at `[0.01..81.92]` since single tool calls are closer to LLM-call durations.
Following lmolkova's review and trask's towncrier reminder: * Rename per open-telemetry#249 (lmolkova confirmed on PR open-telemetry#201 SIG call discussion): - gen_ai.agent.invocation.duration -> gen_ai.invoke_agent.duration - gen_ai.tool.execution.duration -> gen_ai.execute_tool.duration Metric names now align with the operation name on spans (gen_ai.invoke_agent, gen_ai.execute_tool). * Move CHANGELOG entry to a Towncrier fragment per trask's reminder about open-telemetry#275: changelog.d/201.enhancement.md. * Bump gen_ai.agent.name from 'recommended' back to 'conditionally_required: When available.' (lmolkova: keep consistent with the internal invoke_agent span; entity work in open-telemetry#270 will reshape this later anyway). * Capitalize 'When available' / 'If available' per open-telemetry#245 sentence-case convention on every requirement_level note. * Apply lmolkova's suggested rewrites on metric briefs/notes: - Agent: more concise brief about the invocation start/end. - Tool: drop 'performed by or on behalf of a GenAI agent' since generic apps (not just agents) can execute tools. - Tool note: simplify the requirement statement (drops the explicit 'required vs recommended' framing; semconv is moving away from those labels for metrics per open-telemetry/semantic-conventions#3278). * Add a few more low-cardinality attributes on invoke_agent.duration per lmolkova: gen_ai.agent.id, gen_ai.agent.version, gen_ai.request.model (all conditionally_required When/If available). They mirror what the invoke_agent span carries and will be reshaped once open-telemetry#270 introduces agent entities.
023135d to
f6a2978
Compare
|
Done — added |
…duration * gen_ai.agent.version: version information should come from the service.version resource attribute, not from a per-signal metric dimension. The intent in this metric set is to use service.version consistently for agent/tool/app versioning, so dropping the per-metric gen_ai.agent.version reference. (The registry attribute itself stays since it was added by trask in a separate PR; it just isn't a dimension on this metric.) * gen_ai.request.model: pushing back on adding this. At the invoke_agent boundary the framework doesn't always know which model will be used yet - the model can be selected later by the agent or vary across the multiple LLM calls within an invocation. The request.model breakdown is already captured on the child gen_ai.inference / gen_ai.client.operation.duration signals; adding it here would conflate two different boundaries.
| requirement_level: required | ||
| - ref: gen_ai.tool.type | ||
| requirement_level: recommended | ||
| - ref: gen_ai.agent.name |
There was a problem hiding this comment.
if #270 is merged before this PR, we'll need to add entity associations and adjust requirement level accordingly, just leaving a comment so we don't forget
| @@ -0,0 +1 @@ | |||
| Add `gen_ai.invoke_agent.duration` metric to track the end-to-end duration of a single agent invocation, and `gen_ai.execute_tool.duration` metric to track the duration of a single tool execution. | |||
There was a problem hiding this comment.
could you please update PR description and title to reflect new naming? thanks!
Description
Adds two new GenAI semantic convention metrics for agent and tool latency,
modeled on the recently-merged
gen_ai.workflow.durationmetric(#126):
gen_ai.agent.invocation.duration(Histogram,s): end-to-end durationof a single agent invocation. Aligns with the existing
gen_ai.invoke_agentclient and internal spans.gen_ai.tool.execution.duration(Histogram,s): duration of a singletool execution. Aligns with the existing
gen_ai.execute_toolspan.Also adds a new registry attribute used as a dimension on the tool metric:
gen_ai.tool.version: low-cardinality version string for the tool,mirroring the existing
gen_ai.agent.version.Motivation
The conventions already define
invoke_agentandexecute_toolspans (withgen_ai.agent.name,gen_ai.agent.version,gen_ai.tool.name,error.type)and now
gen_ai.workflow.durationfor the workflow boundary, but there is nostandard metric for individual agent invocations or individual tool
executions. Operators today cannot build agent- or tool-level latency /
error-rate dashboards without inventing custom metric names.
User journey: latency SLOs, error-rate dashboards, capacity planning, and
regression detection across agent / tool versions.
Checklist
UnreleasedinCHANGELOG.mdSee CONTRIBUTING.md.