Skip to content

Add gen_ai.agent.invocation.duration and gen_ai.tool.execution.duration metrics#201

Open
pvlsirotkin wants to merge 5 commits into
open-telemetry:mainfrom
pvlsirotkin:agent-invocation-tool-duration-metrics
Open

Add gen_ai.agent.invocation.duration and gen_ai.tool.execution.duration metrics#201
pvlsirotkin wants to merge 5 commits into
open-telemetry:mainfrom
pvlsirotkin:agent-invocation-tool-duration-metrics

Conversation

@pvlsirotkin

Copy link
Copy Markdown

Description

Adds two new GenAI semantic convention metrics for agent and tool latency,
modeled on the recently-merged gen_ai.workflow.duration metric
(#126):

  • gen_ai.agent.invocation.duration (Histogram, s): end-to-end duration
    of a single agent invocation. Aligns with the existing
    gen_ai.invoke_agent client and internal spans.
  • gen_ai.tool.execution.duration (Histogram, s): duration of a single
    tool execution. Aligns with the existing gen_ai.execute_tool span.

Also adds a new registry attribute used as a dimension on the tool metric:

  • gen_ai.tool.version: low-cardinality version string for the tool,
    mirroring the existing gen_ai.agent.version.

Motivation

The conventions already define invoke_agent and execute_tool spans (with
gen_ai.agent.name, gen_ai.agent.version, gen_ai.tool.name, error.type)
and now gen_ai.workflow.duration for the workflow boundary, but there is no
standard metric for individual agent invocations or individual tool
executions
. Operators today cannot build agent- or tool-level latency /
error-rate dashboards without inventing custom metric names.

User journey: latency SLOs, error-rate dashboards, capacity planning, and
regression detection across agent / tool versions.

Checklist

  • Motivation section filled in above
  • Reference instrumentation and scenarios updated for affected libraries
  • Changelog entry added under Unreleased in CHANGELOG.md

See CONTRIBUTING.md.

@linux-foundation-easycla

linux-foundation-easycla Bot commented May 27, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: pvlsirotkin / name: Pavel Sirotkin (7245c62)

@pvlsirotkin pvlsirotkin force-pushed the agent-invocation-tool-duration-metrics branch from 1fc6c02 to 8cfb4ed Compare May 27, 2026 10:56
This was referenced May 27, 2026
@pvlsirotkin pvlsirotkin force-pushed the agent-invocation-tool-duration-metrics branch 2 times, most recently from 7245c62 to 5bfa1dc Compare May 27, 2026 15:27
@pvlsirotkin pvlsirotkin marked this pull request as ready for review May 27, 2026 15:31
@pvlsirotkin pvlsirotkin requested a review from a team as a code owner May 27, 2026 15:31
Copilot AI review requested due to automatic review settings May 27, 2026 15:31

@MikeGoldsmith MikeGoldsmith left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @pvlsirotkin.

There's a few things I think we need to fix up, but heading in the right direction.

Comment thread docs/gen-ai/gen-ai-metrics.md
Comment thread model/gen-ai/metrics.yaml
stability: development
attributes:
- ref_group: attributes.gen_ai.error
- ref: gen_ai.agent.name

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be required rather than conditionally_required when available_
Without gen_ai.agent.name the metric becomes a single global histogram and operators won't be able to break latency down by agent.

gen_ai.tool.name on the tool metric below is required as the primary dimension. The invoke_agent span already carries gen_ai.agent.name so I don't see why it would be missing at metric-recording time?

@pvlsirotkin pvlsirotkin May 31, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing invoke_agent.client / invoke_agent.internal spans use conditionally_required: when available for gen_ai.agent.name (see spans.yaml#L111-L113) – that's what I was mirroring here.

Would recommended work as a compromise? Stronger than the current level, but leaves room for frameworks where the agent has no stable name. Happy to go all the way to required if you'd rather.

Same change would apply to #202 and #203.

@lmolkova lmolkova Jun 10, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#270 proposes to have an entity for agents so that hosted agents would record their info there instead of recording all agent info on every measurement.

So I think whatever we come up with will change in the future once we add entities/resources into the picture and attribute on the metric itself can't be required.

For now, I think it makes sense to keep it consistent with internal agent span that (for some unknown to me reason) has conditionally_required: When available level

@pvlsirotkin pvlsirotkin Jun 11, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good — kept as conditionally_required: When available. (matching the internal invoke_agent span). Also fixed the capitalization per #245.

Comment thread CHANGELOG.md Outdated
Comment thread model/gen-ai/metrics.yaml
pvlsirotkin added a commit to pvlsirotkin/semantic-conventions-genai that referenced this pull request May 31, 2026
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric
  (recommended level, matching the `execute_tool` span). Addresses
  MikeGoldsmith feedback.
* Explain in the doc why `gen_ai.tool.execution.duration` is recommended
  while `gen_ai.agent.invocation.duration` is required (tool executions
  may happen via paths the agent framework does not observe \u2014 external
  MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith
  feedback.
@pvlsirotkin pvlsirotkin force-pushed the agent-invocation-tool-duration-metrics branch from 787f331 to 2b38207 Compare May 31, 2026 13:13
pvlsirotkin added a commit to pvlsirotkin/semantic-conventions-genai that referenced this pull request May 31, 2026
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric
  (recommended level, matching the `execute_tool` span). Addresses
  MikeGoldsmith feedback.
* Explain in the doc why `gen_ai.tool.execution.duration` is recommended
  while `gen_ai.agent.invocation.duration` is required (tool executions
  may happen via paths the agent framework does not observe — external
  MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith
  feedback.
@pvlsirotkin pvlsirotkin requested a review from MikeGoldsmith June 2, 2026 12:55

@lmolkova lmolkova left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Left some mostly cosmetic suggestions. I'm going to bring metric naming to the GenAI SIG call tomorrow and will post the outcome on this PR.

Comment thread model/gen-ai/registry.yaml Outdated
Comment thread model/gen-ai/metrics.yaml Outdated
- ref: gen_ai.agent.version
requirement_level:
conditionally_required: when available
- name: gen_ai.tool.execution.duration

@lmolkova lmolkova Jun 8, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've been discussing general approach to metric naming - #249

I think we should align operation name with metric name and the proposal for this one looks like gen_ai.execute_tool.duration.

I'm planning to bring it up on tomorrow's GenAI call to confirm the naming pattern and will post an update here after

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead with gen_ai.execute_tool.duration since it aligns with the operation name on spans.

The discussion on the call for #249 didn't raise any objections.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to gen_ai.execute_tool.duration (and gen_ai.invoke_agent.duration for the agent metric).

Comment thread model/gen-ai/metrics.yaml Outdated
Comment thread model/gen-ai/metrics.yaml Outdated
- ref: gen_ai.workflow.name
requirement_level:
conditionally_required: If available.
- name: gen_ai.agent.invocation.duration

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to execute_tool, following #249 the proposal is to align metric name with operation name and call it gen_ai.invoke_agent.duration on internal / local agents and gen_ai.client.invoke_agent.duration when invoking remote agents.

For the scope of this PR, I think it's fine to start with gen_ai.invoke_agent.duration only

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to gen_ai.invoke_agent.duration. Skipped the .client.invoke_agent.duration variant for this PR per your guidance.

Comment thread model/gen-ai/metrics.yaml Outdated
Comment thread docs/gen-ai/gen-ai-metrics.md Outdated
Comment thread docs/gen-ai/gen-ai-metrics.md Outdated
Comment thread docs/gen-ai/gen-ai-metrics.md Outdated
Comment thread docs/gen-ai/gen-ai-metrics.md Outdated
Comment thread docs/gen-ai/gen-ai-metrics.md Outdated
metric value SHOULD be the same as the span duration.

This metric SHOULD be specified with [ExplicitBucketBoundaries] of
[0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92].

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we scale up histogram boundaries? I would expect agent call duration to easily take longer than 90 sec, I think a reasonable range could be minutes, perhaps we can do something like

[0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4, 204.8, 409.6, 819.2] ?

@pvlsirotkin pvlsirotkin Jun 10, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on scaling up. Bumped to [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4, 204.8, 409.6] — starts at 0.1s for sub-second resolution on fast agent calls, upper bound at ~6.8 min which should cover the typical synchronous range without going as wide as 14 min.

Kept the tool metric (gen_ai.tool.execution.duration) at [0.01..81.92] since single tool calls are closer to LLM-call durations. Happy to scale either further if you have data suggesting longer tails.

@pvlsirotkin pvlsirotkin force-pushed the agent-invocation-tool-duration-metrics branch from 2b38207 to 023135d Compare June 10, 2026 14:02
pvlsirotkin added a commit to pvlsirotkin/semantic-conventions-genai that referenced this pull request Jun 10, 2026
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric
  (recommended level, matching the `execute_tool` span). Addresses
  MikeGoldsmith feedback.
* Explain in the doc why `gen_ai.tool.execution.duration` is recommended
  while `gen_ai.agent.invocation.duration` is required (tool executions
  may happen via paths the agent framework does not observe — external
  MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith
  feedback.
pvlsirotkin added a commit to pvlsirotkin/semantic-conventions-genai that referenced this pull request Jun 10, 2026
* Drop `gen_ai.tool.version` entirely (from the registry and the tool
  metric attribute list) and drop `gen_ai.agent.version` from the new
  metrics' attribute lists. Tool/agent versions are reasonably covered
  by `service.version` (resource attribute) and instrumentation scope
  version. The ADK Python implementation also declares
  `gen_ai.tool.version` as a constant but doesn't actually record it.
* Apply lmolkova's suggested rewrites to move the per-metric description
  from `note` into `brief`.
* Move the section-level prose (intent paragraph, guidance about when
  to use `gen_ai.client.operation.duration` vs `gen_ai.workflow.duration`,
  tool/agent asymmetry reasoning) into the YAML metric `note` and remove
  the now-duplicated MD prose. Keeps the MD as a thin wrapper while
  letting the autogen block render the full description.
* Scale up the bucket boundaries for `gen_ai.agent.invocation.duration`
  to `[0.1..409.6]` (~6.8 min upper bound) to cover longer multi-step
  agent calls while keeping sub-second resolution for fast ones. Tool
  metric kept at `[0.01..81.92]` since single tool calls are closer to
  LLM-call durations.
pvlsirotkin added a commit to pvlsirotkin/semantic-conventions-genai that referenced this pull request Jun 10, 2026
* Rename metrics: gen_ai.agent.request.size -> gen_ai.agent.input.content.size
  and gen_ai.agent.response.size -> gen_ai.agent.output.content.size. The new
  names don't imply a physical HTTP/gRPC request and are explicit that the
  metric is about content bytes (lmolkova's feedback).
* Drop the per-invocation-increment framing. Metric semantics now: byte size
  of content the agent receives/produces at its entrypoint, whatever the
  framework sees natively. Addresses lmolkova's point that 'what's new' is
  framework-dependent and ambiguous, and trask's question about defining in
  terms of gen_ai.input.messages (which would force frameworks to serialize
  full chat history).
* Spell out the byte-counting algorithm concretely: UTF-8 byte length for
  text parts, raw byte length for binary parts, framing bytes (JSON keys,
  role/metadata) not counted. Matches what the ADK reference implementation
  does. Addresses both Mike's and lmolkova's precision requests.
* Bump gen_ai.agent.name from 'conditionally_required: when available' to
  'recommended'. Same compromise as PR open-telemetry#201 - stronger than current but
  doesn't break unnamed-agent frameworks.
* Add error.type via attributes.gen_ai.error ref_group (Mike's suggestion);
  held off on metric_attributes.gen_ai since address/port/provider/model
  don't add much for an in-process content-size metric.
* Drop gen_ai.agent.version from attribute lists (same reasoning as PR open-telemetry#201
  - service.version covers it).
* Remove cross-reference to gen_ai.agent.invocation.duration since open-telemetry#201 has
  not landed yet. Will re-add later.
* Restructure the docs/gen-ai/gen-ai-metrics.md section to follow the thin
  MD wrapper + rich YAML note pattern (same as PR open-telemetry#201 revision).
pvlsirotkin added a commit to pvlsirotkin/semantic-conventions-genai that referenced this pull request Jun 10, 2026
* Rename metrics: gen_ai.agent.request.size -> gen_ai.agent.input.content.size
  and gen_ai.agent.response.size -> gen_ai.agent.output.content.size. The new
  names don't imply a physical HTTP/gRPC request and are explicit that the
  metric is about content bytes (lmolkova's feedback).
* Drop the per-invocation-increment framing. Metric semantics now: byte size
  of content the agent receives/produces at its entrypoint, whatever the
  framework sees natively. Addresses lmolkova's point that 'what's new' is
  framework-dependent and ambiguous, and trask's question about defining in
  terms of gen_ai.input.messages (which would force frameworks to serialize
  full chat history).
* Spell out the byte-counting algorithm concretely: UTF-8 byte length for
  text parts, raw byte length for binary parts, framing bytes (JSON keys,
  role/metadata) not counted. Matches what the ADK reference implementation
  does. Addresses both Mike's and lmolkova's precision requests.
* Bump gen_ai.agent.name from 'conditionally_required: when available' to
  'recommended'. Same compromise as PR open-telemetry#201 - stronger than current but
  doesn't break unnamed-agent frameworks.
* Add error.type via attributes.gen_ai.error ref_group (Mike's suggestion);
  held off on metric_attributes.gen_ai since address/port/provider/model
  don't add much for an in-process content-size metric.
* Drop gen_ai.agent.version from attribute lists (same reasoning as PR open-telemetry#201
  - service.version covers it).
* Remove cross-reference to gen_ai.agent.invocation.duration since open-telemetry#201 has
  not landed yet. Will re-add later.
* Restructure the docs/gen-ai/gen-ai-metrics.md section to follow the thin
  MD wrapper + rich YAML note pattern (same as PR open-telemetry#201 revision).
pvlsirotkin added a commit to pvlsirotkin/semantic-conventions-genai that referenced this pull request Jun 10, 2026
The original RFC and the production ADK implementation always intended
this metric to count steps a *specific agent* takes during a single
invocation - not steps in some larger workflow. A previous revision of
this PR moved the metric into the gen_ai.workflow.* namespace, which
caused most of the confusion in the latest review round (multiple
reviewers asked whether the metric is a per-invocation count or a
cumulative workflow-level aggregation; whether nested sub-agent steps
should be counted; whether workflow.name should be required given the
name implies workflow scope).

This commit refocuses the metric back on its intended scope:

* Rename gen_ai.workflow.steps -> gen_ai.agent.steps. Lives in the
  gen_ai.agent.* namespace alongside gen_ai.agent.invocation.duration
  (open-telemetry#201).
* Drop gen_ai.workflow.name and gen_ai.agent.version from the
  attribute list. The metric is now dimensioned only by
  gen_ai.agent.name (at 'recommended' level, matching open-telemetry#201/open-telemetry#202).
* Add attributes.gen_ai.error ref_group (Mike's feedback).
* Rewrite the metric note to spell out per-agent semantics:
  - Counts only events the agent itself authored (e.g. in ADK,
    events whose event.author equals the agent name).
  - Tool results are NOT counted (authored by tool, not agent).
  - Sub-agent steps are NOT counted (each step recorded once
    across the call tree - addresses aabmass' 'recorded exactly
    once' invariant).
  - Failed and partial steps still count (they consumed execution
    time - addresses Mike's question).
  - Step counts MUST NOT be compared across frameworks (addresses
    Mike's stronger-wording ask).
  - Concrete framework-specific definitions for ADK, LangGraph,
    LangChain agents, CrewAI, with a link to the adk.dev
    event-loop documentation (aabmass' suggestion).
  - Histogram bucket recommendation included in the YAML note
    (Mike's #M3 ask).
* Restructure the docs/gen-ai/gen-ai-metrics.md page: remove
  gen_ai.agent.steps from the 'Generative AI workflow metrics'
  section and create a new 'Generative AI agent metrics' section.
  MD body kept thin; the metric note in YAML carries the full
  semantics and is rendered via the autogen block.
@trask

trask commented Jun 10, 2026

Copy link
Copy Markdown
Member

Hi! As part of #275, this repository switched to Towncrier changelog fragments to reduce merge conflicts in CHANGELOG.md.

Please move this PR's changelog entry out of CHANGELOG.md and into this Towncrier fragment:

Create changelog.d/201.enhancement.md containing the change log entry, e.g.:

Add `gen_ai.agent.invocation.duration` metric to track the end-to-end duration of a single agent invocation, and `gen_ai.tool.execution.duration` metric to track the duration of a single tool execution.

After adding the fragment, please remove this PR's direct edit to CHANGELOG.md. Towncrier will add the PR link from the fragment filename when release notes are generated.

Thanks!

@lmolkova lmolkova left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, just a few comments on naming and suggestions to add agent info attributes to metrics

Comment thread model/gen-ai/metrics.yaml Outdated
Comment on lines +156 to +157
The duration of a single tool execution performed by or on behalf of a
GenAI agent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, tool execution can be performed by a generic application using LLM (of course you can call it an agent too)

Suggested change
The duration of a single tool execution performed by or on behalf of a
GenAI agent.
The duration of a single tool execution.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

Comment thread model/gen-ai/metrics.yaml Outdated
- ref: gen_ai.agent.version
requirement_level:
conditionally_required: when available
- name: gen_ai.tool.execution.duration

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead with gen_ai.execute_tool.duration since it aligns with the operation name on spans.

The discussion on the call for #249 didn't raise any objections.

Comment thread model/gen-ai/metrics.yaml Outdated
Comment on lines +159 to +169
Intended for instrumentations of agent frameworks (or of application
code that executes tools on behalf of an agent) that can reliably
bound a single tool call.

Unlike `gen_ai.agent.invocation.duration` (which is required), this
metric is only recommended because tools may be executed through
paths that the agent framework does not observe — for example,
external MCP servers or application-managed dispatch.
Instrumentations SHOULD record this metric for every tool execution
they observe but are not required to capture all tool calls across
the agentic system.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Intended for instrumentations of agent frameworks (or of application
code that executes tools on behalf of an agent) that can reliably
bound a single tool call.
Unlike `gen_ai.agent.invocation.duration` (which is required), this
metric is only recommended because tools may be executed through
paths that the agent framework does not observe — for example,
external MCP servers or application-managed dispatch.
Instrumentations SHOULD record this metric for every tool execution
they observe but are not required to capture all tool calls across
the agentic system.
Instrumentation that can reliably bound a single tool call SHOULD
record this metric for every tool execution they can observe.

Suggesting a simplification. I believe we're moving away from having required and recommended levels for metrics to (on by default and off by default) - open-telemetry/semantic-conventions#3278

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied. Good context on the required/recommended direction change — thanks for the heads-up.

Comment thread model/gen-ai/metrics.yaml Outdated
requirement_level: recommended
- ref: gen_ai.agent.name
requirement_level:
conditionally_required: when available

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
conditionally_required: when available
conditionally_required: When available.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied — and standardized the same capitalization on every conditionally_required note on these metrics.

Comment thread model/gen-ai/metrics.yaml Outdated
Comment on lines +124 to +126
The end-to-end duration of a single agent invocation,
from the moment the agent is invoked to the moment it produces its final
response (or terminates with an error).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
The end-to-end duration of a single agent invocation,
from the moment the agent is invoked to the moment it produces its final
response (or terminates with an error).
The end-to-end duration of a single agent invocation,
from the moment the invocation starts until the agent emits
the last chunk of its final response or terminates with an error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

Comment thread model/gen-ai/metrics.yaml
stability: development
attributes:
- ref_group: attributes.gen_ai.error
- ref: gen_ai.agent.name

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invoke agent span has a few more attributes that could be useful on metrics. In particular:

  • agent.id, version, and description
  • request model

I think we should reference them. The agent info will also come through entity in #270

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added gen_ai.agent.id, gen_ai.agent.version, and gen_ai.request.model (all conditionally_required: When/If available.). Skipped gen_ai.agent.description since it tends to be free-form text and risks high cardinality on a metric.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick follow-up on this: ended up dropping gen_ai.agent.version and gen_ai.request.model again on a second look. Kept gen_ai.agent.id.

  • gen_ai.agent.version: the intent in this metric set is to use the service.version resource attribute (which is stable, applied process-wide via the resource) for agent/tool/app versioning, rather than carrying a per-signal gen_ai.agent.version dimension on every metric. The registry attribute itself stays (it was added by @trask in a separate PR), it just isn't a dimension on this metric.
  • gen_ai.request.model: not always available at the time this metric is recorded. The framework knows the agent it's invoking, but the model selection can happen later inside the agent (and may differ across the multiple LLM calls within one invocation). The model breakdown is already on the child gen_ai.inference / gen_ai.client.operation.duration signals; adding it here would conflate two different boundaries.

Happy to revisit either if there's a concrete use case I'm missing.

@lmolkova lmolkova Jun 11, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe agent version != service version for remote agents and even when you're hosting an agent the version of agent is NOT the version user sees - you can update your service without incrementing agent version because nothing has changed in user-related agent config or behavior.

For the request model, we can document that it's the model provided when agent was created. Given that pretty much all frameworks take it as constructor parameter, it seems to be common and important use case.

They can both be recommended: When available and meaningful / provide information not captured service.version / etc

…on metrics

Adds two new GenAI semantic convention metrics for agent and tool latency,
modeled on the recently-added gen_ai.workflow.duration metric:

  * gen_ai.agent.invocation.duration (histogram, seconds): end-to-end
    duration of a single agent invocation, aligned with the existing
    gen_ai.invoke_agent.{client,internal} spans.
  * gen_ai.tool.execution.duration (histogram, seconds): duration of a
    single tool execution, aligned with the existing
    gen_ai.execute_tool.internal span.

Also adds the gen_ai.tool.version attribute, used as a dimension on
gen_ai.tool.execution.duration (mirrors the existing gen_ai.agent.version).

NOTE: docs/registry/ and schema-snapshot/ regeneration via 'make
generate-all' has NOT been run in this commit (no Docker available in
the authoring environment). Run it locally before pushing for review.
* Add `gen_ai.tool.type` to `gen_ai.tool.execution.duration` metric
  (recommended level, matching the `execute_tool` span). Addresses
  MikeGoldsmith feedback.
* Explain in the doc why `gen_ai.tool.execution.duration` is recommended
  while `gen_ai.agent.invocation.duration` is required (tool executions
  may happen via paths the agent framework does not observe — external
  MCP servers, app-managed dispatch, etc.). Addresses MikeGoldsmith
  feedback.
* Drop `gen_ai.tool.version` entirely (from the registry and the tool
  metric attribute list) and drop `gen_ai.agent.version` from the new
  metrics' attribute lists. Tool/agent versions are reasonably covered
  by `service.version` (resource attribute) and instrumentation scope
  version. The ADK Python implementation also declares
  `gen_ai.tool.version` as a constant but doesn't actually record it.
* Apply lmolkova's suggested rewrites to move the per-metric description
  from `note` into `brief`.
* Move the section-level prose (intent paragraph, guidance about when
  to use `gen_ai.client.operation.duration` vs `gen_ai.workflow.duration`,
  tool/agent asymmetry reasoning) into the YAML metric `note` and remove
  the now-duplicated MD prose. Keeps the MD as a thin wrapper while
  letting the autogen block render the full description.
* Scale up the bucket boundaries for `gen_ai.agent.invocation.duration`
  to `[0.1..409.6]` (~6.8 min upper bound) to cover longer multi-step
  agent calls while keeping sub-second resolution for fast ones. Tool
  metric kept at `[0.01..81.92]` since single tool calls are closer to
  LLM-call durations.
Following lmolkova's review and trask's towncrier reminder:

* Rename per open-telemetry#249 (lmolkova confirmed on PR open-telemetry#201 SIG call discussion):
  - gen_ai.agent.invocation.duration -> gen_ai.invoke_agent.duration
  - gen_ai.tool.execution.duration -> gen_ai.execute_tool.duration
  Metric names now align with the operation name on spans
  (gen_ai.invoke_agent, gen_ai.execute_tool).
* Move CHANGELOG entry to a Towncrier fragment per trask's reminder
  about open-telemetry#275: changelog.d/201.enhancement.md.
* Bump gen_ai.agent.name from 'recommended' back to 'conditionally_required:
  When available.' (lmolkova: keep consistent with the internal invoke_agent
  span; entity work in open-telemetry#270 will reshape this later anyway).
* Capitalize 'When available' / 'If available' per open-telemetry#245 sentence-case
  convention on every requirement_level note.
* Apply lmolkova's suggested rewrites on metric briefs/notes:
  - Agent: more concise brief about the invocation start/end.
  - Tool: drop 'performed by or on behalf of a GenAI agent' since
    generic apps (not just agents) can execute tools.
  - Tool note: simplify the requirement statement (drops the
    explicit 'required vs recommended' framing; semconv is moving
    away from those labels for metrics per
    open-telemetry/semantic-conventions#3278).
* Add a few more low-cardinality attributes on invoke_agent.duration
  per lmolkova: gen_ai.agent.id, gen_ai.agent.version, gen_ai.request.model
  (all conditionally_required When/If available). They mirror what the
  invoke_agent span carries and will be reshaped once open-telemetry#270 introduces
  agent entities.
@pvlsirotkin pvlsirotkin force-pushed the agent-invocation-tool-duration-metrics branch from 023135d to f6a2978 Compare June 11, 2026 14:41
@pvlsirotkin

Copy link
Copy Markdown
Author

Done — added changelog.d/201.enhancement.md and removed the direct CHANGELOG.md edit. Thanks for the heads-up about #275.

…duration

* gen_ai.agent.version: version information should come from the
  service.version resource attribute, not from a per-signal metric
  dimension. The intent in this metric set is to use service.version
  consistently for agent/tool/app versioning, so dropping the
  per-metric gen_ai.agent.version reference. (The registry attribute
  itself stays since it was added by trask in a separate PR; it just
  isn't a dimension on this metric.)
* gen_ai.request.model: pushing back on adding this. At the
  invoke_agent boundary the framework doesn't always know which model
  will be used yet - the model can be selected later by the agent or
  vary across the multiple LLM calls within an invocation. The
  request.model breakdown is already captured on the child
  gen_ai.inference / gen_ai.client.operation.duration signals; adding
  it here would conflate two different boundaries.

@lmolkova lmolkova left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment thread model/gen-ai/metrics.yaml
requirement_level: required
- ref: gen_ai.tool.type
requirement_level: recommended
- ref: gen_ai.agent.name

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if #270 is merged before this PR, we'll need to add entity associations and adjust requirement level accordingly, just leaving a comment so we don't forget

@@ -0,0 +1 @@
Add `gen_ai.invoke_agent.duration` metric to track the end-to-end duration of a single agent invocation, and `gen_ai.execute_tool.duration` metric to track the duration of a single tool execution.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please update PR description and title to reflect new naming? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants