Skip to content

gen-ai: add evaluation operation name and gen_ai.evaluate.internal span#185

Open
hippoley wants to merge 2 commits into
open-telemetry:mainfrom
hippoley:gen-ai/evaluation-operation-name
Open

gen-ai: add evaluation operation name and gen_ai.evaluate.internal span#185
hippoley wants to merge 2 commits into
open-telemetry:mainfrom
hippoley:gen-ai/evaluation-operation-name

Conversation

@hippoley

Copy link
Copy Markdown
Contributor

Closes / relates to: https://github.com/open-telemetry/semantic-conventions/issues/3398

Problem

The gen_ai.evaluation.result event was merged in #2563 and is already used by multiple reference scenarios (deepeval, azure-ai-evaluation, dspy). However there is currently no standard span type to carry it, and no well-known gen_ai.operation.name value for evaluation operations.

This causes two concrete problems:

  1. Invisible in dashboards — evaluation spans cannot be queried or filtered at the operation level alongside invoke_agent, execute_tool, and chat spans.
  2. Ecosystem fragmentation — OpenSearch's genai-observability-sdk-py already uses gen_ai.operation.name: "evaluation" as a custom value. Without a standard, every instrumentation library invents its own string.

Changes

model/gen-ai/registry.yaml — add evaluation enum member to gen_ai.operation.name:

- id: evaluation
  value: "evaluation"
  brief: 'Evaluation of GenAI output quality, accuracy, or other characteristics'
  note: >
    Used when the operation being instrumented is an evaluation step,
    such as running an LLM-as-judge, a scoring metric, or a benchmark suite.
    The span SHOULD carry one or more gen_ai.evaluation.result events.
  stability: development

model/gen-ai/spans.yaml — add gen_ai.evaluate.internal span:

  • gen_ai.operation.name = evaluation (required)
  • gen_ai.evaluation.name (conditionally required: single named metric)
  • gen_ai.provider.name (conditionally required: when using a judge model)
  • gen_ai.request.model (conditionally required: when using a judge model)
  • gen_ai.conversation.id (conditionally required: conversation-scoped eval)
  • gen_ai.agent.name (conditionally required: agent-scoped eval)

Span name SHOULD be evaluate {gen_ai.evaluation.name} for a single metric, or evaluate when multiple metrics share one span.

Design notes

  • This is a purely additive change; no existing attributes or spans are modified.
  • The span is INTERNAL kind, consistent with gen_ai.execute_tool.internal and gen_ai.plan.internal.
  • A follow-up PR will add a reference scenario demonstrating the span in a benchmark context (deepeval / azure-ai-evaluation).

Verification

# Weaver resolves the updated registry without new errors
cd reference/
uv run run-scenario deepeval   # exit 0, no new violations

@hippoley hippoley requested a review from a team as a code owner May 21, 2026 08:14
Copilot AI review requested due to automatic review settings May 21, 2026 08:14
@linux-foundation-easycla

linux-foundation-easycla Bot commented May 21, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

@github-actions github-actions Bot mentioned this pull request May 21, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new internal span definition for GenAI evaluation workflows and registers a corresponding gen_ai.operation.name enum value to standardize evaluation instrumentation.

Changes:

  • Introduces gen_ai.evaluate.internal span with guidance on naming, parenting gen_ai.evaluation.result events, and relevant attributes.
  • Adds evaluation as a predefined value for gen_ai.operation.name in the registry.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
model/gen-ai/spans.yaml Defines a new internal evaluation span and its required/conditional attributes.
model/gen-ai/registry.yaml Registers evaluation as a gen_ai.operation.name value with usage guidance.

Comment thread model/gen-ai/spans.yaml Outdated
requirement_level:
conditionally_required: when available

- type: gen_ai.evaluate.internal
Comment thread model/gen-ai/spans.yaml
Represents an evaluation of GenAI output quality, accuracy, or other
characteristics. Carries one or more `gen_ai.evaluation.result` events.
note: |
The `gen_ai.operation.name` SHOULD be `evaluation`.
Comment thread model/gen-ai/registry.yaml Outdated
note: >
Used when the operation being instrumented is an evaluation step,
such as running an LLM-as-judge, a scoring metric, or a benchmark suite.
The span SHOULD carry one or more gen_ai.evaluation.result events.
hippoley added a commit to hippoley/semantic-conventions-genai that referenced this pull request May 21, 2026
Align span type name with operation name value ('evaluation') and
gen_ai.evaluation.result event name. Also update span name guidance
from 'evaluate ...' to 'evaluation ...' for consistency.

Fix backtick formatting for gen_ai.evaluation.result in registry.yaml note.

Addresses Copilot review feedback on PR open-telemetry#185.
@hippoley

Copy link
Copy Markdown
Contributor Author

Thanks for the review feedback!

Addressed in the latest commit (9122ae0):

  1. Naming inconsistency — renamed gen_ai.evaluate.internalgen_ai.evaluation.internal and updated span name guidance from evaluate ...evaluation ... to align with the operation name value (evaluation) and the existing gen_ai.evaluation.result event name.
  2. Backtick formatting — added backticks around gen_ai.evaluation.result in the registry.yaml note.

@hippoley

Copy link
Copy Markdown
Contributor Author

Note on CI: the required-status-check failure is caused by the claude-agent-sdk scenario failing on this branch, which is a pre-existing issue on main unrelated to this PR's changes (this PR only touches model/gen-ai/registry.yaml and model/gen-ai/spans.yaml).

The root cause has been identified and fixed in PR #186: claude-agent-sdk 0.2.x changed the AssistantMessage API (fields moved from message.message.* to direct attributes), causing scenario.py to silently skip setting the response attributes. PR #186 now includes the fix.

hippoley added 2 commits May 26, 2026 10:52
Add 'evaluation' as a well-known value for gen_ai.operation.name and
introduce the gen_ai.evaluate.internal span type to carry
gen_ai.evaluation.result events.

Motivation (issue #3398):
- The gen_ai.evaluation.result event exists but there is no standard
  span type to carry it. Without a well-known operation name, evaluation
  spans are invisible to operation-level queries and dashboards.
- OpenSearch genai-observability-sdk-py already uses
  gen_ai.operation.name='evaluation' as a custom value; standardising
  it prevents ecosystem fragmentation.

Changes:
- registry.yaml: add 'evaluation' enum member to gen_ai.operation.name
- spans.yaml: add gen_ai.evaluate.internal span with required
  gen_ai.operation.name, conditionally required gen_ai.evaluation.name,
  gen_ai.provider.name, gen_ai.request.model, gen_ai.conversation.id,
  and gen_ai.agent.name attributes

Relates to: https://github.com/open-telemetry/semantic-conventions/issues/3398
Align span type name with operation name value ('evaluation') and
gen_ai.evaluation.result event name. Also update span name guidance
from 'evaluate ...' to 'evaluation ...' for consistency.

Fix backtick formatting for gen_ai.evaluation.result in registry.yaml note.

Addresses Copilot review feedback on PR open-telemetry#185.
value: "delete_memory_store"
brief: 'Delete or deprovision a memory store'
stability: development
- id: evaluation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a verb here.

value: "evaluation"
brief: 'Evaluation of GenAI output quality, accuracy, or other characteristics'
note: >
Used when the operation being instrumented is an evaluation step,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering which instrumentation should generate this kind of span? A kind of evaluation framework? Could you show a prototype for this?

@singankit

Copy link
Copy Markdown
Contributor

@hippoley Idea of evaluation result as event was that it can be linked to the span via traceId, spanId for any span be it invoke_agent span , chat span etc.

Comment thread model/gen-ai/spans.yaml
- ref: gen_ai.provider.name
requirement_level:
conditionally_required: when the evaluation uses a GenAI judge model
- ref: gen_ai.request.model

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be different for different metric being evaluated, hence having this attribute for it on evaluation span does not seem relevant.

Comment thread model/gen-ai/spans.yaml
- ref: gen_ai.request.model
requirement_level:
conditionally_required: when the evaluation uses a GenAI judge model
- ref: gen_ai.conversation.id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback applies to this attribute too as request.model attribute.

@singankit

singankit commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Can you please help provide clarity on following questions:

  • How will evaluation span be associated to operation being evaluated?
  • What is evaluation span supposed to capture? The process of evaluation or evaluation results or both?
  • Currently evaluation results can be association with span being evaluated. After this change will evaluation result be associated with both evaluation span and span being evaluated?

@otelbot

otelbot Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Copilot has reviewed this PR. Copilot's suggestions aren't always correct or applicable, so please evaluate each comment on its merits and then handle it in one of these ways:

  • click GitHub's "Apply suggestion" button (auto-resolves the thread);
  • reply that it was applied (ideally linking to the commit); or
  • reply that it was not applied, with the reason; or
  • reply with a question for reviewers.

Automation flags a PR for human review once every Copilot comment has a reply or is marked as resolved, so keeping these threads up to date helps reviewers know when the PR is ready.

Status across open PRs is visible on the pull request dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants