gen-ai: add evaluation operation name and gen_ai.evaluate.internal span by hippoley · Pull Request #185 · open-telemetry/semantic-conventions-genai

hippoley · 2026-05-21T08:14:48Z

Closes / relates to: https://github.com/open-telemetry/semantic-conventions/issues/3398

Problem

The gen_ai.evaluation.result event was merged in #2563 and is already used by multiple reference scenarios (deepeval, azure-ai-evaluation, dspy). However there is currently no standard span type to carry it, and no well-known gen_ai.operation.name value for evaluation operations.

This causes two concrete problems:

Invisible in dashboards — evaluation spans cannot be queried or filtered at the operation level alongside invoke_agent, execute_tool, and chat spans.
Ecosystem fragmentation — OpenSearch's genai-observability-sdk-py already uses gen_ai.operation.name: "evaluation" as a custom value. Without a standard, every instrumentation library invents its own string.

Changes

model/gen-ai/registry.yaml — add evaluation enum member to gen_ai.operation.name:

- id: evaluation
  value: "evaluation"
  brief: 'Evaluation of GenAI output quality, accuracy, or other characteristics'
  note: >
    Used when the operation being instrumented is an evaluation step,
    such as running an LLM-as-judge, a scoring metric, or a benchmark suite.
    The span SHOULD carry one or more gen_ai.evaluation.result events.
  stability: development

model/gen-ai/spans.yaml — add gen_ai.evaluate.internal span:

gen_ai.operation.name = evaluation (required)
gen_ai.evaluation.name (conditionally required: single named metric)
gen_ai.provider.name (conditionally required: when using a judge model)
gen_ai.request.model (conditionally required: when using a judge model)
gen_ai.conversation.id (conditionally required: conversation-scoped eval)
gen_ai.agent.name (conditionally required: agent-scoped eval)

Span name SHOULD be evaluate {gen_ai.evaluation.name} for a single metric, or evaluate when multiple metrics share one span.

Design notes

This is a purely additive change; no existing attributes or spans are modified.
The span is INTERNAL kind, consistent with gen_ai.execute_tool.internal and gen_ai.plan.internal.
A follow-up PR will add a reference scenario demonstrating the span in a benchmark context (deepeval / azure-ai-evaluation).

Verification

# Weaver resolves the updated registry without new errors
cd reference/
uv run run-scenario deepeval   # exit 0, no new violations

linux-foundation-easycla · 2026-05-21T08:14:57Z

The committers listed above are authorized under a signed CLA.

✅ login: hippoley / name: hippoley (042b913, 9122ae0)

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new internal span definition for GenAI evaluation workflows and registers a corresponding gen_ai.operation.name enum value to standardize evaluation instrumentation.

Changes:

Introduces gen_ai.evaluate.internal span with guidance on naming, parenting gen_ai.evaluation.result events, and relevant attributes.
Adds evaluation as a predefined value for gen_ai.operation.name in the registry.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
model/gen-ai/spans.yaml	Defines a new internal evaluation span and its required/conditional attributes.
model/gen-ai/registry.yaml	Registers `evaluation` as a `gen_ai.operation.name` value with usage guidance.

        requirement_level:
          conditionally_required: when available

+  - type: gen_ai.evaluate.internal


+      Represents an evaluation of GenAI output quality, accuracy, or other
+      characteristics. Carries one or more `gen_ai.evaluation.result` events.
+    note: |
+      The `gen_ai.operation.name` SHOULD be `evaluation`.


+          note: >
+            Used when the operation being instrumented is an evaluation step,
+            such as running an LLM-as-judge, a scoring metric, or a benchmark suite.
+            The span SHOULD carry one or more gen_ai.evaluation.result events.


Align span type name with operation name value ('evaluation') and gen_ai.evaluation.result event name. Also update span name guidance from 'evaluate ...' to 'evaluation ...' for consistency. Fix backtick formatting for gen_ai.evaluation.result in registry.yaml note. Addresses Copilot review feedback on PR open-telemetry#185.

hippoley · 2026-05-21T10:43:32Z

Thanks for the review feedback!

Addressed in the latest commit (9122ae0):

Naming inconsistency — renamed gen_ai.evaluate.internal → gen_ai.evaluation.internal and updated span name guidance from evaluate ... → evaluation ... to align with the operation name value (evaluation) and the existing gen_ai.evaluation.result event name.
Backtick formatting — added backticks around gen_ai.evaluation.result in the registry.yaml note.

hippoley · 2026-05-21T16:33:49Z

Note on CI: the required-status-check failure is caused by the claude-agent-sdk scenario failing on this branch, which is a pre-existing issue on main unrelated to this PR's changes (this PR only touches model/gen-ai/registry.yaml and model/gen-ai/spans.yaml).

The root cause has been identified and fixed in PR #186: claude-agent-sdk 0.2.x changed the AssistantMessage API (fields moved from message.message.* to direct attributes), causing scenario.py to silently skip setting the response attributes. PR #186 now includes the fix.

Add 'evaluation' as a well-known value for gen_ai.operation.name and introduce the gen_ai.evaluate.internal span type to carry gen_ai.evaluation.result events. Motivation (issue #3398): - The gen_ai.evaluation.result event exists but there is no standard span type to carry it. Without a well-known operation name, evaluation spans are invisible to operation-level queries and dashboards. - OpenSearch genai-observability-sdk-py already uses gen_ai.operation.name='evaluation' as a custom value; standardising it prevents ecosystem fragmentation. Changes: - registry.yaml: add 'evaluation' enum member to gen_ai.operation.name - spans.yaml: add gen_ai.evaluate.internal span with required gen_ai.operation.name, conditionally required gen_ai.evaluation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.conversation.id, and gen_ai.agent.name attributes Relates to: https://github.com/open-telemetry/semantic-conventions/issues/3398

Align span type name with operation name value ('evaluation') and gen_ai.evaluation.result event name. Also update span name guidance from 'evaluate ...' to 'evaluation ...' for consistency. Fix backtick formatting for gen_ai.evaluation.result in registry.yaml note. Addresses Copilot review feedback on PR open-telemetry#185.

Cirilla-zmh · 2026-05-27T03:19:43Z

          value: "delete_memory_store"
          brief: 'Delete or deprovision a memory store'
          stability: development
+        - id: evaluation


I think we need a verb here.

Cirilla-zmh · 2026-05-27T03:22:49Z

+          value: "evaluation"
+          brief: 'Evaluation of GenAI output quality, accuracy, or other characteristics'
+          note: >
+            Used when the operation being instrumented is an evaluation step,


I'm wondering which instrumentation should generate this kind of span? A kind of evaluation framework? Could you show a prototype for this?

singankit · 2026-05-27T03:27:27Z

@hippoley Idea of evaluation result as event was that it can be linked to the span via traceId, spanId for any span be it invoke_agent span , chat span etc.

singankit · 2026-06-04T23:08:00Z

+      - ref: gen_ai.provider.name
+        requirement_level:
+          conditionally_required: when the evaluation uses a GenAI judge model
+      - ref: gen_ai.request.model


This can be different for different metric being evaluated, hence having this attribute for it on evaluation span does not seem relevant.

singankit · 2026-06-04T23:08:47Z

+      - ref: gen_ai.request.model
+        requirement_level:
+          conditionally_required: when the evaluation uses a GenAI judge model
+      - ref: gen_ai.conversation.id


Same feedback applies to this attribute too as request.model attribute.

singankit · 2026-06-04T23:20:48Z

Can you please help provide clarity on following questions:

How will evaluation span be associated to operation being evaluated?
What is evaluation span supposed to capture? The process of evaluation or evaluation results or both?
Currently evaluation results can be association with span being evaluated. After this change will evaluation result be associated with both evaluation span and span being evaluated?

otelbot · 2026-06-05T22:25:15Z

Copilot has reviewed this PR. Copilot's suggestions aren't always correct or applicable, so please evaluate each comment on its merits and then handle it in one of these ways:

click GitHub's "Apply suggestion" button (auto-resolves the thread);
reply that it was applied (ideally linking to the commit); or
reply that it was not applied, with the reason; or
reply with a question for reviewers.

Automation flags a PR for human review once every Copilot comment has a reply or is marked as resolved, so keeping these threads up to date helps reviewers know when the PR is ready.

Status across open PRs is visible on the pull request dashboard.

hippoley requested a review from a team as a code owner May 21, 2026 08:14

Copilot AI review requested due to automatic review settings May 21, 2026 08:14

github-actions Bot mentioned this pull request May 21, 2026

Pull Request Dashboard #102

Closed

Copilot AI reviewed May 21, 2026

View reviewed changes

hippoley mentioned this pull request May 21, 2026

# Semantic Conventions for GenAI Evaluation organization into Experiments and Test Cases #79

Open

hippoley added 2 commits May 26, 2026 10:52

hippoley force-pushed the gen-ai/evaluation-operation-name branch from 9122ae0 to ea91a2c Compare May 26, 2026 02:52

hippoley mentioned this pull request May 26, 2026

Add workflow node convention #188

Open

3 tasks

lmolkova added this to GenAI Semantic Conventions and Instrumentation libraries May 26, 2026

lmolkova moved this to In Progress in GenAI Semantic Conventions and Instrumentation libraries May 26, 2026

github-actions Bot mentioned this pull request May 26, 2026

Pull Request Dashboard #196

Closed

Cirilla-zmh reviewed May 27, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 27, 2026

Pull Request Dashboard #204

Open

singankit reviewed Jun 4, 2026

View reviewed changes

trask mentioned this pull request Jun 5, 2026

Add Reviewers column to PR dashboard #251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gen-ai: add evaluation operation name and gen_ai.evaluate.internal span#185

gen-ai: add evaluation operation name and gen_ai.evaluate.internal span#185
hippoley wants to merge 2 commits into
open-telemetry:mainfrom
hippoley:gen-ai/evaluation-operation-name

hippoley commented May 21, 2026

Uh oh!

linux-foundation-easycla Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

hippoley commented May 21, 2026

Uh oh!

hippoley commented May 21, 2026

Uh oh!

Cirilla-zmh May 27, 2026

Uh oh!

Cirilla-zmh May 27, 2026

Uh oh!

singankit commented May 27, 2026

Uh oh!

singankit Jun 4, 2026

Uh oh!

singankit Jun 4, 2026

Uh oh!

singankit commented Jun 4, 2026 •

edited

Loading

Uh oh!

otelbot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hippoley commented May 21, 2026

Problem

Changes

Design notes

Verification

Uh oh!

linux-foundation-easycla Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

hippoley commented May 21, 2026

Uh oh!

hippoley commented May 21, 2026

Uh oh!

Cirilla-zmh May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Cirilla-zmh May 27, 2026

Choose a reason for hiding this comment

Uh oh!

singankit commented May 27, 2026

Uh oh!

singankit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

singankit Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

singankit commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

otelbot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linux-foundation-easycla Bot commented May 21, 2026 •

edited

Loading

singankit commented Jun 4, 2026 •

edited

Loading