gen-ai: add evaluation operation name and gen_ai.evaluate.internal span#185
gen-ai: add evaluation operation name and gen_ai.evaluate.internal span#185hippoley wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new internal span definition for GenAI evaluation workflows and registers a corresponding gen_ai.operation.name enum value to standardize evaluation instrumentation.
Changes:
- Introduces
gen_ai.evaluate.internalspan with guidance on naming, parentinggen_ai.evaluation.resultevents, and relevant attributes. - Adds
evaluationas a predefined value forgen_ai.operation.namein the registry.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| model/gen-ai/spans.yaml | Defines a new internal evaluation span and its required/conditional attributes. |
| model/gen-ai/registry.yaml | Registers evaluation as a gen_ai.operation.name value with usage guidance. |
| requirement_level: | ||
| conditionally_required: when available | ||
|
|
||
| - type: gen_ai.evaluate.internal |
| Represents an evaluation of GenAI output quality, accuracy, or other | ||
| characteristics. Carries one or more `gen_ai.evaluation.result` events. | ||
| note: | | ||
| The `gen_ai.operation.name` SHOULD be `evaluation`. |
| note: > | ||
| Used when the operation being instrumented is an evaluation step, | ||
| such as running an LLM-as-judge, a scoring metric, or a benchmark suite. | ||
| The span SHOULD carry one or more gen_ai.evaluation.result events. |
Align span type name with operation name value ('evaluation') and
gen_ai.evaluation.result event name. Also update span name guidance
from 'evaluate ...' to 'evaluation ...' for consistency.
Fix backtick formatting for gen_ai.evaluation.result in registry.yaml note.
Addresses Copilot review feedback on PR open-telemetry#185.
|
Thanks for the review feedback! Addressed in the latest commit (9122ae0):
|
|
Note on CI: the The root cause has been identified and fixed in PR #186: |
Add 'evaluation' as a well-known value for gen_ai.operation.name and introduce the gen_ai.evaluate.internal span type to carry gen_ai.evaluation.result events. Motivation (issue #3398): - The gen_ai.evaluation.result event exists but there is no standard span type to carry it. Without a well-known operation name, evaluation spans are invisible to operation-level queries and dashboards. - OpenSearch genai-observability-sdk-py already uses gen_ai.operation.name='evaluation' as a custom value; standardising it prevents ecosystem fragmentation. Changes: - registry.yaml: add 'evaluation' enum member to gen_ai.operation.name - spans.yaml: add gen_ai.evaluate.internal span with required gen_ai.operation.name, conditionally required gen_ai.evaluation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.conversation.id, and gen_ai.agent.name attributes Relates to: https://github.com/open-telemetry/semantic-conventions/issues/3398
Align span type name with operation name value ('evaluation') and
gen_ai.evaluation.result event name. Also update span name guidance
from 'evaluate ...' to 'evaluation ...' for consistency.
Fix backtick formatting for gen_ai.evaluation.result in registry.yaml note.
Addresses Copilot review feedback on PR open-telemetry#185.
9122ae0 to
ea91a2c
Compare
| value: "delete_memory_store" | ||
| brief: 'Delete or deprovision a memory store' | ||
| stability: development | ||
| - id: evaluation |
There was a problem hiding this comment.
I think we need a verb here.
| value: "evaluation" | ||
| brief: 'Evaluation of GenAI output quality, accuracy, or other characteristics' | ||
| note: > | ||
| Used when the operation being instrumented is an evaluation step, |
There was a problem hiding this comment.
I'm wondering which instrumentation should generate this kind of span? A kind of evaluation framework? Could you show a prototype for this?
|
@hippoley Idea of evaluation result as event was that it can be linked to the span via traceId, spanId for any span be it invoke_agent span , chat span etc. |
| - ref: gen_ai.provider.name | ||
| requirement_level: | ||
| conditionally_required: when the evaluation uses a GenAI judge model | ||
| - ref: gen_ai.request.model |
There was a problem hiding this comment.
This can be different for different metric being evaluated, hence having this attribute for it on evaluation span does not seem relevant.
| - ref: gen_ai.request.model | ||
| requirement_level: | ||
| conditionally_required: when the evaluation uses a GenAI judge model | ||
| - ref: gen_ai.conversation.id |
There was a problem hiding this comment.
Same feedback applies to this attribute too as request.model attribute.
|
Can you please help provide clarity on following questions:
|
|
Copilot has reviewed this PR. Copilot's suggestions aren't always correct or applicable, so please evaluate each comment on its merits and then handle it in one of these ways:
Automation flags a PR for human review once every Copilot comment has a reply or is marked as resolved, so keeping these threads up to date helps reviewers know when the PR is ready. Status across open PRs is visible on the pull request dashboard. |
Closes / relates to: https://github.com/open-telemetry/semantic-conventions/issues/3398
Problem
The
gen_ai.evaluation.resultevent was merged in #2563 and is already used by multiple reference scenarios (deepeval, azure-ai-evaluation, dspy). However there is currently no standard span type to carry it, and no well-knowngen_ai.operation.namevalue for evaluation operations.This causes two concrete problems:
invoke_agent,execute_tool, andchatspans.gen_ai.operation.name: "evaluation"as a custom value. Without a standard, every instrumentation library invents its own string.Changes
model/gen-ai/registry.yaml— addevaluationenum member togen_ai.operation.name:model/gen-ai/spans.yaml— addgen_ai.evaluate.internalspan:gen_ai.operation.name=evaluation(required)gen_ai.evaluation.name(conditionally required: single named metric)gen_ai.provider.name(conditionally required: when using a judge model)gen_ai.request.model(conditionally required: when using a judge model)gen_ai.conversation.id(conditionally required: conversation-scoped eval)gen_ai.agent.name(conditionally required: agent-scoped eval)Span name SHOULD be
evaluate {gen_ai.evaluation.name}for a single metric, orevaluatewhen multiple metrics share one span.Design notes
INTERNALkind, consistent withgen_ai.execute_tool.internalandgen_ai.plan.internal.Verification