Add experimental GenAI context selection event#190
Add experimental GenAI context selection event#190caioribeiroclw-pixel wants to merge 10 commits into
Conversation
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new GenAI semantic convention event to report privacy-preserving context selection telemetry (candidate/selected/suppressed and delivered-hash counts), along with the supporting attributes and documentation.
Changes:
- Add
gen_ai.context.selection.evaluatedevent to the GenAI events registry. - Introduce new
gen_ai.context.selection.*attributes for context selection counts and reasoning. - Regenerate schema snapshot + update registry docs and changelog.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| schema-snapshot/registry.yaml | Updates the generated snapshot to include the new event/attributes. |
| model/gen-ai/registry.yaml | Adds new gen_ai.context.selection.* attribute definitions. |
| model/gen-ai/events.yaml | Defines the new gen_ai.context.selection.evaluated event and its attribute refs/requirements. |
| docs/registry/attributes/gen-ai.md | Documents the new attributes and renumbers footnotes. |
| docs/gen-ai/gen-ai-events.md | Documents the new event in the events reference page. |
| CHANGELOG.md | Notes the new event in the Unreleased section. |
Comments suppressed due to low confidence (1)
docs/registry/attributes/gen-ai.md:1
- In the description column, the footnote marker is appended without punctuation (
agent [24]). For consistency with other rows (which generally use. [n]), consider updating this toagent. [24].
<!-- NOTE: THIS FILE IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
| brief: The implementation-specific reason or policy that selected and suppressed context inputs. | ||
| note: > | ||
| The value SHOULD have low cardinality. Examples include `budget`, `relevance`, `dedupe`, | ||
| `target_agent`, `policy`, and `unknown`. | ||
| examples: ["budget", "relevance"] |
|
|
||
| ## Unreleased | ||
|
|
||
| - Add experimental `gen_ai.context.selection.evaluated` event for privacy-preserving context selection counts. |
|
This is exactly the kind of observability primitive we need for RAG-heavy agent benchmarks. In our EnterpriseAgentBench setup we run agents on RAG tasks (document QA, knowledge retrieval) and one of the hardest questions to answer post-hoc is: "did the agent retrieve too many candidates and then silently drop the relevant ones, or did it never retrieve them at all?" The A few thoughts from the benchmark instrumentation angle:
Overall strongly supportive of this direction. Happy to add a reference scenario for a RAG-style retrieval pipeline if that would help validate the event shape. |
4f8429c to
e34a9d2
Compare
|
Thanks — this is a very useful validation point, especially the EnterpriseAgentBench RAG case. I pushed a small update in
On the event namespace: I kept A reference scenario for a RAG-style pipeline would be great. The concrete benchmark question you gave — “retrieved too many and dropped the relevant one vs never retrieved it” — is exactly the acceptance case I think this event should cover. Local validation I could run here:
I could not run |
|
Follow-up to make the EnterpriseAgentBench/RAG case executable rather than only described in prose:
CI is green again on |
|
hi @caioribeiroclw-pixel, can you please fill out the PR template (https://raw.githubusercontent.com/open-telemetry/semantic-conventions-genai/refs/heads/main/.github/PULL_REQUEST_TEMPLATE.md) and sign the CLA? |
|
Copilot has reviewed this PR. Copilot's suggestions aren't always correct or applicable, so please evaluate each comment on its merits and then handle it in one of these ways:
Automation flags a PR for human review once every Copilot comment has a reply or is marked as resolved, so keeping these threads up to date helps reviewers know when the PR is ready. Status across open PRs is visible on the pull request dashboard. |
|
Hi! As part of #275, this repository switched to Towncrier changelog fragments to reduce merge conflicts in Please move this PR's changelog entry out of Create Add experimental `gen_ai.context.selection.evaluated` event for privacy-preserving context selection counts.After adding the fragment, please remove this PR's direct edit to Thanks! |
|
Updated for the Towncrier migration request:
Local checks: git diff --check
python3 - <<'PY'
from pathlib import Path
frag = Path('changelog.d/190.enhancement.md')
assert frag.exists()
assert 'gen_ai.context.selection.evaluated' in frag.read_text()
assert '- Add experimental `gen_ai.context.selection.evaluated` event' not in Path('CHANGELOG.md').read_text().split('### 🛑 Breaking changes 🛑')[0]
print('fragment-ok')
PY |
Summary
Adds an experimental
gen_ai.context.selection.evaluatedevent for privacy-preserving context selection counts.This is the smallest shape from the discussion in #181 that can answer the operator question raised there: "did this agent run load too much context before we know which context was decision-relevant?"
The event captures counts only:
It intentionally does not define a full relevance/evaluator layer and should not capture raw prompt text, raw context text, tool outputs, memory bodies, or repository excerpts.
Why
Agent harnesses increasingly do retrieval, memory lookup, skills/rules loading, tool search, compaction, and other context selection steps before the model call. Token/cost on the model span can show the final bill, but not whether the harness over-selected context upstream.
A cheap count-only event gives operators an early waste signal without requiring raw content capture or a decision-relevance evaluator.
Validation
make generate-all WEAVER=/home/azureuser/.openclaw/workspace/bin/weavermake check-policies WEAVER=/home/azureuser/.openclaw/workspace/bin/weaverBoth passed locally with the existing
definition/2stability warnings.Related discussion: #181