Skip to content

fix: coerce judge score drift#756

Open
schultzjack wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schultzjack:codex/569-coerce-judge-scores
Open

fix: coerce judge score drift#756
schultzjack wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schultzjack:codex/569-coerce-judge-scores

Conversation

@schultzjack

Copy link
Copy Markdown

Summary

  • normalize LLM-judge score values before enum validation in generated judge response models
  • accept numeric/string drift and simple case/whitespace drift when it maps unambiguously to a configured score option
  • keep unmatched or malformed scores on the existing Pydantic validation path

Scope

This addresses the LLM-judge validation path discussed in #569. It intentionally leaves the broader LLM-structured schema coercion path unchanged.

Testing

  • uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py -q
  • uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py packages/data-designer-engine/tests/engine/column_generators/utils/test_prompt_renderer.py packages/data-designer-engine/tests/engine/column_generators/generators/test_llm_completion_generators.py packages/data-designer-engine/tests/engine/models/recipes/test_response_recipes.py -q
  • make check-engine
  • make test-engine

Refs #569

Signed-off-by: schultzjack <schultzjack@users.noreply.github.com>
@schultzjack schultzjack requested a review from a team as a code owner June 17, 2026 00:31
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@github-actions

Copy link
Copy Markdown
Contributor

Linked Issue Check

This PR does not reference an issue. External contributions must link to
a triaged issue before the PR can be merged.

Add one of the following to your PR description:

  • Fixes #<issue-number>
  • Closes #<issue-number>
  • Resolves #<issue-number>

If no issue exists yet, open one
and a maintainer will triage it.

See CONTRIBUTING.md
for details.

@schultzjack

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@schultzjack

Copy link
Copy Markdown
Author

recheck

@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes LLM-judge score drift by adding a mode="before" Pydantic model validator on BaseJudgeResponse that normalises incoming score values (numeric↔string, case, whitespace) before they reach enum validation, falling back to the standard Pydantic path when the coercion is ambiguous or the value is malformed.

  • _normalize_score_value converts values to a stripped, casefolded string (integer floats become their integer string), enabling reliable equality comparison across types.
  • _coerce_score_value performs an exact-match pass first (with a bool-guard to prevent True/False from silently matching 1/0), then falls back to normalised matching only when exactly one enum member maps to the same normalised form.
  • Four tests covering the primary drift scenarios, nested-model coercion, and unhashable-value fallthrough are added.

Confidence Score: 5/5

The change is self-contained to the judge score coercion path and introduces no mutations to unrelated model behaviour.

The coercion logic is narrowly scoped: it only fires when the field annotation is a concrete Enum subclass, the input is a dict, and exactly one enum member matches after normalisation. The bool-guard in the exact-match phase correctly prevents True/False from silently collapsing onto integer members. Unrecognised or ambiguous values pass through to Pydantic unchanged, preserving existing validation behaviour. The new tests cover the three key drift categories plus the unhashable-value fallthrough path.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/column_generators/utils/judge_score_factory.py Adds _normalize_score_value, _coerce_score_value, and a model_validator(mode='before') on BaseJudgeResponse to coerce LLM-returned score drift (numeric/string/case/whitespace) before Pydantic enum validation; unmatched values fall through to Pydantic unchanged.
packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py Adds four new tests covering int→string, string→int, and case/whitespace coercion via parametrize; nested structured-output coercion; and unhashable-score fallthrough to Pydantic ValidationError.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]
Loading

Reviews (1): Last reviewed commit: "fix: coerce judge score drift" | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant