fix: coerce judge score drift by schultzjack · Pull Request #756 · NVIDIA-NeMo/DataDesigner

schultzjack · 2026-06-17T00:30:59Z

Summary

normalize LLM-judge score values before enum validation in generated judge response models
accept numeric/string drift and simple case/whitespace drift when it maps unambiguously to a configured score option
keep unmatched or malformed scores on the existing Pydantic validation path

Scope

This addresses the LLM-judge validation path discussed in #569. It intentionally leaves the broader LLM-structured schema coercion path unchanged.

Testing

uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py -q
uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py packages/data-designer-engine/tests/engine/column_generators/utils/test_prompt_renderer.py packages/data-designer-engine/tests/engine/column_generators/generators/test_llm_completion_generators.py packages/data-designer-engine/tests/engine/models/recipes/test_response_recipes.py -q
make check-engine
make test-engine

Refs #569

Signed-off-by: schultzjack <schultzjack@users.noreply.github.com>

github-actions · 2026-06-17T00:31:12Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

github-actions · 2026-06-17T00:31:14Z

Linked Issue Check

This PR does not reference an issue. External contributions must link to
a triaged issue before the PR can be merged.

Add one of the following to your PR description:

Fixes #<issue-number>
Closes #<issue-number>
Resolves #<issue-number>

If no issue exists yet, open one
and a maintainer will triage it.

See CONTRIBUTING.md
for details.

schultzjack · 2026-06-17T00:33:44Z

I have read the DCO document and I hereby sign the DCO.

schultzjack · 2026-06-17T00:34:03Z

recheck

greptile-apps · 2026-06-17T00:34:52Z

Greptile Summary

This PR fixes LLM-judge score drift by adding a mode="before" Pydantic model validator on BaseJudgeResponse that normalises incoming score values (numeric↔string, case, whitespace) before they reach enum validation, falling back to the standard Pydantic path when the coercion is ambiguous or the value is malformed.

_normalize_score_value converts values to a stripped, casefolded string (integer floats become their integer string), enabling reliable equality comparison across types.
_coerce_score_value performs an exact-match pass first (with a bool-guard to prevent True/False from silently matching 1/0), then falls back to normalised matching only when exactly one enum member maps to the same normalised form.
Four tests covering the primary drift scenarios, nested-model coercion, and unhashable-value fallthrough are added.

Confidence Score: 5/5

The change is self-contained to the judge score coercion path and introduces no mutations to unrelated model behaviour.

The coercion logic is narrowly scoped: it only fires when the field annotation is a concrete Enum subclass, the input is a dict, and exactly one enum member matches after normalisation. The bool-guard in the exact-match phase correctly prevents True/False from silently collapsing onto integer members. Unrecognised or ambiguous values pass through to Pydantic unchanged, preserving existing validation behaviour. The new tests cover the three key drift categories plus the unhashable-value fallthrough path.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/column_generators/utils/judge_score_factory.py	Adds _normalize_score_value, _coerce_score_value, and a model_validator(mode='before') on BaseJudgeResponse to coerce LLM-returned score drift (numeric/string/case/whitespace) before Pydantic enum validation; unmatched values fall through to Pydantic unchanged.
packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py	Adds four new tests covering int→string, string→int, and case/whitespace coercion via parametrize; nested structured-output coercion; and unhashable-score fallthrough to Pydantic ValidationError.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]

_{Reviews (1): Last reviewed commit: "fix: coerce judge score drift" | Re-trigger Greptile}

fix: coerce judge score drift

2b3c629

Signed-off-by: schultzjack <schultzjack@users.noreply.github.com>

schultzjack requested a review from a team as a code owner June 17, 2026 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: coerce judge score drift#756

fix: coerce judge score drift#756
schultzjack wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schultzjack:codex/569-coerce-judge-scores

schultzjack commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026

Confidence Score: 5/5

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

schultzjack commented Jun 17, 2026

Summary

Scope

Testing

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 17, 2026

Linked Issue Check

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 17, 2026 •

edited

Loading