serve: Gemma 4 non-thinking responses returned as reasoning_content with empty content

## System Info

- `transformers` main (`acc2cda7d9`), `transformers serve`
- Model: `google/gemma-4-31B-it`
- Linux, Python 3.x, CUDA

## Who can help?

Serving / CLI maintainers

## Reproduction

1. Start the server and send a non-streaming chat completion with a thinking-disabled prompt:

```bash
transformers serve
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-4-31B-it", "messages": [{"role": "user", "content": "Write a short joke about saving RAM."}]}'
```

2. The response has empty `content`; the full generated text lands in `reasoning_content`:

```json
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"","role":"assistant"}}], "usage":{"completion_tokens":28,"prompt_tokens":33,"total_tokens":61}}
```

(`reasoning_content` is also stripped from the serialized JSON, but the underlying `ChatCompletionMessage` shows the text was classified as reasoning.)

## Root cause

Gemma 4's chat template prefills an **empty, already-closed** thinking block at the end of the prompt when thinking is disabled. The last prompt tokens are:

```
<|turn>model\n<|channel>thought\n<channel|>   →  [..., 105, 4368, 107, 100, 45518, 107, 101]
```

`_starts_in_thinking()` in `src/transformers/cli/serving/utils.py` checks whether the prompt ends with the thinking opener (`["<|channel>", "thought", "\n"]` = `[100, 45518, 107]`), tolerating one trailing token (intended for templates like DeepSeek-R1 that emit `<think>\n`). Here the tolerated trailing token is `101` = `<channel|>` — the **closing** tag — so the heuristic wrongly reports `start_in_thinking=True`.

Since the model's output contains no thinking markers, `parse_reasoning()` falls into the "prefilled opener truncated before close" branch and reclassifies the entire completion as reasoning, returning `content=""`.

## Proposed fix

Reject the match when the prompt's final token is the thinking end token:

```diff
diff --git a/src/transformers/cli/serving/utils.py b/src/transformers/cli/serving/utils.py
index 8aedfb45c0..04457af56a 100644
--- a/src/transformers/cli/serving/utils.py
+++ b/src/transformers/cli/serving/utils.py
@@ -263,7 +263,7 @@ def get_reasoning_config(processor, model: "PreTrainedModel", input_ids=None) ->
         schema = _DEFAULT_THINKING_TOKENS["schema"]
     config: dict = {"start_ids": start_ids, "end_id": end_id, "schema": schema}
     if input_ids is not None:
-        config["start_in_thinking"] = _starts_in_thinking(input_ids, start_ids)
+        config["start_in_thinking"] = _starts_in_thinking(input_ids, start_ids, end_id)
     return config
 
 
@@ -287,7 +287,7 @@ def parse_reasoning(processor, generated_ids, content: str, reasoning_config: di
     return content, None
 
 
-def _starts_in_thinking(input_ids, start_ids: list[int]) -> bool:
+def _starts_in_thinking(input_ids, start_ids: list[int], end_id: int) -> bool:
     """True if the rendered prompt ends with an unclosed thinking block.
 
     Some reasoning-model chat templates prefill the thinking opener as the final
@@ -305,6 +305,11 @@ def _starts_in_thinking(input_ids, start_ids: list[int]) -> bool:
         if len(input_ids) != 1:
             return False
         input_ids = input_ids[0]
+    # A prompt ending with the end token has no unclosed block: some templates
+    # prefill an empty, already-closed block when thinking is disabled (e.g.
+    # Gemma 4 renders ``<|channel>thought\n<channel|>``).
+    if input_ids and input_ids[-1] == end_id:
+        return False
     n = len(start_ids)
     # Match start_ids at the tail, allowing up to one trailing token (e.g. "\n").
     for trailing in (0, 1):
```

Verified locally: the Gemma 4 closed-block tail now returns `False`, while a genuinely open block and the DeepSeek-style `<think>\n` prefill still return `True`, and the curl above returns the text in `message.content`.

I'm happy to open a PR with this fix if maintainers confirm the approach.

## Expected behavior

Generated text should be returned in `message.content` when the model did not produce a thinking block.

---

*This issue was investigated and drafted with AI assistance (Claude Code); the patch was reviewed and tested by me locally.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serve: Gemma 4 non-thinking responses returned as reasoning_content with empty content #46561

System Info

Who can help?

Reproduction

Root cause

Proposed fix

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

serve: Gemma 4 non-thinking responses returned as reasoning_content with empty content #46561

Description

System Info

Who can help?

Reproduction

Root cause

Proposed fix

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions