Skip to content

serve: Gemma 4 non-thinking responses returned as reasoning_content with empty content #46561

@b11015006

Description

@b11015006

System Info

  • transformers main (acc2cda7d9), transformers serve
  • Model: google/gemma-4-31B-it
  • Linux, Python 3.x, CUDA

Who can help?

Serving / CLI maintainers

Reproduction

  1. Start the server and send a non-streaming chat completion with a thinking-disabled prompt:
transformers serve
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-4-31B-it", "messages": [{"role": "user", "content": "Write a short joke about saving RAM."}]}'
  1. The response has empty content; the full generated text lands in reasoning_content:
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"","role":"assistant"}}], "usage":{"completion_tokens":28,"prompt_tokens":33,"total_tokens":61}}

(reasoning_content is also stripped from the serialized JSON, but the underlying ChatCompletionMessage shows the text was classified as reasoning.)

Root cause

Gemma 4's chat template prefills an empty, already-closed thinking block at the end of the prompt when thinking is disabled. The last prompt tokens are:

<|turn>model\n<|channel>thought\n<channel|>   →  [..., 105, 4368, 107, 100, 45518, 107, 101]

_starts_in_thinking() in src/transformers/cli/serving/utils.py checks whether the prompt ends with the thinking opener (["<|channel>", "thought", "\n"] = [100, 45518, 107]), tolerating one trailing token (intended for templates like DeepSeek-R1 that emit <think>\n). Here the tolerated trailing token is 101 = <channel|> — the closing tag — so the heuristic wrongly reports start_in_thinking=True.

Since the model's output contains no thinking markers, parse_reasoning() falls into the "prefilled opener truncated before close" branch and reclassifies the entire completion as reasoning, returning content="".

Proposed fix

Reject the match when the prompt's final token is the thinking end token:

diff --git a/src/transformers/cli/serving/utils.py b/src/transformers/cli/serving/utils.py
index 8aedfb45c0..04457af56a 100644
--- a/src/transformers/cli/serving/utils.py
+++ b/src/transformers/cli/serving/utils.py
@@ -263,7 +263,7 @@ def get_reasoning_config(processor, model: "PreTrainedModel", input_ids=None) ->
         schema = _DEFAULT_THINKING_TOKENS["schema"]
     config: dict = {"start_ids": start_ids, "end_id": end_id, "schema": schema}
     if input_ids is not None:
-        config["start_in_thinking"] = _starts_in_thinking(input_ids, start_ids)
+        config["start_in_thinking"] = _starts_in_thinking(input_ids, start_ids, end_id)
     return config
 
 
@@ -287,7 +287,7 @@ def parse_reasoning(processor, generated_ids, content: str, reasoning_config: di
     return content, None
 
 
-def _starts_in_thinking(input_ids, start_ids: list[int]) -> bool:
+def _starts_in_thinking(input_ids, start_ids: list[int], end_id: int) -> bool:
     """True if the rendered prompt ends with an unclosed thinking block.
 
     Some reasoning-model chat templates prefill the thinking opener as the final
@@ -305,6 +305,11 @@ def _starts_in_thinking(input_ids, start_ids: list[int]) -> bool:
         if len(input_ids) != 1:
             return False
         input_ids = input_ids[0]
+    # A prompt ending with the end token has no unclosed block: some templates
+    # prefill an empty, already-closed block when thinking is disabled (e.g.
+    # Gemma 4 renders ``<|channel>thought\n<channel|>``).
+    if input_ids and input_ids[-1] == end_id:
+        return False
     n = len(start_ids)
     # Match start_ids at the tail, allowing up to one trailing token (e.g. "\n").
     for trailing in (0, 1):

Verified locally: the Gemma 4 closed-block tail now returns False, while a genuinely open block and the DeepSeek-style <think>\n prefill still return True, and the curl above returns the text in message.content.

I'm happy to open a PR with this fix if maintainers confirm the approach.

Expected behavior

Generated text should be returned in message.content when the model did not produce a thinking block.


This issue was investigated and drafted with AI assistance (Claude Code); the patch was reviewed and tested by me locally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions