System Info
transformers main (acc2cda7d9), transformers serve
- Model:
google/gemma-4-31B-it
- Linux, Python 3.x, CUDA
Who can help?
Serving / CLI maintainers
Reproduction
- Start the server and send a non-streaming chat completion with a thinking-disabled prompt:
transformers serve
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "google/gemma-4-31B-it", "messages": [{"role": "user", "content": "Write a short joke about saving RAM."}]}'
- The response has empty
content; the full generated text lands in reasoning_content:
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"","role":"assistant"}}], "usage":{"completion_tokens":28,"prompt_tokens":33,"total_tokens":61}}
(reasoning_content is also stripped from the serialized JSON, but the underlying ChatCompletionMessage shows the text was classified as reasoning.)
Root cause
Gemma 4's chat template prefills an empty, already-closed thinking block at the end of the prompt when thinking is disabled. The last prompt tokens are:
<|turn>model\n<|channel>thought\n<channel|> → [..., 105, 4368, 107, 100, 45518, 107, 101]
_starts_in_thinking() in src/transformers/cli/serving/utils.py checks whether the prompt ends with the thinking opener (["<|channel>", "thought", "\n"] = [100, 45518, 107]), tolerating one trailing token (intended for templates like DeepSeek-R1 that emit <think>\n). Here the tolerated trailing token is 101 = <channel|> — the closing tag — so the heuristic wrongly reports start_in_thinking=True.
Since the model's output contains no thinking markers, parse_reasoning() falls into the "prefilled opener truncated before close" branch and reclassifies the entire completion as reasoning, returning content="".
Proposed fix
Reject the match when the prompt's final token is the thinking end token:
diff --git a/src/transformers/cli/serving/utils.py b/src/transformers/cli/serving/utils.py
index 8aedfb45c0..04457af56a 100644
--- a/src/transformers/cli/serving/utils.py
+++ b/src/transformers/cli/serving/utils.py
@@ -263,7 +263,7 @@ def get_reasoning_config(processor, model: "PreTrainedModel", input_ids=None) ->
schema = _DEFAULT_THINKING_TOKENS["schema"]
config: dict = {"start_ids": start_ids, "end_id": end_id, "schema": schema}
if input_ids is not None:
- config["start_in_thinking"] = _starts_in_thinking(input_ids, start_ids)
+ config["start_in_thinking"] = _starts_in_thinking(input_ids, start_ids, end_id)
return config
@@ -287,7 +287,7 @@ def parse_reasoning(processor, generated_ids, content: str, reasoning_config: di
return content, None
-def _starts_in_thinking(input_ids, start_ids: list[int]) -> bool:
+def _starts_in_thinking(input_ids, start_ids: list[int], end_id: int) -> bool:
"""True if the rendered prompt ends with an unclosed thinking block.
Some reasoning-model chat templates prefill the thinking opener as the final
@@ -305,6 +305,11 @@ def _starts_in_thinking(input_ids, start_ids: list[int]) -> bool:
if len(input_ids) != 1:
return False
input_ids = input_ids[0]
+ # A prompt ending with the end token has no unclosed block: some templates
+ # prefill an empty, already-closed block when thinking is disabled (e.g.
+ # Gemma 4 renders ``<|channel>thought\n<channel|>``).
+ if input_ids and input_ids[-1] == end_id:
+ return False
n = len(start_ids)
# Match start_ids at the tail, allowing up to one trailing token (e.g. "\n").
for trailing in (0, 1):
Verified locally: the Gemma 4 closed-block tail now returns False, while a genuinely open block and the DeepSeek-style <think>\n prefill still return True, and the curl above returns the text in message.content.
I'm happy to open a PR with this fix if maintainers confirm the approach.
Expected behavior
Generated text should be returned in message.content when the model did not produce a thinking block.
This issue was investigated and drafted with AI assistance (Claude Code); the patch was reviewed and tested by me locally.
System Info
transformersmain (acc2cda7d9),transformers servegoogle/gemma-4-31B-itWho can help?
Serving / CLI maintainers
Reproduction
content; the full generated text lands inreasoning_content:{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"","role":"assistant"}}], "usage":{"completion_tokens":28,"prompt_tokens":33,"total_tokens":61}}(
reasoning_contentis also stripped from the serialized JSON, but the underlyingChatCompletionMessageshows the text was classified as reasoning.)Root cause
Gemma 4's chat template prefills an empty, already-closed thinking block at the end of the prompt when thinking is disabled. The last prompt tokens are:
_starts_in_thinking()insrc/transformers/cli/serving/utils.pychecks whether the prompt ends with the thinking opener (["<|channel>", "thought", "\n"]=[100, 45518, 107]), tolerating one trailing token (intended for templates like DeepSeek-R1 that emit<think>\n). Here the tolerated trailing token is101=<channel|>— the closing tag — so the heuristic wrongly reportsstart_in_thinking=True.Since the model's output contains no thinking markers,
parse_reasoning()falls into the "prefilled opener truncated before close" branch and reclassifies the entire completion as reasoning, returningcontent="".Proposed fix
Reject the match when the prompt's final token is the thinking end token:
Verified locally: the Gemma 4 closed-block tail now returns
False, while a genuinely open block and the DeepSeek-style<think>\nprefill still returnTrue, and the curl above returns the text inmessage.content.I'm happy to open a PR with this fix if maintainers confirm the approach.
Expected behavior
Generated text should be returned in
message.contentwhen the model did not produce a thinking block.This issue was investigated and drafted with AI assistance (Claude Code); the patch was reviewed and tested by me locally.