LLM Thinking Mode Eval

Extended thinking gives models a private scratchpad before they answer — but does it actually improve results? This benchmark runs 8 questions across math, coding, factual, and creative categories with thinking ON and OFF, then uses the same model to self-judge each answer on a 1–10 scale. The heatmap shows exactly where thinking helps, where it doesn't, and where it causes regressions.

8 questions

Two per category: Math (prime counting, Fibonacci), Coding (merge sort, binary search), Factual (quantum entanglement, photosynthesis), Creative (haiku, metaphor). Each question is hard enough that extended reasoning could plausibly help.

ON vs OFF

For Claude models, extended thinking uses the native thinking: {type: "enabled"} API param. For all other providers, a "Think carefully step by step" system suffix simulates the effect without proprietary APIs.

Self-judging

The same model and API key used to answer is reused as judge. A strict rubric (1 = wrong, 5 = partial, 10 = excellent) is enforced. Self-judging keeps the eval provider-agnostic and avoids cross-model bias.

What it reveals

Real runs show o4-mini regressed (9.30 → 7.25) when thinking was enabled — token budget consumed by reasoning left too few tokens for the answer. Claude Sonnet improved modestly (8.63 → 9.00). DeepSeek-R1 improved consistently.

Judge scoring rubric — thinking mode eval
ScoreLabelMeaning
10ExcellentCorrect, complete, well-explained answer
5PartialMostly correct but incomplete or poorly explained
1WrongIncorrect answer or refused to answer
Math
Count primes <100 Fibonacci: 20th term
Coding
Implement merge sort Binary search tree
Factual
Quantum entanglement Photosynthesis stages
Creative
Write a haiku Metaphor for the internet

Each category has 2 questions. All 8 are asked twice — once with thinking OFF, once with thinking ON — and scored 1–10 by the model itself.

Sample results

Real runs from 2026-04-26. The heatmap shows average score per model × mode combination. Green delta = thinking helped. Red = regression.

Loading…

Select a run on the left to view its heatmap.

Run your own eval

Bring your own API key. Results stream live via WebSocket.

Your API key is never stored or logged.
OpenRouter keys start with sk-or-. Get one free at openrouter.ai
Loading models…
Estimated cost
API calls
Breakdown
Price estimate
Total

Recent runs

Runs are stored in memory for this session only.

No runs yet. Run an eval above to see history here.

About this benchmark

The "extended thinking" debate in the LLM community is largely anecdotal — practitioners report conflicting results depending on the task and model. This eval provides a structured, reproducible comparison using 8 questions across 4 difficulty profiles, with self-judging to avoid cross-provider bias.

The key insight from real runs is that thinking mode is not universally beneficial. Token budget is the hidden constraint: models that use reasoning tokens heavily may run out of output budget before completing their answer. The o4-mini regression (9.30 → 7.25) in our real runs was caused by exactly this: thinking tokens exhausted before the answer was fully written, resulting in truncated responses that the judge penalized heavily.

Frequently asked questions

How does the system suffix approximation work for non-Claude models?

For models without native extended thinking APIs, we append "Think through this problem carefully and systematically before giving your final answer." to the system prompt. This is an approximation — it encourages chain-of-thought but does not enforce a separate reasoning phase or token budget. The heatmap shows both modes so you can compare the effect for each provider independently.

Can I run only math questions to test a specific capability?

Yes. Use the category checkboxes in the run panel to select only the categories you want. You can also run only thinking OFF or only ON to test baseline performance before adding the thinking comparison.

Why does creative category sometimes show the largest delta?

Creative tasks (haiku, metaphor) have subjective scoring criteria. When the model thinks step by step, it often produces more deliberate, polished output — which a self-judge rewards more highly. However, this can also mean the judge is more lenient when the model shows its reasoning process, introducing a positive bias for thinking-ON responses in creative tasks.

Sunny Pal Singh
Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →