What is thinking mode?

Extended thinking mode gives a model a hidden scratchpad to reason through a problem before producing its final answer. Anthropic Claude supports native extended thinking via the API. For all other models, a 'Think step by step carefully' system suffix is used as an approximation.

How are answers scored?

Each answer is self-judged by the same model and API key used to generate it. The judge receives the question and answer and scores 1–10 on a strict rubric. This keeps the eval provider-agnostic and avoids bias from a separate judge model.

Does thinking mode always help?

No. In our real runs, o4-mini showed a regression — scores dropped from 9.30 to 7.25 when thinking was enabled. This was caused by truncation: the model spent its token budget on reasoning, leaving insufficient tokens for the actual answer. Claude Sonnet improved from 8.63 to 9.00, a modest but consistent gain.

Is my API key stored?

No. Your key is transmitted over HTTPS, used only for the API call, and never logged or stored anywhere.

LLM Thinking Mode Eval

Extended thinking gives models a private scratchpad before they answer — but does it actually improve results? This benchmark runs 8 questions across math, coding, factual, and creative categories with thinking ON and OFF, then uses the same model to self-judge each answer on a 1–10 scale. The heatmap shows exactly where thinking helps, where it doesn't, and where it causes regressions.

8 questions

Two per category: Math (prime counting, Fibonacci), Coding (merge sort, binary search), Factual (quantum entanglement, photosynthesis), Creative (haiku, metaphor). Each question is hard enough that extended reasoning could plausibly help.

ON vs OFF

For Claude models, extended thinking uses the native thinking: {type: "enabled"} API param. For all other providers, a "Think carefully step by step" system suffix simulates the effect without proprietary APIs.

Self-judging

The same model and API key used to answer is reused as judge. A strict rubric (1 = wrong, 5 = partial, 10 = excellent) is enforced. Self-judging keeps the eval provider-agnostic and avoids cross-model bias.

What it reveals

Real runs show o4-mini regressed (9.30 → 7.25) when thinking was enabled — token budget consumed by reasoning left too few tokens for the answer. Claude Sonnet improved modestly (8.63 → 9.00). DeepSeek-R1 improved consistently.

Judge scoring rubric — thinking mode eval
Score	Label	Meaning
10	Excellent	Correct, complete, well-explained answer
5	Partial	Mostly correct but incomplete or poorly explained
1	Wrong	Incorrect answer or refused to answer

Math

Count primes <100 Fibonacci: 20th term

Coding

Implement merge sort Binary search tree

Factual

Quantum entanglement Photosynthesis stages

Creative

Write a haiku Metaphor for the internet

Each category has 2 questions. All 8 are asked twice — once with thinking OFF, once with thinking ON — and scored 1–10 by the model itself.

Pre-computed runs

Sample results

Real runs from 2026-04-26. The heatmap shows average score per model × mode combination. Green delta = thinking helped. Red = regression.

Loading…

Select a run on the left to view its heatmap.

Live benchmark

Run your own eval

Bring your own API key. Results stream live via WebSocket.

Your API key is never stored or logged.

API provider

OpenRouter keys start with sk-or-. Get one free at openrouter.ai

API key

Select models (max 4)

Loading models…

Question categories

Math Coding Factual Creative

Thinking modes

Thinking OFF Thinking ON

Estimated cost

API calls—

Breakdown—

Price estimate—

Total—

Your history

Recent runs

Runs are stored in memory for this session only.

No runs yet. Run an eval above to see history here.

About this benchmark

The "extended thinking" debate in the LLM community is largely anecdotal — practitioners report conflicting results depending on the task and model. This eval provides a structured, reproducible comparison using 8 questions across 4 difficulty profiles, with self-judging to avoid cross-provider bias.

The key insight from real runs is that thinking mode is not universally beneficial. Token budget is the hidden constraint: models that use reasoning tokens heavily may run out of output budget before completing their answer. The o4-mini regression (9.30 → 7.25) in our real runs was caused by exactly this: thinking tokens exhausted before the answer was fully written, resulting in truncated responses that the judge penalized heavily.

Frequently asked questions

How does the system suffix approximation work for non-Claude models?

For models without native extended thinking APIs, we append "Think through this problem carefully and systematically before giving your final answer." to the system prompt. This is an approximation — it encourages chain-of-thought but does not enforce a separate reasoning phase or token budget. The heatmap shows both modes so you can compare the effect for each provider independently.

Can I run only math questions to test a specific capability?

Yes. Use the category checkboxes in the run panel to select only the categories you want. You can also run only thinking OFF or only ON to test baseline performance before adding the thinking comparison.

Why does creative category sometimes show the largest delta?

Creative tasks (haiku, metaphor) have subjective scoring criteria. When the model thinks step by step, it often produces more deliberate, polished output — which a self-judge rewards more highly. However, this can also mean the judge is more lenient when the model shows its reasoning process, introducing a positive bias for thinking-ON responses in creative tasks.

Key finding from real runs

o4-mini regressed significantly: 9.30 → 7.25 (−2.05). 6 of 40 answers were truncated when thinking tokens were enabled. Claude Sonnet 4.6 improved modestly: 8.63 → 9.00 (+0.37). DeepSeek-R1 showed the largest consistent gain across all categories.

Real run: 2026-04-26. Results may differ on newer model versions.

REST API

Every eval is available as a JSON endpoint:

POST /api/v1/evals/thinking-mode/run
{ apiKey, keyType, models,
  categories, modes }

GET /api/v1/evals/thinking-mode/runs
GET /api/v1/evals/thinking-mode/run/:id
GET /api/v1/evals/thinking-mode/run/:id/export

Affiliate disclosure

ByteWaveNetwork may earn a referral fee if you sign up for API access through links on this page. Benchmark results are independently produced and not influenced by commercial relationships.

Sunny Pal Singh

Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →