About this benchmark
The "extended thinking" debate in the LLM community is largely anecdotal — practitioners report conflicting results depending on the task and model. This eval provides a structured, reproducible comparison using 8 questions across 4 difficulty profiles, with self-judging to avoid cross-provider bias.
The key insight from real runs is that thinking mode is not universally beneficial. Token budget is the hidden constraint: models that use reasoning tokens heavily may run out of output budget before completing their answer. The o4-mini regression (9.30 → 7.25) in our real runs was caused by exactly this: thinking tokens exhausted before the answer was fully written, resulting in truncated responses that the judge penalized heavily.
Frequently asked questions
How does the system suffix approximation work for non-Claude models?
For models without native extended thinking APIs, we append "Think through this problem carefully and systematically before giving your final answer." to the system prompt. This is an approximation — it encourages chain-of-thought but does not enforce a separate reasoning phase or token budget. The heatmap shows both modes so you can compare the effect for each provider independently.
Can I run only math questions to test a specific capability?
Yes. Use the category checkboxes in the run panel to select only the categories you want. You can also run only thinking OFF or only ON to test baseline performance before adding the thinking comparison.
Why does creative category sometimes show the largest delta?
Creative tasks (haiku, metaphor) have subjective scoring criteria. When the model thinks step by step, it often produces more deliberate, polished output — which a self-judge rewards more highly. However, this can also mean the judge is more lenient when the model shows its reasoning process, introducing a positive bias for thinking-ON responses in creative tasks.