About this benchmark
Agentic loops are the hardest reliability challenge in LLM deployment. A single-turn task that a model completes perfectly can fail unpredictably when it requires multi-step planning, information retrieval, and synthesis. This benchmark isolates the failure modes: which step breaks first, and how consistently does the model fail there?
The 5-step chain uses DeepSeek as the research target because it provides a chain of extractable facts — founding year, headquarters city, country — that must be discovered in sequence. Fake tool results are injected at each step so the eval works identically across all providers, regardless of whether the model has internet access.
Frequently asked questions
Why use text-based JSON tool calling instead of native function calling?
Native function-calling APIs vary significantly between providers and are not available on all models. Text-based JSON calling is provider-agnostic and tests whether the model understands the tool-calling paradigm — a more fundamental capability than provider-specific API compliance.
What happens at step 5 (Synthesis)?
Step 5 asks the model to write a free-form research report using all previously gathered information. There is no tool call — the model just writes text. The step passes if the response is at least 40 words and contains at least one of the key facts: DeepSeek, Hangzhou, China, or 2023.
Can I run multiple trials to test consistency?
Yes. Agentic behaviors are non-deterministic — a model that passes all 5 steps 3/5 times is more reliable than one that always fails at step 1, but less reliable than one that always completes the chain. Multiple trials expose this variance directly.