LLM Agentic Loop Reliability Eval

Multi-step agents break in unpredictable ways — a model that follows tool-calling instructions perfectly in isolation often fails when steps are chained. This benchmark runs each model through a 5-step research chain: Wikipedia search, weather lookup, country data retrieval, a second Wikipedia search, and final synthesis. Every step must call the right tool with correct arguments, passing information forward.

The chain

Five steps: search Wikipedia for DeepSeek → get weather for Hangzhou → get country info for China → search Wikipedia for 2023 AI events → synthesize a report. Each step's output feeds the next.

Tool calling

Models respond with JSON tool calls: {"tool":"wikipedia_search","args":{"query":"DeepSeek"}}. No provider-specific function calling needed — works with any model via any API.

Scoring

Pass/fail per step. Each step is validated for correct tool name and arguments. A failed step stops the chain — subsequent steps are not attempted. Completion rate = % of trials where all 5 steps pass.

What it reveals

Claude models complete multi-step chains reliably. GPT-4.1 stopped after step 1 in real runs — it generated text instead of a JSON tool call. DeepSeek showed mixed results across trials.

Scoring thresholds — agentic loop eval
Step resultCriteriaEffect on chain
PASSCorrect tool name and all required argumentsChain continues to next step
FAILWrong tool, missing argument, or plain text instead of JSON callChain stops; subsequent steps skipped
Step 1
wikipedia_search query: DeepSeek
Step 2
get_weather city: Hangzhou
Step 3
get_country_info country: China
Step 4
wikipedia_search query: 2023 AI
Step 5
synthesis free-form report

Each step uses the result from the previous one. Fake tool results are injected so any model can participate regardless of actual internet access.

Sample results

Real runs showing step-by-step pass rates. Claude completed all 5 steps in every trial. GPT-4.1 stopped at step 1 in all 6 trials.

Loading…

Select a run on the left to view its heatmap.

Run your own eval

Bring your own API key. Results stream live via WebSocket.

Your API key is never stored or logged.
OpenRouter keys start with sk-or-. Get one free at openrouter.ai
Loading models…
Estimated cost
API calls
Breakdown
Price estimate
Total

Recent runs

Runs are stored in memory for this session only.

No runs yet. Run an eval above to see history here.

About this benchmark

Agentic loops are the hardest reliability challenge in LLM deployment. A single-turn task that a model completes perfectly can fail unpredictably when it requires multi-step planning, information retrieval, and synthesis. This benchmark isolates the failure modes: which step breaks first, and how consistently does the model fail there?

The 5-step chain uses DeepSeek as the research target because it provides a chain of extractable facts — founding year, headquarters city, country — that must be discovered in sequence. Fake tool results are injected at each step so the eval works identically across all providers, regardless of whether the model has internet access.

Frequently asked questions

Why use text-based JSON tool calling instead of native function calling?

Native function-calling APIs vary significantly between providers and are not available on all models. Text-based JSON calling is provider-agnostic and tests whether the model understands the tool-calling paradigm — a more fundamental capability than provider-specific API compliance.

What happens at step 5 (Synthesis)?

Step 5 asks the model to write a free-form research report using all previously gathered information. There is no tool call — the model just writes text. The step passes if the response is at least 40 words and contains at least one of the key facts: DeepSeek, Hangzhou, China, or 2023.

Can I run multiple trials to test consistency?

Yes. Agentic behaviors are non-deterministic — a model that passes all 5 steps 3/5 times is more reliable than one that always fails at step 1, but less reliable than one that always completes the chain. Multiple trials expose this variance directly.

Sunny Pal Singh
Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →