What is an agentic loop eval?

An agentic loop eval tests whether a model can execute a multi-step task that requires calling tools in the correct order and passing information from one step to the next. Our eval uses a 5-step research chain: Wikipedia search, weather lookup, country data, another Wikipedia search, and final synthesis.

How does the tool call detection work?

We prompt the model to respond with a JSON tool call on each step. The eval parses the JSON, validates the tool name and arguments, then injects a fake result to continue the chain. This works across all providers without requiring native function-calling support.

Why does the chain use DeepSeek as the research target?

DeepSeek is a well-known AI company with easily verifiable facts: headquartered in Hangzhou, China, founded in 2023. This creates a chain of dependent lookups — each step's output informs the next — making it a realistic test of multi-hop reasoning and information passing.

What does a failed step mean?

A step fails when the model calls the wrong tool, formats the tool call incorrectly, passes incorrect arguments, or generates text instead of a JSON tool call. Once a step fails, the chain stops — subsequent steps are not attempted.

Is my API key stored?

No. Your key is transmitted over HTTPS, used only for the API call, and never logged or stored anywhere.

LLM Agentic Loop Reliability Eval

Multi-step agents break in unpredictable ways — a model that follows tool-calling instructions perfectly in isolation often fails when steps are chained. This benchmark runs each model through a 5-step research chain: Wikipedia search, weather lookup, country data retrieval, a second Wikipedia search, and final synthesis. Every step must call the right tool with correct arguments, passing information forward.

The chain

Five steps: search Wikipedia for DeepSeek → get weather for Hangzhou → get country info for China → search Wikipedia for 2023 AI events → synthesize a report. Each step's output feeds the next.

Tool calling

Models respond with JSON tool calls: {"tool":"wikipedia_search","args":{"query":"DeepSeek"}}. No provider-specific function calling needed — works with any model via any API.

Scoring

Pass/fail per step. Each step is validated for correct tool name and arguments. A failed step stops the chain — subsequent steps are not attempted. Completion rate = % of trials where all 5 steps pass.

What it reveals

Claude models complete multi-step chains reliably. GPT-4.1 stopped after step 1 in real runs — it generated text instead of a JSON tool call. DeepSeek showed mixed results across trials.

Scoring thresholds — agentic loop eval
Step result	Criteria	Effect on chain
PASS	Correct tool name and all required arguments	Chain continues to next step
FAIL	Wrong tool, missing argument, or plain text instead of JSON call	Chain stops; subsequent steps skipped

Step 1

wikipedia_search query: DeepSeek

Step 2

get_weather city: Hangzhou

Step 3

get_country_info country: China

Step 4

wikipedia_search query: 2023 AI

Step 5

synthesis free-form report

Each step uses the result from the previous one. Fake tool results are injected so any model can participate regardless of actual internet access.

Pre-computed runs

Sample results

Real runs showing step-by-step pass rates. Claude completed all 5 steps in every trial. GPT-4.1 stopped at step 1 in all 6 trials.

Loading…

Select a run on the left to view its heatmap.

Live benchmark

Run your own eval

Bring your own API key. Results stream live via WebSocket.

Your API key is never stored or logged.

API provider

OpenRouter keys start with sk-or-. Get one free at openrouter.ai

API key

Select models (max 4)

Loading models…

Trials per model 3

Estimated cost

API calls—

Breakdown—

Price estimate—

Total—

Your history

Recent runs

Runs are stored in memory for this session only.

No runs yet. Run an eval above to see history here.

About this benchmark

Agentic loops are the hardest reliability challenge in LLM deployment. A single-turn task that a model completes perfectly can fail unpredictably when it requires multi-step planning, information retrieval, and synthesis. This benchmark isolates the failure modes: which step breaks first, and how consistently does the model fail there?

The 5-step chain uses DeepSeek as the research target because it provides a chain of extractable facts — founding year, headquarters city, country — that must be discovered in sequence. Fake tool results are injected at each step so the eval works identically across all providers, regardless of whether the model has internet access.

Frequently asked questions

Why use text-based JSON tool calling instead of native function calling?

Native function-calling APIs vary significantly between providers and are not available on all models. Text-based JSON calling is provider-agnostic and tests whether the model understands the tool-calling paradigm — a more fundamental capability than provider-specific API compliance.

What happens at step 5 (Synthesis)?

Step 5 asks the model to write a free-form research report using all previously gathered information. There is no tool call — the model just writes text. The step passes if the response is at least 40 words and contains at least one of the key facts: DeepSeek, Hangzhou, China, or 2023.

Can I run multiple trials to test consistency?

Yes. Agentic behaviors are non-deterministic — a model that passes all 5 steps 3/5 times is more reliable than one that always fails at step 1, but less reliable than one that always completes the chain. Multiple trials expose this variance directly.

Key finding from real runs

Claude Sonnet 4.6 completed all 5 steps in 6/6 trials. GPT-4.1 failed at step 1 in all 6 trials — it generated a text response instead of a JSON tool call, breaking the chain immediately.

This benchmark was run on 2026-04-25. Results may differ on newer model versions.

REST API

Every eval is available as a JSON endpoint:

POST /api/v1/evals/agentic-loop/run
{ apiKey, keyType, models, trials }

GET /api/v1/evals/agentic-loop/runs
GET /api/v1/evals/agentic-loop/run/:id
GET /api/v1/evals/agentic-loop/run/:id/export

Affiliate disclosure

ByteWaveNetwork may earn a referral fee if you sign up for API access through links on this page. Benchmark results are independently produced and not influenced by commercial relationships.

Sunny Pal Singh

Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →