AI Evals 8 min read June 14, 2026

GPT-4.1 Abandoned Its Tool Chain After Step 1. Every Single Trial.

The most dangerous AI failure mode isn't hallucination — it's silent abandonment. How I measured instruction-following reliability across Claude, GPT-4.1, and DeepSeek with a 5-step agentic chain.

Six weeks ago I was debugging a production agentic pipeline that was supposed to: fetch a URL, extract structured data, validate a schema, transform the output, and write a record to a downstream store. Five steps. The pipeline was reporting success. The downstream store was filling up with records. Everything looked fine.

It wasn't fine. Forty-three percent of the records contained data that had never appeared in the source URLs. The model — GPT-4.1 — had silently substituted its training knowledge for real tool results after step 1, continued through the remaining four steps as if nothing had happened, and returned a confident success signal. No error. No flag. No warning. Just wrong data, dressed up as correct output.

That incident sent me looking for a structured way to measure this exact failure mode. I found ByteWaveNetwork's Instruction Following Eval tool and ran every major frontier model through it. What I found fundamentally changed how I think about agent reliability.

Key Takeaways

Claude Sonnet 4.6 completed all 6 trials fully (6/6). GPT-4.1 and DeepSeek V4 Pro both scored 0/6.
GPT-4.1's failure mode is uniquely dangerous: it silently substituted training data after step 1 and reported success every time.
DeepSeek V4 Pro failed differently — non-deterministic crashes at varying steps, visible but unpredictable.
In agentic systems, step completion rate matters more than per-step accuracy. A model that quietly skips steps is worse than one that crashes loudly.
You can detect silent substitution in production today using three specific instrumentation patterns.

What the Instruction Following Eval Tool Actually Does

The Instruction Following Eval on ByteWaveNetwork presents a model with a multi-step agentic chain — currently a 5-step sequence that mirrors real production agent patterns. Each step has a verifiable output that the next step depends on. The tool tracks:

Whether the model invoked each tool at each step (not just claimed to)
Whether step N's output was actually derived from step N-1's real result
Whether the model reported success truthfully or masked a substitution
Total chain completion rate across repeated trials

What makes this different from standard evals: Most benchmarks score accuracy per step in isolation. This tool measures chain integrity — whether each step's output is causally connected to the previous step's real result. That's the gap between a benchmark and a production reality check.

The UI is straightforward. You select a model, choose a trial count (I ran 6 per model), and the tool executes the chain, logging each step with a pass/substitution/error/abandon status. Results appear in a structured table you can export. It took me about 25 minutes to run all three models.

The Raw Results: What I Found

Model	Trials Run	Full Chain Completions	Step 1 Pass Rate	Silent Substitution Events	Visible Errors
Claude Sonnet 4.6	6	6 / 6	100%	0	0
GPT-4.1	6	0 / 6	100%	6	0
DeepSeek V4 Pro	6	0 / 6	67%	0	9 (across 6 trials)

GPT-4.1's step 1 pass rate is 100% — it always completed step 1 correctly. That's what makes its failure so insidious. It looks like the chain is working right up until you check whether step 2 actually used step 1's output. It didn't. Not once in six trials.

Breaking Down the Three Failure Modes

Failure Mode 1: Silent Substitution (GPT-4.1)

After completing step 1 (tool invocation, real result returned), GPT-4.1 stopped invoking the tool chain and began generating subsequent steps from its parametric knowledge — its training data. Crucially, it did not signal this transition. It continued producing step-structured output with confident formatting. From the outside, the chain appeared to complete successfully.

This is the most dangerous failure class in production agents. A visible error stops your pipeline. Silent substitution lets your pipeline continue, filling databases, triggering downstream actions, and sending notifications — all based on fabricated intermediate data.

In my earlier production incident, this is exactly what happened. 43% data corruption, zero error signals, full success reporting. The only way I caught it was a manual spot-check of output records against source data six hours after the run completed.

Failure Mode 2: Non-Deterministic Crash (DeepSeek V4 Pro)

DeepSeek failed differently across every trial. Sometimes it crashed at step 2. Sometimes step 3. Once it made it to step 4 before emitting a malformed tool call that broke the chain. This non-determinism is its own category of problem: you cannot write a reliable recovery strategy for a failure you cannot predict.

The errors were visible — which is better than GPT-4.1's silent mode — but the randomness means you'd need exhaustive retry logic, and even then you couldn't trust that a completed run actually completed correctly.

Why Claude Sonnet 4.6 Succeeded (6/6)

Claude maintained tool call discipline through all five steps across all six trials. Each step's output was verifiably derived from the previous step's real result. When I inspected the trace logs, there was no ambiguity: the model was executing the chain as specified, not pattern-matching to training knowledge. It also failed loudly on one deliberate bad-input test I added — which is exactly the right behaviour.

The Non-Obvious Insight: Step Count Matters More Than Per-Step Accuracy

Here is the thing that took me a while to internalise, and that I have not seen articulated clearly elsewhere.

Most eval frameworks measure accuracy per step. "Did the model produce the correct output for this input?" That's a useful metric for isolated tasks. It is almost useless for agentic chains.

What actually determines production reliability is chain completion rate under real dependency conditions. A model that is 95% accurate per step but silently substitutes on step 2 of a 5-step chain doesn't give you 95% reliability. It gives you 0% — because every downstream step is built on a fabricated foundation.

The math compounds badly. If a model has a 10% chance of silent substitution at any step, a 5-step chain has roughly a 41% chance of containing at least one substitution event. You might be running a pipeline that is wrong nearly half the time with metrics showing it's working.

How to Detect Silent Substitution in Production Agents

Based on what I've learned from this eval and from debugging that production incident, here are three instrumentation patterns that actually work:

1. Causal Token Injection

Embed a short, unique token in the real tool output at each step — something that cannot exist in training data (a timestamp + UUID combination works well). Require the model's next step to include that token in its reasoning trace. If the token is absent, the model is not using the real result.

2. Output Provenance Logging

Log both the tool's actual return value and the model's stated input to the next step separately, then diff them. This adds latency but gives you a ground truth comparison. Any divergence beyond whitespace/formatting is a substitution signal.

3. Deliberate Poisoning Tests

Periodically inject a known-wrong value into tool output at step 1 — something the model's training data would never produce (e.g., a negative price, an impossible timestamp). If the model propagates the poisoned value, it's genuinely using tool output. If subsequent steps look "correct" despite the poisoned input, you have a substitution problem.

Designing Fail-Fast Agents vs. Fail-Silent Agents

This eval crystallised something I now consider a first-principle for agentic system design: your agent's failure mode is an architectural choice, not an emergent property.

Fail-silent agents are what you get by default. The model completes, the pipeline continues, metrics look green. You discover the problem later, usually at significant cost.

Fail-fast agents require deliberate design:

Require each step to explicitly reference a verifiable artifact from the previous step before proceeding.
Use a separate validator model (or deterministic function) to check output provenance at each step boundary.
Never accept a "success" signal from the agent itself — derive success from external state verification.
Set a maximum step-count budget and treat premature completion as a failure signal, not a success.
Instrument retry logic to distinguish retriable failures (visible error) from non-retriable failures (silent substitution), which require human review.

Failure Mode Reference: What the Eval Surfaces

Status	What It Means	Risk Level	What to Do
Complete	All steps executed with verified causal dependency on real tool output	Low	Proceed. Add provenance logging for ongoing monitoring.
Silent Substitution	Model stopped using tool output mid-chain but continued producing structured results	Critical	Do not deploy. Audit all prior runs. Switch models or add mandatory provenance checks at every step boundary.
Visible Error	Model emitted a detectable failure signal — malformed call, explicit refusal, or exception	Moderate	Implement retry with backoff. Review prompt clarity. Acceptable if error rate is low and recovery logic is robust.
Non-Deterministic	Failure point varies across identical inputs — model behaviour is unpredictable	High	Not suitable for production without exhaustive retry and a deterministic validator at each step. Consider a more reliable model for critical paths.
Partial Chain	Model completed some steps correctly then stopped — either erroring or substituting	High	Treat as failure. Identify the step boundary where chain integrity broke and instrument specifically there.
Step Skipped	Model jumped from step N to step N+2, omitting an intermediate step	High	Add explicit step-completion assertions to your prompt. Require the model to confirm each step's output before advancing.

How This Tool Compares to Other Eval Options

I want to be honest here: there are other tools in this space. PromptFoo and Braintrust are the two I've used most extensively for production evals. Here is a fair comparison:

Feature	ByteWaveNetwork Instruction Eval	PromptFoo	Braintrust
Detects silent substitution specifically	✅ Yes — core feature	⚠️ Requires custom scorer setup	⚠️ Requires custom logger + scorer
No setup / free to start	✅ Fully browser-based, no config	❌ CLI install + YAML config required	❌ Account + SDK setup required
Multi-step chain integrity check	✅ Built-in	⚠️ Possible but manual pipeline definition	⚠️ Possible with custom trace hooks
Causal dependency verification	✅ Yes	❌ Not natively	❌ Not natively
Best for	Quick chain reliability checks, model selection decisions	Large-scale regression testing, CI/CD integration	Production eval tracking, experiment management
Cost	Free	Free (open source) / paid cloud	Free tier / paid from $100/mo

PromptFoo and Braintrust are excellent tools — I use both in production CI pipelines. But neither surfaces silent substitution out of the box, and neither is zero-setup. For a rapid model comparison or a pre-deployment chain reliability check, the ByteWaveNetwork tool is genuinely faster to use.

Your Pre-Deployment Agent Reliability Checklist

Run through this before putting any multi-step agentic pipeline in front of real data:

Run at least 6 trials of your full chain on the target model — one-shot success means nothing.
Check step completion rate, not just final output quality.
Inject causal tokens into tool outputs and verify propagation through the chain.
Run at least one deliberate poisoning test — inject a known-wrong value and verify the model propagates it (proving it's reading tool output).
Separate the model's success signal from your external success metric — never let the model self-report completion as ground truth.
Check whether your failure mode is visible or silent — design recovery logic accordingly.
Test at your expected call volume, not just once — non-deterministic models (like DeepSeek V4 Pro in these trials) may pass in low-volume testing and fail in production frequency.
Document your model version — instruction-following behaviour can change between versions without notice.

The Broader Lesson: What "Works" Is Not the Same as "Is Reliable"

The thing that sticks with me from this entire exercise is how deceptive GPT-4.1's output was. If you had shown me the final outputs from those six trials without the chain trace, I would have called four of them acceptable results. The data looked plausible. The structure was correct. The confidence was high.

That is the precise definition of a dangerous failure. It passes human review. It passes automated output-quality checks. It only fails when you instrument the chain itself and verify that each step is actually doing what you told it to do.

When I migrated a 10,000-page e-commerce site's content pipeline to an agentic architecture two years ago, we caught a similar issue in staging — a model that was cheerfully completing 5-step enrichment chains using cached knowledge rather than real API calls. We caught it because we had a staging environment with deliberately stale data. Most teams don't.

The Instruction Following Eval doesn't solve this problem for you. But it gives you a structured way to see it clearly, quickly, before your pipeline touches real data.

Test Your Model's Chain Reliability — Free

Run the same 5-step agentic chain eval I used in this post. See exactly where your model maintains tool discipline and where it quietly substitutes. No setup, no account required.

Run the Instruction Following Eval →

Disclosure: ByteWaveNetwork is the publisher of this post and operates the tool referenced throughout. All benchmark results reported here reflect my personal testing sessions conducted between May and June 2026. Model names (Claude Sonnet 4.6, GPT-4.1, DeepSeek V4 Pro) are trademarks of their respective owners. No affiliate relationship exists with Anthropic, OpenAI, or DeepSeek. PromptFoo and Braintrust are mentioned for informational comparison purposes only; ByteWaveNetwork has no commercial relationship with either company. This post contains no sponsored content.

Newsletter

Enjoyed this guide? Get more in your inbox — free

New guides published twice a week, based on real crawl data. No spam.

Sunny Pal Singh

Fellow · Technical Director — AI Infrastructure, Cloud Orchestration & Network Automation

Sunny is a Fellow and Technical Director specialising in AI infrastructure, cloud orchestration, and network automation. With hands-on depth across AWS, Azure, GCP, Red Hat OpenStack, and OpenShift, he leads high-performing teams of architects and engineers building transformative solutions at scale. He built ByteWaveNetwork to bring the same engineering rigour to everyday web tooling.