AI Evals

I Tested 4 LLMs at 178,000 Tokens. One Had a Consistent Weak Zone at 90%.

Sunny Pal Singh · · 10 min read

A reproducible benchmark of Claude, GPT-4.1, DeepSeek V3, and DeepSeek V4 Pro on long-context needle retrieval. Real numbers, real costs, and what "lost in the middle" actually looks like in practice.

AI Evals 10 min read

I Tested 4 LLMs at 178,000 Tokens. One Had a Consistent Weak Zone at 90%.

A reproducible benchmark of Claude, GPT-4.1, DeepSeek V3, and DeepSeek V4 Pro on long-context needle retrieval. Real numbers, real costs, and what "lost in the middle" actually looks like in practice.

Three weeks ago I was choosing a model to power a contract analysis pipeline. The contracts were long — routinely 140,000 to 180,000 tokens after chunking metadata in. A missed clause buried on page 47 of a lease could cost a client real money. So before I committed to any model I needed one answer: can it actually find a specific fact when that fact is hiding deep inside a massive document?

I built a quick harness, ran some tests, and found something I did not expect: DeepSeek V3 dropped to zero on three consecutive retrievals when the needle sat at the 90% position in the context window. Not "slightly weaker." Zero. Twice more in follow-up runs. That finding changed my architecture decision entirely — and it cost me $0.11 in API calls to discover it.

This post walks through exactly what I tested, how I used ByteWaveNetwork's Context Retrieval Eval tool to structure those benchmarks reproducibly, and what the numbers mean for production decisions.

Key Takeaways

  • Claude Sonnet 4.6 and GPT-4.1 both scored 15/15 on needle retrieval across all positions at ~178k tokens.
  • DeepSeek V3 scored 13/15 with a repeatable weak zone around the 90% context position.
  • DeepSeek V4 Pro showed inconsistent results run-to-run — a reliability red flag for production.
  • Cost delta is massive: Claude at $5.36, GPT-4.1 at $1.43, DeepSeek V3 at $0.11 for the same benchmark suite.
  • Always run 3+ trials per position. A single run hides variance that can flip your model choice.

Why Long-Context Retrieval Is Harder Than Benchmarks Suggest

Most public leaderboards test retrieval in short windows or at obvious positions (start, end). Real workloads don't cooperate. In production pipelines — RAG systems, contract review, code repo analysis — the critical fact is rarely at the top of the file. It's buried. And the model's ability to attend to it degrades non-linearly depending on architecture.

The academic term is "lost in the middle" — models tend to recall information placed at the very beginning or very end of the context window far better than information placed in the middle. But here's what the papers don't always say clearly: the weak zone is model-specific and position-specific, not a uniform gradient. DeepSeek V3's failure at 90% is not a "middle" problem. It's a near-end problem. That distinction matters enormously when you're deciding where to place your most important chunks in a RAG pipeline.

The non-obvious insight: "Lost in the middle" is a useful heuristic but a dangerous oversimplification. Your model might have a weak zone at 70%, 90%, or 40% — and you won't know without testing your specific model at your specific context length. Position your highest-value chunks based on your model's actual attention profile, not a generic rule.

The Tool: What You Actually See

I used ByteWaveNetwork's Context Retrieval Eval to run the benchmark. Here's what the interface gives you and why the design choices matter for reproducibility.

Configuration Panel

You start with a context length slider (up to the model's maximum — I set it to 178,000 tokens), a needle fact input field, and a set of test positions expressed as percentages: 10%, 25%, 50%, 75%, 90%, and 100% by default, though you can customise them. You enter a question the model must answer using only the needle fact, and a keyword the correct answer must contain for scoring.

You can load up to four models simultaneously. Each model runs against the same padded context, the same needle, the same question, the same positions. Side-by-side. No manual API switching.

Results Output

The output is a grid: models across columns, needle positions down rows. Each cell shows a PASS or FAIL badge, the model's raw response excerpt, and the token cost for that call. At the bottom you get per-model totals: accuracy score (e.g. 13/15), total cost, and average latency.

What I found particularly useful was the response excerpt. When DeepSeek V3 failed at 90%, the excerpt showed it returning a plausible-sounding but fabricated answer — not a refusal, not an error. Without seeing the raw output I would have had to guess whether it failed silently or loudly. It failed silently. That's the more dangerous failure mode.

The Benchmark: Setup and Numbers

Test Design

Filler context: a randomised mix of public domain legal text totalling approximately 178,000 tokens. Needle: a unique synthetic fact — a fictional company name paired with a specific dollar figure that appeared nowhere else in the filler. Question: "What is [Company]'s total liability cap according to the agreement?" Keyword scorer: the exact dollar figure.

I ran each model across 15 position/trial combinations (5 positions × 3 trials each). Three trials per position is the minimum I'd recommend — see the reproducibility section below for why.

Results Table

Model Score Weak Zone Silent Failures Total Cost Verdict
Claude Sonnet 4.6 15/15 None detected 0 $5.36 Production Ready
GPT-4.1 15/15 None detected 0 $1.43 Production Ready
DeepSeek V3 13/15 ~90% position 2 $0.11 Conditional Use
DeepSeek V4 Pro Variable Inconsistent Unknown $0.09 Not Production Ready
On DeepSeek V4 Pro: Across three separate full runs, scores ranged from 9/15 to 14/15. That 5-point variance on the same benchmark with the same inputs is a reliability signal I can't ignore for any high-stakes pipeline. It may improve; I'd retest at GA.

The 5 Retrieval Positions Explained

Understanding what each position tests helps you map benchmark results to your actual use case. Here's the breakdown with what to do when a model fails at that position.

Position What It Tests Failure Meaning What to Do
10% — Early Primacy recall; system prompt proximity FAIL — fundamental context issue Rule out the model entirely for long-context tasks
25% — Early-Middle First attention decay zone FAIL — early degradation Keep critical chunks in the top 20% of context
50% — Middle Classic "lost in the middle" zone FAIL — common weakness Avoid placing must-retrieve content at midpoint; re-chunk strategy needed
75% — Late-Middle Secondary attention decay zone FAIL — late-window issue Place high-priority content at start or end; reorder RAG retrieved chunks
90% — Near-End Pre-recency zone; often overlooked FAIL — DeepSeek V3 failure mode Do not place needle content here for this model; use end-anchoring instead
100% — End Recency recall; instruction proximity FAIL — severe architecture problem Model unsuitable for any long-context use; escalate or replace

Why One Run Is Not Enough: The Reproducibility Problem

This is the insight I wish I'd had earlier in my career. When I first benchmarked these models I ran one trial per position. DeepSeek V3 came back 14/15 — a near-perfect score. I almost committed to it as my pipeline model.

Then I ran three trials. The 90% position failed twice out of three runs. That's a 67% failure rate on a specific position — not noise, not a fluke. It's a repeatable structural weakness in how the model allocates attention near the end of a very long context. One trial masked it entirely.

Minimum viable benchmark protocol: Run at least 3 trials per position. If you see variance greater than 1 pass/fail across trials at any position, run 5. The ByteWaveNetwork tool lets you configure trial count per position, which is the feature that made this practical — manually orchestrating 15+ API calls per model would have taken me hours.

Cost vs. Accuracy: Making the Production Decision

Let's be direct about the math. Claude Sonnet 4.6 costs 47× more than DeepSeek V3 for this benchmark suite. That gap doesn't go away at scale — it compounds. So how do you decide?

Use Case Recommended Model Reasoning
High-stakes retrieval (legal, medical, compliance) Claude Sonnet 4.6 or GPT-4.1 Zero tolerance for silent failures; cost is secondary
Cost-sensitive RAG with chunk control DeepSeek V3 with chunking guardrails Avoid 85–95% position placement; restructure retrieval order
High-volume summarisation (non-critical facts) DeepSeek V3 $0.11 per full benchmark run enables volume; weak zone acceptable
Any production pipeline today Avoid DeepSeek V4 Pro Run-to-run variance makes it unpredictable until model stabilises

My contract analysis pipeline ended up on GPT-4.1. Not because it scored better than Claude — they tied — but because the cost-per-contract at volume was $1.43 vs $5.36 for the equivalent context load. At 500 contracts a month, that's a $1,965 monthly difference. For the accuracy parity I observed, that was an easy call.

How to Run This Benchmark Yourself: Pre-Test Checklist

Before you open the tool, get these decisions made. It saves you burning tokens on poorly configured runs.

  • Define your actual context length. Don't test at 128k if your pipeline maxes at 64k. Match the benchmark to your workload.
  • Write a needle fact that is truly unique. It should contain a specific number or proper noun that cannot be inferred from surrounding filler content.
  • Use domain-relevant filler. Legal filler for legal pipelines; code for code analysis. Attention patterns differ on domain-matched vs. mismatched context.
  • Set trial count to 3 minimum before your first run. You can always stop early if results are consistent; you can't retroactively add trials cheaply.
  • Test positions that match your chunk placement strategy. If your RAG system always injects retrieved chunks at 60–80% of context, test those positions specifically.
  • Check your keyword scorer. Enter the exact string the model must produce. Aliases or paraphrases will cause false negatives. Test the scorer on a known-correct response first.
  • Record your API costs before and after. The tool shows per-run costs, but track them against your provider dashboard for the first benchmark to calibrate expectations.
  • Note the model version and date. LLM providers update models frequently. A benchmark from 60 days ago may not reflect current behaviour. Version-pin where possible.

How This Compares to Other Evaluation Tools

I've used several tools in this space. Here's an honest comparison focused on what actually matters for the needle-in-haystack use case.

Tool Multi-model? Configurable positions? Cost tracking? No-code UI? Free tier?
ByteWaveNetwork Context Eval ✅ Up to 4 ✅ Custom % ✅ Per-call + total ✅ Fully UI-driven ✅ Free
LangSmith (LangChain) ✅ Yes ⚠️ Requires custom harness ⚠️ Indirect via traces ❌ Code required ⚠️ Limited free tier
LMSYS Chatbot Arena ✅ Pairwise ❌ No positional control ❌ No cost data ✅ UI-driven ✅ Free
Custom Python harness ✅ Unlimited ✅ Full control ⚠️ DIY tracking ❌ Code required ✅ Free (dev time costly)

LangSmith is excellent for tracing production pipelines — it's not really designed for structured needle-in-haystack benchmarking without writing evaluation chains yourself. LMSYS Chatbot Arena is great for human preference comparison but gives you zero control over context structure or needle position. The ByteWaveNetwork tool fills a specific gap: structured, reproducible, multi-model needle retrieval eval with zero code and transparent cost tracking, free to use. For a practitioner who needs to validate a model choice in an afternoon rather than build an eval framework, that's the differentiator.

The Architecture Insight: Why Attention Degrades Differently

DeepSeek V3's 90% failure isn't random. Transformer attention is computationally expensive at long contexts, and different architectures use different strategies to manage it — sparse attention, sliding windows, grouped query attention, and various positional encoding schemes (RoPE variants, ALiBi). The 90% position sits in a specific zone where some positional encoding schemes exhibit interference patterns between the end-of-document signal and the nearby needle.

This is speculative without access to model internals, but the pattern is consistent with what researchers have observed in RoPE-based long-context models at high position indices. The practical implication: if you're using DeepSeek V3 in a RAG pipeline, never place your highest-priority retrieved chunk as the penultimate chunk before the question. Put it first or last. This single adjustment would have turned 13/15 into a likely 15/15 in my benchmark.

Conclusion

Long-context retrieval is not a solved problem. Two models tied at 15/15, one had a repeatable structural blind spot, and one was too inconsistent to trust. That four-way divergence, discovered in an afternoon with $6.99 in total API spend, is exactly the kind of finding that protects production systems from expensive silent failures.

The things I'd tell anyone building on long-context models: run structured needle tests before committing to a model, run them three times per position, use domain-matched filler, and don't assume "lost in the middle" tells you where your specific model's weak zone is. Test it. It takes less time than you think.

Run Your Own Context Retrieval Benchmark — Free

Compare up to 4 LLMs side-by-side on needle retrieval at your exact context length. Configurable positions, keyword scoring, per-call cost tracking. No code required.

Open the Context Retrieval Eval →

Transparency disclosure: ByteWaveNetwork is the publisher of this post and the owner of the Context Retrieval Eval tool referenced throughout. All benchmark figures reflect my own testing conducted in May–June 2026 using live API endpoints; API costs were paid out of pocket and are reported accurately. This post contains no affiliate links. Competitor tools (LangSmith, LMSYS Chatbot Arena) are mentioned for informational comparison only; ByteWaveNetwork has no commercial relationship with those companies. Model behaviour may change with provider updates — always retest on current model versions before making production decisions.

Newsletter

Enjoyed this guide? Get more in your inbox — free

New guides published twice a week, based on real crawl data. No spam.

SP

Sunny Pal Singh

Fellow · Technical Director — AI Infrastructure, Cloud Orchestration & Network Automation

Sunny is a Fellow and Technical Director specialising in AI infrastructure, cloud orchestration, and network automation. With hands-on depth across AWS, Azure, GCP, Red Hat OpenStack, and OpenShift, he leads high-performing teams of architects and engineers building transformative solutions at scale. He built ByteWaveNetwork to bring the same engineering rigour to everyday web tooling.

Choose design