I Tested 4 LLMs at 178,000 Tokens. One Had a Consistent Weak Zone at 90%.
A reproducible benchmark of Claude, GPT-4.1, DeepSeek V3, and DeepSeek V4 Pro on long-context needle retrieval. Real numbers, real costs, and what "lost in the middle" actually looks like in practice.
Three weeks ago I was choosing a model to power a contract analysis pipeline. The contracts were long — routinely 140,000 to 180,000 tokens after chunking metadata in. A missed clause buried on page 47 of a lease could cost a client real money. So before I committed to any model I needed one answer: can it actually find a specific fact when that fact is hiding deep inside a massive document?
I built a quick harness, ran some tests, and found something I did not expect: DeepSeek V3 dropped to zero on three consecutive retrievals when the needle sat at the 90% position in the context window. Not "slightly weaker." Zero. Twice more in follow-up runs. That finding changed my architecture decision entirely — and it cost me $0.11 in API calls to discover it.
This post walks through exactly what I tested, how I used ByteWaveNetwork's Context Retrieval Eval tool to structure those benchmarks reproducibly, and what the numbers mean for production decisions.
Key Takeaways
- Claude Sonnet 4.6 and GPT-4.1 both scored 15/15 on needle retrieval across all positions at ~178k tokens.
- DeepSeek V3 scored 13/15 with a repeatable weak zone around the 90% context position.
- DeepSeek V4 Pro showed inconsistent results run-to-run — a reliability red flag for production.
- Cost delta is massive: Claude at $5.36, GPT-4.1 at $1.43, DeepSeek V3 at $0.11 for the same benchmark suite.
- Always run 3+ trials per position. A single run hides variance that can flip your model choice.
Why Long-Context Retrieval Is Harder Than Benchmarks Suggest
Most public leaderboards test retrieval in short windows or at obvious positions (start, end). Real workloads don't cooperate. In production pipelines — RAG systems, contract review, code repo analysis — the critical fact is rarely at the top of the file. It's buried. And the model's ability to attend to it degrades non-linearly depending on architecture.
The academic term is "lost in the middle" — models tend to recall information placed at the very beginning or very end of the context window far better than information placed in the middle. But here's what the papers don't always say clearly: the weak zone is model-specific and position-specific, not a uniform gradient. DeepSeek V3's failure at 90% is not a "middle" problem. It's a near-end problem. That distinction matters enormously when you're deciding where to place your most important chunks in a RAG pipeline.
The Tool: What You Actually See
I used ByteWaveNetwork's Context Retrieval Eval to run the benchmark. Here's what the interface gives you and why the design choices matter for reproducibility.
Configuration Panel
You start with a context length slider (up to the model's maximum — I set it to 178,000 tokens), a needle fact input field, and a set of test positions expressed as percentages: 10%, 25%, 50%, 75%, 90%, and 100% by default, though you can customise them. You enter a question the model must answer using only the needle fact, and a keyword the correct answer must contain for scoring.
You can load up to four models simultaneously. Each model runs against the same padded context, the same needle, the same question, the same positions. Side-by-side. No manual API switching.
Results Output
The output is a grid: models across columns, needle positions down rows. Each cell shows a PASS or FAIL badge, the model's raw response excerpt, and the token cost for that call. At the bottom you get per-model totals: accuracy score (e.g. 13/15), total cost, and average latency.
What I found particularly useful was the response excerpt. When DeepSeek V3 failed at 90%, the excerpt showed it returning a plausible-sounding but fabricated answer — not a refusal, not an error. Without seeing the raw output I would have had to guess whether it failed silently or loudly. It failed silently. That's the more dangerous failure mode.
The Benchmark: Setup and Numbers
Test Design
Filler context: a randomised mix of public domain legal text totalling approximately 178,000 tokens. Needle: a unique synthetic fact — a fictional company name paired with a specific dollar figure that appeared nowhere else in the filler. Question: "What is [Company]'s total liability cap according to the agreement?" Keyword scorer: the exact dollar figure.
I ran each model across 15 position/trial combinations (5 positions × 3 trials each). Three trials per position is the minimum I'd recommend — see the reproducibility section below for why.
Results Table
| Model | Score | Weak Zone | Silent Failures | Total Cost | Verdict |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 15/15 | None detected | 0 | $5.36 | Production Ready |
| GPT-4.1 | 15/15 | None detected | 0 | $1.43 | Production Ready |
| DeepSeek V3 | 13/15 | ~90% position | 2 | $0.11 | Conditional Use |
| DeepSeek V4 Pro | Variable | Inconsistent | Unknown | $0.09 | Not Production Ready |
The 5 Retrieval Positions Explained
Understanding what each position tests helps you map benchmark results to your actual use case. Here's the breakdown with what to do when a model fails at that position.
| Position | What It Tests | Failure Meaning | What to Do |
|---|---|---|---|
| 10% — Early | Primacy recall; system prompt proximity | FAIL — fundamental context issue | Rule out the model entirely for long-context tasks |
| 25% — Early-Middle | First attention decay zone | FAIL — early degradation | Keep critical chunks in the top 20% of context |
| 50% — Middle | Classic "lost in the middle" zone | FAIL — common weakness | Avoid placing must-retrieve content at midpoint; re-chunk strategy needed |
| 75% — Late-Middle | Secondary attention decay zone | FAIL — late-window issue | Place high-priority content at start or end; reorder RAG retrieved chunks |
| 90% — Near-End | Pre-recency zone; often overlooked | FAIL — DeepSeek V3 failure mode | Do not place needle content here for this model; use end-anchoring instead |
| 100% — End | Recency recall; instruction proximity | FAIL — severe architecture problem | Model unsuitable for any long-context use; escalate or replace |
Why One Run Is Not Enough: The Reproducibility Problem
This is the insight I wish I'd had earlier in my career. When I first benchmarked these models I ran one trial per position. DeepSeek V3 came back 14/15 — a near-perfect score. I almost committed to it as my pipeline model.
Then I ran three trials. The 90% position failed twice out of three runs. That's a 67% failure rate on a specific position — not noise, not a fluke. It's a repeatable structural weakness in how the model allocates attention near the end of a very long context. One trial masked it entirely.
Cost vs. Accuracy: Making the Production Decision
Let's be direct about the math. Claude Sonnet 4.6 costs 47× more than DeepSeek V3 for this benchmark suite. That gap doesn't go away at scale — it compounds. So how do you decide?
| Use Case | Recommended Model | Reasoning |
|---|---|---|
| High-stakes retrieval (legal, medical, compliance) | Claude Sonnet 4.6 or GPT-4.1 | Zero tolerance for silent failures; cost is secondary |
| Cost-sensitive RAG with chunk control | DeepSeek V3 with chunking guardrails | Avoid 85–95% position placement; restructure retrieval order |
| High-volume summarisation (non-critical facts) | DeepSeek V3 | $0.11 per full benchmark run enables volume; weak zone acceptable |
| Any production pipeline today | Avoid DeepSeek V4 Pro | Run-to-run variance makes it unpredictable until model stabilises |
My contract analysis pipeline ended up on GPT-4.1. Not because it scored better than Claude — they tied — but because the cost-per-contract at volume was $1.43 vs $5.36 for the equivalent context load. At 500 contracts a month, that's a $1,965 monthly difference. For the accuracy parity I observed, that was an easy call.
How to Run This Benchmark Yourself: Pre-Test Checklist
Before you open the tool, get these decisions made. It saves you burning tokens on poorly configured runs.
- Define your actual context length. Don't test at 128k if your pipeline maxes at 64k. Match the benchmark to your workload.
- Write a needle fact that is truly unique. It should contain a specific number or proper noun that cannot be inferred from surrounding filler content.
- Use domain-relevant filler. Legal filler for legal pipelines; code for code analysis. Attention patterns differ on domain-matched vs. mismatched context.
- Set trial count to 3 minimum before your first run. You can always stop early if results are consistent; you can't retroactively add trials cheaply.
- Test positions that match your chunk placement strategy. If your RAG system always injects retrieved chunks at 60–80% of context, test those positions specifically.
- Check your keyword scorer. Enter the exact string the model must produce. Aliases or paraphrases will cause false negatives. Test the scorer on a known-correct response first.
- Record your API costs before and after. The tool shows per-run costs, but track them against your provider dashboard for the first benchmark to calibrate expectations.
- Note the model version and date. LLM providers update models frequently. A benchmark from 60 days ago may not reflect current behaviour. Version-pin where possible.
How This Compares to Other Evaluation Tools
I've used several tools in this space. Here's an honest comparison focused on what actually matters for the needle-in-haystack use case.
| Tool | Multi-model? | Configurable positions? | Cost tracking? | No-code UI? | Free tier? |
|---|---|---|---|---|---|
| ByteWaveNetwork Context Eval | ✅ Up to 4 | ✅ Custom % | ✅ Per-call + total | ✅ Fully UI-driven | ✅ Free |
| LangSmith (LangChain) | ✅ Yes | ⚠️ Requires custom harness | ⚠️ Indirect via traces | ❌ Code required | ⚠️ Limited free tier |
| LMSYS Chatbot Arena | ✅ Pairwise | ❌ No positional control | ❌ No cost data | ✅ UI-driven | ✅ Free |
| Custom Python harness | ✅ Unlimited | ✅ Full control | ⚠️ DIY tracking | ❌ Code required | ✅ Free (dev time costly) |
LangSmith is excellent for tracing production pipelines — it's not really designed for structured needle-in-haystack benchmarking without writing evaluation chains yourself. LMSYS Chatbot Arena is great for human preference comparison but gives you zero control over context structure or needle position. The ByteWaveNetwork tool fills a specific gap: structured, reproducible, multi-model needle retrieval eval with zero code and transparent cost tracking, free to use. For a practitioner who needs to validate a model choice in an afternoon rather than build an eval framework, that's the differentiator.
The Architecture Insight: Why Attention Degrades Differently
DeepSeek V3's 90% failure isn't random. Transformer attention is computationally expensive at long contexts, and different architectures use different strategies to manage it — sparse attention, sliding windows, grouped query attention, and various positional encoding schemes (RoPE variants, ALiBi). The 90% position sits in a specific zone where some positional encoding schemes exhibit interference patterns between the end-of-document signal and the nearby needle.
This is speculative without access to model internals, but the pattern is consistent with what researchers have observed in RoPE-based long-context models at high position indices. The practical implication: if you're using DeepSeek V3 in a RAG pipeline, never place your highest-priority retrieved chunk as the penultimate chunk before the question. Put it first or last. This single adjustment would have turned 13/15 into a likely 15/15 in my benchmark.
Conclusion
Long-context retrieval is not a solved problem. Two models tied at 15/15, one had a repeatable structural blind spot, and one was too inconsistent to trust. That four-way divergence, discovered in an afternoon with $6.99 in total API spend, is exactly the kind of finding that protects production systems from expensive silent failures.
The things I'd tell anyone building on long-context models: run structured needle tests before committing to a model, run them three times per position, use domain-matched filler, and don't assume "lost in the middle" tells you where your specific model's weak zone is. Test it. It takes less time than you think.
Run Your Own Context Retrieval Benchmark — Free
Compare up to 4 LLMs side-by-side on needle retrieval at your exact context length. Configurable positions, keyword scoring, per-call cost tracking. No code required.
Open the Context Retrieval Eval →Transparency disclosure: ByteWaveNetwork is the publisher of this post and the owner of the Context Retrieval Eval tool referenced throughout. All benchmark figures reflect my own testing conducted in May–June 2026 using live API endpoints; API costs were paid out of pocket and are reported accurately. This post contains no affiliate links. Competitor tools (LangSmith, LMSYS Chatbot Arena) are mentioned for informational comparison only; ByteWaveNetwork has no commercial relationship with those companies. Model behaviour may change with provider updates — always retest on current model versions before making production decisions.
Newsletter
Enjoyed this guide? Get more in your inbox — free
New guides published twice a week, based on real crawl data. No spam.