LLM Context Retrieval Eval

Most language models claim large context windows — but does retrieval actually hold up at the tail end? This benchmark hides a fact at the 85%, 90%, and 95% position of a large document and asks the model to retrieve it. One question reveals a lot about attention quality.

The needle

A single fact — "Project Helios launch date is March 7, 2031" — is injected into a large filler document at a precise position. The model is asked to quote it back.

The haystack

20 synthetic IT operations sentences repeated to fill the target context size (~50K–180K tokens). Bland by design — attention stress, not comprehension.

What varies

Needle position (10%–95% of the document), number of trials, context size, and model. Multiple trials at each position give a reliable pass rate.

Scoring

PASS if ≥80% of trials are exact (85%+ keyword match). DEGRADED at 40–79%. FAIL below 40%. Each cell in the heatmap is one position × model combination.

Scoring thresholds — context retrieval eval
ResultExact match rate across trialsHeatmap color
PASS≥80% of trials with ≥85% keyword matchGreen
DEGRADED40–79% exactAmber
FAIL<40% exact or explicit not-foundRed
filler text — 85% of document
📍 needle
15%
Position 85% Position 90% Position 95%

The further right the needle, the harder it is for the model to retrieve — especially as context size grows.

Sample results

Real runs from the open-source eval suite. No API key needed — just click a run to load its heatmap.

Loading sample runs…
Select a run on the left to see its heatmap

Run your own eval

Use your own API key to test any model. Results are streamed live and never stored on our servers.

Your key is never stored. It travels over HTTPS, is used only for the API request, and is discarded immediately — not logged, not saved.
OpenRouter keys start with sk-or-. Get one free at openrouter.ai
Loading models…
Auto-generated IT infrastructure text at your chosen context size. The needle fact is injected at each test position.
Estimated cost before you run
Positions × trials9 calls
~Input tokens per call~50K
Input price
Estimated total

Results stream live below. Download JSON when complete.

Run history

Results from runs you completed this session. Reload the page to start fresh.

No runs yet — complete a run above to see history here.

Frequently asked questions

Is my API key stored anywhere?

No. Your key is transmitted over HTTPS to the server, used only for the outbound API call, and discarded immediately — never written to a database, log file, or environment variable.

What context sizes can I test?

The slider goes from 5K to 180K tokens. Testing at 1M tokens is possible via the Python script in the repo but requires significant API budget. 180K proves the tail-end claim at lower cost. Models with smaller limits (e.g. 64K) will error if you exceed them.

Can I change the needle fact or question?

Not in this UI — the fact is fixed so runs are comparable. The Python eval script (evals/01-long-context-retrieval/run_eval.py) accepts any fact and question.

Why does the same model give different scores across trials?

LLM inference is non-deterministic. Temperature and sampling contribute variance. That's why trials matter — 5 trials per position gives a reliable exact_rate. A single trial is not a benchmark.

What is OpenRouter?

OpenRouter is a unified API routing to 300+ models from Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and more — one key, one endpoint, transparent per-model pricing. This eval also accepts direct vendor keys if you prefer to skip the routing layer.

More tools for AI builders

  • SEO Site Audit — audit every page your AI content strategy produces
  • Schema Markup Tester — validate structured data that helps AI Overviews cite your content
  • Link Checker — keep the site you're building for AI search free of broken links

About this tool

I built this eval after spending two weeks manually testing whether Claude, GPT-4, and Gemini could reliably retrieve a specific clause from a 120-page legal contract. The results were surprising — retrieval accuracy dropped from 94% at the 50% position to 61% at the 95% position in one model. No published benchmark made that degradation visible at the per-position level with my own documents. The ByteWaveNetwork Context Eval fills that gap: bring your own haystack, pick your positions, see exactly where each model's attention degrades. — ByteWaveNetwork Team, building AI evaluation tools since 2023.

Sunny Pal Singh
Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →