LLM Context Retrieval Eval
Most language models claim large context windows — but does retrieval actually hold up at the tail end? This benchmark hides a fact at the 85%, 90%, and 95% position of a large document and asks the model to retrieve it. One question reveals a lot about attention quality.
The needle
A single fact — "Project Helios launch date is March 7, 2031" — is injected into a large filler document at a precise position. The model is asked to quote it back.
The haystack
20 synthetic IT operations sentences repeated to fill the target context size (~50K–180K tokens). Bland by design — attention stress, not comprehension.
What varies
Needle position (10%–95% of the document), number of trials, context size, and model. Multiple trials at each position give a reliable pass rate.
Scoring
PASS if ≥80% of trials are exact (85%+ keyword match). DEGRADED at 40–79%. FAIL below 40%. Each cell in the heatmap is one position × model combination.
| Result | Exact match rate across trials | Heatmap color |
|---|---|---|
| PASS | ≥80% of trials with ≥85% keyword match | Green |
| DEGRADED | 40–79% exact | Amber |
| FAIL | <40% exact or explicit not-found | Red |
The further right the needle, the harder it is for the model to retrieve — especially as context size grows.
Sample results
Real runs from the open-source eval suite. No API key needed — just click a run to load its heatmap.
Run your own eval
Use your own API key to test any model. Results are streamed live and never stored on our servers.
sk-or-. Get one free at openrouter.ai
Results stream live below. Download JSON when complete.
Run history
Results from runs you completed this session. Reload the page to start fresh.
Frequently asked questions
Is my API key stored anywhere?
No. Your key is transmitted over HTTPS to the server, used only for the outbound API call, and discarded immediately — never written to a database, log file, or environment variable.
What context sizes can I test?
The slider goes from 5K to 180K tokens. Testing at 1M tokens is possible via the Python script in the repo but requires significant API budget. 180K proves the tail-end claim at lower cost. Models with smaller limits (e.g. 64K) will error if you exceed them.
Can I change the needle fact or question?
Not in this UI — the fact is fixed so runs are comparable. The Python eval script (evals/01-long-context-retrieval/run_eval.py) accepts any fact and question.
Why does the same model give different scores across trials?
LLM inference is non-deterministic. Temperature and sampling contribute variance. That's why trials matter — 5 trials per position gives a reliable exact_rate. A single trial is not a benchmark.
What is OpenRouter?
OpenRouter is a unified API routing to 300+ models from Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and more — one key, one endpoint, transparent per-model pricing. This eval also accepts direct vendor keys if you prefer to skip the routing layer.
More tools for AI builders
- → SEO Site Audit — audit every page your AI content strategy produces
- → Schema Markup Tester — validate structured data that helps AI Overviews cite your content
- → Link Checker — keep the site you're building for AI search free of broken links