What is a needle-in-a-haystack eval?

A specific fact is hidden inside a large document at a controlled position. The model must retrieve it. This tests whether attention degrades at the tail end of long context windows.

What is the scoring criteria?

Exact = 85%+ of key fact terms in response. Partial = 40–84%. Miss = below 40% or explicit not-found. A position passes if 80%+ of trials are exact.

OpenRouter is a single API that routes to 300+ models from Anthropic, OpenAI, Google, Meta, DeepSeek and more. One key, all vendors.

LLM Context Retrieval Eval

Most language models claim large context windows — but does retrieval actually hold up at the tail end? This benchmark hides a fact at the 85%, 90%, and 95% position of a large document and asks the model to retrieve it. One question reveals a lot about attention quality.

The needle

A single fact — "Project Helios launch date is March 7, 2031" — is injected into a large filler document at a precise position. The model is asked to quote it back.

The haystack

20 synthetic IT operations sentences repeated to fill the target context size (~50K–180K tokens). Bland by design — attention stress, not comprehension.

What varies

Needle position (10%–95% of the document), number of trials, context size, and model. Multiple trials at each position give a reliable pass rate.

Scoring

PASS if ≥80% of trials are exact (85%+ keyword match). DEGRADED at 40–79%. FAIL below 40%. Each cell in the heatmap is one position × model combination.

Scoring thresholds — context retrieval eval
Result	Exact match rate across trials	Heatmap color
PASS	≥80% of trials with ≥85% keyword match	Green
DEGRADED	40–79% exact	Amber
FAIL	<40% exact or explicit not-found	Red

filler text — 85% of document

📍 needle

15%

Position 85% Position 90% Position 95%

The further right the needle, the harder it is for the model to retrieve — especially as context size grows.

Pre-computed runs

Sample results

Real runs from the open-source eval suite. No API key needed — just click a run to load its heatmap.

Loading sample runs…

Select a run on the left to see its heatmap

Live benchmark

Run your own eval

Use your own API key to test any model. Results are streamed live and never stored on our servers.

Your key is never stored. It travels over HTTPS, is used only for the API request, and is discarded immediately — not logged, not saved.

API key provider

API Key OpenRouter keys start with sk-or-. Get one free at openrouter.ai

Models (select up to 4)

Loading models…

Document haystack

Auto-generated IT infrastructure text at your chosen context size. The needle fact is injected at each test position.

Needle positions to test

Trials per position 3

Context size

Estimated cost before you run

Positions × trials9 calls

~Input tokens per call~50K

Input price—

Estimated total—

Results stream live below. Download JSON when complete.

Your runs

Run history

Results from runs you completed this session. Reload the page to start fresh.

No runs yet — complete a run above to see history here.

Frequently asked questions

Is my API key stored anywhere?

No. Your key is transmitted over HTTPS to the server, used only for the outbound API call, and discarded immediately — never written to a database, log file, or environment variable.

What context sizes can I test?

The slider goes from 5K to 180K tokens. Testing at 1M tokens is possible via the Python script in the repo but requires significant API budget. 180K proves the tail-end claim at lower cost. Models with smaller limits (e.g. 64K) will error if you exceed them.

Can I change the needle fact or question?

Not in this UI — the fact is fixed so runs are comparable. The Python eval script (evals/01-long-context-retrieval/run_eval.py) accepts any fact and question.

Why does the same model give different scores across trials?

LLM inference is non-deterministic. Temperature and sampling contribute variance. That's why trials matter — 5 trials per position gives a reliable exact_rate. A single trial is not a benchmark.

What is OpenRouter?

OpenRouter is a unified API routing to 300+ models from Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and more — one key, one endpoint, transparent per-model pricing. This eval also accepts direct vendor keys if you prefer to skip the routing layer.

More tools for AI builders

→ SEO Site Audit — audit every page your AI content strategy produces
→ Schema Markup Tester — validate structured data that helps AI Overviews cite your content
→ Link Checker — keep the site you're building for AI search free of broken links

About this tool

I built this eval after spending two weeks manually testing whether Claude, GPT-4, and Gemini could reliably retrieve a specific clause from a 120-page legal contract. The results were surprising — retrieval accuracy dropped from 94% at the 50% position to 61% at the 95% position in one model. No published benchmark made that degradation visible at the per-position level with my own documents. The ByteWaveNetwork Context Eval fills that gap: bring your own haystack, pick your positions, see exactly where each model's attention degrades. — ByteWaveNetwork Team, building AI evaluation tools since 2023.

Sunny Pal Singh

Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →

LLM Context Retrieval Eval

The needle

The haystack

What varies

Scoring

Sample results

Run your own eval

Your results

Run history

Frequently asked questions

More tools for AI builders

About this tool