LLM Prompt Sensitivity Eval

The same question phrased 10 different ways — does your model score consistently or does it collapse on certain wordings? This benchmark sends 10 variants of each task (coding, factual, reasoning) to the model, scores each answer 1–10 via self-judging, and measures standard deviation as a stability signal. Green means rock-solid. Red means a rephrasing could break your pipeline.

3 tasks × 10 variants

Coding: palindrome checker in Python. Factual: what causes Earth's seasons? Reasoning: apple math puzzle. Each rephrased 10 times — from one-line terse to multi-sentence verbose, academic to casual.

Standard deviation = stability

All 10 scores are averaged (mean) and their spread measured (σ). σ ≤ 1.0 = stable green. σ ≤ 2.0 = moderate amber. σ > 2.0 = volatile red. A stable model with σ = 0 gives identical quality regardless of phrasing.

Self-judging

The same model and API key used to answer each variant also scores the response 1–10 via a strict judging prompt. This keeps the eval provider-agnostic and eliminates cross-model judge bias, while still exposing meaningful variance.

Real run findings

Claude Sonnet scored coding σ = 0.66 — near-perfect consistency. o4-mini scored coding σ = 3.35 — variants 3 and 4 triggered terse, unhelpful responses (scores of 1 out of 10). gpt-4.1 scored reasoning σ = 0.0 — perfectly consistent on that task.

Stability thresholds — prompt sensitivity eval
RatingStd deviation (σ) across 10 variantsInterpretation
Stableσ ≤ 1.0Consistent quality regardless of phrasing
Moderateσ ≤ 2.0Noticeable variation; some phrasings underperform
Volatileσ > 2.0Output quality highly sensitive to wording
V1–V2
Terse direct "Write a palindrome checker"
V3–V4
Academic formal Long sentences
V5–V6
Casual phrasing Colloquial style
V7–V8
Structured specs Bullet requirements
V9–V10
Verbose elaborate Multi-sentence context

All 10 variants ask for the same thing — only the phrasing style changes. Models that score consistently across all 10 are production-ready. Models that break on specific styles need prompt engineering guardrails.

Sample results

Real run from 2026-05-02. Cells show mean score / σ. Green = stable. Red = volatile. Click any cell to see all 10 variant scores.

Loading…

Select a run on the left to view its heatmap.

Run your own eval

Bring your own API key. Results stream live via WebSocket.

Your API key is never stored or logged.
OpenRouter keys start with sk-or-. Get one free at openrouter.ai
Loading models…
Each task runs 10 phrasing variants × selected models. 3 tasks = 30 variants × models API calls (plus 30 judge calls).
Estimated cost
API calls
Breakdown
Price estimate
Total

Recent runs

Runs are stored in memory for this session only.

No runs yet. Run an eval above to see history here.

About this benchmark

Prompt sensitivity is one of the most underappreciated risks in LLM deployment. A model that scores 9.5 on your carefully crafted benchmark prompt may score 4.0 when a user phrases the same question slightly differently. This eval quantifies that risk using standard deviation as a stability metric.

The 10-variant design spans a deliberate style gradient: variants 1–2 are terse and direct, variants 3–4 are formally academic, variants 5–6 use casual colloquial language, variants 7–8 use structured bullet-format specifications, and variants 9–10 add verbose contextual framing. Real users write all of these, so a production-ready model must handle them all equally well.

The key insight from our runs: models that perform best on benchmark-style prompts (variants 3–4, academic formal) don't always win on real-user-style prompts (variants 5–6, casual). A high mean with low σ is the goal — not just a high peak score.

Frequently asked questions

Can I add my own custom task variants?

Not via the UI yet — the 3 built-in tasks (coding, factual, reasoning) are hardcoded. If you need custom tasks, the REST API accepts a tasks parameter, and the backend TASK_VARIANTS object can be extended in src/routes/api/v1/promptSensitivityEval.js.

Why 10 variants specifically?

10 is the minimum for a meaningful standard deviation. Fewer variants underestimate variance — a model that fails on variant 3 of 5 looks more reliable than one that fails on 3 of 10. 10 also keeps costs manageable: each task run requires 10 answer calls + 10 judge calls = 20 API calls per model per task.

What does a stdDev of 3.35 mean in practice?

On a 1–10 scale, σ = 3.35 means the model's scores ranged roughly from mean − 3.35 to mean + 3.35. If the mean is 7.9, that's a plausible range of 4.5 to 10. In production, this means a user who phrases the question in the "wrong" style could get a response 5+ points worse than a user who happens to phrase it optimally — an unacceptable user experience variance.

Sunny Pal Singh
Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →