What is prompt sensitivity?

Prompt sensitivity measures how much a model's answer quality changes when the same underlying question is phrased differently. A low-sensitivity (stable) model produces similar quality regardless of phrasing. A high-sensitivity model can give excellent answers to one phrasing and fail completely on another.

How is stability measured?

Each of 10 phrasings is sent to the model and the response is scored 1–10 by the same model (self-judging). The standard deviation of these 10 scores is the stability metric. σ ≤ 1.0 = green (stable), σ ≤ 2.0 = amber (moderate), σ > 2.0 = red (volatile).

What are the 3 tasks and their variants?

Coding: 10 phrasings of 'write a Python palindrome checker' (terse to verbose). Factual: 10 phrasings of 'what causes Earth's seasons?'. Reasoning: 10 phrasings of an apple math puzzle (3 apples, give 2 away, receive 5 more). Each variant covers a different phrasing style: direct, socratic, academic, casual, structured, and so on.

Why does o4-mini show high variance on coding tasks?

In our real runs, o4-mini scored [9,10,1,1,10,10,9,10,9,10] on coding variants — a stdDev of 3.35. Two specific phrasings (variants 3 and 4) triggered terse, unhelpful responses. This is a known failure mode: models that rely heavily on instruction-following signals can be derailed by ambiguous or unusually formal phrasings.

Is my API key stored?

No. Your key is transmitted over HTTPS, used only for the API call, and never logged or stored anywhere.

LLM Prompt Sensitivity Eval

The same question phrased 10 different ways — does your model score consistently or does it collapse on certain wordings? This benchmark sends 10 variants of each task (coding, factual, reasoning) to the model, scores each answer 1–10 via self-judging, and measures standard deviation as a stability signal. Green means rock-solid. Red means a rephrasing could break your pipeline.

3 tasks × 10 variants

Coding: palindrome checker in Python. Factual: what causes Earth's seasons? Reasoning: apple math puzzle. Each rephrased 10 times — from one-line terse to multi-sentence verbose, academic to casual.

Standard deviation = stability

All 10 scores are averaged (mean) and their spread measured (σ). σ ≤ 1.0 = stable green. σ ≤ 2.0 = moderate amber. σ > 2.0 = volatile red. A stable model with σ = 0 gives identical quality regardless of phrasing.

Self-judging

The same model and API key used to answer each variant also scores the response 1–10 via a strict judging prompt. This keeps the eval provider-agnostic and eliminates cross-model judge bias, while still exposing meaningful variance.

Real run findings

Claude Sonnet scored coding σ = 0.66 — near-perfect consistency. o4-mini scored coding σ = 3.35 — variants 3 and 4 triggered terse, unhelpful responses (scores of 1 out of 10). gpt-4.1 scored reasoning σ = 0.0 — perfectly consistent on that task.

Stability thresholds — prompt sensitivity eval
Rating	Std deviation (σ) across 10 variants	Interpretation
Stable	σ ≤ 1.0	Consistent quality regardless of phrasing
Moderate	σ ≤ 2.0	Noticeable variation; some phrasings underperform
Volatile	σ > 2.0	Output quality highly sensitive to wording

V1–V2

Terse direct "Write a palindrome checker"

V3–V4

Academic formal Long sentences

V5–V6

Casual phrasing Colloquial style

V7–V8

Structured specs Bullet requirements

V9–V10

Verbose elaborate Multi-sentence context

All 10 variants ask for the same thing — only the phrasing style changes. Models that score consistently across all 10 are production-ready. Models that break on specific styles need prompt engineering guardrails.

Pre-computed runs

Sample results

Real run from 2026-05-02. Cells show mean score / σ. Green = stable. Red = volatile. Click any cell to see all 10 variant scores.

Loading…

Select a run on the left to view its heatmap.

Live benchmark

Run your own eval

Bring your own API key. Results stream live via WebSocket.

Your API key is never stored or logged.

API provider

OpenRouter keys start with sk-or-. Get one free at openrouter.ai

API key

Select models (max 4)

Loading models…

Tasks to test

Coding Factual Reasoning

Each task runs 10 phrasing variants × selected models. 3 tasks = 30 variants × models API calls (plus 30 judge calls).

Estimated cost

API calls—

Breakdown—

Price estimate—

Total—

Your history

Recent runs

Runs are stored in memory for this session only.

No runs yet. Run an eval above to see history here.

About this benchmark

Prompt sensitivity is one of the most underappreciated risks in LLM deployment. A model that scores 9.5 on your carefully crafted benchmark prompt may score 4.0 when a user phrases the same question slightly differently. This eval quantifies that risk using standard deviation as a stability metric.

The 10-variant design spans a deliberate style gradient: variants 1–2 are terse and direct, variants 3–4 are formally academic, variants 5–6 use casual colloquial language, variants 7–8 use structured bullet-format specifications, and variants 9–10 add verbose contextual framing. Real users write all of these, so a production-ready model must handle them all equally well.

The key insight from our runs: models that perform best on benchmark-style prompts (variants 3–4, academic formal) don't always win on real-user-style prompts (variants 5–6, casual). A high mean with low σ is the goal — not just a high peak score.

Frequently asked questions

Can I add my own custom task variants?

Not via the UI yet — the 3 built-in tasks (coding, factual, reasoning) are hardcoded. If you need custom tasks, the REST API accepts a tasks parameter, and the backend TASK_VARIANTS object can be extended in src/routes/api/v1/promptSensitivityEval.js.

Why 10 variants specifically?

10 is the minimum for a meaningful standard deviation. Fewer variants underestimate variance — a model that fails on variant 3 of 5 looks more reliable than one that fails on 3 of 10. 10 also keeps costs manageable: each task run requires 10 answer calls + 10 judge calls = 20 API calls per model per task.

What does a stdDev of 3.35 mean in practice?

On a 1–10 scale, σ = 3.35 means the model's scores ranged roughly from mean − 3.35 to mean + 3.35. If the mean is 7.9, that's a plausible range of 4.5 to 10. In production, this means a user who phrases the question in the "wrong" style could get a response 5+ points worse than a user who happens to phrase it optimally — an unacceptable user experience variance.

Key finding from real runs

Claude Sonnet 4.6 coding σ = 0.66 — nearly perfect stability. o4-mini coding σ = 3.35 — two specific phrasings (academic formal) scored 1/10. GPT-4.1 reasoning σ = 0.0 — every variant scored 10/10.

Real run: 2026-05-02. Results may differ on newer model versions.

REST API

Every eval is available as a JSON endpoint:

POST /api/v1/evals/prompt-sensitivity/run
{ apiKey, keyType, models, tasks }

GET /api/v1/evals/prompt-sensitivity/runs
GET /api/v1/evals/prompt-sensitivity/run/:id
GET /api/v1/evals/prompt-sensitivity/run/:id/export

Affiliate disclosure

ByteWaveNetwork may earn a referral fee if you sign up for API access through links on this page. Benchmark results are independently produced and not influenced by commercial relationships.

Sunny Pal Singh

Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →