About this benchmark
Prompt sensitivity is one of the most underappreciated risks in LLM deployment. A model that scores 9.5 on your carefully crafted benchmark prompt may score 4.0 when a user phrases the same question slightly differently. This eval quantifies that risk using standard deviation as a stability metric.
The 10-variant design spans a deliberate style gradient: variants 1–2 are terse and direct, variants 3–4 are formally academic, variants 5–6 use casual colloquial language, variants 7–8 use structured bullet-format specifications, and variants 9–10 add verbose contextual framing. Real users write all of these, so a production-ready model must handle them all equally well.
The key insight from our runs: models that perform best on benchmark-style prompts (variants 3–4, academic formal) don't always win on real-user-style prompts (variants 5–6, casual). A high mean with low σ is the goal — not just a high peak score.
Frequently asked questions
Can I add my own custom task variants?
Not via the UI yet — the 3 built-in tasks (coding, factual, reasoning) are hardcoded. If you need custom tasks, the REST API accepts a tasks parameter, and the backend TASK_VARIANTS object can be extended in src/routes/api/v1/promptSensitivityEval.js.
Why 10 variants specifically?
10 is the minimum for a meaningful standard deviation. Fewer variants underestimate variance — a model that fails on variant 3 of 5 looks more reliable than one that fails on 3 of 10. 10 also keeps costs manageable: each task run requires 10 answer calls + 10 judge calls = 20 API calls per model per task.
What does a stdDev of 3.35 mean in practice?
On a 1–10 scale, σ = 3.35 means the model's scores ranged roughly from mean − 3.35 to mean + 3.35. If the mean is 7.9, that's a plausible range of 4.5 to 10. In production, this means a user who phrases the question in the "wrong" style could get a response 5+ points worse than a user who happens to phrase it optimally — an unacceptable user experience variance.