LLM Instruction Following Eval

Every LLM claims to follow instructions — but does it when constraints get complex? This benchmark gives each model a simple writing task paired with an escalating set of precise formatting rules: word limits, bullet counts, banned words, case requirements. Watch exactly which constraints each model drops as difficulty increases.

The task

Fixed prompt — "Explain the main benefits of getting regular exercise." Simple enough that any model can answer it. The constraint set determines the difficulty.

The constraints

Four difficulty levels stack 1 to 5 formatting rules on the same task. Easy: one word-count rule. Extreme: five simultaneous constraints including case, punctuation, and exclusion.

What varies

Difficulty level (Easy to Extreme), number of trials per level, and model selection. Multiple trials expose inconsistency — a model that sometimes follows all rules and sometimes misses one.

Scoring

PASS if ≥80% of trials satisfy ≥90% of constraints. PARTIAL at 40–79%. FAIL below 40%. Compliance rate is per-constraint — you see exactly which rule each model breaks.

Scoring thresholds — instruction following eval
ResultTrial pass rateCompliance rate per trial
PASS≥80% of trials pass≥90% of constraints met
PARTIAL40–79% of trials pass≥90% of constraints met
FAIL<40% of trials passOne or more constraints broken
Easy
word limit
Medium
word limit start phrase
Hard
3 bullets word limit bullet char
Extreme
3 bullets word limit exclude word lowercase no punctuation

Each difficulty level adds more constraints. Models that score 100% on Easy often drop to 40% or below on Extreme.

Sample results

Real runs showing compliance rates across all four difficulty levels. No API key needed — just click a run to load its heatmap.

Loading sample runs…
Select a run on the left to see its heatmap

Run your own eval

Use your own API key to test any model. Results are streamed live and never stored on our servers.

Your key is never stored. It travels over HTTPS, is used only for the API request, and is discarded immediately — not logged, not saved.
OpenRouter keys start with sk-or-. Get one free at openrouter.ai
Loading models…
Fixed task: "Explain the main benefits of getting regular exercise." The same task is used for all runs so results are comparable.
Estimated cost before you run
Difficulties × trials4 levels × 3 trials
Total calls12 calls
~300 input + ~100 output tokens/call
Estimated total

Results stream live below. Download JSON when complete.

Run history

Results from runs you completed this session. Reload the page to start fresh.

No runs yet — complete a run above to see history here.

Frequently asked questions

What does "instruction following" test exactly?

It tests whether a model can simultaneously satisfy multiple precise formatting rules on the same response. Easy difficulty has one rule (word limit under 50 words). Extreme has five simultaneous rules: exactly 3 bullet points, under 60 words, no use of the word "health", everything lowercase, and no punctuation of any kind. Most models handle one constraint reliably but begin dropping constraints when juggling four or five simultaneously.

Is my API key stored anywhere?

No. Your key is transmitted over HTTPS to the server, used only for the outbound API call, and discarded immediately — never written to a database, log file, or environment variable.

Why does the same model fail on Extreme but pass on Easy?

As constraints compound, models face a multi-objective trade-off: satisfying one rule can interfere with attention to another. For example, maintaining all-lowercase while preserving bullet format and staying under a word count requires consistent, simultaneous tracking of three separate signals. Each additional constraint increases cognitive load. Most frontier models score near 100% on Easy and drop significantly by Extreme.

What are the exact constraints for each difficulty level?

Easy (1 constraint): Use fewer than 50 words total.

Medium (2 constraints): Use fewer than 60 words total. Begin your response with the words "Regular exercise".

Hard (3 constraints): Use exactly 3 bullet points. Use fewer than 80 words total. Begin each bullet point with the • character.

Extreme (5 constraints): Use exactly 3 bullet points. Use fewer than 60 words total. Do not use the word "health" anywhere. Write everything in lowercase letters only. Use no punctuation characters at all (no periods, commas, exclamation marks, question marks, semicolons, colons, quotes, or brackets).

How is "compliance rate" different from a pass/fail score?

Compliance rate is per-constraint within a single trial. If a model satisfies 4 out of 5 constraints, its compliance rate is 80% for that trial. A trial earns a PASS verdict if compliance rate is 90%+, PARTIAL at 50–89%, FAIL below 50%. The heatmap cell shows the pass rate across all trials for a model+difficulty combination — for example, "67%" means 2 out of 3 trials were PASS.

More tools for AI builders

  • Context Retrieval Eval — test whether models can retrieve a fact at 85–95% of their context window
  • SEO Site Audit — audit every page your AI content strategy produces
  • Schema Markup Tester — validate structured data that helps AI Overviews cite your content
  • Link Checker — keep the site you're building for AI search free of broken links

About this tool

I built this eval after noticing that LLMs deployed in production pipelines would randomly drop formatting constraints when under load or when the constraint set grew beyond two rules. A workflow that required "respond in exactly three bullet points, all lowercase, no punctuation" would work 9 times out of 10 — and on the 10th call, the model would silently add a period or capitalize a letter, breaking the downstream parser. No public benchmark made per-constraint failure visible in a way that let you compare models side by side. The ByteWaveNetwork Instruction Following Eval fills that gap: pick your models, run across all four difficulty levels, and see exactly which constraint each model drops under pressure. — ByteWaveNetwork Team, building AI evaluation tools since 2023.

Sunny Pal Singh
Fellow · Technical Director

Building developer tools at ByteWaveNetwork since 2012. Every utility here was built because we needed it ourselves and couldn’t find one done right elsewhere. LinkedIn →