Local Model Benchmarks: What to Measure Instead of Leaderboards

Tomas

Leaderboard scores are nearly useless for picking a local model for your specific tasks. Here is how to benchmark what actually matters for agent and coding workflows.

The leaderboard problem

MMLU, HumanEval, MATH - these benchmarks test things that are well-defined, have clear correct answers, and are easy to evaluate at scale. They do not test:

Whether the model follows formatting instructions reliably
Whether it produces stable JSON that parses correctly
Whether it handles your specific domain’s terminology
Whether it performs consistently at your required context length
Whether it can use tool call syntax in a way your framework accepts

A model that scores 5% higher on MMLU may perform worse on every task you actually care about.

The 4 metrics worth measuring for practical work

—

1. Time to first token (TTFT)

How long from sending the request to receiving the first output token. This dominates perceived responsiveness for interactive use. A model with slightly lower quality but 2x faster TTFT is often the better choice for interactive agents.

Measure: send 10 test requests, record time from request to first token, take median.

—

2. Sustained throughput (tokens/second)

How fast the model generates once started. Matters most for long outputs (reports, code files, detailed summaries).

Measure: generate a 500-token output 5 times, calculate tokens/second for each, take median.

—

3. Instruction following accuracy

Does the model do what you actually asked? Not on generic tasks - on your specific prompts.

Measure: take your 20 most common real-world prompts. Run each 3 times. Score outputs: fully correct / partially correct / wrong. Compare models on this score. This is the benchmark that matters most and you will not find it on any leaderboard.

—

4. Tool call reliability

If your agent uses function calling or structured tool use, does the model produce valid call syntax consistently?

Measure: send 50 prompts that should trigger tool calls. Count: correct call syntax / wrong tool selected / no call made when one was needed / malformed JSON. Smaller models fail this test more often than benchmarks suggest.

—

How to run a personal benchmark in under an hour

Pick 3 models you are considering.
Create a test set: 10 prompts from each of your main use cases (total ₃₀ prompts).
Run each prompt through each model. Use a simple script to time TTFT and throughput.
Score instruction following manually (5 minutes per model per test set).
For tool use: run the 50-prompt structured test.
Total time: 45-60 minutes. Result: a comparison that actually predicts which model works best for you.

Quantization tradeoffs: when it matters

Q4 (4-bit quantization) vs Q8 (8-bit):

For casual chat and simple tasks: Q4 quality is nearly indistinguishable from Q8
For complex reasoning, long-form writing, code generation: Q8 is meaningfully better for models above 13B parameters
For 7B models: Q4 performance gap vs Q8 is small enough that the speed gain usually wins
For 70B models: Q4 vs Q8 quality difference is more pronounced

The crossover point varies by task and model family. Test with your prompts rather than trusting the general rule.

One thing to test that most people miss

Context length stability. Run your model at 50%, 75%, and 90% of its rated context length with tasks that require reading all of the context. Many models degrade significantly in the latter portion of their context window even when they claim to support the length.

What benchmarks or tests do you run before committing to a local model?

Curated by Selendia AI 📦