Evaluation Is a Human Problem
Why benchmarks are not enough and judgment defines quality.
Key takeaways
- Benchmarks compare models, but humans define what “good” means.
- Rubrics turn judgment into repeatable evaluation.
- RLHF encodes human preference, not ground truth.
- Evaluation is a loop, not a one-time score.
The quality of a Large Language Model is not a number. While automated benchmarks can measure performance on standardized tests, the real measure of quality—whether a model is helpful, safe, and reliable for a specific purpose—is a matter of human judgment. Evaluation is the process of encoding that judgment into a repeatable system.
In practice, clarity at boundaries reduces downstream errors more than late-stage tuning.
Act I: The fundamentals
Benchmarks and their limits
Early methods for evaluating language models relied on automated metrics that compared a model’s output to a “reference” or “golden” answer. Metrics like BLEU and ROUGE measure the overlap of words and phrases, which is useful for tasks like translation but fails to capture the semantic meaning or factual accuracy of a response.
The next step was the creation of large-scale benchmarks like MMLU (Massive Multitask Language Understanding), which test a model’s ability to answer multiple-choice questions across a wide range of subjects. These are valuable for comparing the general knowledge of different models, but they do not tell you if a model is suitable for your specific application.
Act II: The modern paradigm
Human rubrics and RLHF
The modern approach to evaluation accepts that “quality” is context-dependent. The definition of a “good” answer for a customer service bot is different from that of a “good” answer for a creative writing assistant. This has led to the rise of human-in-the-loop evaluation.
This process involves:
- Defining a rubric: A detailed set of criteria that defines what a high-quality response looks like for a specific use case. The rubric might include measures for helpfulness, honesty, harmlessness, tone, and factual accuracy.
- Human rating: Human reviewers score the model’s responses against the rubric. They provide not just a score, but also detailed feedback on why a response failed.
- Reinforcement Learning from Human Feedback (RLHF): This feedback is used to train a “reward model” that learns to predict how a human would score a given response. This reward model is then used to fine-tune the original LLM, teaching it to produce outputs that are more aligned with human preferences.
Act III: Principles in practice
Governance and tradeoffs
Evaluation is not a one-time event; it is a continuous process. As you discover new failure modes or as your requirements change, your definition of quality—and therefore your evaluation rubric—must also evolve.
Building a good evaluation system is more of a governance and process design challenge than a purely technical one. It requires answering difficult, subjective questions:
- Who gets to decide what “good” means?
- How do we ensure consistency across human raters?
- How do we handle disagreements and edge cases?
- What tradeoffs are we willing to make between, for example, helpfulness and harmlessness?
The robustness of your evaluation framework is the ultimate ceiling on the quality and safety of your AI system. An automated benchmark can tell you if your model is powerful, but only a human-centered evaluation process can tell you if it is useful.
One practical addition is to maintain a “failure library” of real bad outputs by category (factual, tone, policy, escalation). Re-scoring this fixed set after every major prompt or model change gives you a stable regression signal that generic benchmarks cannot provide.
For related systems context, see Systems 001: Foundations and From Prompt to Production.
What this changes in practice
Spend more time defining and measuring what “good” means for your specific use case than comparing your model’s performance on generic industry benchmarks.