The Three Types Of Evals
Matt Pocock
Deterministic Evals
- Deterministic evals are evals that return a pass/fail result. They seek to extract determinism from a probabilistic system.
- They are the "most useful kind of eval" according to Ian Webster. These evals should be fast, and developer focused.
- A brilliant example is from Discord's Ian Webster, where they checked that their AI bot, Clyde, always replied with a lowercase letter at the beginning of its messages. This meant that their bot was imitating the behavior of a Gen-Z user.
LLM-as-a-Judge
- Some evaluations can be done via LLM's.
- autoevals templates are a good example of various types of these evals. Humor judges if something is funny. Battle compares two responses to find which one is better.
- You can even use LLM's to check factuality, by providing a ground truth statement to check the response against.
- Failing a LLM-as-a-judge evaluator is often a good indicator that a human should take a look. So, it's more like a smoke test than a real test (personal opinion).
Human Feedback
- Some evals can only be usefully evaluated by humans. These involve long-form text generation and certain types of factuality.
- Human oversight is needed for any type of LLM app.
Share