The Three Types Of Evals

Deterministic Evals

Deterministic evals are evals that return a pass/fail result. They seek to extract determinism from a probabilistic system.
They are the "most useful kind of eval" according to Ian Webster. These evals should be fast, and developer focused.
A brilliant example is from Discord's Ian Webster, where they checked that their AI bot, Clyde, always replied with a lowercase letter at the beginning of its messages. This meant that their bot was imitating the behavior of a Gen-Z user.

Some evaluations can be done via LLM's.
autoevals templates are a good example of various types of these evals. Humor judges if something is funny. Battle compares two responses to find which one is better.
You can even use LLM's to check factuality, by providing a ground truth statement to check the response against.
Failing a LLM-as-a-judge evaluator is often a good indicator that a human should take a look. So, it's more like a smoke test than a real test (personal opinion).

Some evals can only be usefully evaluated by humans. These involve long-form text generation and certain types of factuality.
Human oversight is needed for any type of LLM app.