Loading

    The Three Types Of Evals

    Matt PocockMatt Pocock

    Deterministic Evals

    • Deterministic evals are evals that return a pass/fail result. They seek to extract determinism from a probabilistic system.
    • They are the "most useful kind of eval" according to Ian Webster. These evals should be fast, and developer focused.
    • A brilliant example is from Discord's Ian Webster, where they checked that their AI bot, Clyde, always replied with a lowercase letter at the beginning of its messages. This meant that their bot was imitating the behavior of a Gen-Z user.

    LLM-as-a-Judge

    • Some evaluations can be done via LLM's.
    • autoevals templates are a good example of various types of these evals. Humor judges if something is funny. Battle compares two responses to find which one is better.
    • You can even use LLM's to check factuality, by providing a ground truth statement to check the response against.
    • Failing a LLM-as-a-judge evaluator is often a good indicator that a human should take a look. So, it's more like a smoke test than a real test (personal opinion).

    Human Feedback

    • Some evals can only be usefully evaluated by humans. These involve long-form text generation and certain types of factuality.
    • Human oversight is needed for any type of LLM app.
    Loading
    Share