A Guide to LLM Evals

Overview

You built an AI system, but how do you know it works properly?

Measuring the quality of outputs from systems that employ LLMs are particularly difficult. LLM evaluations help solve this by using evaluators to determine output quality of a test dataset (which includes a prompt and LLM response). These evaluators either check the output against a rubric and generate a score/reasoning or pick a preferred response from 2 different system prompts.

The evaluation's insights are used to better the system prompt to elicit a better response!

The latter approach has been proven to be more effective, so I will be focusing on that method. There are 2 types of evaluators: human or LLM-as-a-judge.

Pros and Cons of Each Evaluator

A human would provide good analysis, but it is less scalable due to wages expense and time consumed per eval. An LLM judge, if designed poorly, would give low quality evals because of the LLM limitations like authority bias (favouring responses that are factually incorrect but confidently presented). However, LLM-as-a-judge are cheaper and more scalable since it'll take less time processing information.

Thus, when using LLM judge evaluator, there must be a balance between human-like judgement, in order to bypass LLM limitations, and scalability. Something that has been solved by G-Evals.

So… What Are G-Evals?

G-Evals is a research-backed framework that uses LLM-as-a-judge to determine quality of LLM response. They have been proven to match performance of human judgement.

G-Evals use a custom criteria (defined by engineer) with a chain of thought (CoT) approach.

What is CoT?

Chain of thought is a prompt engineering technique that improves the performance of complex tasks by generating intermediary steps that must be completed. Basically, it mimics human problem solving since we break down larger problems to smaller and manageable chunks.

It does this through the following steps:

The LLM transforms your criteria to list of evaluation steps. This is where CoT is implemented as it breaks the criteria into smaller parts
These steps are then used to create a judgement on the response
Once judgement is generated, a score is given based on their log probabilities.

What Are Log Probabilities?

Simply put, they take a range of numbers (like 0–5), assign how likely each number is to be the score as a weight for the final score. Then they blend those numbers based on the weightage allotted.

Both the judgement and score are then used to determine how to better system design/prompt :)

P.S. You can check out the research paper here.