On LLM Evals

LLMs hallucinate.Your job is to ensure they don’t embarrass you, your company, or your brand.

Introduction
Choosing the right LLM
Human Evaluation – HHH Framework
Scaling Evaluation – LLM as Judges
Hands-On exercise to run evals

Introduction

LLMs are powerful but unpredictable. Unlike traditional software, they don’t always give the same answer to the same input. Without structured evaluation, you’re flying blind — you can’t guarantee quality, safety, or trust. Evals are how you make sure your AI is reliable enough to put in front of customers. Enter LLLm Evals .

AI evals are like unit tests for agents.
Difference between software testing vs AI evals:
- Software testing & unit tests are deterministic. LLM agents are non-deterministic, with multiple possible paths.
- Integration tests rely on code/docs, but improving agents relies on data.

Part 1 - Choosing the right LLM

Start with requirements along these dimensions
- Accuracy: example >90% ideal in legal contexts.
- Latency: Medium–low is acceptable for offline jobs.
- Cost: Less sensitive initially since manual legal review is expensive.
- Context length: for example Must handle long docs (≈20K–1M tokens).
- Throughput (QPM): Size for expected query volume.
- Grounding: Model should cite source contract text.
Some benchmarksyou could use , typically published by models -
- Language understanding
- Q&A
- Document classification
- Reasoning (planning, chain-of-thought)
- Tool usage (email, APIs, CRM)
Model Selection Use published evals/model cards (e.g., a “model matrix”) to compare options. Example: O3 High mini may offer a strong balance of accuracy, latency, and cost.

Part 2 Human Evaluation – HHH Framework

Helpful → Solves problem, complete, succinct.
Honest → Factual, validated, clickable links.
Harmless → Ethical, legal, policy-aligned, guardrails applied.
How to use:
- Define north star metrics: Job completion, customer satisfaction.
- Create yes/no evaluation questions.
- Human evaluators score against ground truth.
- Define launch thresholds (e.g., 60% helpfulness).
- Update criteria with customer feedback.

Start with this simple framework

Part 3 Scaling Evaluation – LLM as Judges

Challenge: Human eval doesn’t scale.
Solutions:
- Cloud APIs for moderation (toxicity, hate speech, frustration).
- Train an LLM judge using good conversation examples.
- Use SDKs or pre-built judges to measure: correctness, relevance, guideline adherence, context sufficiency, RAG chunk relevance.
Process:
- Collect requests + responses + ground truth → apply judges → score → dashboard metrics.
- Continuous improvement: Judges improve over time but still an evolving field.

## Part 4 Hands-On exercise to run a few evals

PromptFoo is a test-driven development framework for LLMs
Install Node.js & npm (nodejs.org)
Set up your OPENAI_API_KEY environment variable
Configure promptfooconfig.yaml. Mine Here
npm install -g promptfoo
run this to run the evaluations npx -y promptfoo eval -c «config».yaml
run this to see the results in the browser like below - npx -y promptfoo view -p 8080

promptfooUI

Written on August 23, 2025

Srini Ponugupaty

On LLM Evals

Table of contents

Introduction

Part 1 - Choosing the right LLM

Part 2 Human Evaluation – HHH Framework

Part 3 Scaling Evaluation – LLM as Judges