On LLM Evals

LLMs hallucinate.Your job is to ensure they don’t embarrass you, your company, or your brand.

Table of contents

  1. Introduction
  2. Choosing the right LLM
  3. Human Evaluation – HHH Framework
  4. Scaling Evaluation – LLM as Judges
  5. Hands-On exercise to run evals

Introduction

LLMs are powerful but unpredictable. Unlike traditional software, they don’t always give the same answer to the same input. Without structured evaluation, you’re flying blind — you can’t guarantee quality, safety, or trust. Evals are how you make sure your AI is reliable enough to put in front of customers. Enter LLLm Evals .

  • AI evals are like unit tests for agents.
  • Difference between software testing vs AI evals:
    • Software testing & unit tests are deterministic. LLM agents are non-deterministic, with multiple possible paths.
    • Integration tests rely on code/docs, but improving agents relies on data.

Part 1 - Choosing the right LLM

  • Start with requirements along these dimensions
    • Accuracy: example >90% ideal in legal contexts.
    • Latency: Medium–low is acceptable for offline jobs.
    • Cost: Less sensitive initially since manual legal review is expensive.
    • Context length: for example Must handle long docs (≈20K–1M tokens).
    • Throughput (QPM): Size for expected query volume.
    • Grounding: Model should cite source contract text.
  • Some benchmarksyou could use , typically published by models -
    • Language understanding
    • Q&A
    • Document classification
    • Reasoning (planning, chain-of-thought)
    • Tool usage (email, APIs, CRM)
  • Model Selection Use published evals/model cards (e.g., a “model matrix”) to compare options. Example: O3 High mini may offer a strong balance of accuracy, latency, and cost.

Part 2 Human Evaluation – HHH Framework

  • Helpful → Solves problem, complete, succinct.
  • Honest → Factual, validated, clickable links.
  • Harmless → Ethical, legal, policy-aligned, guardrails applied.
  • How to use:
    • Define north star metrics: Job completion, customer satisfaction.
    • Create yes/no evaluation questions.
    • Human evaluators score against ground truth.
    • Define launch thresholds (e.g., 60% helpfulness).
    • Update criteria with customer feedback.

Start with this simple framework

Part 3 Scaling Evaluation – LLM as Judges

  • Challenge: Human eval doesn’t scale.
  • Solutions:
    • Cloud APIs for moderation (toxicity, hate speech, frustration).
    • Train an LLM judge using good conversation examples.
    • Use SDKs or pre-built judges to measure: correctness, relevance, guideline adherence, context sufficiency, RAG chunk relevance.
  • Process:
    • Collect requests + responses + ground truth → apply judges → score → dashboard metrics.
    • Continuous improvement: Judges improve over time but still an evolving field.

## Part 4 Hands-On exercise to run a few evals

  • PromptFoo is a test-driven development framework for LLMs
  • Install Node.js & npm (nodejs.org)
  • Set up your OPENAI_API_KEY environment variable
  • Configure promptfooconfig.yaml. Mine Here
  • npm install -g promptfoo
  • run this to run the evaluations npx -y promptfoo eval -c «config».yaml
  • run this to see the results in the browser like below - npx -y promptfoo view -p 8080

promptfooUI

Written on August 23, 2025