DeepEval: Unit Testing and Evaluation for LLM Applications
Building applications powered by Large Language Models (LLMs) such as RAG chatbots, AI assistants, or generative pipelines isn't just about making the model work. The real challenge is ensuring output quality consistently. How do you know your chatbot isn't hallucinating? How do you measure answer relevancy to user questions? How do you detect bias or toxic content in responses?
DeepEval is an open-source framework designed specifically to answer all these questions. DeepEval allows you to write unit tests for LLM applications just like you write unit tests for traditional software, complete with proven evaluation metrics and pytest integration.What Is DeepEval?
DeepEval is a Python framework for testing and evaluating LLM applications. The framework provides:
- Unit Testing for LLMs: Write test cases that validate LLM output with measurable metrics
- 14+ Built-in Metrics: Hallucination, Answer Relevancy, Faithfulness, Bias, Toxicity, and more
- Pytest Integration: Run LLM evaluations like regular test suites
- Custom Metrics: Create evaluation metrics tailored to your business needs
- Conversational Metrics: Evaluate multi-turn conversation quality
- Benchmarks: Test your model against standard benchmarks like MMLU and HellaSwag
- Confident AI Dashboard: Monitoring and analytics platform for evaluation results
- CI/CD Ready: Integrate evaluations into your deployment pipeline
Installation
Basic Installation
pip install deepeval
Installation with Additional Dependencies
# For all features including benchmarks
pip install deepeval[benchmark]
For Confident AI integration
pip install deepeval[dashboard]
Initial Setup
After installation, login to Confident AI (optional but recommended for tracking):
deepeval login
Or set the API key manually:
export DEEPEVALAPIKEY="your-api-key-here"
Make sure you also have an API key for the LLM used as evaluator:
export OPENAIAPIKEY="your-openai-api-key"
Core Concept: LLMTestCase
LLMTestCase is the fundamental unit in DeepEval. Each test case represents a single interaction with an LLM that you want to evaluate.
from deepeval.testcase import LLMTestCase
test
case = LLMTestCase(
input="What is the capital of Indonesia?",
actualoutput="The capital of Indonesia is Jakarta. However, the Indonesian government is building a new capital called Nusantara (IKN) in East Kalimantan.",
expectedoutput="Jakarta (with note about IKN Nusantara as the new capital)",
retrievalcontext=[
"Jakarta has been the capital of Indonesia since independence in 1945.",
"Nusantara (IKN) is Indonesia's capital relocation project to East Kalimantan, inaugurated in 2024."
]
)
Key LLMTestCase parameters:
| Parameter | Description | Required |
|-----------|-------------|----------|
| input | The question or prompt given to the LLM | Yes |
| actualoutput | The actual response from the LLM | Yes |
| expectedoutput | The expected response (ground truth) | Depends on metric |
| retrievalcontext | Context retrieved from knowledge base | Depends on metric |
| context | Ground truth context (ideal) | Depends on metric |
Built-in Evaluation Metrics
DeepEval provides various ready-to-use evaluation metrics. Here are the key metrics explained:
1. Hallucination Metric
Measures whether the LLM output contains information not supported by the given context.
from deepeval.metrics import HallucinationMetric
from deepeval.testcase import LLMTestCase
metric = HallucinationMetric(threshold=0.5)
testcase = LLMTestCase(
input="Tell me about PT ABC company",
actualoutput="PT ABC was founded in 1990 in Jakarta. The company has 5000 employees and operates in the technology sector.",
context=[
"PT ABC was founded in 1990 in Jakarta.",