DeepEval: Unit Testing and Evaluation for LLM Applications

Building applications powered by Large Language Models (LLMs) such as RAG chatbots, AI assistants, or generative pipelines isn't just about making the model work. The real challenge is ensuring output quality consistently. How do you know your chatbot isn't hallucinating? How do you measure answer relevancy to user questions? How do you detect bias or toxic content in responses?

DeepEval is an open-source framework designed specifically to answer all these questions. DeepEval allows you to write unit tests for LLM applications just like you write unit tests for traditional software, complete with proven evaluation metrics and pytest integration.

What Is DeepEval?

DeepEval is a Python framework for testing and evaluating LLM applications. The framework provides:

Unit Testing for LLMs: Write test cases that validate LLM output with measurable metrics
14+ Built-in Metrics: Hallucination, Answer Relevancy, Faithfulness, Bias, Toxicity, and more
Pytest Integration: Run LLM evaluations like regular test suites
Custom Metrics: Create evaluation metrics tailored to your business needs
Conversational Metrics: Evaluate multi-turn conversation quality
Benchmarks: Test your model against standard benchmarks like MMLU and HellaSwag
Confident AI Dashboard: Monitoring and analytics platform for evaluation results
CI/CD Ready: Integrate evaluations into your deployment pipeline

Installation

Basic Installation

pip install deepeval

Installation with Additional Dependencies

# For all features including benchmarks
pip install deepeval[benchmark]

For Confident AI integration
pip install deepeval[dashboard]

Initial Setup

After installation, login to Confident AI (optional but recommended for tracking):

deepeval login

Or set the API key manually:

export DEEPEVALAPIKEY="your-api-key-here"

Make sure you also have an API key for the LLM used as evaluator:

export OPENAIAPIKEY="your-openai-api-key"

Core Concept: LLMTestCase

LLMTestCase is the fundamental unit in DeepEval. Each test case represents a single interaction with an LLM that you want to evaluate.

from deepeval.testcase import LLMTestCase

testcase = LLMTestCase(
    input="What is the capital of Indonesia?",
    actualoutput="The capital of Indonesia is Jakarta. However, the Indonesian government is building a new capital called Nusantara (IKN) in East Kalimantan.",

    expectedoutput="Jakarta (with note about IKN Nusantara as the new capital)",
    retrievalcontext=[

        "Jakarta has been the capital of Indonesia since independence in 1945.",
        "Nusantara (IKN) is Indonesia's capital relocation project to East Kalimantan, inaugurated in 2024."
    ]
)

Key LLMTestCase parameters:

| Parameter | Description | Required |

|-----------|-------------|----------|

| input | The question or prompt given to the LLM | Yes |

| actualoutput | The actual response from the LLM | Yes |

| expectedoutput | The expected response (ground truth) | Depends on metric |

| retrievalcontext | Context retrieved from knowledge base | Depends on metric |

| context | Ground truth context (ideal) | Depends on metric |

Built-in Evaluation Metrics

DeepEval provides various ready-to-use evaluation metrics. Here are the key metrics explained:

1. Hallucination Metric

Measures whether the LLM output contains information not supported by the given context.

from deepeval.metrics import HallucinationMetric
from deepeval.testcase import LLMTestCase


metric = HallucinationMetric(threshold=0.5)

testcase = LLMTestCase(
    input="Tell me about PT ABC company",
    actualoutput="PT ABC was founded in 1990 in Jakarta. The company has 5000 employees and operates in the technology sector.",

    context=[
        "PT ABC was founded in 1990 in Jakarta.",

DeepEval: Unit Testing and Evaluation for LLM Applications

DeepEval: Unit Testing and Evaluation for LLM Applications

What Is DeepEval?

Installation

Basic Installation

Installation with Additional Dependencies

For Confident AI integration

Initial Setup

Core Concept: LLMTestCase

Built-in Evaluation Metrics

1. Hallucination Metric

Related Articles

RAGAS: Evaluation Framework for RAG Pipelines

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

SGLang Tutorial: Fast LLM Serving and Structured Generation

Related Articles

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS: Framework Evaluasi untuk Pipeline RAG Pendahuluan Retrieval-Augmented Generation (RAG) telah menjadi arsitektur s...

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...