Instructor: Getting Structured Output from LLMs with Python
One of the biggest challenges when working with Large Language Models (LLMs) is getting structured and consistent output. LLMs by default produce free-form text, which is difficult to parse and integrate into applications. The Instructor library solves this problem by leveraging Pydantic for validation and structured data extraction from LLMs.
In this tutorial, we will learn how to use Instructor to get reliable JSON/Pydantic outputs from various LLM providers such as OpenAI, Anthropic, and others.
What Is Instructor?
Instructor is a Python library that patches LLM clients (like OpenAI) to return validated Pydantic objects instead of plain strings. Instructor works by leveraging function calling or JSON mode from the LLM, then validates the results using Pydantic.
Key advantages of Instructor:
- Type-safe: Output is guaranteed to match the defined Pydantic schema
- Automatic retry: If validation fails, Instructor automatically retries with error feedback
- Streaming support: Supports partial streaming for complex objects
- Multi-provider: Supports OpenAI, Anthropic, Google, Mistral, and more
- Custom validation: You can add Pydantic validators for business logic
Installation
First, install Instructor along with the required dependencies:
pip install instructor openai pydantic
For other providers, install additional dependencies:
# For Anthropic
pip install instructor anthropic
For Google Gemini
pip install instructor google-generativeai
For Mistral
pip install instructor mistralai
Make sure you have an API key from the provider you will be using:
export OPENAIAPIKEY="sk-your-api-key-here"
Basic Usage with OpenAI
Let's start with a simple example: extracting user information from text.
import instructor
from openai import OpenAI
from pydantic import BaseModel
Patch OpenAI client with Instructor
client = instructor.fromopenai(OpenAI())
Define the output schema
class UserInfo(BaseModel):
name: str
age: int
email: str
Extract structured data from text
user = client.chat.completions.create(
model="gpt-4o-mini",
responsemodel=UserInfo,
messages=[
{
"role": "user",
"content": "My name is John Smith, I'm 28 years old. "
"My email is john.smith@email.com"
}
],
)
print(user)
UserInfo(name='John Smith', age=28, email='john.smith@email.com')
print(user.name) # John Smith
print(user.age) # 28
print(user.email) # john.smith@email.com
Notice that responsemodel=UserInfo is the key parameter that tells Instructor what schema to expect. The result is not a dictionary or string, but a validated Pydantic object.
Complex Pydantic Models
Instructor supports complex Pydantic models including nested models, optional fields, enums, and lists.
from pydantic import BaseModel, Field
from typing import Optional, List
from enum import Enum
class JobLevel(str, Enum):
JUNIOR = "junior"
MID = "mid"
SENIOR = "senior"
LEAD = "lead"
class Skill(BaseModel):
name: str = Field(description="Name of the skill or technology")
yearsexperience: int = Field(
description="Years of experience", ge=0, le=50
)
proficiency: str = Field(
description="Proficiency level: beginner, intermediate, advanced"
)
class WorkExperience(BaseModel):
company: str
role: str
durationmonths: int = Field(ge=1)
description: str
class CandidateProfile(BaseModel):
name: str
currentrole: str
level: JobLevel
totalyearsexperience: int = Field(ge=0)
skills: List[Skill]
workhistory: List[WorkExperience]
education: str
summary: str = Field(
description="Profile summary of the candidate in 2-3 sentences"
)
resumetext = """
I'm Sarah Chen, currently working as a Senior Data Engineer at Spotify