Haystack - NLP Framework for Production
Table of Contents
Introduction
Haystack is an open-source NLP framework designed for building production-ready search and question-answering systems. Developed by deepset, Haystack provides a modular, pipeline-based architecture that makes it straightforward to combine retrievers, readers, generators, and other NLP components into powerful applications.
With Haystack 2.x, the framework has been redesigned around a flexible pipeline system that supports both extractive and generative AI workflows. Whether you are building a simple document search system or a complex Retrieval-Augmented Generation (RAG) pipeline, Haystack provides the building blocks and the glue to connect them.
This tutorial covers the core concepts of Haystack, walks you through building practical pipelines, shows you how to create custom components, and demonstrates how to deploy your system as a REST API ready for production traffic.
Prerequisites
- Python 3.9 or higher
- Basic understanding of NLP concepts (embeddings, retrieval, generation)
- An OpenAI API key (for generator components)
pip install haystack-ai "sentence-transformers>=2.2.0" "opensearch-haystack" fastapi uvicorn
export OPENAIAPIKEY="sk-your-key-here"
Understanding Haystack Architecture
Haystack 2.x is built around three core concepts:
Components
Components are the building blocks of Haystack. Each component performs a specific task such as retrieving documents, generating text, or converting file formats. Components declare their inputs and outputs explicitly, enabling the framework to validate connections at pipeline construction time.
Pipelines
Pipelines connect components into directed acyclic graphs (DAGs). Data flows from one component to the next based on declared input/output connections. Pipelines can be linear or branching.
Document Stores
Document stores are the persistence layer. They store documents and their associated metadata and embeddings, providing retrieval interfaces for the pipeline components.
from haystack import Pipeline, Document
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
A simple pipeline that cleans and splits documents
preprocessing = Pipeline()
preprocessing.addcomponent("cleaner", DocumentCleaner())
preprocessing.addcomponent("splitter", DocumentSplitter(
splitby="sentence",
splitlength=3,
splitoverlap=1,
))
preprocessing.connect("cleaner", "splitter")
Run the pipeline
docs = [
Document(content="""Haystack is an open-source NLP framework. It provides modular
components for building search systems. The framework supports both extractive and
generative approaches. You can combine multiple components into pipelines.
Pipelines can be deployed as REST APIs.""")
]
result = preprocessing.run({"cleaner": {"documents": docs}})
for doc in result["splitter"]["documents"]:
print(f" Chunk: {doc.content[:80]}...")
Document Stores
Document stores are the backbone of any Haystack application. They persist documents and enable efficient retrieval.
InMemoryDocumentStore
Best for development, testing, and small datasets.
from haystack.documentstores.inmemory import InMemoryDocumentStore
Create an in-memory document store
store = InMemoryDocumentStore(embedding
similarityfunction="cosine")
Write documents
documents = [
Document(