Haystack Tutorial: NLP Framework for Production

# Haystack - Framework NLP untuk Produksi ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Memahami Arsitektur Haystack](#memahami-arsitektur-haystack) 4. [Document Store]...

By Ruby Abdullah · · tutorial
HaystackNLPRAGDocument StoreRetrievalProduction

Haystack - NLP Framework for Production

Table of Contents

  • Introduction
  • Prerequisites
  • Understanding Haystack Architecture
  • Document Stores
  • Retrievers
  • Readers and Generators
  • Building a RAG Pipeline
  • Custom Components
  • Pipeline Evaluation
  • REST API Deployment
  • Best Practices
  • Conclusion

  • Introduction

    Haystack is an open-source NLP framework designed for building production-ready search and question-answering systems. Developed by deepset, Haystack provides a modular, pipeline-based architecture that makes it straightforward to combine retrievers, readers, generators, and other NLP components into powerful applications.

    With Haystack 2.x, the framework has been redesigned around a flexible pipeline system that supports both extractive and generative AI workflows. Whether you are building a simple document search system or a complex Retrieval-Augmented Generation (RAG) pipeline, Haystack provides the building blocks and the glue to connect them.

    This tutorial covers the core concepts of Haystack, walks you through building practical pipelines, shows you how to create custom components, and demonstrates how to deploy your system as a REST API ready for production traffic.


    Prerequisites

    • Python 3.9 or higher
    • Basic understanding of NLP concepts (embeddings, retrieval, generation)
    • An OpenAI API key (for generator components)

    pip install haystack-ai "sentence-transformers>=2.2.0" "opensearch-haystack" fastapi uvicorn
    

    export OPENAIAPIKEY="sk-your-key-here"
    


    Understanding Haystack Architecture

    Haystack 2.x is built around three core concepts:

    Components

    Components are the building blocks of Haystack. Each component performs a specific task such as retrieving documents, generating text, or converting file formats. Components declare their inputs and outputs explicitly, enabling the framework to validate connections at pipeline construction time.

    Pipelines

    Pipelines connect components into directed acyclic graphs (DAGs). Data flows from one component to the next based on declared input/output connections. Pipelines can be linear or branching.

    Document Stores

    Document stores are the persistence layer. They store documents and their associated metadata and embeddings, providing retrieval interfaces for the pipeline components.

    from haystack import Pipeline, Document
    

    from haystack.components.writers import DocumentWriter

    from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter

    A simple pipeline that cleans and splits documents

    preprocessing = Pipeline()

    preprocessing.addcomponent("cleaner", DocumentCleaner())

    preprocessing.addcomponent("splitter", DocumentSplitter(

    splitby="sentence",

    splitlength=3,

    splitoverlap=1,

    ))

    preprocessing.connect("cleaner", "splitter")

    Run the pipeline

    docs = [

    Document(content="""Haystack is an open-source NLP framework. It provides modular

    components for building search systems. The framework supports both extractive and

    generative approaches. You can combine multiple components into pipelines.

    Pipelines can be deployed as REST APIs.""")

    ]

    result = preprocessing.run({"cleaner": {"documents": docs}})

    for doc in result["splitter"]["documents"]:

    print(f" Chunk: {doc.content[:80]}...")


    Document Stores

    Document stores are the backbone of any Haystack application. They persist documents and enable efficient retrieval.

    InMemoryDocumentStore

    Best for development, testing, and small datasets.

    from haystack.documentstores.inmemory import InMemoryDocumentStore
    
    

    Create an in-memory document store

    store = InMemoryDocumentStore(embeddingsimilarityfunction="cosine")

    Write documents

    documents = [

    Document(

    Related Articles

    ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

    ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

    BERTopic Tutorial: Modern Topic Modeling with Embeddings

    BERTopic: Pemodelan Topik Modern dengan Embedding BERTopic adalah library pemodelan topik yang menggabungkan embedding t...

    Sentence Transformers Tutorial: Embeddings, Similarity, and Rerankers

    Sentence Transformers: Embedding, Kemiripan Semantik, dan Reranker Sentence Transformers (sering disebut SBERT) adalah p...

    spaCy Tutorial: Industrial-Strength NLP in Python

    spaCy: NLP Kelas Industri di Python spaCy adalah pustaka open-source untuk pemrosesan bahasa alami (NLP) yang dirancang ...