GraphRAG Tutorial: Graph-Based Retrieval Augmented Generation

Introduction

Retrieval Augmented Generation (RAG) has become the standard approach for connecting Large Language Models (LLMs) with external data. However, traditional RAG has significant limitations: it only performs vector similarity search without understanding the relationships between entities in documents. This is where GraphRAG comes in.

GraphRAG, developed by Microsoft Research, combines the power of knowledge graphs with RAG to produce more comprehensive and contextual answers. Instead of simply searching for text chunks similar to a query, GraphRAG builds a knowledge graph from documents, identifies entities and their relationships, then uses the graph structure to answer questions requiring holistic understanding.

In this tutorial, we will learn how to use Microsoft's graphrag library to build a graph-based RAG system from scratch to production-ready deployment.

Why GraphRAG?

Traditional RAG works well for questions whose answers are contained in one or a few text chunks. However, traditional RAG struggles with questions requiring synthesis of information from many sources, such as:

"What are the main themes discussed across all these documents?"
"How does department A relate to department B in the organization?"
"List all projects involving technology X and person Y"

GraphRAG overcomes these limitations with two query approaches:

Local Search: Finds relevant entities in the graph along with their community context to answer specific questions

Global Search: Uses community summaries from the entire graph to answer questions requiring comprehensive understanding

Installation

Prerequisites

Python 3.10 or later
API key from OpenAI or Azure OpenAI (for LLM and embeddings)
Minimum 8GB RAM (indexing process requires significant memory)

Installing the Library

pip install graphrag

For the latest version from the repository:

pip install git+https://github.com/microsoft/graphrag.git

Verify the installation:

import graphrag
print(graphrag.version)

Setting Up API Keys

GraphRAG requires an LLM for entity extraction and summary generation. Set up your OpenAI API key:

export GRAPHRAGAPIKEY="sk-your-openai-api-key"

Or for Azure OpenAI:

export GRAPHRAGAPIKEY="your-azure-api-key" export GRAPHRAGAPIBASE="https://your-resource.openai.azure.com" export GRAPHRAGAPIVERSION="2024-06-01"

Project Initialization

Creating Project Structure

mkdir graphrag-demo cd graphrag-demo python -m graphrag init --root .

This command generates the directory structure:

graphrag-demo/ ├── settings.yaml # Main configuration ├── .env # Environment variables └── input/ # Folder for source documents

Preparing Input Documents

Place the text documents you want to index in the input/ folder. GraphRAG supports .txt and .csv files.

mkdir -p input

As an example, we will create sample documents about a fictional technology company:

# createsampledata.py
import os

documents = {
    "companyoverview.txt": """

TechNova Inc. is a technology company founded in 2020 in San Francisco. 
The company focuses on developing artificial intelligence solutions for 
the banking and healthcare sectors. The CEO is James Chen, who previously 
worked at Google for 10 years.

The company has three main divisions: the AI Research Division led by 
Dr. Sarah Williams, the Product Development Division led by Michael Park, 
and the Business Development Division led by Elena Rodriguez.

TechNova Inc. secured Series B funding of $50 million from Sequoia Capital 
and Andreessen Horowitz in 2023. The company currently has 200 employees 
spread across San Francisco, New York, and Austin.
""",
    "projects.txt": """
TechNova Inc.'s main projects include:

GraphRAG Tutorial: Graph-Based Retrieval Augmented Generation

GraphRAG Tutorial: Graph-Based Retrieval Augmented Generation

Introduction

Why GraphRAG?

Installation

Prerequisites

Installing the Library

Setting Up API Keys

Project Initialization

Creating Project Structure

Preparing Input Documents

Related Articles

RAGAS: Evaluation Framework for RAG Pipelines

Haystack Tutorial: NLP Framework for Production

Advanced RAG Tutorial: Hybrid Search, Reranking, and Evaluation

Complete LlamaIndex Tutorial: Building RAG Applications with LLMs

Related Articles

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS: Framework Evaluasi untuk Pipeline RAG Pendahuluan Retrieval-Augmented Generation (RAG) telah menjadi arsitektur s...

Haystack Tutorial: NLP Framework for Production

Haystack - Framework NLP untuk Produksi Daftar Isi Pendahuluan Prasyarat Memahami Arsitektur Haystack [Document Store].....

Advanced RAG Tutorial: Hybrid Search, Reranking, and Evaluation

RAG Tingkat Lanjut - Membangun Retrieval-Augmented Generation Kelas Produksi Daftar Isi Pendahuluan Prasyarat Instalasi ...

Complete LlamaIndex Tutorial: Building RAG Applications with LLMs

Tutorial Lengkap LlamaIndex: Membangun Aplikasi RAG dengan LLM LlamaIndex adalah framework data yang powerful untuk memb...