GraphRAG Tutorial: Graph-Based Retrieval Augmented Generation
Introduction
Retrieval Augmented Generation (RAG) has become the standard approach for connecting Large Language Models (LLMs) with external data. However, traditional RAG has significant limitations: it only performs vector similarity search without understanding the relationships between entities in documents. This is where GraphRAG comes in.
GraphRAG, developed by Microsoft Research, combines the power of knowledge graphs with RAG to produce more comprehensive and contextual answers. Instead of simply searching for text chunks similar to a query, GraphRAG builds a knowledge graph from documents, identifies entities and their relationships, then uses the graph structure to answer questions requiring holistic understanding.
In this tutorial, we will learn how to use Microsoft's graphrag library to build a graph-based RAG system from scratch to production-ready deployment.
Why GraphRAG?
Traditional RAG works well for questions whose answers are contained in one or a few text chunks. However, traditional RAG struggles with questions requiring synthesis of information from many sources, such as:
- "What are the main themes discussed across all these documents?"
- "How does department A relate to department B in the organization?"
- "List all projects involving technology X and person Y"
GraphRAG overcomes these limitations with two query approaches:
Installation
Prerequisites
- Python 3.10 or later
- API key from OpenAI or Azure OpenAI (for LLM and embeddings)
- Minimum 8GB RAM (indexing process requires significant memory)
Installing the Library
pip install graphrag
For the latest version from the repository:
pip install git+https://github.com/microsoft/graphrag.git
Verify the installation:
import graphrag
print(graphrag.version)
Setting Up API Keys
GraphRAG requires an LLM for entity extraction and summary generation. Set up your OpenAI API key:
export GRAPHRAGAPIKEY="sk-your-openai-api-key"
Or for Azure OpenAI:
export GRAPHRAGAPIKEY="your-azure-api-key"
export GRAPHRAGAPIBASE="https://your-resource.openai.azure.com"
export GRAPHRAGAPIVERSION="2024-06-01"
Project Initialization
Creating Project Structure
mkdir graphrag-demo
cd graphrag-demo
python -m graphrag init --root .
This command generates the directory structure:
graphrag-demo/
├── settings.yaml # Main configuration
├── .env # Environment variables
└── input/ # Folder for source documents
Preparing Input Documents
Place the text documents you want to index in the input/ folder. GraphRAG supports .txt and .csv files.
mkdir -p input
As an example, we will create sample documents about a fictional technology company:
# createsampledata.py
import os
documents = {
"companyoverview.txt": """
TechNova Inc. is a technology company founded in 2020 in San Francisco.
The company focuses on developing artificial intelligence solutions for
the banking and healthcare sectors. The CEO is James Chen, who previously
worked at Google for 10 years.
The company has three main divisions: the AI Research Division led by
Dr. Sarah Williams, the Product Development Division led by Michael Park,
and the Business Development Division led by Elena Rodriguez.
TechNova Inc. secured Series B funding of $50 million from Sequoia Capital
and Andreessen Horowitz in 2023. The company currently has 200 employees
spread across San Francisco, New York, and Austin.
""",
"projects.txt": """
TechNova Inc.'s main projects include: