Docling: Smart Document Parsing for AI and RAG Pipelines
In the era of generative AI, the ability to automatically extract information from documents has become crucial. Whether it's annual report PDFs, PowerPoint presentations, Word documents, or HTML web pages, we need tools capable of transforming these formats into structured data ready for AI model consumption. This is where Docling comes in.
Docling is an open-source library from IBM Research designed for intelligent document parsing. It supports various input formats and produces clean Markdown or JSON output, ready for use in AI pipelines, particularly Retrieval-Augmented Generation (RAG).
In this tutorial, we will learn how to use Docling from installation, basic document conversion, to building a complete document ingestion pipeline for RAG.
Prerequisites
Before starting, make sure you have:
- Python 3.10 or later
- pip package manager
- Basic understanding of Python
- (Optional) GPU for faster OCR processing
Installation
Installing Docling is straightforward using pip. There are several installation options depending on your needs.
Basic Installation
pip install docling
Installation with OCR Support
pip install "docling[ocr]"
Full Installation (All Features)
pip install "docling[all]"
Verify Installation
import docling
print(f"Docling version: {docling.version}")
Supported Formats
Docling supports various document formats as input:
| Format | Extension | Description |
|--------|-----------|-------------|
| PDF | .pdf | Including scanned/image PDFs |
| Microsoft Word | .docx | Open XML format |
| Microsoft PowerPoint | .pptx | Open XML format |
| HTML | .html, .htm | Web pages |
| Images | .png, .jpg, .tiff | Via OCR |
| AsciiDoc | .adoc | Documentation format |
| Markdown | .md | Markdown format |
Basic Document Conversion
Let's start with simple document conversion. Docling provides an intuitive API through the DocumentConverter class.
Converting PDF to Markdown
from docling.documentconverter import DocumentConverter
Initialize converter
converter = DocumentConverter()
Convert PDF document
result = converter.convert("annual
report.pdf")
Get output in Markdown format
markdownoutput = result.document.exporttomarkdown()
print(markdownoutput)
Converting from URL
Docling also supports direct conversion from URLs:
from docling.documentconverter import DocumentConverter
converter = DocumentConverter()
Convert from URL
result = converter.convert("https://example.com/document.pdf")
markdown
output = result.document.exporttomarkdown()
print(markdownoutput)
Converting DOCX
from docling.documentconverter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("projectproposal.docx")
markdownoutput = result.document.exporttomarkdown()
Save to file
with open("projectproposal.md", "w", encoding="utf-8") as f:
f.write(markdownoutput)
print("Conversion complete!")
Output Formats
Docling supports several output formats that you can choose based on your needs.
Export to Markdown
from docling.documentconverter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
Export to Markdown
md
content = result.document.exporttomarkdown()
Save to file
with open("output.md", "w", encoding="utf-8") as f:
f.write(mdcontent)
Export to JSON (Document Format)
import json
from docling.documentconverter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
Export to dictionary/JSON
docdict = result.document.exporttodict()