Docling: Smart Document Parsing for AI and RAG Pipelines

In the era of generative AI, the ability to automatically extract information from documents has become crucial. Whether it's annual report PDFs, PowerPoint presentations, Word documents, or HTML web pages, we need tools capable of transforming these formats into structured data ready for AI model consumption. This is where Docling comes in.

Docling is an open-source library from IBM Research designed for intelligent document parsing. It supports various input formats and produces clean Markdown or JSON output, ready for use in AI pipelines, particularly Retrieval-Augmented Generation (RAG).

In this tutorial, we will learn how to use Docling from installation, basic document conversion, to building a complete document ingestion pipeline for RAG.

Prerequisites

Before starting, make sure you have:

Python 3.10 or later
pip package manager
Basic understanding of Python
(Optional) GPU for faster OCR processing

Installation

Installing Docling is straightforward using pip. There are several installation options depending on your needs.

Basic Installation

pip install docling

Installation with OCR Support

pip install "docling[ocr]"

Full Installation (All Features)

pip install "docling[all]"

Verify Installation

import docling
print(f"Docling version: {docling.version}")

Supported Formats

Docling supports various document formats as input:

| Format | Extension | Description |

|--------|-----------|-------------|

| PDF | .pdf | Including scanned/image PDFs |

| Microsoft Word | .docx | Open XML format |

| Microsoft PowerPoint | .pptx | Open XML format |

| HTML | .html, .htm | Web pages |

| Images | .png, .jpg, .tiff | Via OCR |

| AsciiDoc | .adoc | Documentation format |

| Markdown | .md | Markdown format |

Basic Document Conversion

Let's start with simple document conversion. Docling provides an intuitive API through the DocumentConverter class.

Converting PDF to Markdown

from docling.documentconverter import DocumentConverter

Initialize converter
converter = DocumentConverter()

Convert PDF document
result = converter.convert("annualreport.pdf")

Get output in Markdown format
markdownoutput = result.document.exporttomarkdown()

print(markdownoutput)

Converting from URL

Docling also supports direct conversion from URLs:

from docling.documentconverter import DocumentConverter

converter = DocumentConverter()

Convert from URL
result = converter.convert("https://example.com/document.pdf")

markdownoutput = result.document.exporttomarkdown()
print(markdownoutput)

Converting DOCX

from docling.documentconverter import DocumentConverter

converter = DocumentConverter()

result = converter.convert("projectproposal.docx")

markdownoutput = result.document.exporttomarkdown()

Save to file
with open("projectproposal.md", "w", encoding="utf-8") as f:

    f.write(markdownoutput)

print("Conversion complete!")

Output Formats

Docling supports several output formats that you can choose based on your needs.

Export to Markdown

from docling.documentconverter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

Export to Markdown
mdcontent = result.document.exporttomarkdown()

Save to file
with open("output.md", "w", encoding="utf-8") as f:
    f.write(mdcontent)

Export to JSON (Document Format)

import json
from docling.documentconverter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

Export to dictionary/JSON
docdict = result.document.exporttodict()

Docling: Smart Document Parsing for AI and RAG Pipelines

Docling: Smart Document Parsing for AI and RAG Pipelines

Prerequisites

Installation

Basic Installation

Installation with OCR Support

Full Installation (All Features)

Verify Installation

Supported Formats

Basic Document Conversion

Converting PDF to Markdown

Initialize converter

Convert PDF document

Get output in Markdown format

Converting from URL

Convert from URL

Converting DOCX

Save to file

Output Formats

Export to Markdown

Export to Markdown

Save to file

Export to JSON (Document Format)

Export to dictionary/JSON

Related Articles

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

Firecrawl: Web Scraping API for AI and LLM Applications

RAGAS: Evaluation Framework for RAG Pipelines

Complete Qdrant Tutorial: Vector Database for AI Applications

Related Articles

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

Firecrawl: Web Scraping API for AI and LLM Applications

Firecrawl: Web Scraping API untuk Aplikasi AI dan LLM Firecrawl adalah managed web scraping API yang dirancang khusus un...

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS: Framework Evaluasi untuk Pipeline RAG Pendahuluan Retrieval-Augmented Generation (RAG) telah menjadi arsitektur s...

Complete Qdrant Tutorial: Vector Database for AI Applications

Tutorial Lengkap Qdrant: Vector Database untuk Aplikasi AI Qdrant adalah vector database performa tinggi yang dirancang ...