Docling: Smart Document Parsing for AI and RAG Pipelines

# Docling: Document Parsing Cerdas untuk Pipeline AI dan RAG Dalam era AI generatif, kemampuan untuk mengekstrak informasi dari dokumen secara otomatis menjadi sangat krusial. Baik itu PDF laporan ta...

By Ruby Abdullah · · tutorial
DoclingDocument ParsingPDFRAGPython

Docling: Smart Document Parsing for AI and RAG Pipelines

In the era of generative AI, the ability to automatically extract information from documents has become crucial. Whether it's annual report PDFs, PowerPoint presentations, Word documents, or HTML web pages, we need tools capable of transforming these formats into structured data ready for AI model consumption. This is where Docling comes in.

Docling is an open-source library from IBM Research designed for intelligent document parsing. It supports various input formats and produces clean Markdown or JSON output, ready for use in AI pipelines, particularly Retrieval-Augmented Generation (RAG).

In this tutorial, we will learn how to use Docling from installation, basic document conversion, to building a complete document ingestion pipeline for RAG.

Prerequisites

Before starting, make sure you have:

  • Python 3.10 or later
  • pip package manager
  • Basic understanding of Python
  • (Optional) GPU for faster OCR processing

Installation

Installing Docling is straightforward using pip. There are several installation options depending on your needs.

Basic Installation

pip install docling

Installation with OCR Support

pip install "docling[ocr]"

Full Installation (All Features)

pip install "docling[all]"

Verify Installation

import docling

print(f"Docling version: {docling.version}")

Supported Formats

Docling supports various document formats as input:

| Format | Extension | Description |

|--------|-----------|-------------|

| PDF | .pdf | Including scanned/image PDFs |

| Microsoft Word | .docx | Open XML format |

| Microsoft PowerPoint | .pptx | Open XML format |

| HTML | .html, .htm | Web pages |

| Images | .png, .jpg, .tiff | Via OCR |

| AsciiDoc | .adoc | Documentation format |

| Markdown | .md | Markdown format |

Basic Document Conversion

Let's start with simple document conversion. Docling provides an intuitive API through the DocumentConverter class.

Converting PDF to Markdown

from docling.documentconverter import DocumentConverter

Initialize converter

converter = DocumentConverter()

Convert PDF document

result = converter.convert("annualreport.pdf")

Get output in Markdown format

markdownoutput = result.document.exporttomarkdown()

print(markdownoutput)

Converting from URL

Docling also supports direct conversion from URLs:

from docling.documentconverter import DocumentConverter

converter = DocumentConverter()

Convert from URL

result = converter.convert("https://example.com/document.pdf")

markdownoutput = result.document.exporttomarkdown()

print(markdownoutput)

Converting DOCX

from docling.documentconverter import DocumentConverter

converter = DocumentConverter()

result = converter.convert("projectproposal.docx")

markdownoutput = result.document.exporttomarkdown()

Save to file

with open("projectproposal.md", "w", encoding="utf-8") as f:

f.write(markdownoutput)

print("Conversion complete!")

Output Formats

Docling supports several output formats that you can choose based on your needs.

Export to Markdown

from docling.documentconverter import DocumentConverter

converter = DocumentConverter()

result = converter.convert("document.pdf")

Export to Markdown

mdcontent = result.document.exporttomarkdown()

Save to file

with open("output.md", "w", encoding="utf-8") as f:

f.write(mdcontent)

Export to JSON (Document Format)

import json

from docling.documentconverter import DocumentConverter

converter = DocumentConverter()

result = converter.convert("document.pdf")

Export to dictionary/JSON

docdict = result.document.exporttodict()

Related Articles

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

Firecrawl: Web Scraping API for AI and LLM Applications

Firecrawl: Web Scraping API untuk Aplikasi AI dan LLM Firecrawl adalah managed web scraping API yang dirancang khusus un...

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS: Framework Evaluasi untuk Pipeline RAG Pendahuluan Retrieval-Augmented Generation (RAG) telah menjadi arsitektur s...

Complete Qdrant Tutorial: Vector Database for AI Applications

Tutorial Lengkap Qdrant: Vector Database untuk Aplikasi AI Qdrant adalah vector database performa tinggi yang dirancang ...