Firecrawl: Web Scraping API for AI and LLM Applications

# Firecrawl: Web Scraping API untuk Aplikasi AI dan LLM Firecrawl adalah managed web scraping API yang dirancang khusus untuk kebutuhan AI dan LLM. Berbeda dengan web scraper tradisional, Firecrawl m...

By Ruby Abdullah · · tutorial
FirecrawlWeb ScrapingAPIRAGPython

Firecrawl: Web Scraping API for AI and LLM Applications

Firecrawl is a managed web scraping API designed specifically for AI and LLM needs. Unlike traditional web scrapers, Firecrawl produces output in clean, structured markdown format, ready for use in RAG (Retrieval-Augmented Generation) pipelines, fine-tuning datasets, or knowledge base construction.

This article covers how to use Firecrawl for various web scraping scenarios in the context of AI applications, from single page scraping to building a complete knowledge base ingestion pipeline.

Why Firecrawl?

Web scraping for AI purposes has unique challenges. Traditional scraping tools like BeautifulSoup or Scrapy produce raw HTML that needs manual cleaning. Modern pages often use JavaScript rendering that requires browser automation. Firecrawl solves these problems:

  • Clean markdown output: Content directly in a format that LLMs understand
  • JavaScript rendering: Handles SPA and dynamic content automatically
  • Structured data extraction: Extract structured data using built-in LLM
  • Website crawling: Automatically explore entire websites by following links
  • Managed infrastructure: No need to manage proxies, browsers, or servers
  • Automatic rate limiting: Respects robots.txt and rate limits

Installation and Setup

SDK Installation

pip install firecrawl-py

API Key Setup

Register at firecrawl.dev to get an API key:

from firecrawl import FirecrawlApp

app = FirecrawlApp(apikey="fc-YOURAPIKEYHERE")

Or set it as an environment variable:

export FIRECRAWLAPIKEY="fc-YOURAPIKEYHERE"

import os

from firecrawl import FirecrawlApp

app = FirecrawlApp(apikey=os.getenv("FIRECRAWLAPIKEY"))

Scraping a Single Page

Markdown Output

Firecrawl converts web pages into clean markdown:

from firecrawl import FirecrawlApp

app = FirecrawlApp(apikey="fc-YOURAPIKEY")

Scrape a single page

result = app.scrapeurl(

"https://docs.python.org/3/tutorial/classes.html",

params={"formats": ["markdown"]}

)

Access markdown content

markdowncontent = result.get("markdown", "")

print(f"Content length: {len(markdowncontent)} characters")

print(markdowncontent[:500])

Page metadata

metadata = result.get("metadata", {})

print(f"Title: {metadata.get('title')}")

print(f"Description: {metadata.get('description')}")

print(f"Language: {metadata.get('language')}")

HTML Output

If you need raw HTML:

result = app.scrapeurl(

"https://example.com/article",

params={"formats": ["html", "markdown"]}

)

htmlcontent = result.get("html", "")

markdowncontent = result.get("markdown", "")

print(f"HTML length: {len(htmlcontent)}")

print(f"Markdown length: {len(markdowncontent)}")

Screenshot

Firecrawl can also capture page screenshots:

result = app.scrapeurl(

"https://example.com",

params={"formats": ["screenshot"]}

)

screenshoturl = result.get("screenshot")

print(f"Screenshot URL: {screenshoturl}")

Download screenshot

import requests

if screenshoturl:

response = requests.get(screenshoturl)

with open("screenshot.png", "wb") as f:

f.write(response.content)

print("Screenshot saved!")

Crawling an Entire Website

Firecrawl can explore an entire website by automatically following links:

from firecrawl import FirecrawlApp

app = FirecrawlApp(apikey="fc-YOURAPIKEY")

Crawl website with configuration

crawlresult = app.crawlurl(

"https://docs.python.org/3/tutorial/",

params={

"limit": 50, # Maximum 50 pages

"maxDepth": 3, # Maximum link depth

"includePaths": ["/3/tutorial/"], # Only specific paths

"excludePaths": ["/3/tutorial/index.html"],

"scrapeOptions": {

"formats": ["markdown"]

}

},

Related Articles

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS: Framework Evaluasi untuk Pipeline RAG Pendahuluan Retrieval-Augmented Generation (RAG) telah menjadi arsitektur s...

Docling: Smart Document Parsing for AI and RAG Pipelines

Docling: Document Parsing Cerdas untuk Pipeline AI dan RAG Dalam era AI generatif, kemampuan untuk mengekstrak informasi...

Crawl4AI: AI-Optimized Web Crawling for Data Pipelines

Crawl4AI: Web Crawling yang Dioptimalkan untuk Pipeline AI Dalam membangun aplikasi AI modern, terutama yang berbasis Re...