Firecrawl: Web Scraping API for AI and LLM Applications
Firecrawl is a managed web scraping API designed specifically for AI and LLM needs. Unlike traditional web scrapers, Firecrawl produces output in clean, structured markdown format, ready for use in RAG (Retrieval-Augmented Generation) pipelines, fine-tuning datasets, or knowledge base construction.
This article covers how to use Firecrawl for various web scraping scenarios in the context of AI applications, from single page scraping to building a complete knowledge base ingestion pipeline.
Why Firecrawl?
Web scraping for AI purposes has unique challenges. Traditional scraping tools like BeautifulSoup or Scrapy produce raw HTML that needs manual cleaning. Modern pages often use JavaScript rendering that requires browser automation. Firecrawl solves these problems:
- Clean markdown output: Content directly in a format that LLMs understand
- JavaScript rendering: Handles SPA and dynamic content automatically
- Structured data extraction: Extract structured data using built-in LLM
- Website crawling: Automatically explore entire websites by following links
- Managed infrastructure: No need to manage proxies, browsers, or servers
- Automatic rate limiting: Respects robots.txt and rate limits
Installation and Setup
SDK Installation
pip install firecrawl-py
API Key Setup
Register at firecrawl.dev to get an API key:
from firecrawl import FirecrawlApp
app = FirecrawlApp(apikey="fc-YOURAPIKEYHERE")
Or set it as an environment variable:
export FIRECRAWLAPIKEY="fc-YOURAPIKEYHERE"
import os
from firecrawl import FirecrawlApp
app = FirecrawlApp(apikey=os.getenv("FIRECRAWLAPIKEY"))
Scraping a Single Page
Markdown Output
Firecrawl converts web pages into clean markdown:
from firecrawl import FirecrawlApp
app = FirecrawlApp(apikey="fc-YOURAPIKEY")
Scrape a single page
result = app.scrapeurl(
"https://docs.python.org/3/tutorial/classes.html",
params={"formats": ["markdown"]}
)
Access markdown content
markdowncontent = result.get("markdown", "")
print(f"Content length: {len(markdowncontent)} characters")
print(markdowncontent[:500])
Page metadata
metadata = result.get("metadata", {})
print(f"Title: {metadata.get('title')}")
print(f"Description: {metadata.get('description')}")
print(f"Language: {metadata.get('language')}")
HTML Output
If you need raw HTML:
result = app.scrapeurl(
"https://example.com/article",
params={"formats": ["html", "markdown"]}
)
htmlcontent = result.get("html", "")
markdowncontent = result.get("markdown", "")
print(f"HTML length: {len(htmlcontent)}")
print(f"Markdown length: {len(markdowncontent)}")
Screenshot
Firecrawl can also capture page screenshots:
result = app.scrapeurl(
"https://example.com",
params={"formats": ["screenshot"]}
)
screenshot
url = result.get("screenshot")
print(f"Screenshot URL: {screenshoturl}")
Download screenshot
import requests
if screenshoturl:
response = requests.get(screenshoturl)
with open("screenshot.png", "wb") as f:
f.write(response.content)
print("Screenshot saved!")
Crawling an Entire Website
Firecrawl can explore an entire website by automatically following links:
from firecrawl import FirecrawlApp
app = FirecrawlApp(apikey="fc-YOURAPIKEY")
Crawl website with configuration
crawlresult = app.crawlurl(
"https://docs.python.org/3/tutorial/",
params={
"limit": 50, # Maximum 50 pages
"maxDepth": 3, # Maximum link depth
"includePaths": ["/3/tutorial/"], # Only specific paths
"excludePaths": ["/3/tutorial/index.html"],
"scrapeOptions": {
"formats": ["markdown"]
}
},