Tutorial 9: Scrapy + BeautifulSoup - Data Collection for ML

Introduction

Prerequisites

BeautifulSoup Basics

Scrapy Fundamentals

Building Scrapy Spiders

Items and Pipelines

Middleware Configuration

Handling Pagination and JavaScript

Data Cleaning Pipeline

Storing Data to CSV, JSON, and Database

Ethical Scraping and Rate Limiting

Building an ML Dataset from Web Data

Best Practices

Conclusion

Introduction

Machine learning models are only as good as the data they are trained on. While many curated datasets exist, real-world ML projects often require collecting custom data from the web. Whether you need product reviews for sentiment analysis, news articles for text classification, or job listings for market analysis, web scraping is an essential skill for ML practitioners.

This tutorial covers two of the most powerful Python tools for web scraping: BeautifulSoup for parsing HTML and Scrapy for building scalable, production-grade web crawlers. You will learn how to extract structured data from websites, clean it, store it in various formats, and transform it into ML-ready datasets, all while respecting ethical scraping guidelines.

Prerequisites

Python 3.9 or higher
Basic understanding of HTML structure and CSS selectors
Familiarity with Python data structures

Install the required packages:

pip install scrapy beautifulsoup4 lxml requests pandas sqlalchemy pip install scrapy-playwright # For JavaScript-rendered pages

BeautifulSoup Basics

Parsing HTML Documents

from bs4 import BeautifulSoup
import requests

Fetch a web page
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

Print the page title
print(f"Page title: {soup.title.string}")

Navigating the DOM Tree

# Find elements by tag
alldivs = soup.findall("div")
print(f"Found {len(alldivs)} div elements")


Find by class name
quotes = soup.findall("div", class="quote")

print(f"Found {len(quotes)} quotes on the page")

Find by ID
specific = soup.find(id="main-content")

Find by CSS selector
authors = soup.select("small.author")
for author in authors:
    print(f"Author: {author.gettext()}")

Extracting Data

# Extract structured data from each quote
data = []
for quote in soup.findall("div", class="quote"):
    text = quote.find("span", class="text").gettext()
    author = quote.find("small", class="author").gettext()
    tags = [tag.gettext() for tag in quote.findall("a", class="tag")]


    data.append({
        "text": text,
        "author": author,
        "tags": tags,
    })

for item in data[:3]:
    print(f'"{item["text"][:60]}..." - {item["author"]}')
    print(f"  Tags: {', '.join(item['tags'])}")

Working with Attributes and Links

# Extract all links from a page
links = []
for atag in soup.findall("a", href=True):

    href = atag["href"]
    linktext = atag.gettext(strip=True)

    links.append({"url": href, "text": linktext})

Extract image sources
images = []
for img in soup.findall("img"):

    src = img.get("src", "")
    alt = img.get("alt", "No alt text")
    images.append({"src": src, "alt": alt})

Extract data attributes
elements = soup.findall(attrs={"data-category": True})
for el in elements:
    print(f"Category: {el['data-category']}")

Scrapy + BeautifulSoup Tutorial: Data Collection for ML

Tutorial 9: Scrapy + BeautifulSoup - Data Collection for ML

Table of Contents

Introduction

Prerequisites

BeautifulSoup Basics

Parsing HTML Documents

Fetch a web page

Print the page title

Navigating the DOM Tree

Find by class name

Find by ID

Find by CSS selector

Extracting Data

Working with Attributes and Links

Extract image sources

Extract data attributes

Related Articles

Dagster Tutorial: Data Orchestration with Software-Defined Assets

Firecrawl: Web Scraping API for AI and LLM Applications

Crawl4AI: AI-Optimized Web Crawling for Data Pipelines

Complete Prefect Tutorial: Modern Workflow Orchestration for ML

Related Articles

Dagster Tutorial: Data Orchestration with Software-Defined Assets

Dagster: Orkestrasi Data Modern dengan Software-Defined Assets Dagster adalah orkestrator data yang menyusun pipeline be...

Firecrawl: Web Scraping API for AI and LLM Applications

Firecrawl: Web Scraping API untuk Aplikasi AI dan LLM Firecrawl adalah managed web scraping API yang dirancang khusus un...

Crawl4AI: AI-Optimized Web Crawling for Data Pipelines

Crawl4AI: Web Crawling yang Dioptimalkan untuk Pipeline AI Dalam membangun aplikasi AI modern, terutama yang berbasis Re...

Complete Prefect Tutorial: Modern Workflow Orchestration for ML

Tutorial Lengkap Prefect: Modern Workflow Orchestration untuk ML Prefect adalah platform workflow orchestration modern y...