Tutorial 9: Scrapy + BeautifulSoup - Data Collection for ML
Table of Contents
Introduction
Machine learning models are only as good as the data they are trained on. While many curated datasets exist, real-world ML projects often require collecting custom data from the web. Whether you need product reviews for sentiment analysis, news articles for text classification, or job listings for market analysis, web scraping is an essential skill for ML practitioners.
This tutorial covers two of the most powerful Python tools for web scraping: BeautifulSoup for parsing HTML and Scrapy for building scalable, production-grade web crawlers. You will learn how to extract structured data from websites, clean it, store it in various formats, and transform it into ML-ready datasets, all while respecting ethical scraping guidelines.
Prerequisites
- Python 3.9 or higher
- Basic understanding of HTML structure and CSS selectors
- Familiarity with Python data structures
Install the required packages:
pip install scrapy beautifulsoup4 lxml requests pandas sqlalchemy
pip install scrapy-playwright # For JavaScript-rendered pages
BeautifulSoup Basics
Parsing HTML Documents
from bs4 import BeautifulSoup
import requests
Fetch a web page
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
Print the page title
print(f"Page title: {soup.title.string}")
Navigating the DOM Tree
# Find elements by tag
alldivs = soup.findall("div")
print(f"Found {len(alldivs)} div elements")
Find by class name
quotes = soup.findall("div", class="quote")
print(f"Found {len(quotes)} quotes on the page")
Find by ID
specific = soup.find(id="main-content")
Find by CSS selector
authors = soup.select("small.author")
for author in authors:
print(f"Author: {author.gettext()}")
Extracting Data
# Extract structured data from each quote
data = []
for quote in soup.findall("div", class="quote"):
text = quote.find("span", class="text").gettext()
author = quote.find("small", class="author").gettext()
tags = [tag.gettext() for tag in quote.findall("a", class="tag")]
data.append({
"text": text,
"author": author,
"tags": tags,
})
for item in data[:3]:
print(f'"{item["text"][:60]}..." - {item["author"]}')
print(f" Tags: {', '.join(item['tags'])}")
Working with Attributes and Links
# Extract all links from a page
links = []
for atag in soup.findall("a", href=True):
href = atag["href"]
linktext = atag.gettext(strip=True)
links.append({"url": href, "text": linktext})
Extract image sources
images = []
for img in soup.findall("img"):
src = img.get("src", "")
alt = img.get("alt", "No alt text")
images.append({"src": src, "alt": alt})
Extract data attributes
elements = soup.findall(attrs={"data-category": True})
for el in elements:
print(f"Category: {el['data-category']}")