Scrapy + BeautifulSoup Tutorial: Data Collection for ML

# Tutorial 9: Scrapy + BeautifulSoup - Pengumpulan Data untuk ML ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Dasar-Dasar BeautifulSoup](#dasar-dasar-beautifulsoup) 4....

By Ruby Abdullah · · tutorial
ScrapyBeautifulSoupWeb ScrapingData CollectionETLPython

Tutorial 9: Scrapy + BeautifulSoup - Data Collection for ML

Table of Contents

  • Introduction
  • Prerequisites
  • BeautifulSoup Basics
  • Scrapy Fundamentals
  • Building Scrapy Spiders
  • Items and Pipelines
  • Middleware Configuration
  • Handling Pagination and JavaScript
  • Data Cleaning Pipeline
  • Storing Data to CSV, JSON, and Database
  • Ethical Scraping and Rate Limiting
  • Building an ML Dataset from Web Data
  • Best Practices
  • Conclusion

  • Introduction

    Machine learning models are only as good as the data they are trained on. While many curated datasets exist, real-world ML projects often require collecting custom data from the web. Whether you need product reviews for sentiment analysis, news articles for text classification, or job listings for market analysis, web scraping is an essential skill for ML practitioners.

    This tutorial covers two of the most powerful Python tools for web scraping: BeautifulSoup for parsing HTML and Scrapy for building scalable, production-grade web crawlers. You will learn how to extract structured data from websites, clean it, store it in various formats, and transform it into ML-ready datasets, all while respecting ethical scraping guidelines.


    Prerequisites

    • Python 3.9 or higher
    • Basic understanding of HTML structure and CSS selectors
    • Familiarity with Python data structures

    Install the required packages:

    pip install scrapy beautifulsoup4 lxml requests pandas sqlalchemy
    

    pip install scrapy-playwright # For JavaScript-rendered pages


    BeautifulSoup Basics

    Parsing HTML Documents

    from bs4 import BeautifulSoup
    

    import requests

    Fetch a web page

    url = "https://quotes.toscrape.com/"

    response = requests.get(url)

    soup = BeautifulSoup(response.text, "lxml")

    Print the page title

    print(f"Page title: {soup.title.string}")

    # Find elements by tag
    

    alldivs = soup.findall("div")

    print(f"Found {len(alldivs)} div elements")

    Find by class name

    quotes = soup.findall("div", class="quote")

    print(f"Found {len(quotes)} quotes on the page")

    Find by ID

    specific = soup.find(id="main-content")

    Find by CSS selector

    authors = soup.select("small.author")

    for author in authors:

    print(f"Author: {author.gettext()}")

    Extracting Data

    # Extract structured data from each quote
    

    data = []

    for quote in soup.findall("div", class="quote"):

    text = quote.find("span", class="text").gettext()

    author = quote.find("small", class="author").gettext()

    tags = [tag.gettext() for tag in quote.findall("a", class="tag")]

    data.append({

    "text": text,

    "author": author,

    "tags": tags,

    })

    for item in data[:3]:

    print(f'"{item["text"][:60]}..." - {item["author"]}')

    print(f" Tags: {', '.join(item['tags'])}")

    # Extract all links from a page
    

    links = []

    for atag in soup.findall("a", href=True):

    href = atag["href"]

    linktext = atag.gettext(strip=True)

    links.append({"url": href, "text": linktext})

    Extract image sources

    images = []

    for img in soup.findall("img"):

    src = img.get("src", "")

    alt = img.get("alt", "No alt text")

    images.append({"src": src, "alt": alt})

    Extract data attributes

    elements = soup.findall(attrs={"data-category": True})

    for el in elements:

    print(f"Category: {el['data-category']}")

    Related Articles

    Dagster Tutorial: Data Orchestration with Software-Defined Assets

    Dagster: Orkestrasi Data Modern dengan Software-Defined Assets Dagster adalah orkestrator data yang menyusun pipeline be...

    Firecrawl: Web Scraping API for AI and LLM Applications

    Firecrawl: Web Scraping API untuk Aplikasi AI dan LLM Firecrawl adalah managed web scraping API yang dirancang khusus un...

    Crawl4AI: AI-Optimized Web Crawling for Data Pipelines

    Crawl4AI: Web Crawling yang Dioptimalkan untuk Pipeline AI Dalam membangun aplikasi AI modern, terutama yang berbasis Re...

    Complete Prefect Tutorial: Modern Workflow Orchestration for ML

    Tutorial Lengkap Prefect: Modern Workflow Orchestration untuk ML Prefect adalah platform workflow orchestration modern y...