PySpark for Machine Learning
Table of Contents
Introduction
Apache Spark is the de facto standard for large-scale distributed data processing. Its machine learning library, MLlib, provides scalable implementations of common ML algorithms that run seamlessly on clusters of hundreds of machines.
PySpark, the Python API for Spark, allows data scientists to use familiar Python syntax while leveraging the full power of Spark's distributed computing engine. This tutorial covers the complete PySpark ML workflow: from creating a SparkSession and manipulating DataFrames, through building classification, regression, and clustering models, to deploying models with Spark Structured Streaming.
Whether you are processing gigabytes or terabytes of data, the patterns in this tutorial apply equally.
Prerequisites
- Python 3.9+
- Java 11 or 17 (required by Spark)
- Basic understanding of machine learning concepts
Install PySpark:
pip install pyspark numpy pandas
For a cluster deployment, ensure Spark is installed on all nodes. For local development, PySpark includes a built-in standalone Spark instance.
Spark Basics and SparkSession
Every PySpark application begins with a SparkSession, which is the unified entry point for all Spark functionality.
Creating a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark ML Tutorial") \
.master("local[]") \
.config("spark.driver.memory", "4g") \
.config("spark.sql.shuffle.partitions", "8") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.getOrCreate()
Verify the session
print(f"Spark version: {spark.version}")
print(f"App name: {spark.sparkContext.appName}")
print(f"Master: {spark.sparkContext.master}")
print(f"Default parallelism: {spark.sparkContext.defaultParallelism}")
Understanding Spark Architecture
Spark follows a master-worker architecture:
- Driver: The process that runs your main program and creates the SparkContext.
- Executors: Worker processes that execute tasks and store data partitions.
- Cluster Manager: Allocates resources (YARN, Mesos, Kubernetes, or standalone).
Key concepts:
- RDD (Resilient Distributed Dataset): The foundational data structure (low-level API).
- DataFrame: A distributed collection of rows organized into named columns (high-level API, preferred for ML).
- Lazy Evaluation: Transformations are not executed until an action is called.
DataFrame Operations
Spark DataFrames are the primary data structure for PySpark ML workflows.
Creating and Loading DataFrames
from pyspark.sql.types import StructType, StructField, FloatType, IntegerType, StringType
import numpy as np
Create from Python data
data = [
(1, "Alice", 28, 55000.0, "Engineering"),
(2, "Bob", 35, 72000.0, "Marketing"),
(3, "Carol", 42, 88000.0, "Engineering"),
(4, "Dave", 31, 61000.0, "Sales"),
(5, "Eve", 26, 48000.0, "Marketing"),
]
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), False),
StructField("age", IntegerType(), False),