PySpark for Machine Learning Tutorial: Big Data ML Pipeline

# PySpark untuk Machine Learning ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Dasar Spark dan SparkSession](#dasar-spark-dan-sparksession) 4. [Operasi DataFrame](#oper...

By Ruby Abdullah · · tutorial
PySparkSpark MLlibBig DataDistributed MLPipelinePython

PySpark for Machine Learning

Table of Contents

  • Introduction
  • Prerequisites
  • Spark Basics and SparkSession
  • DataFrame Operations
  • Feature Engineering at Scale
  • Classification with Spark MLlib
  • Regression with Spark MLlib
  • Clustering with Spark MLlib
  • Pipeline API
  • Cross-Validation and Hyperparameter Tuning
  • Model Persistence
  • Deployment with Spark Structured Streaming
  • Best Practices
  • Conclusion

  • Introduction

    Apache Spark is the de facto standard for large-scale distributed data processing. Its machine learning library, MLlib, provides scalable implementations of common ML algorithms that run seamlessly on clusters of hundreds of machines.

    PySpark, the Python API for Spark, allows data scientists to use familiar Python syntax while leveraging the full power of Spark's distributed computing engine. This tutorial covers the complete PySpark ML workflow: from creating a SparkSession and manipulating DataFrames, through building classification, regression, and clustering models, to deploying models with Spark Structured Streaming.

    Whether you are processing gigabytes or terabytes of data, the patterns in this tutorial apply equally.


    Prerequisites

    • Python 3.9+
    • Java 11 or 17 (required by Spark)
    • Basic understanding of machine learning concepts

    Install PySpark:

    pip install pyspark numpy pandas
    

    For a cluster deployment, ensure Spark is installed on all nodes. For local development, PySpark includes a built-in standalone Spark instance.


    Spark Basics and SparkSession

    Every PySpark application begins with a SparkSession, which is the unified entry point for all Spark functionality.

    Creating a SparkSession

    from pyspark.sql import SparkSession
    
    

    spark = SparkSession.builder \

    .appName("PySpark ML Tutorial") \

    .master("local[]") \

    .config("spark.driver.memory", "4g") \

    .config("spark.sql.shuffle.partitions", "8") \

    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \

    .getOrCreate()

    Verify the session

    print(f"Spark version: {spark.version}")

    print(f"App name: {spark.sparkContext.appName}")

    print(f"Master: {spark.sparkContext.master}")

    print(f"Default parallelism: {spark.sparkContext.defaultParallelism}")

    Understanding Spark Architecture

    Spark follows a master-worker architecture:

    • Driver: The process that runs your main program and creates the SparkContext.
    • Executors: Worker processes that execute tasks and store data partitions.
    • Cluster Manager: Allocates resources (YARN, Mesos, Kubernetes, or standalone).

    Key concepts:

    • RDD (Resilient Distributed Dataset): The foundational data structure (low-level API).
    • DataFrame: A distributed collection of rows organized into named columns (high-level API, preferred for ML).
    • Lazy Evaluation: Transformations are not executed until an action is called.


    DataFrame Operations

    Spark DataFrames are the primary data structure for PySpark ML workflows.

    Creating and Loading DataFrames

    from pyspark.sql.types import StructType, StructField, FloatType, IntegerType, StringType
    

    import numpy as np

    Create from Python data

    data = [

    (1, "Alice", 28, 55000.0, "Engineering"),

    (2, "Bob", 35, 72000.0, "Marketing"),

    (3, "Carol", 42, 88000.0, "Engineering"),

    (4, "Dave", 31, 61000.0, "Sales"),

    (5, "Eve", 26, 48000.0, "Marketing"),

    ]

    schema = StructType([

    StructField("id", IntegerType(), False),

    StructField("name", StringType(), False),

    StructField("age", IntegerType(), False),

    Related Articles

    Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

    Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

    ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

    ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic Pendahuluan Membangun model machine learning yang akurat...

    Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

    Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

    ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

    ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...