ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

# ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic ## Pendahuluan Membangun model machine learning yang akurat hanyalah sebagian kecil dari tantangan dalam produksi ML. Tantangan sesu...

By Ruby Abdullah · · tutorial
ZenMLMLOpsPipelineOrchestrationPython

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

Introduction

Building an accurate machine learning model is only a small part of the challenge in ML production. The real challenge lies in creating reproducible, scalable, and maintainable pipelines. ZenML is an open-source MLOps framework designed to tackle these problems with a modular and cloud-agnostic approach.

ZenML allows data scientists and ML engineers to define ML pipelines using simple Python decorators while providing full flexibility to integrate various tools and cloud platforms. With its unique "stacks" concept, you can switch from local development to cloud production without changing your pipeline code.

In this tutorial, we will learn ZenML from the basics to building an end-to-end pipeline covering data ingestion, preprocessing, training, evaluation, and deployment.

Prerequisites

Before getting started, make sure you have:

  • Python 3.8 or later
  • pip package manager
  • Basic understanding of machine learning
  • Familiarity with scikit-learn or other ML frameworks

Installing ZenML

Installing ZenML is straightforward using pip:

pip install zenml

For installation with additional integrations:

pip install "zenml[server]"

After installation, initialize a ZenML repository in your project:

zenml init

Launch the ZenML dashboard locally:

zenml login --local

The dashboard will be available at http://127.0.0.1:8237, providing complete visualization of your pipelines, artifacts, and stacks.

Core Concepts: Steps and Pipelines

Defining Steps with @step

A step is the smallest unit of work in ZenML. Each step is a Python function decorated with @step:

from zenml import step

import pandas as pd

@step

def loaddata() -> pd.DataFrame:

"""Load dataset from data source."""

df = pd.readcsv("data/trainingdata.csv")

return df

@step

def preprocessdata(df: pd.DataFrame) -> pd.DataFrame:

"""Clean and process data."""

df = df.dropna()

df = df.dropduplicates()

return df

ZenML automatically performs type checking and serialization on each step's input/output based on the type hints you provide.

Defining Pipelines with @pipeline

A pipeline connects multiple steps into a structured workflow:

from zenml import pipeline

@pipeline

def trainingpipeline():

"""Pipeline for ML model training."""

data = loaddata()

processeddata = preprocessdata(data)

model = trainmodel(processeddata)

metrics = evaluatemodel(model, processeddata)

return metrics

Running a pipeline is as simple as calling the function:

if name == "main":

trainingpipeline()

Parameterizing Steps

You can create steps that accept configuration parameters:

from zenml import step

from pydantic import BaseModel

class TrainingConfig(BaseModel):

learningrate: float = 0.01

nestimators: int = 100

maxdepth: int = 5

@step

def trainmodel(

data: pd.DataFrame,

config: TrainingConfig = TrainingConfig()

) -> object:

"""Train model with customizable configuration."""

from sklearn.ensemble import RandomForestClassifier

X = data.drop("target", axis=1)

y = data["target"]

model = RandomForestClassifier(

nestimators=config.nestimators,

maxdepth=config.maxdepth

)

model.fit(X, y)

return model

Artifacts and Materializers

Understanding Artifacts

Every output from a step is automatically stored as an artifact. Artifacts are data produced and consumed by steps in a pipeline. ZenML tracks each artifact with comprehensive metadata including version, type, and lineage.

from zenml import step, logartifactmetadata

@step

def trainmodel(data: pd.DataFrame) -> object:

model = RandomForestClassifier()

model.fit(Xtrain, ytrain)

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

Complete Apache Airflow Tutorial: Workflow Orchestration for Data Pipelines

Tutorial Lengkap Apache Airflow: Workflow Orchestration untuk Data Pipelines Apache Airflow adalah platform open-source ...

Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

Ray Train & Ray Tune: Pelatihan Terdistribusi dan Penyetelan Hiperparameter Sebagian besar proyek machine learning dimul...

PySpark for Machine Learning Tutorial: Big Data ML Pipeline

PySpark untuk Machine Learning Daftar Isi Pendahuluan Prasyarat Dasar Spark dan SparkSession Operasi DataFrame