ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework
Introduction
Building an accurate machine learning model is only a small part of the challenge in ML production. The real challenge lies in creating reproducible, scalable, and maintainable pipelines. ZenML is an open-source MLOps framework designed to tackle these problems with a modular and cloud-agnostic approach.
ZenML allows data scientists and ML engineers to define ML pipelines using simple Python decorators while providing full flexibility to integrate various tools and cloud platforms. With its unique "stacks" concept, you can switch from local development to cloud production without changing your pipeline code.
In this tutorial, we will learn ZenML from the basics to building an end-to-end pipeline covering data ingestion, preprocessing, training, evaluation, and deployment.
Prerequisites
Before getting started, make sure you have:
- Python 3.8 or later
- pip package manager
- Basic understanding of machine learning
- Familiarity with scikit-learn or other ML frameworks
Installing ZenML
Installing ZenML is straightforward using pip:
pip install zenml
For installation with additional integrations:
pip install "zenml[server]"
After installation, initialize a ZenML repository in your project:
zenml init
Launch the ZenML dashboard locally:
zenml login --local
The dashboard will be available at http://127.0.0.1:8237, providing complete visualization of your pipelines, artifacts, and stacks.
Core Concepts: Steps and Pipelines
Defining Steps with @step
A step is the smallest unit of work in ZenML. Each step is a Python function decorated with @step:
from zenml import step
import pandas as pd
@step
def loaddata() -> pd.DataFrame:
"""Load dataset from data source."""
df = pd.readcsv("data/trainingdata.csv")
return df
@step
def preprocessdata(df: pd.DataFrame) -> pd.DataFrame:
"""Clean and process data."""
df = df.dropna()
df = df.dropduplicates()
return df
ZenML automatically performs type checking and serialization on each step's input/output based on the type hints you provide.
Defining Pipelines with @pipeline
A pipeline connects multiple steps into a structured workflow:
from zenml import pipeline
@pipeline
def trainingpipeline():
"""Pipeline for ML model training."""
data = loaddata()
processeddata = preprocessdata(data)
model = trainmodel(processeddata)
metrics = evaluatemodel(model, processeddata)
return metrics
Running a pipeline is as simple as calling the function:
if name == "main":
trainingpipeline()
Parameterizing Steps
You can create steps that accept configuration parameters:
from zenml import step
from pydantic import BaseModel
class TrainingConfig(BaseModel):
learningrate: float = 0.01
nestimators: int = 100
maxdepth: int = 5
@step
def trainmodel(
data: pd.DataFrame,
config: TrainingConfig = TrainingConfig()
) -> object:
"""Train model with customizable configuration."""
from sklearn.ensemble import RandomForestClassifier
X = data.drop("target", axis=1)
y = data["target"]
model = RandomForestClassifier(
nestimators=config.nestimators,
maxdepth=config.maxdepth
)
model.fit(X, y)
return model
Artifacts and Materializers
Understanding Artifacts
Every output from a step is automatically stored as an artifact. Artifacts are data produced and consumed by steps in a pipeline. ZenML tracks each artifact with comprehensive metadata including version, type, and lineage.
from zenml import step, logartifactmetadata
@step
def trainmodel(data: pd.DataFrame) -> object:
model = RandomForestClassifier()
model.fit(Xtrain, ytrain)