Metaflow Tutorial: Netflix's MLOps Framework for Data Science

Metaflow is an open-source framework originally developed at Netflix for building and managing real-life data science projects. It allows data scientists to focus on modeling without worrying about infrastructure, orchestration, and deployment. In this tutorial, we'll learn how to use Metaflow from basics to advanced features.

Why Metaflow?

In data science and machine learning, one of the biggest challenges isn't building models but managing the entire lifecycle from experimentation to production. Metaflow addresses this with a pragmatic approach:

Human-centric API: Designed to be easy for data scientists, not just ML engineers
Automatic versioning: Every experiment is automatically tracked and reproducible
Transparent scalability: Move from laptop to cloud without changing code
Dependency management: Manages Python environments automatically
Integration: Works with AWS, Azure, GCP, and Kubernetes

Installation

Basic Installation

pip install metaflow

Installation with AWS Support

pip install metaflow[aws]

Verify Installation

import metaflow
print(metaflow.version)

Initial Configuration

After installation, run the configuration:

metaflow configure show

To configure AWS S3 as datastore:

metaflow configure aws

Core Concepts: Flow and Step

Metaflow uses Flow and Step concepts to organize data science pipelines. A Flow is a DAG (Directed Acyclic Graph) composed of multiple Steps.

Your First Flow

from metaflow import FlowSpec, step

class HelloFlow(FlowSpec):

    @step
    def start(self):
        print("Starting the first flow!")
        self.message = "Hello from Metaflow"
        self.next(self.end)

    @step
    def end(self):
        print(f"Message: {self.message}")
        print("Flow completed!")

if name == 'main':
    HelloFlow()

Run the flow:

python helloflow.py run

Understanding Artifacts

Every variable saved as self.x in a step becomes an artifact that is automatically versioned and can be accessed later.

from metaflow import FlowSpec, step

class ArtifactFlow(FlowSpec):

    @step
    def start(self):
        self.data = [1, 2, 3, 4, 5]
        self.modelname = "randomforest"

        self.next(self.process)

    @step
    def process(self):
        self.result = sum(self.data)  2

        print(f"Model: {self.modelname}")
        print(f"Result: {self.result}")
        self.next(self.end)

    @step
    def end(self):
        print(f"Final result: {self.result}")

if name == 'main':
    ArtifactFlow()

Branching and Join

Metaflow supports parallel execution through branching. This is useful for comparing multiple models simultaneously.

Branching Example

from metaflow import FlowSpec, step class BranchFlow(FlowSpec): @step def start(self): self.rawdata = list(range(100)) self.next(self.trainrf, self.trainxgb) @step def trainrf(self): from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import makeclassification X, y = makeclassification(nsamples=1000, nfeatures=20, randomstate=42) model = RandomForestClassifier(nestimators=100, randomstate=42) model.fit(X, y) self.accuracy = model.score(X, y) self.modeltype = "RandomForest" print(f"RF Accuracy: {self.accuracy:.4f}") self.next(self.join) @step def trainxgb(self): from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import makeclassification

Metaflow Tutorial: Netflix's MLOps Framework for Data Science

Metaflow Tutorial: Netflix's MLOps Framework for Data Science

Why Metaflow?

Installation

Basic Installation

Installation with AWS Support

Verify Installation

Initial Configuration

Core Concepts: Flow and Step

Your First Flow

Understanding Artifacts

Branching and Join

Branching Example

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

Hydra Tutorial: Elegant Configuration Management for ML

Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic Pendahuluan Membangun model machine learning yang akurat...

Hydra Tutorial: Elegant Configuration Management for ML

Hydra: Manajemen Konfigurasi yang Rapi untuk Aplikasi ML dan Python Kebanyakan proyek machine learning dimulai dari satu...

Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

Ray Train & Ray Tune: Pelatihan Terdistribusi dan Penyetelan Hiperparameter Sebagian besar proyek machine learning dimul...