Metaflow Tutorial: Netflix's MLOps Framework for Data Science

# Tutorial Metaflow: Framework MLOps dari Netflix untuk Data Science Metaflow adalah framework open-source yang dikembangkan oleh Netflix untuk membangun dan mengelola proyek data science secara efis...

By Ruby Abdullah · · tutorial
MetaflowMLOpsNetflixPipelinePython

Metaflow Tutorial: Netflix's MLOps Framework for Data Science

Metaflow is an open-source framework originally developed at Netflix for building and managing real-life data science projects. It allows data scientists to focus on modeling without worrying about infrastructure, orchestration, and deployment. In this tutorial, we'll learn how to use Metaflow from basics to advanced features.

Why Metaflow?

In data science and machine learning, one of the biggest challenges isn't building models but managing the entire lifecycle from experimentation to production. Metaflow addresses this with a pragmatic approach:

  • Human-centric API: Designed to be easy for data scientists, not just ML engineers
  • Automatic versioning: Every experiment is automatically tracked and reproducible
  • Transparent scalability: Move from laptop to cloud without changing code
  • Dependency management: Manages Python environments automatically
  • Integration: Works with AWS, Azure, GCP, and Kubernetes

Installation

Basic Installation

pip install metaflow

Installation with AWS Support

pip install metaflow[aws]

Verify Installation

import metaflow

print(metaflow.version)

Initial Configuration

After installation, run the configuration:

metaflow configure show

To configure AWS S3 as datastore:

metaflow configure aws

Core Concepts: Flow and Step

Metaflow uses Flow and Step concepts to organize data science pipelines. A Flow is a DAG (Directed Acyclic Graph) composed of multiple Steps.

Your First Flow

from metaflow import FlowSpec, step

class HelloFlow(FlowSpec):

@step

def start(self):

print("Starting the first flow!")

self.message = "Hello from Metaflow"

self.next(self.end)

@step

def end(self):

print(f"Message: {self.message}")

print("Flow completed!")

if name == 'main':

HelloFlow()

Run the flow:

python helloflow.py run

Understanding Artifacts

Every variable saved as self.x in a step becomes an artifact that is automatically versioned and can be accessed later.

from metaflow import FlowSpec, step

class ArtifactFlow(FlowSpec):

@step

def start(self):

self.data = [1, 2, 3, 4, 5]

self.modelname = "randomforest"

self.next(self.process)

@step

def process(self):

self.result = sum(self.data) 2

print(f"Model: {self.modelname}")

print(f"Result: {self.result}")

self.next(self.end)

@step

def end(self):

print(f"Final result: {self.result}")

if name == 'main':

ArtifactFlow()

Branching and Join

Metaflow supports parallel execution through branching. This is useful for comparing multiple models simultaneously.

Branching Example

from metaflow import FlowSpec, step

class BranchFlow(FlowSpec):

@step

def start(self):

self.rawdata = list(range(100))

self.next(self.trainrf, self.trainxgb)

@step

def trainrf(self):

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import makeclassification

X, y = makeclassification(nsamples=1000, nfeatures=20, randomstate=42)

model = RandomForestClassifier(nestimators=100, randomstate=42)

model.fit(X, y)

self.accuracy = model.score(X, y)

self.modeltype = "RandomForest"

print(f"RF Accuracy: {self.accuracy:.4f}")

self.next(self.join)

@step

def trainxgb(self):

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import makeclassification

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic Pendahuluan Membangun model machine learning yang akurat...

Hydra Tutorial: Elegant Configuration Management for ML

Hydra: Manajemen Konfigurasi yang Rapi untuk Aplikasi ML dan Python Kebanyakan proyek machine learning dimulai dari satu...

Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

Ray Train & Ray Tune: Pelatihan Terdistribusi dan Penyetelan Hiperparameter Sebagian besar proyek machine learning dimul...