Metaflow Tutorial: Netflix's MLOps Framework for Data Science
Metaflow is an open-source framework originally developed at Netflix for building and managing real-life data science projects. It allows data scientists to focus on modeling without worrying about infrastructure, orchestration, and deployment. In this tutorial, we'll learn how to use Metaflow from basics to advanced features.
Why Metaflow?
In data science and machine learning, one of the biggest challenges isn't building models but managing the entire lifecycle from experimentation to production. Metaflow addresses this with a pragmatic approach:
- Human-centric API: Designed to be easy for data scientists, not just ML engineers
- Automatic versioning: Every experiment is automatically tracked and reproducible
- Transparent scalability: Move from laptop to cloud without changing code
- Dependency management: Manages Python environments automatically
- Integration: Works with AWS, Azure, GCP, and Kubernetes
Installation
Basic Installation
pip install metaflow
Installation with AWS Support
pip install metaflow[aws]
Verify Installation
import metaflow
print(metaflow.version)
Initial Configuration
After installation, run the configuration:
metaflow configure show
To configure AWS S3 as datastore:
metaflow configure aws
Core Concepts: Flow and Step
Metaflow uses Flow and Step concepts to organize data science pipelines. A Flow is a DAG (Directed Acyclic Graph) composed of multiple Steps.
Your First Flow
from metaflow import FlowSpec, step
class HelloFlow(FlowSpec):
@step
def start(self):
print("Starting the first flow!")
self.message = "Hello from Metaflow"
self.next(self.end)
@step
def end(self):
print(f"Message: {self.message}")
print("Flow completed!")
if name == 'main':
HelloFlow()
Run the flow:
python helloflow.py run
Understanding Artifacts
Every variable saved as self.x in a step becomes an artifact that is automatically versioned and can be accessed later.
from metaflow import FlowSpec, step
class ArtifactFlow(FlowSpec):
@step
def start(self):
self.data = [1, 2, 3, 4, 5]
self.modelname = "randomforest"
self.next(self.process)
@step
def process(self):
self.result = sum(self.data) 2
print(f"Model: {self.modelname}")
print(f"Result: {self.result}")
self.next(self.end)
@step
def end(self):
print(f"Final result: {self.result}")
if name == 'main':
ArtifactFlow()
Branching and Join
Metaflow supports parallel execution through branching. This is useful for comparing multiple models simultaneously.
Branching Example
from metaflow import FlowSpec, step
class BranchFlow(FlowSpec):
@step
def start(self):
self.rawdata = list(range(100))
self.next(self.trainrf, self.trainxgb)
@step
def trainrf(self):
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import makeclassification
X, y = makeclassification(nsamples=1000, nfeatures=20, randomstate=42)
model = RandomForestClassifier(nestimators=100, randomstate=42)
model.fit(X, y)
self.accuracy = model.score(X, y)
self.modeltype = "RandomForest"
print(f"RF Accuracy: {self.accuracy:.4f}")
self.next(self.join)
@step
def trainxgb(self):
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import makeclassification