Introduction
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. Developed by Databricks, MLflow helps data scientists and ML engineers track experiments, package code, manage models, and deploy to production.
Why MLflow?- Reproducibility: Track all experiments in detail
- Collaboration: Share results with team
- Model versioning: Manage different model versions
- Deployment ready: Easily deploy models to various platforms
- Framework agnostic: Works with TensorFlow, PyTorch, Scikit-learn, etc.
MLflow Main Components
MLflow consists of 4 main components:
Installation and Setup
Basic Installation
# Install MLflow
pip install mlflow
Install with extras for various backends
pip install mlflow[extras]
Verify installation
mlflow --version
Setup Database Backend (PostgreSQL)
For production, use a database backend:
# Install dependencies
pip install psycopg2-binary
Setup PostgreSQL (example using Docker)
docker run -d \
--name mlflow-db \
-e POSTGRESUSER=mlflow \
-e POSTGRESPASSWORD=mlflow \
-e POSTGRESDB=mlflow \
-p 5432:5432 \
postgres:13
Setup Artifact Store (MinIO/S3)
# Install boto3 for S3 compatibility
pip install boto3
Setup MinIO (S3-compatible storage)
docker run -d \
--name mlflow-minio \
-p 9000:9000 \
-p 9001:9001 \
-e MINIOROOTUSER=minioadmin \
-e MINIOROOTPASSWORD=minioadmin \
minio/minio server /data --console-address ":9001"
Run MLflow Server
# Development mode (local file store)
mlflow server --host 0.0.0.0 --port 5000
Production mode (with database and S3)
mlflow server \
--backend-store-uri postgresql://mlflow:mlflow@localhost:5432/mlflow \
--default-artifact-root s3://mlflow-artifacts \
--host 0.0.0.0 \
--port 5000
Setup Environment Variables
Create .env file:
# MLflow Tracking
MLFLOWTRACKINGURI=http://localhost:5000
S3/MinIO Configuration
AWSACCESSKEYID=minioadmin
AWSSECRETACCESSKEY=minioadmin
MLFLOWS3ENDPOINTURL=http://localhost:9000
MLflow Tracking: Experiment Tracking
Basic Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import loadiris
from sklearn.modelselection import traintestsplit
from sklearn.metrics import accuracyscore, f1score
Set tracking URI
mlflow.settrackinguri("http://localhost:5000")
Set experiment
mlflow.setexperiment("iris-classification")
Load data
iris = loadiris()
Xtrain, Xtest, ytrain, ytest = traintestsplit(
iris.data, iris.target, testsize=0.2, randomstate=42
)
Start MLflow run
with mlflow.startrun(runname="random-forest-v1") as run:
# Log parameters
params = {
"nestimators": 100,
"maxdepth": 5,
"randomstate": 42
}
mlflow.logparams(params)
# Train model
model = RandomForestClassifier(params)
model.fit(Xtrain, ytrain)
# Make predictions
ypred = model.predict(Xtest)
# Log metrics
metrics = {
"accuracy": accuracyscore(ytest, ypred),
"f1score": f1score(ytest, ypred, average="weighted")
}
mlflow.logmetrics(metrics)
# Log model
mlflow.sklearn.logmodel(
model,
"model",
registeredmodelname="iris-classifier"
)