Tutorial 20: MLOps End-to-End Project
Table of Contents
Introduction
MLOps is the discipline of deploying and maintaining machine learning models in production reliably and efficiently. While building an accurate model is important, the real challenge lies in everything around it: versioning data and code together, tracking experiments reproducibly, automating training and deployment pipelines, monitoring model performance in production, and responding to data drift.
This tutorial walks through a complete MLOps project from raw data to production monitoring. We will build a customer churn prediction system using industry-standard tools: DVC for data versioning, MLflow for experiment tracking and model registry, GitHub Actions for CI/CD, Docker for containerization, and Evidently for production monitoring.
Prerequisites
- Python 3.9+
- Git and GitHub account
- Docker and Docker Compose
- AWS CLI or equivalent cloud CLI (for deployment)
- Basic understanding of ML model training
# Install all required packages
pip install dvc[s3] mlflow scikit-learn pandas evidently
pip install fastapi uvicorn docker boto3
import os
print("MLOps E2E Tutorial - Environment Setup")
Project Overview
Our project structure follows MLOps best practices with clear separation of concerns.
churn-prediction/
├── .github/
│ └── workflows/
│ ├── train.yml
│ ├── test.yml
│ └── deploy.yml
├── data/
│ ├── raw/
│ │ └── customers.csv.dvc
│ └── processed/
│ └── features.csv.dvc
├── src/
│ ├── data/
│ │ ├── init.py
│ │ ├── prepare.py
│ │ └── validate.py
│ ├── features/
│ │ ├── init.py
│ │ └── buildfeatures.py
│ ├── models/
│ │ ├── init.py
│ │ ├── train.py
│ │ └── predict.py
│ └── monitoring/
│ ├── init.py
│ └── driftdetection.py
├── serving/
│ ├── app.py
│ ├── Dockerfile
│ └── requirements.txt
├── tests/
│ ├── testdata.py
│ ├── testmodel.py
│ └── testapi.py
├── configs/
│ └── config.yaml
├── dvc.yaml
├── dvc.lock
├── params.yaml
├── docker-compose.yml
└── requirements.txt
Data Versioning with DVC
DVC (Data Version Control) tracks large datasets and model files alongside your Git repository without storing them in Git itself.
Setting Up DVC
# Initialize DVC in your Git repository
cd churn-prediction
dvc init
Configure remote storage (S3 in this example)
dvc remote add -d myremote s3://my-ml-bucket/dvc-store
dvc remote modify myremote region us-east-1
Track data files
dvc add data/raw/customers.csv
git add data/raw/customers.csv.dvc data/raw/.gitignore
git commit -m "Track raw customer data with DVC"
Push data to remote storage
dvc push
DVC Pipeline Definition
# dvc.yaml - Defines the reproducible ML pipeline
stages:
prepare:
cmd: python src/data/prepare.py
deps:
- src/data/prepare.py
- data/raw/customers.csv
params:
- prepare.testsize
- prepare.randomseed
outs:
- data/processed/train.csv
- data/processed/test.csv
featurize:
cmd: python src/features/buildfeatures.py
deps: