Tutorial 15: Docker for Data Science and Machine Learning
Table of Contents
Introduction
Reproducibility is one of the greatest challenges in data science and machine learning. The notorious "it works on my machine" problem is amplified when projects depend on specific versions of Python, CUDA, cuDNN, system libraries, and dozens of Python packages. A model that trains perfectly on your laptop may fail spectacularly on a colleague's workstation or in a cloud environment.
Docker solves this by packaging your code, dependencies, and runtime environment into a portable, self-contained unit called a container. With Docker, you can guarantee that your ML pipeline runs identically everywhere — from local development to CI/CD to production serving.
This tutorial covers everything you need to containerize ML workflows: writing efficient Dockerfiles, leveraging multi-stage builds, enabling GPU acceleration, orchestrating multi-service stacks with Docker Compose, and integrating containers into CI/CD pipelines.
Prerequisites
- Docker Engine 24.0 or later installed
- Docker Compose v2 installed
- Basic familiarity with Linux command line
- Python 3.9+ and pip
- (Optional) NVIDIA GPU with drivers installed for GPU sections
# Verify Docker installation
docker --version
docker compose version
Docker Fundamentals for ML Engineers
Key Concepts
Images are read-only templates that define the environment. They are built from Dockerfiles and stored in registries (Docker Hub, AWS ECR, GCP Artifact Registry). Containers are running instances of images. They are isolated, ephemeral processes with their own filesystem, network, and process space. Layers are the building blocks of images. Each instruction in a Dockerfile creates a new layer. Docker caches layers, so unchanged layers are reused during rebuilds — this is critical for fast iteration. Registries store and distribute Docker images. Public registries include Docker Hub, while private registries are used for proprietary ML models and code.Why Docker Matters for ML
| Challenge | Docker Solution |
|-----------|----------------|
| Dependency conflicts | Isolated environments per project |
| Python version mismatches | Specify exact Python version in image |
| CUDA/cuDNN compatibility | Use NVIDIA base images with matching versions |
| Reproducible training | Same image produces same results anywhere |
| Deployment consistency | Dev, staging, and production use the same image |
Writing Dockerfiles for ML Projects
Basic ML Dockerfile
# Use an official Python runtime as the base image
FROM python:3.11-slim
Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIPNOCACHEDIR=1 \
PIPDISABLEPIPVERSIONCHECK=1
Set the working directory
WORKDIR /app
Install system dependencies (common for ML libraries)
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
libgomp1 \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender-dev \
libgl1-mesa-glx \
curl \
&& rm -rf /var/lib/apt/lists/
Copy and install Python dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
Copy the application code
COPY . .
Expose the port for the API
EXPOSE 8000