Tutorial 15: Docker for Data Science and Machine Learning

Introduction

Prerequisites

Docker Fundamentals for ML Engineers

Writing Dockerfiles for ML Projects

Multi-Stage Builds for Optimized Images

GPU Support with NVIDIA Docker

Docker Compose for ML Stacks

Volume Mounts for Data and Models

Environment and Dependency Management

CI/CD Integration

Best Practices

Conclusion

Introduction

Reproducibility is one of the greatest challenges in data science and machine learning. The notorious "it works on my machine" problem is amplified when projects depend on specific versions of Python, CUDA, cuDNN, system libraries, and dozens of Python packages. A model that trains perfectly on your laptop may fail spectacularly on a colleague's workstation or in a cloud environment.

Docker solves this by packaging your code, dependencies, and runtime environment into a portable, self-contained unit called a container. With Docker, you can guarantee that your ML pipeline runs identically everywhere — from local development to CI/CD to production serving.

This tutorial covers everything you need to containerize ML workflows: writing efficient Dockerfiles, leveraging multi-stage builds, enabling GPU acceleration, orchestrating multi-service stacks with Docker Compose, and integrating containers into CI/CD pipelines.

Prerequisites

Docker Engine 24.0 or later installed
Docker Compose v2 installed
Basic familiarity with Linux command line
Python 3.9+ and pip
(Optional) NVIDIA GPU with drivers installed for GPU sections

# Verify Docker installation docker --version docker compose version

Docker Fundamentals for ML Engineers

Key Concepts

Images are read-only templates that define the environment. They are built from Dockerfiles and stored in registries (Docker Hub, AWS ECR, GCP Artifact Registry). Containers are running instances of images. They are isolated, ephemeral processes with their own filesystem, network, and process space. Layers are the building blocks of images. Each instruction in a Dockerfile creates a new layer. Docker caches layers, so unchanged layers are reused during rebuilds — this is critical for fast iteration. Registries store and distribute Docker images. Public registries include Docker Hub, while private registries are used for proprietary ML models and code.

Why Docker Matters for ML

| Challenge | Docker Solution |

|-----------|----------------|

| Dependency conflicts | Isolated environments per project |

| Python version mismatches | Specify exact Python version in image |

| CUDA/cuDNN compatibility | Use NVIDIA base images with matching versions |

| Reproducible training | Same image produces same results anywhere |

| Deployment consistency | Dev, staging, and production use the same image |

Writing Dockerfiles for ML Projects

Basic ML Dockerfile

# Use an official Python runtime as the base image
FROM python:3.11-slim

Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIPNOCACHEDIR=1 \

    PIPDISABLEPIPVERSIONCHECK=1


Set the working directory
WORKDIR /app

Install system dependencies (common for ML libraries)
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libgomp1 \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    libgl1-mesa-glx \
    curl \
    && rm -rf /var/lib/apt/lists/


Copy and install Python dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

Copy the application code
COPY . .

Expose the port for the API
EXPOSE 8000

Docker for Data Science & ML Tutorial: Model Containerization

Tutorial 15: Docker for Data Science and Machine Learning

Table of Contents

Introduction

Prerequisites

Docker Fundamentals for ML Engineers

Key Concepts

Why Docker Matters for ML

Writing Dockerfiles for ML Projects

Basic ML Dockerfile

Set environment variables

Set the working directory

Install system dependencies (common for ML libraries)

Copy and install Python dependencies first (for layer caching)

Copy the application code

Expose the port for the API

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Text Generation Inference (TGI) Tutorial: Production LLM Serving

Modal: Serverless GPU Cloud for ML Model Deployment

DuckDB: In-Process Analytical Database for Data Science

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

Text Generation Inference (TGI) Tutorial: Production LLM Serving

Menyajikan LLM di Produksi dengan Text Generation Inference (TGI) Text Generation Inference (TGI) adalah toolkit buatan ...

Modal: Serverless GPU Cloud for ML Model Deployment

Modal: Serverless GPU Cloud untuk Deploy Model ML Salah satu tantangan terbesar dalam machine learning bukan membuat mod...

DuckDB: In-Process Analytical Database for Data Science

DuckDB: Database Analitik In-Process untuk Data Science DuckDB adalah database analitik in-process yang dirancang khusus...