Complete DVC Tutorial: Git for Data and Machine Learning Models

# Tutorial Lengkap DVC: Git untuk Data dan Model Machine Learning DVC (Data Version Control) adalah open-source tool yang memperluas Git untuk menangani data, model, dan eksperimen ML. Dengan DVC, An...

By Ruby Abdullah · · tutorial
DVCMLOpsData Version ControlMachine LearningGitPipeline

Complete DVC Tutorial: Git for Data and Machine Learning Models

DVC (Data Version Control) is an open-source tool that extends Git to handle data, models, and ML experiments. With DVC, you can version control large datasets and models without storing them directly in your Git repository.

Why DVC?

Common problems in ML projects:

  • Large files: Git is not efficient for large files (datasets, models)
  • Reproducibility: Difficult to reproduce experiments without data version tracking
  • Collaboration: Sharing large datasets across teams is challenging
  • Storage: Large datasets consume space in Git repository

DVC Solutions:
  • Version control for data and models
  • Remote storage (S3, GCS, Azure, SSH)
  • Pipeline reproducibility
  • Experiment tracking
  • Seamless integration with Git

Installation

# Install DVC

pip install dvc

Install with remote storage support

pip install "dvc[s3]" # AWS S3

pip install "dvc[gs]" # Google Cloud Storage

pip install "dvc[azure]" # Azure Blob Storage

pip install "dvc[ssh]" # SSH/SFTP

pip install "dvc[all]" # All remotes

Verify installation

dvc version

Quick Start

1. Initialize DVC

# Make sure you're in a Git repository

git init

dvc init

DVC will create:

- .dvc/ directory

- .dvcignore file

git status

.dvc/.gitignore

.dvc/config

.dvcignore

git add .dvc .dvcignore

git commit -m "Initialize DVC"

2. Track Data Files

# Add data file to DVC

dvc add data/dataset.csv

DVC will create:

- data/dataset.csv.dvc (metadata file)

- Update .gitignore

Add .dvc file to Git

git add data/dataset.csv.dvc data/.gitignore

git commit -m "Add dataset"

For folders

dvc add data/images/

git add data/images.dvc data/.gitignore

git commit -m "Add images folder"

3. Setup Remote Storage

# AWS S3

dvc remote add -d myremote s3://mybucket/dvc-storage

Google Cloud Storage

dvc remote add -d myremote gs://mybucket/dvc-storage

Azure Blob Storage

dvc remote add -d myremote azure://mycontainer/dvc-storage

Local/Network storage

dvc remote add -d myremote /path/to/storage

SSH

dvc remote add -d myremote ssh://user@server/path/to/storage

Commit remote config

git add .dvc/config

git commit -m "Configure remote storage"

4. Push and Pull Data

# Push data to remote

dvc push

Pull data from remote

dvc pull

Push/pull specific file

dvc push data/dataset.csv.dvc

dvc pull data/dataset.csv.dvc

Working with Data Versions

1. Update Data

# After data changes

dvc add data/dataset.csv

git add data/dataset.csv.dvc

git commit -m "Update dataset v2"

dvc push

2. Switch Data Versions

# Checkout to previous version

git checkout HEAD~1 data/dataset.csv.dvc

dvc checkout

Return to latest

git checkout HEAD data/dataset.csv.dvc

dvc checkout

3. View Data History

# View history

git log --oneline data/dataset.csv.dvc

Diff between versions

dvc diff HEAD~1

Metrics diff

dvc metrics diff HEAD~1

DVC Pipelines

1. Creating a Pipeline

Pipeline is defined in dvc.yaml:

# dvc.yaml

stages:

prepare:

cmd: python src/prepare.py

deps:

  • src/prepare.py
  • data/raw/
outs:

  • data/processed/

train:

cmd: python src/train.py

deps:

  • src/train.py
  • data/processed/
params:

  • train.nestimators
  • train.maxdepth
outs:

  • models/model.pkl
metrics:

  • metrics/trainmetrics.json:
cache: false

evaluate:

cmd: python src/evaluate.py

deps:

Related Articles

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic Pendahuluan Membangun model machine learning yang akurat...

MLOps End-to-End Project Tutorial: From Data to Production

Tutorial 20: Proyek MLOps End-to-End Daftar Isi Pendahuluan Prasyarat Gambaran Proyek Versioning Data dengan DVC

Complete Vertex AI Tutorial: Google Cloud Unified ML Platform

Tutorial Lengkap Vertex AI: Platform ML Terpadu di Google Cloud Vertex AI adalah platform machine learning terpadu Googl...