Complete DVC Tutorial: Git for Data and Machine Learning Models
DVC (Data Version Control) is an open-source tool that extends Git to handle data, models, and ML experiments. With DVC, you can version control large datasets and models without storing them directly in your Git repository.
Why DVC?
Common problems in ML projects:
- Large files: Git is not efficient for large files (datasets, models)
- Reproducibility: Difficult to reproduce experiments without data version tracking
- Collaboration: Sharing large datasets across teams is challenging
- Storage: Large datasets consume space in Git repository
- Version control for data and models
- Remote storage (S3, GCS, Azure, SSH)
- Pipeline reproducibility
- Experiment tracking
- Seamless integration with Git
Installation
# Install DVC
pip install dvc
Install with remote storage support
pip install "dvc[s3]" # AWS S3
pip install "dvc[gs]" # Google Cloud Storage
pip install "dvc[azure]" # Azure Blob Storage
pip install "dvc[ssh]" # SSH/SFTP
pip install "dvc[all]" # All remotes
Verify installation
dvc version
Quick Start
1. Initialize DVC
# Make sure you're in a Git repository
git init
dvc init
DVC will create:
- .dvc/ directory
- .dvcignore file
git status
.dvc/.gitignore
.dvc/config
.dvcignore
git add .dvc .dvcignore
git commit -m "Initialize DVC"
2. Track Data Files
# Add data file to DVC
dvc add data/dataset.csv
DVC will create:
- data/dataset.csv.dvc (metadata file)
- Update .gitignore
Add .dvc file to Git
git add data/dataset.csv.dvc data/.gitignore
git commit -m "Add dataset"
For folders
dvc add data/images/
git add data/images.dvc data/.gitignore
git commit -m "Add images folder"
3. Setup Remote Storage
# AWS S3
dvc remote add -d myremote s3://mybucket/dvc-storage
Google Cloud Storage
dvc remote add -d myremote gs://mybucket/dvc-storage
Azure Blob Storage
dvc remote add -d myremote azure://mycontainer/dvc-storage
Local/Network storage
dvc remote add -d myremote /path/to/storage
SSH
dvc remote add -d myremote ssh://user@server/path/to/storage
Commit remote config
git add .dvc/config
git commit -m "Configure remote storage"
4. Push and Pull Data
# Push data to remote
dvc push
Pull data from remote
dvc pull
Push/pull specific file
dvc push data/dataset.csv.dvc
dvc pull data/dataset.csv.dvc
Working with Data Versions
1. Update Data
# After data changes
dvc add data/dataset.csv
git add data/dataset.csv.dvc
git commit -m "Update dataset v2"
dvc push
2. Switch Data Versions
# Checkout to previous version
git checkout HEAD~1 data/dataset.csv.dvc
dvc checkout
Return to latest
git checkout HEAD data/dataset.csv.dvc
dvc checkout
3. View Data History
# View history
git log --oneline data/dataset.csv.dvc
Diff between versions
dvc diff HEAD~1
Metrics diff
dvc metrics diff HEAD~1
DVC Pipelines
1. Creating a Pipeline
Pipeline is defined in dvc.yaml:
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/
outs:
- data/processed/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/
params:
- train.nestimators
- train.maxdepth
outs:
- models/model.pkl
metrics:
- metrics/trainmetrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps: