Data Versioning with DVC
DVC (Data Version Control) brings the power of Git to your data and ML pipelines. While Git tracks code, DVC tracks datasets, model files, and pipeline artifacts — without storing them in Git. It's the missing piece that makes ML experiments truly reproducible.
Why Version Data?
Consider a typical scenario: your model's accuracy dropped from 94% to 89% after a data refresh. What changed?
- Was it the new rows added to the training set?
- Was it a change in the preprocessing script?
- Was it a hyperparameter tweak you forgot to record?
Without data versioning, you're flying blind. DVC solves this by:
- Tracking exact dataset versions alongside code commits
- Storing large files in cloud storage (not Git)
- Reproducing any experiment from its exact inputs
- Comparing results across different data/code versions
DVC Init and Track
Installation and Setup
# Install DVC with GCS support
pip install "dvc[gs]"
# Initialize DVC in your project (creates .dvc/ directory)
cd my-ml-project
dvc init
# Commit the DVC configuration to Git
git add .dvc/.gitignore .dvc/config
git commit -m "Initialize DVC"
Tracking Files and Directories
# Track a dataset file
dvc add data/raw/training_data.csv
# This creates training_data.csv.dvc (metadata) and updates .gitignore
# The .dvc file contains the hash and size — NOT the actual data
git add data/raw/training_data.csv.dvc data/raw/.gitignore
git commit -m "Track training_data.csv with DVC"
# Track a directory
dvc add data/processed/
# Track a model file
dvc add models/random_forest_v1.joblib
# Commit the tracking files
git add data/processed.dvc models/random_forest_v1.joblib.dvc
git commit -m "Track processed data and model v1"
What DVC Actually Stores
The .dvc file is a small YAML metadata file:
# training_data.csv.dvc
outs:
- md5: a1b2c3d4e5f6g7h8i9j0
size: 52428800
path: training_data.csv
hash: md5
The actual data is stored in DVC cache (.dvc/cache/) and pushed to remote storage. Git only tracks the .dvc metadata file — keeping your repository lightweight.
GCS Remote Storage
DVC needs a remote storage location to share data with your team. Google Cloud Storage (GCS) is a natural choice for GCP projects.
Setting Up a GCS Remote
# Create a GCS bucket for DVC storage
gsutil mb -l us-central1 gs://my-dvc-store/
# Add the GCS bucket as a DVC remote
dvc remote add -d myremote gs://my-dvc-store/
# Optional: Set a subdirectory structure
dvc remote modify myremote url gs://my-dvc-store/dvc-cache
# Verify the remote configuration
dvc remote list
# Commit the remote config to Git
git add .dvc/config
git commit -m "Add GCS remote for DVC"
Pushing and Pulling Data
# Push tracked data to the GCS remote
dvc push
# Pull data from remote (e.g., on a new machine or CI)
dvc pull
# Push a specific file
dvc push data/raw/training_data.csv.dvc
# Pull a specific file
dvc pull models/random_forest_v1.joblib.dvc
Workflow: Team Collaboration
# Developer A: adds new data and pushes
dvc add data/raw/v2_data.csv
git add data/raw/v2_data.csv.dvc
git commit -m "Add v2 training data"
git push
dvc push
# Developer B: pulls the updated data
git pull
dvc pull # Fetches v2_data.csv from GCS
DVC Pipeline (Stages and Metrics)
DVC Pipelines define your ML workflow as a series of stages with explicit dependencies. This makes the entire workflow reproducible with a single command.
Defining a Pipeline with dvc.yaml
# dvc.yaml
stages:
data_prep:
cmd: python src/prepare_data.py --input data/raw/training_data.csv --output data/processed/
deps:
- src/prepare_data.py
- data/raw/training_data.csv
outs:
- data/processed/train.csv
- data/processed/test.csv
params:
- prepare.test_size
- prepare.random_state
train:
cmd: python src/train.py --train data/processed/train.csv --model models/model.joblib
deps:
- src/train.py
- data/processed/train.csv
outs:
- models/model.joblib
params:
- train.n_estimators
- train.max_depth
evaluate:
cmd: python src/evaluate.py --model models/model.joblib --test data/processed/test.csv --metrics metrics.json
deps:
- src/evaluate.py
- models/model.joblib
- data/processed/test.csv
metrics:
- metrics.json:
cache: false
plots:
- plots/confusion_matrix.png:
cache: false
Parameters File
# params.yaml
prepare:
test_size: 0.2
random_state: 42
train:
n_estimators: 100
max_depth: 5
learning_rate: 0.1
evaluate:
threshold: 0.5
Running the Pipeline
# Run the entire pipeline
dvc repro
# Run from a specific stage
dvc repro train
# Force re-run (ignore cache)
dvc repro --force
# Dry run (show what would execute)
dvc repro --dry
How DVC Determines What to Run
DVC uses dependency tracking to determine which stages need to re-run:
- Check if any dependency (code, data, params) changed since last run
- If a dependency changed → re-run the stage
- If the stage output changed → re-run downstream stages
This means if you only change train.max_depth in params.yaml, DVC skips data_prep and only re-runs train and evaluate.
Metrics Tracking
DVC treats metrics as first-class citizens. You can compare metrics across experiments using Git commits.
Generating Metrics
# src/evaluate.py writes metrics.json
import json
metrics = {
"accuracy": 0.9523,
"f1_score": 0.9487,
"precision": 0.9512,
"recall": 0.9463,
"roc_auc": 0.9834,
}
with open("metrics.json", "w") as f:
json.dump(metrics, f, indent=2)
Comparing Experiments
# Show metrics for current commit
dvc metrics show
# Compare metrics across branches/commits
dvc metrics diff HEAD~1
# Compare specific experiments
dvc metrics diff experiment-1 experiment-2
# Example output:
# Path Metric HEAD HEAD~1 Change
# metrics.json accuracy 0.9523 0.9187 +0.0336
# metrics.json f1_score 0.9487 0.9102 +0.0385
DVC Experiments
DVC Experiments provide a lightweight way to run hyperparameter sweeps without creating Git branches:
# Run an experiment with modified parameters
dvc exp run --set-param train.n_estimators=200
# Run a parameter grid
dvc exp run --set-param train.n_estimators=50,100,200 --set-param train.max_depth=3,5,8
# List all experiments
dvc exp show
# Apply the best experiment
dvc exp apply <experiment-name>
# Push experiments to remote
dvc exp push origin <experiment-name>
Reproducing Experiments
The ultimate goal of DVC is exact reproducibility. Given a Git commit, you can reproduce the exact same results:
# Checkout a specific experiment
git checkout v1.2.3
# Pull the exact data for this version
dvc checkout # Updates working directory to match .dvc files
dvc pull # Downloads data from remote if not in cache
# Reproduce the pipeline
dvc repro
# The results will be identical to the original v1.2.3 run
Full Reproduction Workflow
# On a fresh machine:
git clone https://github.com/org/my-ml-project.git
cd my-ml-project
# Install dependencies
pip install -r requirements.txt
# Pull data from remote
dvc pull
# Reproduce the full pipeline
dvc repro
# Verify metrics match
dvc metrics show
DVC and MLflow complement each other beautifully:
- DVC tracks data and pipeline versions (what data, what code)
- MLflow tracks experiment metadata (parameters, metrics, artifacts)
Use them together: DVC ensures your data/code is reproducible, MLflow ensures your experiment results are comparable.
# In your training script, use both
dvc repro # Reproduces exact data + code
mlflow run . -P n_estimators=200 # Logs the experiment