Skip to main content

Data Versioning with DVC

DVC (Data Version Control) brings the power of Git to your data and ML pipelines. While Git tracks code, DVC tracks datasets, model files, and pipeline artifacts — without storing them in Git. It's the missing piece that makes ML experiments truly reproducible.

Why Version Data?

Consider a typical scenario: your model's accuracy dropped from 94% to 89% after a data refresh. What changed?

  • Was it the new rows added to the training set?
  • Was it a change in the preprocessing script?
  • Was it a hyperparameter tweak you forgot to record?

Without data versioning, you're flying blind. DVC solves this by:

  • Tracking exact dataset versions alongside code commits
  • Storing large files in cloud storage (not Git)
  • Reproducing any experiment from its exact inputs
  • Comparing results across different data/code versions

DVC Init and Track

Installation and Setup

bash
# Install DVC with GCS support
pip install "dvc[gs]"

# Initialize DVC in your project (creates .dvc/ directory)
cd my-ml-project
dvc init

# Commit the DVC configuration to Git
git add .dvc/.gitignore .dvc/config
git commit -m "Initialize DVC"

Tracking Files and Directories

bash
# Track a dataset file
dvc add data/raw/training_data.csv

# This creates training_data.csv.dvc (metadata) and updates .gitignore
# The .dvc file contains the hash and size — NOT the actual data
git add data/raw/training_data.csv.dvc data/raw/.gitignore
git commit -m "Track training_data.csv with DVC"

# Track a directory
dvc add data/processed/

# Track a model file
dvc add models/random_forest_v1.joblib

# Commit the tracking files
git add data/processed.dvc models/random_forest_v1.joblib.dvc
git commit -m "Track processed data and model v1"

What DVC Actually Stores

The .dvc file is a small YAML metadata file:

yaml
# training_data.csv.dvc
outs:
- md5: a1b2c3d4e5f6g7h8i9j0
size: 52428800
path: training_data.csv
hash: md5

The actual data is stored in DVC cache (.dvc/cache/) and pushed to remote storage. Git only tracks the .dvc metadata file — keeping your repository lightweight.

GCS Remote Storage

DVC needs a remote storage location to share data with your team. Google Cloud Storage (GCS) is a natural choice for GCP projects.

Setting Up a GCS Remote

bash
# Create a GCS bucket for DVC storage
gsutil mb -l us-central1 gs://my-dvc-store/

# Add the GCS bucket as a DVC remote
dvc remote add -d myremote gs://my-dvc-store/

# Optional: Set a subdirectory structure
dvc remote modify myremote url gs://my-dvc-store/dvc-cache

# Verify the remote configuration
dvc remote list

# Commit the remote config to Git
git add .dvc/config
git commit -m "Add GCS remote for DVC"

Pushing and Pulling Data

bash
# Push tracked data to the GCS remote
dvc push

# Pull data from remote (e.g., on a new machine or CI)
dvc pull

# Push a specific file
dvc push data/raw/training_data.csv.dvc

# Pull a specific file
dvc pull models/random_forest_v1.joblib.dvc

Workflow: Team Collaboration

bash
# Developer A: adds new data and pushes
dvc add data/raw/v2_data.csv
git add data/raw/v2_data.csv.dvc
git commit -m "Add v2 training data"
git push
dvc push

# Developer B: pulls the updated data
git pull
dvc pull # Fetches v2_data.csv from GCS

DVC Pipeline (Stages and Metrics)

DVC Pipelines define your ML workflow as a series of stages with explicit dependencies. This makes the entire workflow reproducible with a single command.

Defining a Pipeline with dvc.yaml

yaml
# dvc.yaml
stages:
data_prep:
cmd: python src/prepare_data.py --input data/raw/training_data.csv --output data/processed/
deps:
- src/prepare_data.py
- data/raw/training_data.csv
outs:
- data/processed/train.csv
- data/processed/test.csv
params:
- prepare.test_size
- prepare.random_state

train:
cmd: python src/train.py --train data/processed/train.csv --model models/model.joblib
deps:
- src/train.py
- data/processed/train.csv
outs:
- models/model.joblib
params:
- train.n_estimators
- train.max_depth

evaluate:
cmd: python src/evaluate.py --model models/model.joblib --test data/processed/test.csv --metrics metrics.json
deps:
- src/evaluate.py
- models/model.joblib
- data/processed/test.csv
metrics:
- metrics.json:
cache: false
plots:
- plots/confusion_matrix.png:
cache: false

Parameters File

yaml
# params.yaml
prepare:
test_size: 0.2
random_state: 42

train:
n_estimators: 100
max_depth: 5
learning_rate: 0.1

evaluate:
threshold: 0.5

Running the Pipeline

bash
# Run the entire pipeline
dvc repro

# Run from a specific stage
dvc repro train

# Force re-run (ignore cache)
dvc repro --force

# Dry run (show what would execute)
dvc repro --dry

How DVC Determines What to Run

DVC uses dependency tracking to determine which stages need to re-run:

  1. Check if any dependency (code, data, params) changed since last run
  2. If a dependency changed → re-run the stage
  3. If the stage output changed → re-run downstream stages

This means if you only change train.max_depth in params.yaml, DVC skips data_prep and only re-runs train and evaluate.

Metrics Tracking

DVC treats metrics as first-class citizens. You can compare metrics across experiments using Git commits.

Generating Metrics

python
# src/evaluate.py writes metrics.json
import json

metrics = {
"accuracy": 0.9523,
"f1_score": 0.9487,
"precision": 0.9512,
"recall": 0.9463,
"roc_auc": 0.9834,
}

with open("metrics.json", "w") as f:
json.dump(metrics, f, indent=2)

Comparing Experiments

bash
# Show metrics for current commit
dvc metrics show

# Compare metrics across branches/commits
dvc metrics diff HEAD~1

# Compare specific experiments
dvc metrics diff experiment-1 experiment-2

# Example output:
# Path Metric HEAD HEAD~1 Change
# metrics.json accuracy 0.9523 0.9187 +0.0336
# metrics.json f1_score 0.9487 0.9102 +0.0385

DVC Experiments

DVC Experiments provide a lightweight way to run hyperparameter sweeps without creating Git branches:

bash
# Run an experiment with modified parameters
dvc exp run --set-param train.n_estimators=200

# Run a parameter grid
dvc exp run --set-param train.n_estimators=50,100,200 --set-param train.max_depth=3,5,8

# List all experiments
dvc exp show

# Apply the best experiment
dvc exp apply <experiment-name>

# Push experiments to remote
dvc exp push origin <experiment-name>

Reproducing Experiments

The ultimate goal of DVC is exact reproducibility. Given a Git commit, you can reproduce the exact same results:

bash
# Checkout a specific experiment
git checkout v1.2.3

# Pull the exact data for this version
dvc checkout # Updates working directory to match .dvc files
dvc pull # Downloads data from remote if not in cache

# Reproduce the pipeline
dvc repro

# The results will be identical to the original v1.2.3 run

Full Reproduction Workflow

bash
# On a fresh machine:
git clone https://github.com/org/my-ml-project.git
cd my-ml-project

# Install dependencies
pip install -r requirements.txt

# Pull data from remote
dvc pull

# Reproduce the full pipeline
dvc repro

# Verify metrics match
dvc metrics show
DVC + MLflow Together

DVC and MLflow complement each other beautifully:

  • DVC tracks data and pipeline versions (what data, what code)
  • MLflow tracks experiment metadata (parameters, metrics, artifacts)

Use them together: DVC ensures your data/code is reproducible, MLflow ensures your experiment results are comparable.

bash
# In your training script, use both
dvc repro # Reproduces exact data + code
mlflow run . -P n_estimators=200 # Logs the experiment