Data Versioning with DVC

DVC (Data Version Control) brings the power of Git to your data and ML pipelines. While Git tracks code, DVC tracks datasets, model files, and pipeline artifacts — without storing them in Git. It's the missing piece that makes ML experiments truly reproducible.

Why Version Data?

Consider a typical scenario: your model's accuracy dropped from 94% to 89% after a data refresh. What changed?

Was it the new rows added to the training set?
Was it a change in the preprocessing script?
Was it a hyperparameter tweak you forgot to record?

Without data versioning, you're flying blind. DVC solves this by:

Tracking exact dataset versions alongside code commits
Storing large files in cloud storage (not Git)
Reproducing any experiment from its exact inputs
Comparing results across different data/code versions

DVC Init and Track

Installation and Setup

bash

# Install DVC with GCS support
pip install "dvc[gs]"

# Initialize DVC in your project (creates .dvc/ directory)
cd my-ml-project
dvc init

# Commit the DVC configuration to Git
git add .dvc/.gitignore .dvc/config
git commit -m "Initialize DVC"

Tracking Files and Directories

bash

# Track a dataset file
dvc add data/raw/training_data.csv

# This creates training_data.csv.dvc (metadata) and updates .gitignore
# The .dvc file contains the hash and size — NOT the actual data
git add data/raw/training_data.csv.dvc data/raw/.gitignore
git commit -m "Track training_data.csv with DVC"

# Track a directory
dvc add data/processed/

# Track a model file
dvc add models/random_forest_v1.joblib

# Commit the tracking files
git add data/processed.dvc models/random_forest_v1.joblib.dvc
git commit -m "Track processed data and model v1"

What DVC Actually Stores

The .dvc file is a small YAML metadata file:

yaml

# training_data.csv.dvc
outs:
  - md5: a1b2c3d4e5f6g7h8i9j0
    size: 52428800
    path: training_data.csv
    hash: md5

The actual data is stored in DVC cache (.dvc/cache/) and pushed to remote storage. Git only tracks the .dvc metadata file — keeping your repository lightweight.

GCS Remote Storage

DVC needs a remote storage location to share data with your team. Google Cloud Storage (GCS) is a natural choice for GCP projects.

Setting Up a GCS Remote

bash

# Create a GCS bucket for DVC storage
gsutil mb -l us-central1 gs://my-dvc-store/

# Add the GCS bucket as a DVC remote
dvc remote add -d myremote gs://my-dvc-store/

# Optional: Set a subdirectory structure
dvc remote modify myremote url gs://my-dvc-store/dvc-cache

# Verify the remote configuration
dvc remote list

# Commit the remote config to Git
git add .dvc/config
git commit -m "Add GCS remote for DVC"

Pushing and Pulling Data

bash

# Push tracked data to the GCS remote
dvc push

# Pull data from remote (e.g., on a new machine or CI)
dvc pull

# Push a specific file
dvc push data/raw/training_data.csv.dvc

# Pull a specific file
dvc pull models/random_forest_v1.joblib.dvc

Workflow: Team Collaboration

bash

# Developer A: adds new data and pushes
dvc add data/raw/v2_data.csv
git add data/raw/v2_data.csv.dvc
git commit -m "Add v2 training data"
git push
dvc push

# Developer B: pulls the updated data
git pull
dvc pull  # Fetches v2_data.csv from GCS

DVC Pipeline (Stages and Metrics)

DVC Pipelines define your ML workflow as a series of stages with explicit dependencies. This makes the entire workflow reproducible with a single command.

Defining a Pipeline with dvc.yaml

yaml

# dvc.yaml
stages:
  data_prep:
    cmd: python src/prepare_data.py --input data/raw/training_data.csv --output data/processed/
    deps:
      - src/prepare_data.py
      - data/raw/training_data.csv
    outs:
      - data/processed/train.csv
      - data/processed/test.csv
    params:
      - prepare.test_size
      - prepare.random_state

  train:
    cmd: python src/train.py --train data/processed/train.csv --model models/model.joblib
    deps:
      - src/train.py
      - data/processed/train.csv
    outs:
      - models/model.joblib
    params:
      - train.n_estimators
      - train.max_depth

  evaluate:
    cmd: python src/evaluate.py --model models/model.joblib --test data/processed/test.csv --metrics metrics.json
    deps:
      - src/evaluate.py
      - models/model.joblib
      - data/processed/test.csv
    metrics:
      - metrics.json:
          cache: false
    plots:
      - plots/confusion_matrix.png:
          cache: false

Parameters File

yaml

# params.yaml
prepare:
  test_size: 0.2
  random_state: 42

train:
  n_estimators: 100
  max_depth: 5
  learning_rate: 0.1

evaluate:
  threshold: 0.5

Running the Pipeline

bash

# Run the entire pipeline
dvc repro

# Run from a specific stage
dvc repro train

# Force re-run (ignore cache)
dvc repro --force

# Dry run (show what would execute)
dvc repro --dry

How DVC Determines What to Run

DVC uses dependency tracking to determine which stages need to re-run:

Check if any dependency (code, data, params) changed since last run
If a dependency changed → re-run the stage
If the stage output changed → re-run downstream stages

This means if you only change train.max_depth in params.yaml, DVC skips data_prep and only re-runs train and evaluate.

Metrics Tracking

DVC treats metrics as first-class citizens. You can compare metrics across experiments using Git commits.

Generating Metrics

python

# src/evaluate.py writes metrics.json
import json

metrics = {
    "accuracy": 0.9523,
    "f1_score": 0.9487,
    "precision": 0.9512,
    "recall": 0.9463,
    "roc_auc": 0.9834,
}

with open("metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

Comparing Experiments

bash

# Show metrics for current commit
dvc metrics show

# Compare metrics across branches/commits
dvc metrics diff HEAD~1

# Compare specific experiments
dvc metrics diff experiment-1 experiment-2

# Example output:
# Path          Metric     HEAD     HEAD~1    Change
# metrics.json  accuracy   0.9523   0.9187    +0.0336
# metrics.json  f1_score   0.9487   0.9102    +0.0385

DVC Experiments

DVC Experiments provide a lightweight way to run hyperparameter sweeps without creating Git branches:

bash

# Run an experiment with modified parameters
dvc exp run --set-param train.n_estimators=200

# Run a parameter grid
dvc exp run --set-param train.n_estimators=50,100,200 --set-param train.max_depth=3,5,8

# List all experiments
dvc exp show

# Apply the best experiment
dvc exp apply <experiment-name>

# Push experiments to remote
dvc exp push origin <experiment-name>

Reproducing Experiments

The ultimate goal of DVC is exact reproducibility. Given a Git commit, you can reproduce the exact same results:

bash

# Checkout a specific experiment
git checkout v1.2.3

# Pull the exact data for this version
dvc checkout  # Updates working directory to match .dvc files
dvc pull      # Downloads data from remote if not in cache

# Reproduce the pipeline
dvc repro

# The results will be identical to the original v1.2.3 run

Full Reproduction Workflow

bash

# On a fresh machine:
git clone https://github.com/org/my-ml-project.git
cd my-ml-project

# Install dependencies
pip install -r requirements.txt

# Pull data from remote
dvc pull

# Reproduce the full pipeline
dvc repro

# Verify metrics match
dvc metrics show

DVC + MLflow Together

DVC and MLflow complement each other beautifully:

DVC tracks data and pipeline versions (what data, what code)
MLflow tracks experiment metadata (parameters, metrics, artifacts)

Use them together: DVC ensures your data/code is reproducible, MLflow ensures your experiment results are comparable.

bash

# In your training script, use both
dvc repro  # Reproduces exact data + code
mlflow run . -P n_estimators=200  # Logs the experiment

Why Version Data?​

DVC Init and Track​

Installation and Setup​

Tracking Files and Directories​

What DVC Actually Stores​

GCS Remote Storage​

Setting Up a GCS Remote​

Pushing and Pulling Data​

Workflow: Team Collaboration​

DVC Pipeline (Stages and Metrics)​

Defining a Pipeline with dvc.yaml​

Parameters File​

Running the Pipeline​

How DVC Determines What to Run​

Metrics Tracking​

Generating Metrics​

Comparing Experiments​

DVC Experiments​

Reproducing Experiments​

Full Reproduction Workflow​

Why Version Data?

DVC Init and Track

Installation and Setup

Tracking Files and Directories

What DVC Actually Stores

GCS Remote Storage

Setting Up a GCS Remote

Pushing and Pulling Data

Workflow: Team Collaboration

DVC Pipeline (Stages and Metrics)

Defining a Pipeline with dvc.yaml

Parameters File

Running the Pipeline

How DVC Determines What to Run

Metrics Tracking

Generating Metrics

Comparing Experiments

DVC Experiments

Reproducing Experiments

Full Reproduction Workflow