Cost Optimization

GCP bills can spiral quickly — a single GPU training job left running overnight can cost hundreds of dollars. This page covers practical strategies to keep your ML workload costs under control while still getting the compute power you need.

The Cost Problem in ML

code

Typical ML Monthly Spend (Unoptimized):
┌────────────────────────────────────────────┐
│ Vertex AI Training (GPU)        $800  40%  │
│ Cloud Run (always-on instances) $400  20%  │
│ BigQuery (full table scans)     $300  15%  │
│ Cloud Storage (unused data)     $200  10%  │
│ Artifact Registry (old images)  $150   8%  │
│ Pub/Sub + Functions              $50   2%  │
│ Other                            $100   5%  │
│ ───────────────────────────────────────────│
│ Total:                         $2,000      │
│ After optimization:              $600  ← 70% savings │
└────────────────────────────────────────────┘

1. GCP Free Tier Services and Limits

GCP offers an Always Free tier that doesn't expire after the trial period:

Service	Free Tier Limit	ML Use Case
Cloud Run	2M requests/month, 360K vCPU-seconds	Dev/test model servers
Cloud Functions	2M invocations/month	Event triggers, webhooks
Pub/Sub	10 GB/month	ML pipeline messaging
Cloud Storage	5 GB (US)	Small datasets, model artifacts
BigQuery	1 TB queries/month, 10 GB storage	Data analysis, monitoring
Artifact Registry	0.5 GB storage	Small container images
Cloud Scheduler	3 jobs/month	Periodic retraining triggers
Vertex AI	Limited free credits (new accounts)	Experimentation

bash

# Check your current free tier usage
gcloud beta services quota list \
  --service=run.googleapis.com \
  --filter="usage:free"

# View billing account details
gcloud billing accounts list
gcloud billing accounts describe YOUR_BILLING_ACCOUNT_ID

Stay in Free Tier

For course projects and prototyping, you can stay entirely within the free tier. Use Cloud Run with --min-instances=0, keep BigQuery datasets small, and clean up unused resources weekly.

2. Committed Use Discounts

If you have predictable, steady-state workloads, committed use discounts (CUDs) offer up to 57% savings in exchange for a 1-year or 3-year commitment:

Commitment	Discount	Best For
1-year CUD	~28% off	Production model servers
3-year CUD	~46% off	Long-running inference clusters
Flexible CUD	~15% off	Mixed workloads across regions

bash

# Purchase a committed use discount
gcloud compute commitments create ml-inference-cud \
  --region=us-central1 \
  --plan=12-month \
  --resources=vcpu=4,memory=16384MB \
  --project=your-project-id

# View existing commitments
gcloud compute commitments list

Commitment Caveats

CUDs charge you regardless of usage. Only commit for resources you'll use 24/7 (e.g., a production inference server with --min-instances=1). Never commit for training jobs.

3. Spot Instances for Training

Spot VMs use spare compute capacity at 60-91% discount. They can be preempted at any time, but for ML training with checkpointing, this is an excellent trade-off:

bash

# Submit a Vertex AI training job with spot instances
gcloud ai custom-jobs create \
  --display-name="ml-training-spot" \
  --region=us-central1 \
  --worker-pool-spec=machine-type=n1-standard-8,accelerator-type=NVIDIA_TESLA_T4,accelerator-count=1,replica-count=1,container-image-uri=gcr.io/your-project/trainer:latest \
  --spot \
  --project=your-project-id

Python: Training with spot instances and checkpointing

python

import google.cloud.aiplatform as aiplatform

aiplatform.init(project="your-project-id", location="us-central1")

# Define a worker pool with spot pricing
worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-8",
            "accelerator_type": "NVIDIA_TESLA_T4",
            "accelerator_count": 1,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": "gcr.io/your-project/trainer:latest",
            "command": ["python", "train.py"],
            "args": [
                "--checkpoint-dir=gs://your-project-ml/checkpoints",
                "--resume-from-checkpoint=true",
            ],
        },
    }
]

job = aiplatform.CustomJob(
    display_name="spot-training-with-checkpoints",
    worker_pool_specs=worker_pool_specs,
)

# Run on spot instances
job.run(spot=True)

# If preempted, resume from the latest checkpoint
# The training script should auto-resume from the checkpoint dir

Checkpoint strategy for spot instances

python

# train.py — resilient training with frequent checkpoints
import torch
import os

CHECKPOINT_DIR = os.environ.get("CHECKPOINT_DIR", "./checkpoints")

def save_checkpoint(model, optimizer, epoch, loss):
    path = os.path.join(CHECKPOINT_DIR, f"checkpoint_epoch_{epoch}.pt")
    torch.save({
        "epoch": epoch,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "loss": loss,
    }, path)
    print(f"Checkpoint saved: {path}")

def load_latest_checkpoint(model, optimizer):
    checkpoints = sorted([
        f for f in os.listdir(CHECKPOINT_DIR)
        if f.startswith("checkpoint_")
    ], reverse=True)

    if not checkpoints:
        return 0  # Start from scratch

    latest = os.path.join(CHECKPOINT_DIR, checkpoints[0])
    checkpoint = torch.load(latest)
    model.load_state_dict(checkpoint["model_state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
    print(f"Resumed from epoch {checkpoint['epoch']}")
    return checkpoint["epoch"] + 1

# Training loop with checkpointing every 5 epochs
start_epoch = load_latest_checkpoint(model, optimizer)
for epoch in range(start_epoch, num_epochs):
    loss = train_one_epoch(model, optimizer, dataloader)
    if epoch % 5 == 0:
        save_checkpoint(model, optimizer, epoch, loss)

4. Budget Alerts and Caps

Set up budget alerts to catch runaway costs before they become painful:

bash

# Create a budget alert
gcloud billing budgets create \
  --billing-account=YOUR_BILLING_ACCOUNT_ID \
  --display-name="ML Project Budget" \
  --budget-amount=500USD \
  --threshold-rule=percent=50,basis=current-spend \
  --threshold-rule=percent=80,basis=current-spend \
  --threshold-rule=percent=100,basis=current-spend \
  --notifications-rule=pubsub-topic=projects/your-project-id/topics/budget-alerts \
  --filter-projects=projects/your-project-id

Budget notification handler

python

# functions/budget_alert/main.py
import json
import requests

SLACK_WEBHOOK = "https://hooks.slack.com/services/YOUR/WEBHOOK"

def budget_alert(event, context):
    """Triggered by Pub/Sub budget alert."""
    pubsub_message = json.loads(event["data"])
    budget_amount = pubsub_message["budgetAmount"]
    current_spend = pubsub_message["costAmount"]
    threshold = pubsub_message["alertThresholdExceeded"]

    percent = (current_spend / budget_amount) * 100

    emoji = ":warning:" if percent < 100 else ":rotating_light:"
    message = (
        f"{emoji} *GCP Budget Alert*\n"
        f"Budget: ${budget_amount:.2f}\n"
        f"Spend: ${current_spend:.2f} ({percent:.1f}%)\n"
        f"Threshold: {threshold * 100:.0f}%"
    )

    requests.post(SLACK_WEBHOOK, json={"text": message})

    # At 100%, optionally shut down non-essential services
    if percent >= 100:
        scale_down_resources()

    return {"status": "alert_sent"}

5. Cost Dashboard in BigQuery + Looker Studio

Export billing data to BigQuery for detailed analysis:

bash

# Enable billing export to BigQuery
gcloud billing projects link your-project-id \
  --billing-account=YOUR_BILLING_ACCOUNT_ID

# Set up BigQuery billing export
gcloud alpha billing projects configure-bigquery-export your-project-id \
  --billing-account=YOUR_BILLING_ACCOUNT_ID \
  --dataset=billing_export

Cost analysis queries

sql

-- Top 10 services by cost (last 30 days)
SELECT
  service.description AS service,
  SUM(cost) AS total_cost,
  ROUND(SUM(cost) / SUM(SUM(cost)) OVER () * 100, 1) AS pct_of_total
FROM
  `your-project.billing_export.gcp_billing_export_v1*`
WHERE
  invoice.month = FORMAT_DATE("%Y%m", CURRENT_DATE())
GROUP BY
  service
ORDER BY
  total_cost DESC
LIMIT 10;

sql

-- Daily cost trend for ML-related services
SELECT
  DATE(usage_start_time) AS date,
  service.description AS service,
  SUM(cost) AS daily_cost
FROM
  `your-project.billing_export.gcp_billing_export_v1*`
WHERE
  service.description IN (
    'Cloud Run', 'Vertex AI', 'BigQuery',
    'Cloud Storage', 'Artifact Registry'
  )
  AND usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
  date, service
ORDER BY
  date DESC, daily_cost DESC;

sql

-- Cost by resource label (requires labeling)
SELECT
  labels.value AS team,
  SUM(cost) AS total_cost
FROM
  `your-project.billing_export.gcp_billing_export_v1*`
  CROSS JOIN UNNEST(labels) AS labels
WHERE
  labels.key = 'team'
GROUP BY
  team
ORDER BY
  total_cost DESC;

Connect to Looker Studio

Open Looker Studio
Create a new report → Connect to BigQuery
Select your billing export dataset
Build charts: daily cost trend, cost by service, cost by team

6. Resource Labeling and Organization

Labels are key-value pairs attached to GCP resources. They flow into billing exports, enabling cost attribution:

bash

# Label a Cloud Run service
gcloud run services update ml-predictor \
  --region=us-central1 \
  --update-labels=team=ml-platform,env=production,cost-center=data-science

# Label a Vertex AI endpoint
gcloud ai endpoints update 1234567890123456789 \
  --region=us-central1 \
  --update-labels=team=ml-platform,env=production,model=credit-scoring

# Label a Cloud Storage bucket
gcloud storage buckets update gs://your-project-ml \
  --update-labels=team=ml-platform,env=production,data-type=training

Recommended label schema

Label Key	Values	Purpose
`team`	`ml-platform`, `data-eng`, `research`	Cost attribution
`env`	`dev`, `staging`, `production`	Environment filtering
`cost-center`	`data-science`, `engineering`	Budget allocation
`model`	`credit-scoring`, `fraud-detection`	Model-level tracking
`lifecycle`	`temporary`, `permanent`	Cleanup decisions

Bulk cleanup of unlabeled temporary resources

bash

# Find Cloud Run services without an env label (likely temporary)
gcloud run services list \
  --format="table(name,region,status.url)" \
  --filter="NOT labels.env:*"

# Delete temporary services older than 7 days
for service in $(gcloud run services list --filter="labels.lifecycle=temporary" --format="value(name)"); do
  gcloud run services delete ${service} --region=us-central1 --quiet
done

The Cost Problem in ML​

1. GCP Free Tier Services and Limits​

2. Committed Use Discounts​

3. Spot Instances for Training​

Python: Training with spot instances and checkpointing​

Checkpoint strategy for spot instances​

4. Budget Alerts and Caps​

Budget notification handler​

5. Cost Dashboard in BigQuery + Looker Studio​

Cost analysis queries​

Connect to Looker Studio​

6. Resource Labeling and Organization​

Recommended label schema​

Bulk cleanup of unlabeled temporary resources​