Skip to main content

Cost Optimization

GCP bills can spiral quickly — a single GPU training job left running overnight can cost hundreds of dollars. This page covers practical strategies to keep your ML workload costs under control while still getting the compute power you need.

The Cost Problem in ML

code
Typical ML Monthly Spend (Unoptimized):
┌────────────────────────────────────────────┐
│ Vertex AI Training (GPU) $800 40% │
│ Cloud Run (always-on instances) $400 20% │
│ BigQuery (full table scans) $300 15% │
│ Cloud Storage (unused data) $200 10% │
│ Artifact Registry (old images) $150 8% │
│ Pub/Sub + Functions $50 2% │
│ Other $100 5% │
│ ───────────────────────────────────────────│
│ Total: $2,000 │
│ After optimization: $600 ← 70% savings │
└────────────────────────────────────────────┘

1. GCP Free Tier Services and Limits

GCP offers an Always Free tier that doesn't expire after the trial period:

ServiceFree Tier LimitML Use Case
Cloud Run2M requests/month, 360K vCPU-secondsDev/test model servers
Cloud Functions2M invocations/monthEvent triggers, webhooks
Pub/Sub10 GB/monthML pipeline messaging
Cloud Storage5 GB (US)Small datasets, model artifacts
BigQuery1 TB queries/month, 10 GB storageData analysis, monitoring
Artifact Registry0.5 GB storageSmall container images
Cloud Scheduler3 jobs/monthPeriodic retraining triggers
Vertex AILimited free credits (new accounts)Experimentation
bash
# Check your current free tier usage
gcloud beta services quota list \
--service=run.googleapis.com \
--filter="usage:free"

# View billing account details
gcloud billing accounts list
gcloud billing accounts describe YOUR_BILLING_ACCOUNT_ID
Stay in Free Tier

For course projects and prototyping, you can stay entirely within the free tier. Use Cloud Run with --min-instances=0, keep BigQuery datasets small, and clean up unused resources weekly.

2. Committed Use Discounts

If you have predictable, steady-state workloads, committed use discounts (CUDs) offer up to 57% savings in exchange for a 1-year or 3-year commitment:

CommitmentDiscountBest For
1-year CUD~28% offProduction model servers
3-year CUD~46% offLong-running inference clusters
Flexible CUD~15% offMixed workloads across regions
bash
# Purchase a committed use discount
gcloud compute commitments create ml-inference-cud \
--region=us-central1 \
--plan=12-month \
--resources=vcpu=4,memory=16384MB \
--project=your-project-id

# View existing commitments
gcloud compute commitments list
Commitment Caveats

CUDs charge you regardless of usage. Only commit for resources you'll use 24/7 (e.g., a production inference server with --min-instances=1). Never commit for training jobs.

3. Spot Instances for Training

Spot VMs use spare compute capacity at 60-91% discount. They can be preempted at any time, but for ML training with checkpointing, this is an excellent trade-off:

bash
# Submit a Vertex AI training job with spot instances
gcloud ai custom-jobs create \
--display-name="ml-training-spot" \
--region=us-central1 \
--worker-pool-spec=machine-type=n1-standard-8,accelerator-type=NVIDIA_TESLA_T4,accelerator-count=1,replica-count=1,container-image-uri=gcr.io/your-project/trainer:latest \
--spot \
--project=your-project-id

Python: Training with spot instances and checkpointing

python
import google.cloud.aiplatform as aiplatform

aiplatform.init(project="your-project-id", location="us-central1")

# Define a worker pool with spot pricing
worker_pool_specs = [
{
"machine_spec": {
"machine_type": "n1-standard-8",
"accelerator_type": "NVIDIA_TESLA_T4",
"accelerator_count": 1,
},
"replica_count": 1,
"container_spec": {
"image_uri": "gcr.io/your-project/trainer:latest",
"command": ["python", "train.py"],
"args": [
"--checkpoint-dir=gs://your-project-ml/checkpoints",
"--resume-from-checkpoint=true",
],
},
}
]

job = aiplatform.CustomJob(
display_name="spot-training-with-checkpoints",
worker_pool_specs=worker_pool_specs,
)

# Run on spot instances
job.run(spot=True)

# If preempted, resume from the latest checkpoint
# The training script should auto-resume from the checkpoint dir

Checkpoint strategy for spot instances

python
# train.py — resilient training with frequent checkpoints
import torch
import os

CHECKPOINT_DIR = os.environ.get("CHECKPOINT_DIR", "./checkpoints")

def save_checkpoint(model, optimizer, epoch, loss):
path = os.path.join(CHECKPOINT_DIR, f"checkpoint_epoch_{epoch}.pt")
torch.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}, path)
print(f"Checkpoint saved: {path}")

def load_latest_checkpoint(model, optimizer):
checkpoints = sorted([
f for f in os.listdir(CHECKPOINT_DIR)
if f.startswith("checkpoint_")
], reverse=True)

if not checkpoints:
return 0 # Start from scratch

latest = os.path.join(CHECKPOINT_DIR, checkpoints[0])
checkpoint = torch.load(latest)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
print(f"Resumed from epoch {checkpoint['epoch']}")
return checkpoint["epoch"] + 1

# Training loop with checkpointing every 5 epochs
start_epoch = load_latest_checkpoint(model, optimizer)
for epoch in range(start_epoch, num_epochs):
loss = train_one_epoch(model, optimizer, dataloader)
if epoch % 5 == 0:
save_checkpoint(model, optimizer, epoch, loss)

4. Budget Alerts and Caps

Set up budget alerts to catch runaway costs before they become painful:

bash
# Create a budget alert
gcloud billing budgets create \
--billing-account=YOUR_BILLING_ACCOUNT_ID \
--display-name="ML Project Budget" \
--budget-amount=500USD \
--threshold-rule=percent=50,basis=current-spend \
--threshold-rule=percent=80,basis=current-spend \
--threshold-rule=percent=100,basis=current-spend \
--notifications-rule=pubsub-topic=projects/your-project-id/topics/budget-alerts \
--filter-projects=projects/your-project-id

Budget notification handler

python
# functions/budget_alert/main.py
import json
import requests

SLACK_WEBHOOK = "https://hooks.slack.com/services/YOUR/WEBHOOK"

def budget_alert(event, context):
"""Triggered by Pub/Sub budget alert."""
pubsub_message = json.loads(event["data"])
budget_amount = pubsub_message["budgetAmount"]
current_spend = pubsub_message["costAmount"]
threshold = pubsub_message["alertThresholdExceeded"]

percent = (current_spend / budget_amount) * 100

emoji = ":warning:" if percent < 100 else ":rotating_light:"
message = (
f"{emoji} *GCP Budget Alert*\n"
f"Budget: ${budget_amount:.2f}\n"
f"Spend: ${current_spend:.2f} ({percent:.1f}%)\n"
f"Threshold: {threshold * 100:.0f}%"
)

requests.post(SLACK_WEBHOOK, json={"text": message})

# At 100%, optionally shut down non-essential services
if percent >= 100:
scale_down_resources()

return {"status": "alert_sent"}

5. Cost Dashboard in BigQuery + Looker Studio

Export billing data to BigQuery for detailed analysis:

bash
# Enable billing export to BigQuery
gcloud billing projects link your-project-id \
--billing-account=YOUR_BILLING_ACCOUNT_ID

# Set up BigQuery billing export
gcloud alpha billing projects configure-bigquery-export your-project-id \
--billing-account=YOUR_BILLING_ACCOUNT_ID \
--dataset=billing_export

Cost analysis queries

sql
-- Top 10 services by cost (last 30 days)
SELECT
service.description AS service,
SUM(cost) AS total_cost,
ROUND(SUM(cost) / SUM(SUM(cost)) OVER () * 100, 1) AS pct_of_total
FROM
`your-project.billing_export.gcp_billing_export_v1*`
WHERE
invoice.month = FORMAT_DATE("%Y%m", CURRENT_DATE())
GROUP BY
service
ORDER BY
total_cost DESC
LIMIT 10;
sql
-- Daily cost trend for ML-related services
SELECT
DATE(usage_start_time) AS date,
service.description AS service,
SUM(cost) AS daily_cost
FROM
`your-project.billing_export.gcp_billing_export_v1*`
WHERE
service.description IN (
'Cloud Run', 'Vertex AI', 'BigQuery',
'Cloud Storage', 'Artifact Registry'
)
AND usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
date, service
ORDER BY
date DESC, daily_cost DESC;
sql
-- Cost by resource label (requires labeling)
SELECT
labels.value AS team,
SUM(cost) AS total_cost
FROM
`your-project.billing_export.gcp_billing_export_v1*`
CROSS JOIN UNNEST(labels) AS labels
WHERE
labels.key = 'team'
GROUP BY
team
ORDER BY
total_cost DESC;

Connect to Looker Studio

  1. Open Looker Studio
  2. Create a new report → Connect to BigQuery
  3. Select your billing export dataset
  4. Build charts: daily cost trend, cost by service, cost by team

6. Resource Labeling and Organization

Labels are key-value pairs attached to GCP resources. They flow into billing exports, enabling cost attribution:

bash
# Label a Cloud Run service
gcloud run services update ml-predictor \
--region=us-central1 \
--update-labels=team=ml-platform,env=production,cost-center=data-science

# Label a Vertex AI endpoint
gcloud ai endpoints update 1234567890123456789 \
--region=us-central1 \
--update-labels=team=ml-platform,env=production,model=credit-scoring

# Label a Cloud Storage bucket
gcloud storage buckets update gs://your-project-ml \
--update-labels=team=ml-platform,env=production,data-type=training
Label KeyValuesPurpose
teamml-platform, data-eng, researchCost attribution
envdev, staging, productionEnvironment filtering
cost-centerdata-science, engineeringBudget allocation
modelcredit-scoring, fraud-detectionModel-level tracking
lifecycletemporary, permanentCleanup decisions

Bulk cleanup of unlabeled temporary resources

bash
# Find Cloud Run services without an env label (likely temporary)
gcloud run services list \
--format="table(name,region,status.url)" \
--filter="NOT labels.env:*"

# Delete temporary services older than 7 days
for service in $(gcloud run services list --filter="labels.lifecycle=temporary" --format="value(name)"); do
gcloud run services delete ${service} --region=us-central1 --quiet
done

Cost Optimization Checklist

  • Set budget alerts at 50%, 80%, and 100% thresholds
  • Use spot instances for all training jobs (with checkpointing)
  • Set --min-instances=0 on Cloud Run dev/staging services
  • Enable BigQuery billing export and build a cost dashboard
  • Label all resources with team, env, and cost-center
  • Clean up unused Artifact Registry images monthly
  • Use lifecycle policies on Cloud Storage to auto-delete old data
  • Review the free tier before provisioning new resources
  • Consider CUDs for production inference workloads running 24/7
  • Audit costs weekly — look for forgotten instances, old endpoints, unused datasets