Cost Optimization
GCP bills can spiral quickly — a single GPU training job left running overnight can cost hundreds of dollars. This page covers practical strategies to keep your ML workload costs under control while still getting the compute power you need.
The Cost Problem in ML
Typical ML Monthly Spend (Unoptimized):
┌────────────────────────────────────────────┐
│ Vertex AI Training (GPU) $800 40% │
│ Cloud Run (always-on instances) $400 20% │
│ BigQuery (full table scans) $300 15% │
│ Cloud Storage (unused data) $200 10% │
│ Artifact Registry (old images) $150 8% │
│ Pub/Sub + Functions $50 2% │
│ Other $100 5% │
│ ───────────────────────────────────────────│
│ Total: $2,000 │
│ After optimization: $600 ← 70% savings │
└────────────────────────────────────────────┘
1. GCP Free Tier Services and Limits
GCP offers an Always Free tier that doesn't expire after the trial period:
| Service | Free Tier Limit | ML Use Case |
|---|---|---|
| Cloud Run | 2M requests/month, 360K vCPU-seconds | Dev/test model servers |
| Cloud Functions | 2M invocations/month | Event triggers, webhooks |
| Pub/Sub | 10 GB/month | ML pipeline messaging |
| Cloud Storage | 5 GB (US) | Small datasets, model artifacts |
| BigQuery | 1 TB queries/month, 10 GB storage | Data analysis, monitoring |
| Artifact Registry | 0.5 GB storage | Small container images |
| Cloud Scheduler | 3 jobs/month | Periodic retraining triggers |
| Vertex AI | Limited free credits (new accounts) | Experimentation |
# Check your current free tier usage
gcloud beta services quota list \
--service=run.googleapis.com \
--filter="usage:free"
# View billing account details
gcloud billing accounts list
gcloud billing accounts describe YOUR_BILLING_ACCOUNT_ID
For course projects and prototyping, you can stay entirely within the free tier. Use Cloud Run with --min-instances=0, keep BigQuery datasets small, and clean up unused resources weekly.
2. Committed Use Discounts
If you have predictable, steady-state workloads, committed use discounts (CUDs) offer up to 57% savings in exchange for a 1-year or 3-year commitment:
| Commitment | Discount | Best For |
|---|---|---|
| 1-year CUD | ~28% off | Production model servers |
| 3-year CUD | ~46% off | Long-running inference clusters |
| Flexible CUD | ~15% off | Mixed workloads across regions |
# Purchase a committed use discount
gcloud compute commitments create ml-inference-cud \
--region=us-central1 \
--plan=12-month \
--resources=vcpu=4,memory=16384MB \
--project=your-project-id
# View existing commitments
gcloud compute commitments list
CUDs charge you regardless of usage. Only commit for resources you'll use 24/7 (e.g., a production inference server with --min-instances=1). Never commit for training jobs.
3. Spot Instances for Training
Spot VMs use spare compute capacity at 60-91% discount. They can be preempted at any time, but for ML training with checkpointing, this is an excellent trade-off:
# Submit a Vertex AI training job with spot instances
gcloud ai custom-jobs create \
--display-name="ml-training-spot" \
--region=us-central1 \
--worker-pool-spec=machine-type=n1-standard-8,accelerator-type=NVIDIA_TESLA_T4,accelerator-count=1,replica-count=1,container-image-uri=gcr.io/your-project/trainer:latest \
--spot \
--project=your-project-id
Python: Training with spot instances and checkpointing
import google.cloud.aiplatform as aiplatform
aiplatform.init(project="your-project-id", location="us-central1")
# Define a worker pool with spot pricing
worker_pool_specs = [
{
"machine_spec": {
"machine_type": "n1-standard-8",
"accelerator_type": "NVIDIA_TESLA_T4",
"accelerator_count": 1,
},
"replica_count": 1,
"container_spec": {
"image_uri": "gcr.io/your-project/trainer:latest",
"command": ["python", "train.py"],
"args": [
"--checkpoint-dir=gs://your-project-ml/checkpoints",
"--resume-from-checkpoint=true",
],
},
}
]
job = aiplatform.CustomJob(
display_name="spot-training-with-checkpoints",
worker_pool_specs=worker_pool_specs,
)
# Run on spot instances
job.run(spot=True)
# If preempted, resume from the latest checkpoint
# The training script should auto-resume from the checkpoint dir
Checkpoint strategy for spot instances
# train.py — resilient training with frequent checkpoints
import torch
import os
CHECKPOINT_DIR = os.environ.get("CHECKPOINT_DIR", "./checkpoints")
def save_checkpoint(model, optimizer, epoch, loss):
path = os.path.join(CHECKPOINT_DIR, f"checkpoint_epoch_{epoch}.pt")
torch.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}, path)
print(f"Checkpoint saved: {path}")
def load_latest_checkpoint(model, optimizer):
checkpoints = sorted([
f for f in os.listdir(CHECKPOINT_DIR)
if f.startswith("checkpoint_")
], reverse=True)
if not checkpoints:
return 0 # Start from scratch
latest = os.path.join(CHECKPOINT_DIR, checkpoints[0])
checkpoint = torch.load(latest)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
print(f"Resumed from epoch {checkpoint['epoch']}")
return checkpoint["epoch"] + 1
# Training loop with checkpointing every 5 epochs
start_epoch = load_latest_checkpoint(model, optimizer)
for epoch in range(start_epoch, num_epochs):
loss = train_one_epoch(model, optimizer, dataloader)
if epoch % 5 == 0:
save_checkpoint(model, optimizer, epoch, loss)
4. Budget Alerts and Caps
Set up budget alerts to catch runaway costs before they become painful:
# Create a budget alert
gcloud billing budgets create \
--billing-account=YOUR_BILLING_ACCOUNT_ID \
--display-name="ML Project Budget" \
--budget-amount=500USD \
--threshold-rule=percent=50,basis=current-spend \
--threshold-rule=percent=80,basis=current-spend \
--threshold-rule=percent=100,basis=current-spend \
--notifications-rule=pubsub-topic=projects/your-project-id/topics/budget-alerts \
--filter-projects=projects/your-project-id
Budget notification handler
# functions/budget_alert/main.py
import json
import requests
SLACK_WEBHOOK = "https://hooks.slack.com/services/YOUR/WEBHOOK"
def budget_alert(event, context):
"""Triggered by Pub/Sub budget alert."""
pubsub_message = json.loads(event["data"])
budget_amount = pubsub_message["budgetAmount"]
current_spend = pubsub_message["costAmount"]
threshold = pubsub_message["alertThresholdExceeded"]
percent = (current_spend / budget_amount) * 100
emoji = ":warning:" if percent < 100 else ":rotating_light:"
message = (
f"{emoji} *GCP Budget Alert*\n"
f"Budget: ${budget_amount:.2f}\n"
f"Spend: ${current_spend:.2f} ({percent:.1f}%)\n"
f"Threshold: {threshold * 100:.0f}%"
)
requests.post(SLACK_WEBHOOK, json={"text": message})
# At 100%, optionally shut down non-essential services
if percent >= 100:
scale_down_resources()
return {"status": "alert_sent"}
5. Cost Dashboard in BigQuery + Looker Studio
Export billing data to BigQuery for detailed analysis:
# Enable billing export to BigQuery
gcloud billing projects link your-project-id \
--billing-account=YOUR_BILLING_ACCOUNT_ID
# Set up BigQuery billing export
gcloud alpha billing projects configure-bigquery-export your-project-id \
--billing-account=YOUR_BILLING_ACCOUNT_ID \
--dataset=billing_export
Cost analysis queries
-- Top 10 services by cost (last 30 days)
SELECT
service.description AS service,
SUM(cost) AS total_cost,
ROUND(SUM(cost) / SUM(SUM(cost)) OVER () * 100, 1) AS pct_of_total
FROM
`your-project.billing_export.gcp_billing_export_v1*`
WHERE
invoice.month = FORMAT_DATE("%Y%m", CURRENT_DATE())
GROUP BY
service
ORDER BY
total_cost DESC
LIMIT 10;
-- Daily cost trend for ML-related services
SELECT
DATE(usage_start_time) AS date,
service.description AS service,
SUM(cost) AS daily_cost
FROM
`your-project.billing_export.gcp_billing_export_v1*`
WHERE
service.description IN (
'Cloud Run', 'Vertex AI', 'BigQuery',
'Cloud Storage', 'Artifact Registry'
)
AND usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
date, service
ORDER BY
date DESC, daily_cost DESC;
-- Cost by resource label (requires labeling)
SELECT
labels.value AS team,
SUM(cost) AS total_cost
FROM
`your-project.billing_export.gcp_billing_export_v1*`
CROSS JOIN UNNEST(labels) AS labels
WHERE
labels.key = 'team'
GROUP BY
team
ORDER BY
total_cost DESC;
Connect to Looker Studio
- Open Looker Studio
- Create a new report → Connect to BigQuery
- Select your billing export dataset
- Build charts: daily cost trend, cost by service, cost by team
6. Resource Labeling and Organization
Labels are key-value pairs attached to GCP resources. They flow into billing exports, enabling cost attribution:
# Label a Cloud Run service
gcloud run services update ml-predictor \
--region=us-central1 \
--update-labels=team=ml-platform,env=production,cost-center=data-science
# Label a Vertex AI endpoint
gcloud ai endpoints update 1234567890123456789 \
--region=us-central1 \
--update-labels=team=ml-platform,env=production,model=credit-scoring
# Label a Cloud Storage bucket
gcloud storage buckets update gs://your-project-ml \
--update-labels=team=ml-platform,env=production,data-type=training
Recommended label schema
| Label Key | Values | Purpose |
|---|---|---|
team | ml-platform, data-eng, research | Cost attribution |
env | dev, staging, production | Environment filtering |
cost-center | data-science, engineering | Budget allocation |
model | credit-scoring, fraud-detection | Model-level tracking |
lifecycle | temporary, permanent | Cleanup decisions |
Bulk cleanup of unlabeled temporary resources
# Find Cloud Run services without an env label (likely temporary)
gcloud run services list \
--format="table(name,region,status.url)" \
--filter="NOT labels.env:*"
# Delete temporary services older than 7 days
for service in $(gcloud run services list --filter="labels.lifecycle=temporary" --format="value(name)"); do
gcloud run services delete ${service} --region=us-central1 --quiet
done
Cost Optimization Checklist
- Set budget alerts at 50%, 80%, and 100% thresholds
- Use spot instances for all training jobs (with checkpointing)
- Set
--min-instances=0on Cloud Run dev/staging services - Enable BigQuery billing export and build a cost dashboard
- Label all resources with team, env, and cost-center
- Clean up unused Artifact Registry images monthly
- Use lifecycle policies on Cloud Storage to auto-delete old data
- Review the free tier before provisioning new resources
- Consider CUDs for production inference workloads running 24/7
- Audit costs weekly — look for forgotten instances, old endpoints, unused datasets