Cloud Run for ML
Cloud Run is a fully managed serverless platform that runs containers. It's ideal for deploying ML model servers because it scales to zero when idle, supports traffic splitting for canary deployments, and handles HTTPS automatically. No Kubernetes, no cluster management — just push a container and go.
Why Cloud Run for ML?
| Feature | Cloud Run | Cloud Function | GKE |
|---|---|---|---|
| Max timeout | 60 min | 60 min | Unlimited |
| GPU support | Preview | No | Yes |
| Concurrency | 1000/container | 1/call | Configurable |
| Min instances | 0 (scale to zero) | 0 | Manual |
| Traffic splitting | Built-in | No | Istio/Gateway |
| Cold start | ~2-5s | ~1-2s | N/A |
1. FastAPI Model Server
Create a minimal FastAPI app that loads a model and serves predictions:
# main.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI(title="ML Prediction Server", version="1.0.0")
# Load model at startup
model = joblib.load("model.joblib")
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: float
confidence: float
@app.get("/health")
def health():
return {"status": "healthy", "model_loaded": model is not None}
@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
features = np.array(request.features).reshape(1, -1)
prediction = float(model.predict(features)[0])
confidence = float(np.max(model.predict_proba(features)))
return PredictionResponse(prediction=prediction, confidence=confidence)
2. Dockerfile for Cloud Run
Cloud Run requires the container to listen on the port specified by the PORT environment variable (default 8080):
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install dependencies first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code and model artifact
COPY main.py .
COPY model.joblib .
# Cloud Run sets PORT env var; default to 8080
ENV PORT=8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
# requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.30.0
scikit-learn==1.5.1
joblib==1.4.2
pydantic==2.9.0
3. Deploy to Cloud Run
Build the container with Cloud Build and deploy:
# Set variables
export PROJECT_ID="your-project-id"
export REGION="us-central1"
export SERVICE_NAME="ml-predictor"
export IMAGE="gcr.io/${PROJECT_ID}/${SERVICE_NAME}"
# Build and push using Cloud Build
gcloud builds submit --tag ${IMAGE}
# Deploy to Cloud Run
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE} \
--region ${REGION} \
--platform managed \
--allow-unauthenticated \
--cpu 2 \
--memory 4Gi \
--min-instances 0 \
--max-instances 10 \
--concurrency 80 \
--timeout 300
Key deployment flags
| Flag | Purpose | Example |
|---|---|---|
--min-instances | Keep warm instances to avoid cold starts | 1 for production |
--max-instances | Cap scaling to control costs | 10 or 100 |
--concurrency | Requests per container instance | 80 (default), up to 1000 |
--cpu | vCPU allocation (1, 2, 4) | 2 for ML inference |
--memory | RAM per instance | 4Gi, 8Gi, 16Gi |
--timeout | Max request duration | 300 (seconds) |
4. Autoscaling Configuration
Cloud Run autoscales based on concurrency and CPU utilization. Fine-tune for ML workloads:
# Production ML service — keep 1 warm instance, scale up to 20
gcloud run services update ${SERVICE_NAME} \
--region ${REGION} \
--min-instances 1 \
--max-instances 20 \
--concurrency 40 \
--cpu 4 \
--memory 8Gi
ML models are large — loading them on cold start can add 5-15 seconds. Setting --min-instances 1 keeps at least one container warm. The cost is ~$10-20/month for a 2-CPU / 4Gi instance.
5. Traffic Splitting for A/B Testing
Gradually shift traffic between model versions:
# Deploy a new revision without routing traffic
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE}:v2 \
--no-traffic \
--tag canary \
--region ${REGION}
# Test the canary revision directly
export CANARY_URL=$(gcloud run services describe ${SERVICE_NAME} \
--region ${REGION} \
--format="value(status.traffic[0].url)")
curl ${CANARY_URL}/health
# Shift 10% traffic to v2 (canary)
gcloud run services update-traffic ${SERVICE_NAME} \
--to-revisions ${SERVICE_NAME}-00001=90,${SERVICE_NAME}-00002=10 \
--region ${REGION}
# Promote v2 to 100%
gcloud run services update-traffic ${SERVICE_NAME} \
--to-revisions ${SERVICE_NAME}-00002=100 \
--region ${REGION}
6. GPU-Enabled Cloud Run (Preview)
For models that need GPU acceleration (large LLMs, image generation):
# Deploy with NVIDIA L4 GPU (preview feature)
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE} \
--region ${REGION} \
--cpu 4 \
--memory 16Gi \
--gpu 1 \
--gpu-type nvidia-l4 \
--no-cpu-throttling
GPU support is in preview. Availability is limited to specific regions (us-central1, europe-west4). GPU instances are significantly more expensive (~$0.75/hr for L4). Use --min-instances 0 to avoid charges when idle.
7. Request-Based Authentication
For internal services, require IAM authentication:
# Deploy without allowing unauthenticated access
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE} \
--region ${REGION} \
--no-allow-unauthenticated
# Generate an identity token for testing
export TOKEN=$(gcloud auth print-identity-token)
# Call the service
curl -H "Authorization: Bearer ${TOKEN}" \
https://${SERVICE_NAME}-${HASH}.run.app/predict \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
Summary
| Aspect | Recommendation |
|---|---|
| Min instances | 1 for production, 0 for dev |
| Concurrency | 40-80 for CPU-bound ML inference |
| Memory | 4Gi minimum for scikit-learn; 8-16Gi for deep learning |
| Timeout | 300s for most models; 600s for large-batch inference |
| Traffic splitting | Always use canary deployments for new model versions |