Cloud Run for ML

Cloud Run is a fully managed serverless platform that runs containers. It's ideal for deploying ML model servers because it scales to zero when idle, supports traffic splitting for canary deployments, and handles HTTPS automatically. No Kubernetes, no cluster management — just push a container and go.

Why Cloud Run for ML?

Feature	Cloud Run	Cloud Function	GKE
Max timeout	60 min	60 min	Unlimited
GPU support	Preview	No	Yes
Concurrency	1000/container	1/call	Configurable
Min instances	0 (scale to zero)	0	Manual
Traffic splitting	Built-in	No	Istio/Gateway
Cold start	~2-5s	~1-2s	N/A

1. FastAPI Model Server

Create a minimal FastAPI app that loads a model and serves predictions:

python

# main.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI(title="ML Prediction Server", version="1.0.0")

# Load model at startup
model = joblib.load("model.joblib")

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: float
    confidence: float

@app.get("/health")
def health():
    return {"status": "healthy", "model_loaded": model is not None}

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    features = np.array(request.features).reshape(1, -1)
    prediction = float(model.predict(features)[0])
    confidence = float(np.max(model.predict_proba(features)))
    return PredictionResponse(prediction=prediction, confidence=confidence)

2. Dockerfile for Cloud Run

Cloud Run requires the container to listen on the port specified by the PORT environment variable (default 8080):

dockerfile

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code and model artifact
COPY main.py .
COPY model.joblib .

# Cloud Run sets PORT env var; default to 8080
ENV PORT=8080

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

txt

# requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.30.0
scikit-learn==1.5.1
joblib==1.4.2
pydantic==2.9.0

3. Deploy to Cloud Run

Build the container with Cloud Build and deploy:

bash

# Set variables
export PROJECT_ID="your-project-id"
export REGION="us-central1"
export SERVICE_NAME="ml-predictor"
export IMAGE="gcr.io/${PROJECT_ID}/${SERVICE_NAME}"

# Build and push using Cloud Build
gcloud builds submit --tag ${IMAGE}

# Deploy to Cloud Run
gcloud run deploy ${SERVICE_NAME} \
  --image ${IMAGE} \
  --region ${REGION} \
  --platform managed \
  --allow-unauthenticated \
  --cpu 2 \
  --memory 4Gi \
  --min-instances 0 \
  --max-instances 10 \
  --concurrency 80 \
  --timeout 300

Key deployment flags

Flag	Purpose	Example
`--min-instances`	Keep warm instances to avoid cold starts	`1` for production
`--max-instances`	Cap scaling to control costs	`10` or `100`
`--concurrency`	Requests per container instance	`80` (default), up to `1000`
`--cpu`	vCPU allocation (1, 2, 4)	`2` for ML inference
`--memory`	RAM per instance	`4Gi`, `8Gi`, `16Gi`
`--timeout`	Max request duration	`300` (seconds)

4. Autoscaling Configuration

Cloud Run autoscales based on concurrency and CPU utilization. Fine-tune for ML workloads:

bash

# Production ML service — keep 1 warm instance, scale up to 20
gcloud run services update ${SERVICE_NAME} \
  --region ${REGION} \
  --min-instances 1 \
  --max-instances 20 \
  --concurrency 40 \
  --cpu 4 \
  --memory 8Gi

Cold Start Mitigation

ML models are large — loading them on cold start can add 5-15 seconds. Setting --min-instances 1 keeps at least one container warm. The cost is ~$10-20/month for a 2-CPU / 4Gi instance.

5. Traffic Splitting for A/B Testing

Gradually shift traffic between model versions:

bash

# Deploy a new revision without routing traffic
gcloud run deploy ${SERVICE_NAME} \
  --image ${IMAGE}:v2 \
  --no-traffic \
  --tag canary \
  --region ${REGION}

# Test the canary revision directly
export CANARY_URL=$(gcloud run services describe ${SERVICE_NAME} \
  --region ${REGION} \
  --format="value(status.traffic[0].url)")
curl ${CANARY_URL}/health

# Shift 10% traffic to v2 (canary)
gcloud run services update-traffic ${SERVICE_NAME} \
  --to-revisions ${SERVICE_NAME}-00001=90,${SERVICE_NAME}-00002=10 \
  --region ${REGION}

# Promote v2 to 100%
gcloud run services update-traffic ${SERVICE_NAME} \
  --to-revisions ${SERVICE_NAME}-00002=100 \
  --region ${REGION}

6. GPU-Enabled Cloud Run (Preview)

For models that need GPU acceleration (large LLMs, image generation):

bash

# Deploy with NVIDIA L4 GPU (preview feature)
gcloud run deploy ${SERVICE_NAME} \
  --image ${IMAGE} \
  --region ${REGION} \
  --cpu 4 \
  --memory 16Gi \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --no-cpu-throttling

GPU on Cloud Run

GPU support is in preview. Availability is limited to specific regions (us-central1, europe-west4). GPU instances are significantly more expensive (~$0.75/hr for L4). Use --min-instances 0 to avoid charges when idle.

7. Request-Based Authentication

For internal services, require IAM authentication:

bash

# Deploy without allowing unauthenticated access
gcloud run deploy ${SERVICE_NAME} \
  --image ${IMAGE} \
  --region ${REGION} \
  --no-allow-unauthenticated

# Generate an identity token for testing
export TOKEN=$(gcloud auth print-identity-token)

# Call the service
curl -H "Authorization: Bearer ${TOKEN}" \
  https://${SERVICE_NAME}-${HASH}.run.app/predict \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

Summary

Aspect	Recommendation
Min instances	`1` for production, `0` for dev
Concurrency	`40-80` for CPU-bound ML inference
Memory	`4Gi` minimum for scikit-learn; `8-16Gi` for deep learning
Timeout	`300s` for most models; `600s` for large-batch inference
Traffic splitting	Always use canary deployments for new model versions

Why Cloud Run for ML?​

1. FastAPI Model Server​

2. Dockerfile for Cloud Run​

3. Deploy to Cloud Run​

Key deployment flags​

4. Autoscaling Configuration​

5. Traffic Splitting for A/B Testing​

6. GPU-Enabled Cloud Run (Preview)​

7. Request-Based Authentication​

Summary​

Why Cloud Run for ML?

1. FastAPI Model Server

2. Dockerfile for Cloud Run

3. Deploy to Cloud Run

Key deployment flags

4. Autoscaling Configuration

5. Traffic Splitting for A/B Testing

6. GPU-Enabled Cloud Run (Preview)

7. Request-Based Authentication

Summary