Skip to main content

Cloud Run for ML

Cloud Run is a fully managed serverless platform that runs containers. It's ideal for deploying ML model servers because it scales to zero when idle, supports traffic splitting for canary deployments, and handles HTTPS automatically. No Kubernetes, no cluster management — just push a container and go.

Why Cloud Run for ML?

FeatureCloud RunCloud FunctionGKE
Max timeout60 min60 minUnlimited
GPU supportPreviewNoYes
Concurrency1000/container1/callConfigurable
Min instances0 (scale to zero)0Manual
Traffic splittingBuilt-inNoIstio/Gateway
Cold start~2-5s~1-2sN/A

1. FastAPI Model Server

Create a minimal FastAPI app that loads a model and serves predictions:

python
# main.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI(title="ML Prediction Server", version="1.0.0")

# Load model at startup
model = joblib.load("model.joblib")

class PredictionRequest(BaseModel):
features: list[float]

class PredictionResponse(BaseModel):
prediction: float
confidence: float

@app.get("/health")
def health():
return {"status": "healthy", "model_loaded": model is not None}

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
features = np.array(request.features).reshape(1, -1)
prediction = float(model.predict(features)[0])
confidence = float(np.max(model.predict_proba(features)))
return PredictionResponse(prediction=prediction, confidence=confidence)

2. Dockerfile for Cloud Run

Cloud Run requires the container to listen on the port specified by the PORT environment variable (default 8080):

dockerfile
# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code and model artifact
COPY main.py .
COPY model.joblib .

# Cloud Run sets PORT env var; default to 8080
ENV PORT=8080

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
txt
# requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.30.0
scikit-learn==1.5.1
joblib==1.4.2
pydantic==2.9.0

3. Deploy to Cloud Run

Build the container with Cloud Build and deploy:

bash
# Set variables
export PROJECT_ID="your-project-id"
export REGION="us-central1"
export SERVICE_NAME="ml-predictor"
export IMAGE="gcr.io/${PROJECT_ID}/${SERVICE_NAME}"

# Build and push using Cloud Build
gcloud builds submit --tag ${IMAGE}

# Deploy to Cloud Run
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE} \
--region ${REGION} \
--platform managed \
--allow-unauthenticated \
--cpu 2 \
--memory 4Gi \
--min-instances 0 \
--max-instances 10 \
--concurrency 80 \
--timeout 300

Key deployment flags

FlagPurposeExample
--min-instancesKeep warm instances to avoid cold starts1 for production
--max-instancesCap scaling to control costs10 or 100
--concurrencyRequests per container instance80 (default), up to 1000
--cpuvCPU allocation (1, 2, 4)2 for ML inference
--memoryRAM per instance4Gi, 8Gi, 16Gi
--timeoutMax request duration300 (seconds)

4. Autoscaling Configuration

Cloud Run autoscales based on concurrency and CPU utilization. Fine-tune for ML workloads:

bash
# Production ML service — keep 1 warm instance, scale up to 20
gcloud run services update ${SERVICE_NAME} \
--region ${REGION} \
--min-instances 1 \
--max-instances 20 \
--concurrency 40 \
--cpu 4 \
--memory 8Gi
Cold Start Mitigation

ML models are large — loading them on cold start can add 5-15 seconds. Setting --min-instances 1 keeps at least one container warm. The cost is ~$10-20/month for a 2-CPU / 4Gi instance.

5. Traffic Splitting for A/B Testing

Gradually shift traffic between model versions:

bash
# Deploy a new revision without routing traffic
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE}:v2 \
--no-traffic \
--tag canary \
--region ${REGION}

# Test the canary revision directly
export CANARY_URL=$(gcloud run services describe ${SERVICE_NAME} \
--region ${REGION} \
--format="value(status.traffic[0].url)")
curl ${CANARY_URL}/health

# Shift 10% traffic to v2 (canary)
gcloud run services update-traffic ${SERVICE_NAME} \
--to-revisions ${SERVICE_NAME}-00001=90,${SERVICE_NAME}-00002=10 \
--region ${REGION}

# Promote v2 to 100%
gcloud run services update-traffic ${SERVICE_NAME} \
--to-revisions ${SERVICE_NAME}-00002=100 \
--region ${REGION}

6. GPU-Enabled Cloud Run (Preview)

For models that need GPU acceleration (large LLMs, image generation):

bash
# Deploy with NVIDIA L4 GPU (preview feature)
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE} \
--region ${REGION} \
--cpu 4 \
--memory 16Gi \
--gpu 1 \
--gpu-type nvidia-l4 \
--no-cpu-throttling
GPU on Cloud Run

GPU support is in preview. Availability is limited to specific regions (us-central1, europe-west4). GPU instances are significantly more expensive (~$0.75/hr for L4). Use --min-instances 0 to avoid charges when idle.

7. Request-Based Authentication

For internal services, require IAM authentication:

bash
# Deploy without allowing unauthenticated access
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE} \
--region ${REGION} \
--no-allow-unauthenticated

# Generate an identity token for testing
export TOKEN=$(gcloud auth print-identity-token)

# Call the service
curl -H "Authorization: Bearer ${TOKEN}" \
https://${SERVICE_NAME}-${HASH}.run.app/predict \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'

Summary

AspectRecommendation
Min instances1 for production, 0 for dev
Concurrency40-80 for CPU-bound ML inference
Memory4Gi minimum for scikit-learn; 8-16Gi for deep learning
Timeout300s for most models; 600s for large-batch inference
Traffic splittingAlways use canary deployments for new model versions