Skip to main content

BigQuery ML

BigQuery ML lets you create and execute machine learning models directly inside BigQuery using standard SQL. No need to export data, no Python training scripts, no separate infrastructure. If you can write SQL, you can build ML models where your data already lives.

Why BigQuery ML?

Traditional ML workflow: Export data → Move to training environment → Train model → Export predictions → Load back to database. This is slow, expensive, and error-prone.

BigQuery ML workflow: Write CREATE MODEL. Done.

AdvantageDescription
No data movementTrain where data lives — no ETL, no export
SQL interfaceAnalysts can build models without Python
Automatic scalingBigQuery handles distributed compute
Integrated evaluationML.EVALUATE gives metrics instantly
Batch + real-timeML.PREDICT for batch, Vertex AI for online

CREATE MODEL Syntax

The CREATE MODEL statement trains a model and stores it in your BigQuery dataset.

Logistic Regression

sql
-- Train a logistic regression model for churn prediction
CREATE OR REPLACE MODEL `project.dataset.churn_logreg`
OPTIONS(
model_type = 'LOGISTIC_REG',
auto_class_weights = TRUE,
max_iterations = 50,
learn_rate = 0.01,
l1_reg = 0.1,
l2_reg = 0.1,
data_split_method = 'AUTO_SPLIT',
input_label_cols = ['churned']
) AS
SELECT
tenure,
monthly_charges,
total_charges,
num_support_tickets,
contract_type,
payment_method,
churned
FROM `project.dataset.customer_features`
WHERE training_data = TRUE;

Boosted Tree (XGBoost)

sql
-- Train a boosted tree classifier
CREATE OR REPLACE MODEL `project.dataset.churn_xgboost`
OPTIONS(
model_type = 'BOOSTED_TREE_CLASSIFIER',
max_iterations = 100,
max_tree_depth = 6,
learning_rate = 0.1,
subsample = 0.8,
auto_class_weights = TRUE,
input_label_cols = ['churned']
) AS
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = TRUE;

Supported Model Types

Model Typemodel_type ValueUse Case
Linear RegressionLINEAR_REGPredicting a number
Logistic RegressionLOGISTIC_REGBinary/multiclass classification
Boosted Tree (Classifier)BOOSTED_TREE_CLASSIFIERTabular classification
Boosted Tree (Regressor)BOOSTED_TREE_REGRESSORTabular regression
Random ForestRANDOM_FOREST_CLASSIFIER / RANDOM_FOREST_REGRESSORRobust tabular models
Deep Neural NetworkDNN_CLASSIFIER / DNN_REGRESSORComplex patterns in tabular data
Wide & DeepDNN_LINEAR_COMBINED_CLASSIFIERMemorization + generalization
AutoML TablesAUTOML_CLASSIFIER / AUTOML_REGRESSORAutomated model selection
K-MeansKMEANSUnsupervised clustering
TensorFlowTENSORFLOWImport a pre-trained TF model
ONNXONNXImport ONNX models
ARIMAARIMA_PLUSTime series forecasting

Evaluation with ML.EVALUATE

After training, use ML.EVALUATE to get model performance metrics:

sql
-- Evaluate the logistic regression model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_logreg`);

-- Evaluate against a specific dataset (not the auto-split test set)
SELECT * FROM ML.EVALUATE(
MODEL `project.dataset.churn_xgboost`,
(
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = FALSE
)
);

Evaluation Metrics Output

For classification models, ML.EVALUATE returns:

MetricDescription
precisionTrue positives / (True positives + False positives)
recallTrue positives / (True positives + False negatives)
accuracyCorrect predictions / Total predictions
f1_scoreHarmonic mean of precision and recall
log_lossCross-entropy loss
roc_aucArea under the ROC curve
sql
-- Compare models side by side
SELECT
'logreg' AS model,
precision, recall, accuracy, f1_score, roc_auc
FROM ML.EVALUATE(MODEL `project.dataset.churn_logreg`)
UNION ALL
SELECT
'xgboost' AS model,
precision, recall, accuracy, f1_score, roc_auc
FROM ML.EVALUATE(MODEL `project.dataset.churn_xgboost`);

Confusion Matrix

sql
-- Generate a confusion matrix
SELECT * FROM ML.CONFUSION_MATRIX(
MODEL `project.dataset.churn_xgboost`,
(
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = FALSE
)
);

ROC Curve

sql
-- Generate ROC curve data points
SELECT * FROM ML.ROC_CURVE(
MODEL `project.dataset.churn_xgboost`,
(
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = FALSE
)
);

Prediction with ML.PREDICT

Use ML.PREDICT to generate predictions for new data:

sql
-- Predict churn for current customers
SELECT *
FROM ML.PREDICT(
MODEL `project.dataset.churn_xgboost`,
(
SELECT
tenure,
monthly_charges,
total_charges,
num_support_tickets,
contract_type,
payment_method,
customer_id
FROM `project.dataset.current_customers`
)
);

-- Get only high-risk customers
SELECT
customer_id,
predicted_churned,
predicted_churned_probs
FROM ML.PREDICT(
MODEL `project.dataset.churn_xgboost`,
(SELECT * FROM `project.dataset.current_customers`)
)
WHERE predicted_churned = 1
AND (
SELECT prob FROM UNNEST(predicted_churned_probs)
WHERE label = 1
) > 0.8;

Saving Predictions to a Table

sql
-- Write predictions to a new table for downstream use
CREATE OR REPLACE TABLE `project.dataset.churn_predictions` AS
SELECT
customer_id,
predicted_churned,
(
SELECT prob FROM UNNEST(predicted_churned_probs) WHERE label = 1
) AS churn_probability
FROM ML.PREDICT(
MODEL `project.dataset.churn_xgboost`,
(SELECT * FROM `project.dataset.current_customers`)
);

Feature Engineering with ML.TRANSFORM

BigQuery ML can handle feature engineering automatically with the TRANSFORM clause. You define transformations once and they are applied consistently during training and prediction.

sql
-- Train with explicit feature transformations
CREATE OR REPLACE MODEL `project.dataset.churn_with_transform`
TRANSFORM(
-- Bucketize tenure
ML.BUCKETIZE(tenure, [0, 12, 24, 48, 72]) AS tenure_bucket,
-- Scale numeric features
ML.MIN_MAX_SCALER(monthly_charges) AS monthly_charges_scaled,
ML.MIN_MAX_SCALER(total_charges) AS total_charges_scaled,
-- Encode categoricals
ML.ONE_HOT_ENCODER(contract_type) AS contract_encoded,
ML.ONE_HOT_ENCODER(payment_method) AS payment_encoded,
-- Pass through
num_support_tickets,
churned AS label
)
OPTIONS(
model_type = 'BOOSTED_TREE_CLASSIFIER',
max_iterations = 100,
auto_class_weights = TRUE,
input_label_cols = ['label']
) AS
SELECT * FROM `project.dataset.customer_features`;

Available TRANSFORM Functions

FunctionDescription
ML.MIN_MAX_SCALER(col)Scale to [0, 1] range
ML.STANDARD_SCALER(col)Scale to mean=0, std=1
ML.MAX_ABS_SCALER(col)Scale by max absolute value
ML.BUCKETIZE(col, boundaries)Discretize into bins
ML.ONE_HOT_ENCODER(col)One-hot encode categorical
ML.LABEL_ENCODER(col)Integer encode categorical
ML.HASH_BUCKETS(col, num_buckets)Hash-based bucketing
ML.NGRAMS(col, ngram_range)Extract n-grams from text
ML.FEATURE_CROSS(col1, col2)Cross two features

Using the Transform in Prediction

When you use ML.PREDICT with a model that has a TRANSFORM clause, the same transformations are automatically applied to the input data:

sql
-- The TRANSFORM steps are automatically applied
SELECT * FROM ML.PREDICT(
MODEL `project.dataset.churn_with_transform`,
(SELECT * FROM `project.dataset.new_customers`)
);

Model Management

sql
-- List all models in a dataset
SELECT * FROM `project.dataset.INFORMATION_SCHEMA.MODELS`;

-- Get model training info
SELECT * FROM ML.TRAINING_INFO(MODEL `project.dataset.churn_xgboost`);

-- Get model weights/feature importance
SELECT * FROM ML.WEIGHTS(MODEL `project.dataset.churn_logreg`);

-- Get feature statistics
SELECT * FROM ML.FEATURE_INFO(MODEL `project.dataset.churn_xgboost`);

-- Delete a model
DROP MODEL IF EXISTS `project.dataset.old_model`;
BigQuery ML + Vertex AI

You can register a BigQuery ML model to Vertex AI for online serving:

sql
-- Export model to Vertex AI endpoint
CREATE MODEL `project.dataset.churn_endpoint`
OPTIONS(
model_type = 'VERTEX_AI_ENDPOINT',
vertex_ai_endpoint_id = 'your-endpoint-id'
);

This gives you the best of both worlds: train in SQL, serve with low latency.