BigQuery ML
BigQuery ML lets you create and execute machine learning models directly inside BigQuery using standard SQL. No need to export data, no Python training scripts, no separate infrastructure. If you can write SQL, you can build ML models where your data already lives.
Why BigQuery ML?
Traditional ML workflow: Export data → Move to training environment → Train model → Export predictions → Load back to database. This is slow, expensive, and error-prone.
BigQuery ML workflow: Write CREATE MODEL. Done.
| Advantage | Description |
|---|---|
| No data movement | Train where data lives — no ETL, no export |
| SQL interface | Analysts can build models without Python |
| Automatic scaling | BigQuery handles distributed compute |
| Integrated evaluation | ML.EVALUATE gives metrics instantly |
| Batch + real-time | ML.PREDICT for batch, Vertex AI for online |
CREATE MODEL Syntax
The CREATE MODEL statement trains a model and stores it in your BigQuery dataset.
Logistic Regression
-- Train a logistic regression model for churn prediction
CREATE OR REPLACE MODEL `project.dataset.churn_logreg`
OPTIONS(
model_type = 'LOGISTIC_REG',
auto_class_weights = TRUE,
max_iterations = 50,
learn_rate = 0.01,
l1_reg = 0.1,
l2_reg = 0.1,
data_split_method = 'AUTO_SPLIT',
input_label_cols = ['churned']
) AS
SELECT
tenure,
monthly_charges,
total_charges,
num_support_tickets,
contract_type,
payment_method,
churned
FROM `project.dataset.customer_features`
WHERE training_data = TRUE;
Boosted Tree (XGBoost)
-- Train a boosted tree classifier
CREATE OR REPLACE MODEL `project.dataset.churn_xgboost`
OPTIONS(
model_type = 'BOOSTED_TREE_CLASSIFIER',
max_iterations = 100,
max_tree_depth = 6,
learning_rate = 0.1,
subsample = 0.8,
auto_class_weights = TRUE,
input_label_cols = ['churned']
) AS
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = TRUE;
Supported Model Types
| Model Type | model_type Value | Use Case |
|---|---|---|
| Linear Regression | LINEAR_REG | Predicting a number |
| Logistic Regression | LOGISTIC_REG | Binary/multiclass classification |
| Boosted Tree (Classifier) | BOOSTED_TREE_CLASSIFIER | Tabular classification |
| Boosted Tree (Regressor) | BOOSTED_TREE_REGRESSOR | Tabular regression |
| Random Forest | RANDOM_FOREST_CLASSIFIER / RANDOM_FOREST_REGRESSOR | Robust tabular models |
| Deep Neural Network | DNN_CLASSIFIER / DNN_REGRESSOR | Complex patterns in tabular data |
| Wide & Deep | DNN_LINEAR_COMBINED_CLASSIFIER | Memorization + generalization |
| AutoML Tables | AUTOML_CLASSIFIER / AUTOML_REGRESSOR | Automated model selection |
| K-Means | KMEANS | Unsupervised clustering |
| TensorFlow | TENSORFLOW | Import a pre-trained TF model |
| ONNX | ONNX | Import ONNX models |
| ARIMA | ARIMA_PLUS | Time series forecasting |
Evaluation with ML.EVALUATE
After training, use ML.EVALUATE to get model performance metrics:
-- Evaluate the logistic regression model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_logreg`);
-- Evaluate against a specific dataset (not the auto-split test set)
SELECT * FROM ML.EVALUATE(
MODEL `project.dataset.churn_xgboost`,
(
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = FALSE
)
);
Evaluation Metrics Output
For classification models, ML.EVALUATE returns:
| Metric | Description |
|---|---|
precision | True positives / (True positives + False positives) |
recall | True positives / (True positives + False negatives) |
accuracy | Correct predictions / Total predictions |
f1_score | Harmonic mean of precision and recall |
log_loss | Cross-entropy loss |
roc_auc | Area under the ROC curve |
-- Compare models side by side
SELECT
'logreg' AS model,
precision, recall, accuracy, f1_score, roc_auc
FROM ML.EVALUATE(MODEL `project.dataset.churn_logreg`)
UNION ALL
SELECT
'xgboost' AS model,
precision, recall, accuracy, f1_score, roc_auc
FROM ML.EVALUATE(MODEL `project.dataset.churn_xgboost`);
Confusion Matrix
-- Generate a confusion matrix
SELECT * FROM ML.CONFUSION_MATRIX(
MODEL `project.dataset.churn_xgboost`,
(
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = FALSE
)
);
ROC Curve
-- Generate ROC curve data points
SELECT * FROM ML.ROC_CURVE(
MODEL `project.dataset.churn_xgboost`,
(
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = FALSE
)
);
Prediction with ML.PREDICT
Use ML.PREDICT to generate predictions for new data:
-- Predict churn for current customers
SELECT *
FROM ML.PREDICT(
MODEL `project.dataset.churn_xgboost`,
(
SELECT
tenure,
monthly_charges,
total_charges,
num_support_tickets,
contract_type,
payment_method,
customer_id
FROM `project.dataset.current_customers`
)
);
-- Get only high-risk customers
SELECT
customer_id,
predicted_churned,
predicted_churned_probs
FROM ML.PREDICT(
MODEL `project.dataset.churn_xgboost`,
(SELECT * FROM `project.dataset.current_customers`)
)
WHERE predicted_churned = 1
AND (
SELECT prob FROM UNNEST(predicted_churned_probs)
WHERE label = 1
) > 0.8;
Saving Predictions to a Table
-- Write predictions to a new table for downstream use
CREATE OR REPLACE TABLE `project.dataset.churn_predictions` AS
SELECT
customer_id,
predicted_churned,
(
SELECT prob FROM UNNEST(predicted_churned_probs) WHERE label = 1
) AS churn_probability
FROM ML.PREDICT(
MODEL `project.dataset.churn_xgboost`,
(SELECT * FROM `project.dataset.current_customers`)
);
Feature Engineering with ML.TRANSFORM
BigQuery ML can handle feature engineering automatically with the TRANSFORM clause. You define transformations once and they are applied consistently during training and prediction.
-- Train with explicit feature transformations
CREATE OR REPLACE MODEL `project.dataset.churn_with_transform`
TRANSFORM(
-- Bucketize tenure
ML.BUCKETIZE(tenure, [0, 12, 24, 48, 72]) AS tenure_bucket,
-- Scale numeric features
ML.MIN_MAX_SCALER(monthly_charges) AS monthly_charges_scaled,
ML.MIN_MAX_SCALER(total_charges) AS total_charges_scaled,
-- Encode categoricals
ML.ONE_HOT_ENCODER(contract_type) AS contract_encoded,
ML.ONE_HOT_ENCODER(payment_method) AS payment_encoded,
-- Pass through
num_support_tickets,
churned AS label
)
OPTIONS(
model_type = 'BOOSTED_TREE_CLASSIFIER',
max_iterations = 100,
auto_class_weights = TRUE,
input_label_cols = ['label']
) AS
SELECT * FROM `project.dataset.customer_features`;
Available TRANSFORM Functions
| Function | Description |
|---|---|
ML.MIN_MAX_SCALER(col) | Scale to [0, 1] range |
ML.STANDARD_SCALER(col) | Scale to mean=0, std=1 |
ML.MAX_ABS_SCALER(col) | Scale by max absolute value |
ML.BUCKETIZE(col, boundaries) | Discretize into bins |
ML.ONE_HOT_ENCODER(col) | One-hot encode categorical |
ML.LABEL_ENCODER(col) | Integer encode categorical |
ML.HASH_BUCKETS(col, num_buckets) | Hash-based bucketing |
ML.NGRAMS(col, ngram_range) | Extract n-grams from text |
ML.FEATURE_CROSS(col1, col2) | Cross two features |
Using the Transform in Prediction
When you use ML.PREDICT with a model that has a TRANSFORM clause, the same transformations are automatically applied to the input data:
-- The TRANSFORM steps are automatically applied
SELECT * FROM ML.PREDICT(
MODEL `project.dataset.churn_with_transform`,
(SELECT * FROM `project.dataset.new_customers`)
);
Model Management
-- List all models in a dataset
SELECT * FROM `project.dataset.INFORMATION_SCHEMA.MODELS`;
-- Get model training info
SELECT * FROM ML.TRAINING_INFO(MODEL `project.dataset.churn_xgboost`);
-- Get model weights/feature importance
SELECT * FROM ML.WEIGHTS(MODEL `project.dataset.churn_logreg`);
-- Get feature statistics
SELECT * FROM ML.FEATURE_INFO(MODEL `project.dataset.churn_xgboost`);
-- Delete a model
DROP MODEL IF EXISTS `project.dataset.old_model`;
You can register a BigQuery ML model to Vertex AI for online serving:
-- Export model to Vertex AI endpoint
CREATE MODEL `project.dataset.churn_endpoint`
OPTIONS(
model_type = 'VERTEX_AI_ENDPOINT',
vertex_ai_endpoint_id = 'your-endpoint-id'
);
This gives you the best of both worlds: train in SQL, serve with low latency.