BigQuery ML

BigQuery ML lets you create and execute machine learning models directly inside BigQuery using standard SQL. No need to export data, no Python training scripts, no separate infrastructure. If you can write SQL, you can build ML models where your data already lives.

Why BigQuery ML?

Traditional ML workflow: Export data → Move to training environment → Train model → Export predictions → Load back to database. This is slow, expensive, and error-prone.

BigQuery ML workflow: Write CREATE MODEL. Done.

Advantage	Description
No data movement	Train where data lives — no ETL, no export
SQL interface	Analysts can build models without Python
Automatic scaling	BigQuery handles distributed compute
Integrated evaluation	`ML.EVALUATE` gives metrics instantly
Batch + real-time	`ML.PREDICT` for batch, Vertex AI for online

CREATE MODEL Syntax

The CREATE MODEL statement trains a model and stores it in your BigQuery dataset.

Logistic Regression

sql

-- Train a logistic regression model for churn prediction
CREATE OR REPLACE MODEL `project.dataset.churn_logreg`
OPTIONS(
  model_type = 'LOGISTIC_REG',
  auto_class_weights = TRUE,
  max_iterations = 50,
  learn_rate = 0.01,
  l1_reg = 0.1,
  l2_reg = 0.1,
  data_split_method = 'AUTO_SPLIT',
  input_label_cols = ['churned']
) AS
SELECT
  tenure,
  monthly_charges,
  total_charges,
  num_support_tickets,
  contract_type,
  payment_method,
  churned
FROM `project.dataset.customer_features`
WHERE training_data = TRUE;

Boosted Tree (XGBoost)

sql

-- Train a boosted tree classifier
CREATE OR REPLACE MODEL `project.dataset.churn_xgboost`
OPTIONS(
  model_type = 'BOOSTED_TREE_CLASSIFIER',
  max_iterations = 100,
  max_tree_depth = 6,
  learning_rate = 0.1,
  subsample = 0.8,
  auto_class_weights = TRUE,
  input_label_cols = ['churned']
) AS
SELECT * FROM `project.dataset.customer_features`
WHERE training_data = TRUE;

Supported Model Types

Model Type	`model_type` Value	Use Case
Linear Regression	`LINEAR_REG`	Predicting a number
Logistic Regression	`LOGISTIC_REG`	Binary/multiclass classification
Boosted Tree (Classifier)	`BOOSTED_TREE_CLASSIFIER`	Tabular classification
Boosted Tree (Regressor)	`BOOSTED_TREE_REGRESSOR`	Tabular regression
Random Forest	`RANDOM_FOREST_CLASSIFIER` / `RANDOM_FOREST_REGRESSOR`	Robust tabular models
Deep Neural Network	`DNN_CLASSIFIER` / `DNN_REGRESSOR`	Complex patterns in tabular data
Wide & Deep	`DNN_LINEAR_COMBINED_CLASSIFIER`	Memorization + generalization
AutoML Tables	`AUTOML_CLASSIFIER` / `AUTOML_REGRESSOR`	Automated model selection
K-Means	`KMEANS`	Unsupervised clustering
TensorFlow	`TENSORFLOW`	Import a pre-trained TF model
ONNX	`ONNX`	Import ONNX models
ARIMA	`ARIMA_PLUS`	Time series forecasting

Evaluation with ML.EVALUATE

After training, use ML.EVALUATE to get model performance metrics:

sql

-- Evaluate the logistic regression model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_logreg`);

-- Evaluate against a specific dataset (not the auto-split test set)
SELECT * FROM ML.EVALUATE(
  MODEL `project.dataset.churn_xgboost`,
  (
    SELECT * FROM `project.dataset.customer_features`
    WHERE training_data = FALSE
  )
);

Evaluation Metrics Output

For classification models, ML.EVALUATE returns:

Metric	Description
`precision`	True positives / (True positives + False positives)
`recall`	True positives / (True positives + False negatives)
`accuracy`	Correct predictions / Total predictions
`f1_score`	Harmonic mean of precision and recall
`log_loss`	Cross-entropy loss
`roc_auc`	Area under the ROC curve

sql

-- Compare models side by side
SELECT
  'logreg' AS model,
  precision, recall, accuracy, f1_score, roc_auc
FROM ML.EVALUATE(MODEL `project.dataset.churn_logreg`)
UNION ALL
SELECT
  'xgboost' AS model,
  precision, recall, accuracy, f1_score, roc_auc
FROM ML.EVALUATE(MODEL `project.dataset.churn_xgboost`);

Confusion Matrix

sql

-- Generate a confusion matrix
SELECT * FROM ML.CONFUSION_MATRIX(
  MODEL `project.dataset.churn_xgboost`,
  (
    SELECT * FROM `project.dataset.customer_features`
    WHERE training_data = FALSE
  )
);

ROC Curve

sql

-- Generate ROC curve data points
SELECT * FROM ML.ROC_CURVE(
  MODEL `project.dataset.churn_xgboost`,
  (
    SELECT * FROM `project.dataset.customer_features`
    WHERE training_data = FALSE
  )
);

Prediction with ML.PREDICT

Use ML.PREDICT to generate predictions for new data:

sql

-- Predict churn for current customers
SELECT *
FROM ML.PREDICT(
  MODEL `project.dataset.churn_xgboost`,
  (
    SELECT
      tenure,
      monthly_charges,
      total_charges,
      num_support_tickets,
      contract_type,
      payment_method,
      customer_id
    FROM `project.dataset.current_customers`
  )
);

-- Get only high-risk customers
SELECT
  customer_id,
  predicted_churned,
  predicted_churned_probs
FROM ML.PREDICT(
  MODEL `project.dataset.churn_xgboost`,
  (SELECT * FROM `project.dataset.current_customers`)
)
WHERE predicted_churned = 1
  AND (
    SELECT prob FROM UNNEST(predicted_churned_probs)
    WHERE label = 1
  ) > 0.8;

Saving Predictions to a Table

sql

-- Write predictions to a new table for downstream use
CREATE OR REPLACE TABLE `project.dataset.churn_predictions` AS
SELECT
  customer_id,
  predicted_churned,
  (
    SELECT prob FROM UNNEST(predicted_churned_probs) WHERE label = 1
  ) AS churn_probability
FROM ML.PREDICT(
  MODEL `project.dataset.churn_xgboost`,
  (SELECT * FROM `project.dataset.current_customers`)
);

Feature Engineering with ML.TRANSFORM

BigQuery ML can handle feature engineering automatically with the TRANSFORM clause. You define transformations once and they are applied consistently during training and prediction.

sql

-- Train with explicit feature transformations
CREATE OR REPLACE MODEL `project.dataset.churn_with_transform`
TRANSFORM(
  -- Bucketize tenure
  ML.BUCKETIZE(tenure, [0, 12, 24, 48, 72]) AS tenure_bucket,
  -- Scale numeric features
  ML.MIN_MAX_SCALER(monthly_charges) AS monthly_charges_scaled,
  ML.MIN_MAX_SCALER(total_charges) AS total_charges_scaled,
  -- Encode categoricals
  ML.ONE_HOT_ENCODER(contract_type) AS contract_encoded,
  ML.ONE_HOT_ENCODER(payment_method) AS payment_encoded,
  -- Pass through
  num_support_tickets,
  churned AS label
)
OPTIONS(
  model_type = 'BOOSTED_TREE_CLASSIFIER',
  max_iterations = 100,
  auto_class_weights = TRUE,
  input_label_cols = ['label']
) AS
SELECT * FROM `project.dataset.customer_features`;

Available TRANSFORM Functions

Function	Description
`ML.MIN_MAX_SCALER(col)`	Scale to [0, 1] range
`ML.STANDARD_SCALER(col)`	Scale to mean=0, std=1
`ML.MAX_ABS_SCALER(col)`	Scale by max absolute value
`ML.BUCKETIZE(col, boundaries)`	Discretize into bins
`ML.ONE_HOT_ENCODER(col)`	One-hot encode categorical
`ML.LABEL_ENCODER(col)`	Integer encode categorical
`ML.HASH_BUCKETS(col, num_buckets)`	Hash-based bucketing
`ML.NGRAMS(col, ngram_range)`	Extract n-grams from text
`ML.FEATURE_CROSS(col1, col2)`	Cross two features

Using the Transform in Prediction

When you use ML.PREDICT with a model that has a TRANSFORM clause, the same transformations are automatically applied to the input data:

sql

-- The TRANSFORM steps are automatically applied
SELECT * FROM ML.PREDICT(
  MODEL `project.dataset.churn_with_transform`,
  (SELECT * FROM `project.dataset.new_customers`)
);

Model Management

sql

-- List all models in a dataset
SELECT * FROM `project.dataset.INFORMATION_SCHEMA.MODELS`;

-- Get model training info
SELECT * FROM ML.TRAINING_INFO(MODEL `project.dataset.churn_xgboost`);

-- Get model weights/feature importance
SELECT * FROM ML.WEIGHTS(MODEL `project.dataset.churn_logreg`);

-- Get feature statistics
SELECT * FROM ML.FEATURE_INFO(MODEL `project.dataset.churn_xgboost`);

-- Delete a model
DROP MODEL IF EXISTS `project.dataset.old_model`;

BigQuery ML + Vertex AI

You can register a BigQuery ML model to Vertex AI for online serving:

sql

-- Export model to Vertex AI endpoint
CREATE MODEL `project.dataset.churn_endpoint`
OPTIONS(
  model_type = 'VERTEX_AI_ENDPOINT',
  vertex_ai_endpoint_id = 'your-endpoint-id'
);

This gives you the best of both worlds: train in SQL, serve with low latency.

Why BigQuery ML?​

CREATE MODEL Syntax​

Logistic Regression​

Boosted Tree (XGBoost)​

Supported Model Types​

Evaluation with ML.EVALUATE​

Evaluation Metrics Output​

Confusion Matrix​

ROC Curve​

Prediction with ML.PREDICT​

Saving Predictions to a Table​

Feature Engineering with ML.TRANSFORM​

Available TRANSFORM Functions​

Using the Transform in Prediction​

Model Management​