Tutorial Lengkap BigQuery ML: Machine Learning dengan SQL
BigQuery ML memungkinkan Anda membuat dan mengeksekusi model machine learning menggunakan query SQL. Tidak perlu memindahkan data - training dan prediksi langsung di data warehouse Anda.
Mengapa BigQuery ML?
Manfaat Utama:- Berbasis SQL: Gunakan sintaks SQL yang familiar
- Tanpa pemindahan data: Training pada data di tempat
- Scalable: Manfaatkan infrastruktur BigQuery
- Terintegrasi: Bekerja dengan ekosistem BigQuery
- Multiple models: Classification, regression, clustering
- Linear/Logistic Regression
- XGBoost, Random Forest
- Deep Neural Networks
- K-means Clustering
- Time Series (ARIMA)
- Matrix Factorization
Prerequisites
pip install google-cloud-bigquery
gcloud auth login
Quick Start
1. Buat Classification Model
-- Buat logistic regression model
CREATE OR REPLACE MODEL project.dataset.churnmodel
OPTIONS(
modeltype='LOGISTICREG',
inputlabelcols=['churn'],
autoclassweights=TRUE
) AS
SELECT
age,
tenuremonths,
monthlycharges,
totalcharges,
contracttype,
churn
FROM project.dataset.customers
WHERE PARTITIONTIME BETWEEN '2024-01-01' AND '2024-06-01';
2. Evaluasi Model
-- Evaluasi performa model
SELECT
FROM
ML.EVALUATE(MODEL project.dataset.churnmodel,
(SELECT FROM project.dataset.customers
WHERE PARTITIONTIME > '2024-06-01'));
3. Buat Prediksi
-- Prediksi probabilitas churn
SELECT
customerid,
predictedchurn,
predictedchurnprobs
FROM
ML.PREDICT(MODEL project.dataset.churnmodel,
(SELECT FROM project.dataset.newcustomers
));
Tipe Model
1. XGBoost Classifier
CREATE OR REPLACE MODEL project.dataset.xgbclassifier
OPTIONS(
modeltype='BOOSTEDTREECLASSIFIER',
numparalleltree=5,
maxiterations=50,
learnrate=0.1,
inputlabelcols=['target']
) AS
SELECT FROM project.dataset.trainingdata;
2. Random Forest
CREATE OR REPLACE MODEL project.dataset.rfmodel
OPTIONS(
modeltype='RANDOMFORESTCLASSIFIER',
numparalleltree=100,
maxtreedepth=10,
inputlabelcols=['label']
) AS
SELECT FROM project.dataset.trainingdata
;
3. Deep Neural Network
CREATE OR REPLACE MODEL project.dataset.dnnmodel
OPTIONS(
modeltype='DNNCLASSIFIER',
hiddenunits=[128, 64, 32],
dropout=0.2,
batchsize=256,
inputlabelcols=['label']
) AS
SELECT FROM project.dataset.trainingdata;
4. Linear Regression
CREATE OR REPLACE MODEL project.dataset.pricepredictor
OPTIONS(
modeltype='LINEARREG',
inputlabelcols=['price'],
l2reg=0.1
) AS
SELECT
bedrooms,
bathrooms,
sqft,
location,
price
FROM project.dataset.houses;
5. K-Means Clustering
CREATE OR REPLACE MODEL project.dataset.customersegments
OPTIONS(
modeltype='KMEANS',
numclusters=5,
standardizefeatures=TRUE
) AS
SELECT
age,
income,
spendingscore,
purchasefrequency
FROM project.dataset.customers;
-- Dapatkan cluster assignments
SELECT
customerid,
CENTROIDID as segment
FROM
ML.PREDICT(MODEL project.dataset.customersegments,