Tutorial Lengkap Azure Databricks untuk ML: Platform Analytics Terpadu
Azure Databricks menyediakan platform analytics kolaboratif berbasis Apache Spark yang dioptimasi untuk machine learning. Platform ini menggabungkan data engineering, data science, dan machine learning dalam satu platform terpadu.
Mengapa Azure Databricks untuk ML?
Manfaat Utama:- Platform terpadu: Data engineering dan ML dalam satu tempat
- Kolaboratif: Notebooks dengan kolaborasi real-time
- Scalable: Auto-scaling Spark clusters
- Integrasi MLflow: Built-in experiment tracking
- Delta Lake: Storage data lake yang reliable
- Databricks Workspace
- Spark Clusters
- Notebooks
- MLflow
- Feature Store
- Model Serving
Prerequisites
pip install databricks-sdk mlflow
Azure CLI
az login
Setup
1. Buat Databricks Workspace
from azure.mgmt.databricks import AzureDatabricksManagementClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
client = AzureDatabricksManagementClient(
credential=credential,
subscriptionid="your-subscription-id"
)
Buat workspace
workspace = client.workspaces.begincreateorupdate(
resourcegroupname="my-resource-group",
workspacename="my-databricks-workspace",
parameters={
"location": "eastus",
"sku": {"name": "premium"}
}
).result()
print(f"Workspace dibuat: {workspace.name}")
2. Koneksi dengan Databricks SDK
from databricks.sdk import WorkspaceClient
Initialize client
w = WorkspaceClient(
host="https://adb-xxxxx.azuredatabricks.net",
token="dapi-xxxxx"
)
List clusters
clusters = w.clusters.list()
for cluster in clusters:
print(f"{cluster.clustername}: {cluster.state}")
Manajemen Cluster
1. Buat ML Cluster
from databricks.sdk.service.compute import (
ClusterSpec,
AutoScale,
AzureAttributes
)
Buat cluster
cluster = w.clusters.create(
clustername="ml-cluster",
sparkversion="13.3.x-ml-scala2.12",
nodetypeid="StandardDS3v2",
autoscale=AutoScale(minworkers=1, maxworkers=8),
azureattributes=AzureAttributes(
availability="ONDEMANDAZURE",
firstondemand=1
),
sparkconf={
"spark.databricks.delta.preview.enabled": "true"
},
customtags={
"project": "ml-training",
"team": "data-science"
}
).result()
print(f"Cluster ID: {cluster.clusterid}")
2. GPU Cluster untuk Deep Learning
gpucluster = w.clusters.create(
cluster
name="gpu-ml-cluster",
sparkversion="13.3.x-gpu-ml-scala2.12",
nodetypeid="StandardNC6sv3",
numworkers=2,
sparkconf={
"spark.task.resource.gpu.amount": "1"
}
).result()
Notebooks dan Data
1. Buat Notebook
# Buat notebook
notebook = w.workspace.mkdirs("/Users/user@company.com/ml-projects")
Import notebook
w.workspace.import(
path="/Users/user@company.com/ml-projects/training",
format="SOURCE",
language="PYTHON",
content=base64.b64encode(notebookcontent.encode()).decode()
)
2. Bekerja dengan Delta Lake
# Di Databricks notebook
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Baca data
df = spark.read.format("csv").option("header", "true").load("dbfs:/data/train.csv")
Tulis ke Delta Lake
df.write.format("delta").mode("overwrite").save("/delta/trainingdata")
Baca dari Delta Lake
deltadf = spark.read.format("delta").load("/delta/trainingdata")
Buat Delta table
spark.sql("""