Complete Azure Databricks for ML Tutorial: Unified Analytics Platform
Azure Databricks provides a collaborative Apache Spark-based analytics platform optimized for machine learning. It combines data engineering, data science, and machine learning on a unified platform.
Why Azure Databricks for ML?
Key Benefits:- Unified platform: Data engineering and ML in one place
- Collaborative: Notebooks with real-time collaboration
- Scalable: Auto-scaling Spark clusters
- MLflow integration: Built-in experiment tracking
- Delta Lake: Reliable data lake storage
- Databricks Workspace
- Spark Clusters
- Notebooks
- MLflow
- Feature Store
- Model Serving
Prerequisites
pip install databricks-sdk mlflow
Azure CLI
az login
Setup
1. Create Databricks Workspace
from azure.mgmt.databricks import AzureDatabricksManagementClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
client = AzureDatabricksManagementClient(
credential=credential,
subscriptionid="your-subscription-id"
)
Create workspace
workspace = client.workspaces.begincreateorupdate(
resourcegroupname="my-resource-group",
workspacename="my-databricks-workspace",
parameters={
"location": "eastus",
"sku": {"name": "premium"}
}
).result()
print(f"Workspace created: {workspace.name}")
2. Connect with Databricks SDK
from databricks.sdk import WorkspaceClient
Initialize client
w = WorkspaceClient(
host="https://adb-xxxxx.azuredatabricks.net",
token="dapi-xxxxx"
)
List clusters
clusters = w.clusters.list()
for cluster in clusters:
print(f"{cluster.clustername}: {cluster.state}")
Cluster Management
1. Create ML Cluster
from databricks.sdk.service.compute import (
ClusterSpec,
AutoScale,
AzureAttributes
)
Create cluster
cluster = w.clusters.create(
clustername="ml-cluster",
sparkversion="13.3.x-ml-scala2.12",
nodetypeid="StandardDS3v2",
autoscale=AutoScale(minworkers=1, maxworkers=8),
azureattributes=AzureAttributes(
availability="ONDEMANDAZURE",
firstondemand=1
),
sparkconf={
"spark.databricks.delta.preview.enabled": "true"
},
customtags={
"project": "ml-training",
"team": "data-science"
}
).result()
print(f"Cluster ID: {cluster.clusterid}")
2. GPU Cluster for Deep Learning
gpucluster = w.clusters.create(
cluster
name="gpu-ml-cluster",
sparkversion="13.3.x-gpu-ml-scala2.12",
nodetypeid="StandardNC6sv3",
numworkers=2,
sparkconf={
"spark.task.resource.gpu.amount": "1"
}
).result()
Notebooks and Data
1. Create Notebook
# Create notebook
notebook = w.workspace.mkdirs("/Users/user@company.com/ml-projects")
Import notebook
w.workspace.import(
path="/Users/user@company.com/ml-projects/training",
format="SOURCE",
language="PYTHON",
content=base64.b64encode(notebookcontent.encode()).decode()
)
2. Working with Delta Lake
# In Databricks notebook
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Read data
df = spark.read.format("csv").option("header", "true").load("dbfs:/data/train.csv")
Write to Delta Lake
df.write.format("delta").mode("overwrite").save("/delta/trainingdata")
Read from Delta Lake
deltadf = spark.read.format("delta").load("/delta/trainingdata")
Create Delta table
spark.sql("""