Tutorial Lengkap AWS SageMaker Pipelines: Automasi ML Workflows
SageMaker Pipelines adalah layanan CI/CD yang dibuat khusus untuk machine learning yang membantu Anda mengotomasi dan mengelola workflow ML. Layanan ini memungkinkan Anda membuat pipeline ML yang reproducible dan production-ready dengan kode minimal.
Mengapa SageMaker Pipelines?
Manfaat Utama:- Automasi: Otomasi workflow ML end-to-end
- Reprodusibilitas: Lacak dan reproduksi eksperimen
- Integrasi: Integrasi native dengan layanan SageMaker
- Visualisasi: Visualisasi DAG di Studio
- Version control: Versioning pipeline dan lineage
- Processing Steps
- Training Steps
- Transform Steps
- Model Steps
- Condition Steps
- Callback Steps
Prerequisites
pip install sagemaker boto3 pandas scikit-learn
Pastikan SageMaker SDK >= 2.0
python -c "import sagemaker; print(sagemaker.version)"
Quick Start
1. Setup
import boto3
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.parameters import ParameterString, ParameterInteger
session = sagemaker.Session()
bucket = session.defaultbucket()
role = sagemaker.getexecutionrole()
region = session.botoregionname
pipelinename = "iris-ml-pipeline"
2. Definisikan Parameters
from sagemaker.workflow.parameters import (
ParameterString,
ParameterInteger,
ParameterFloat
)
Parameter pipeline
inputdata = ParameterString(
name="InputData",
defaultvalue=f"s3://{bucket}/iris/raw/data.csv"
)
traininginstancetype = ParameterString(
name="TrainingInstanceType",
defaultvalue="ml.m5.xlarge"
)
traininginstancecount = ParameterInteger(
name="TrainingInstanceCount",
defaultvalue=1
)
modelapprovalstatus = ParameterString(
name="ModelApprovalStatus",
defaultvalue="PendingManualApproval"
)
Processing Steps
1. Preprocessing Data
# preprocess.py
import argparse
import os
import pandas as pd
from sklearn.modelselection import traintestsplit
from sklearn.preprocessing import StandardScaler
if name == "main":
parser = argparse.ArgumentParser()
parser.addargument("--input-data", type=str)
parser.addargument("--test-size", type=float, default=0.2)
args = parser.parseargs()
# Baca data
inputpath = os.path.join("/opt/ml/processing/input", "data.csv")
df = pd.readcsv(inputpath)
# Pisahkan fitur dan target
X = df.drop("target", axis=1)
y = df["target"]
# Scale fitur
scaler = StandardScaler()
Xscaled = scaler.fittransform(X)
# Split data
Xtrain, Xtest, ytrain, ytest = traintestsplit(
Xscaled, y, testsize=args.testsize, randomstate=42
)
# Simpan output
traindf = pd.DataFrame(Xtrain)
traindf["target"] = ytrain.values
traindf.tocsv("/opt/ml/processing/train/train.csv", index=False, header=False)
testdf = pd.DataFrame(Xtest)
testdf["target"] = ytest.values
testdf.tocsv("/opt/ml/processing/test/test.csv", index=False, header=False)
print(f"Train size: {len(traindf)}, Test size: {len(testdf)}")
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
Buat processor
sklearnprocessor = SKLearnProcessor(
frameworkversion="1.0-1",
role=role,
instancetype="ml.m5.large",
instancecount=1,
sagemakersession=session
)
Definisikan processing step
stepprocess = ProcessingStep(
name="PreprocessData",
processor=sklearnprocessor,
inputs=[