Complete AWS SageMaker Tutorial: End-to-End ML Pipeline
Amazon SageMaker is a fully managed machine learning service that enables data scientists and developers to build, train, and deploy ML models at scale. This tutorial covers the complete ML lifecycle on AWS.
Why AWS SageMaker?
SageMaker Advantages:- Fully managed: No infrastructure to manage
- End-to-end: Complete ML lifecycle support
- Scalable: Train on any scale with managed infrastructure
- Integrated: Native AWS service integration
- Cost-effective: Pay only for what you use
- SageMaker Studio (IDE)
- SageMaker Training
- SageMaker Inference
- SageMaker Pipelines
- SageMaker Feature Store
- SageMaker Model Monitor
Prerequisites
# Install AWS CLI and SDK
pip install boto3 sagemaker pandas scikit-learn
Configure AWS credentials
aws configure
Enter: AWS Access Key ID, Secret Access Key, Region (e.g., us-east-1)
Quick Start
1. Setup SageMaker Session
import boto3
import sagemaker
from sagemaker import getexecutionrole
Create session
session = sagemaker.Session()
bucket = session.defaultbucket()
role = getexecutionrole() # Or specify IAM role ARN
print(f"Bucket: {bucket}")
print(f"Role: {role}")
print(f"Region: {session.botoregionname}")
2. Prepare Training Data
import pandas as pd
from sklearn.datasets import loadiris
from sklearn.modelselection import traintestsplit
Load sample data
iris = loadiris()
df = pd.DataFrame(iris.data, columns=iris.featurenames)
df['target'] = iris.target
Split data
traindf, testdf = traintestsplit(df, testsize=0.2, randomstate=42)
Save to S3
trainpath = f"s3://{bucket}/iris/train/train.csv"
testpath = f"s3://{bucket}/iris/test/test.csv"
traindf.tocsv(trainpath, index=False)
testdf.tocsv(testpath, index=False)
print(f"Training data: {trainpath}")
print(f"Test data: {testpath}")
Built-in Algorithms
1. XGBoost Training
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
Get XGBoost container
container = sagemaker.imageuris.retrieve(
framework="xgboost",
region=session.botoregionname,
version="1.5-1"
)
Create estimator
xgbestimator = Estimator(
imageuri=container,
role=role,
instancecount=1,
instancetype="ml.m5.xlarge",
outputpath=f"s3://{bucket}/iris/output",
sagemakersession=session,
hyperparameters={
"objective": "multi:softmax",
"numclass": 3,
"numround": 100,
"maxdepth": 5,
"eta": 0.2
}
)
Define training input
traininput = TrainingInput(
s3data=trainpath,
contenttype="csv"
)
Train model
xgbestimator.fit({"train": traininput})
2. Linear Learner
from sagemaker import LinearLearner
Create Linear Learner estimator
linear = LinearLearner(
role=role,
instancecount=1,
instancetype="ml.m5.large",
predictortype="multiclassclassifier",
numclasses=3,
outputpath=f"s3://{bucket}/linear/output"
)
Prepare data in RecordIO format
trainrecords = linear.recordset(
traindf.drop('target', axis=1).values.astype('float32'),
traindf['target'].values.astype('float32'),
channel='train'
)
Train
linear.fit(trainrecords)
Custom Training Scripts
1. Scikit-learn Training
# trainsklearn.py
import argparse
import joblib
import os
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy
score