XGBoost & LightGBM Tutorial: Gradient Boosting Masterclass

# XGBoost & LightGBM - Masterclass Gradient Boosting ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Memahami Gradient Boosting](#memahami-gradient-boosting) 4. [Pelatiha...

By Ruby Abdullah · · tutorial
XGBoostLightGBMGradient BoostingSHAPMachine LearningPython

XGBoost & LightGBM - Gradient Boosting Masterclass

Table of Contents

  • Introduction
  • Prerequisites
  • Understanding Gradient Boosting
  • XGBoost Training and Tuning
  • LightGBM Training and Tuning
  • XGBoost vs LightGBM Comparison
  • Hyperparameter Optimization
  • SHAP Explainability
  • Feature Importance Analysis
  • Early Stopping and Cross-Validation
  • Model Deployment
  • Best Practices
  • Conclusion

  • Introduction

    Gradient boosting is one of the most powerful machine learning techniques for structured/tabular data. XGBoost (eXtreme Gradient Boosting) and LightGBM (Light Gradient Boosting Machine) are two leading implementations that consistently win Kaggle competitions and power production ML systems across industries. This masterclass covers both libraries from fundamentals to production deployment.

    XGBoost, developed by Tianqi Chen, introduced regularization to gradient boosting and popularized the technique. LightGBM, developed by Microsoft, brought innovations like histogram-based splitting and leaf-wise tree growth for faster training. Understanding both allows you to choose the right tool for each problem.

    Prerequisites

    • Python 3.8 or higher
    • Basic understanding of machine learning concepts (classification, regression, overfitting)
    • Familiarity with scikit-learn API conventions

    pip install xgboost lightgbm
    

    pip install scikit-learn pandas numpy

    pip install shap matplotlib seaborn

    pip install optuna # For hyperparameter optimization

    Understanding Gradient Boosting

    Gradient boosting builds an ensemble of weak learners (typically decision trees) sequentially, where each new tree corrects the errors of the previous ensemble:

    import numpy as np
    

    import matplotlib.pyplot as plt

    Conceptual illustration of gradient boosting

    Step 1: Start with an initial prediction (e.g., mean of target)

    Step 2: Compute residuals (actual - predicted)

    Step 3: Fit a tree to the residuals

    Step 4: Update predictions: newpred = oldpred + learningrate treepred

    Step 5: Repeat steps 2-4

    Key differences between XGBoost and LightGBM:

    XGBoost: level-wise tree growth (breadth-first)

    LightGBM: leaf-wise tree growth (best-first) - faster but risk overfitting

    XGBoost splits nodes level by level

    LightGBM splits the leaf with the highest loss reduction

    XGBoost Training and Tuning

    Basic Classification

    import xgboost as xgb
    

    import numpy as np

    from sklearn.datasets import makeclassification

    from sklearn.modelselection import traintestsplit

    from sklearn.metrics import accuracyscore, classificationreport, rocaucscore

    Generate sample data

    X, y = makeclassification(

    nsamples=10000, nfeatures=20, ninformative=15,

    nredundant=3, randomstate=42

    )

    Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

    Using scikit-learn API

    clf = xgb.XGBClassifier(

    nestimators=500,

    maxdepth=6,

    learningrate=0.1,

    subsample=0.8,

    colsamplebytree=0.8,

    minchildweight=5,

    gamma=0.1,

    regalpha=0.1, # L1 regularization

    reglambda=1.0, # L2 regularization

    objective="binary:logistic",

    evalmetric="logloss",

    treemethod="hist", # Histogram-based method (faster)

    device="cuda", # Use GPU if available

    randomstate=42,

    njobs=-1

    )

    clf.fit(

    Xtrain, ytrain,

    evalset=[(Xtest, ytest)],

    verbose=50

    )

    ypred = clf.predict(Xtest)

    yproba = clf.predictproba(Xtest)[:, 1]

    print(f"Accuracy: {accuracyscore(ytest, ypred):.4f}")

    Related Articles