TensorFlow Lite Tutorial: Deploy ML on Mobile and Edge Devices

# Tutorial 18: TensorFlow Lite - Deploy ML di Perangkat Mobile ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Memahami TensorFlow Lite](#memahami-tensorflow-lite) 4. [Ko...

By Ruby Abdullah · · tutorial
TensorFlow LiteMobile MLEdge AIQuantizationAndroidiOS

Tutorial 18: TensorFlow Lite - Deploy ML on Mobile

Table of Contents

  • Introduction
  • Prerequisites
  • Understanding TensorFlow Lite
  • Model Conversion from TensorFlow
  • Model Conversion from PyTorch
  • Post-Training Quantization
  • Quantization-Aware Training (QAT)
  • Model Optimization Techniques
  • Android Deployment
  • iOS Deployment with CoreML Bridge
  • Edge TPU Deployment
  • Benchmarking and Profiling
  • Best Practices
  • Conclusion
  • Introduction

    Deploying machine learning models on mobile and edge devices opens up possibilities that cloud-only inference cannot match: offline capability, reduced latency, improved privacy, and lower operational costs. TensorFlow Lite (TFLite) is Google's framework for running ML models on mobile phones, embedded devices, and edge hardware.

    This tutorial covers the complete workflow from converting trained models to the TFLite format, applying quantization to reduce model size and improve speed, deploying on Android and iOS, running on Edge TPUs, and benchmarking performance. Whether you are building a real-time image classifier, an on-device NLP model, or a sensor data pipeline, these techniques apply directly.

    Prerequisites

    • Python 3.9+ with TensorFlow 2.15+
    • Android Studio (for Android deployment)
    • Xcode 15+ (for iOS deployment)
    • Basic understanding of neural network architectures
    • A trained model (we will create one for demonstration)

    # Install required packages
    

    pip install tensorflow tflite-support onnx onnx-tf torch

    import tensorflow as tf

    import numpy as np

    print(f"TensorFlow version: {tf.version}")

    Understanding TensorFlow Lite

    TFLite uses a different model format (.tflite) from standard TensorFlow (.pb, SavedModel). The TFLite format is a FlatBuffer-based schema optimized for:

    • Small binary size - FlatBuffers are compact and zero-copy
    • Fast initialization - no parsing overhead, memory-mapped
    • Reduced memory footprint - designed for constrained devices
    • Hardware acceleration - supports GPU delegates, NNAPI, Edge TPU

    The conversion pipeline looks like this:

    TensorFlow Model / PyTorch Model
    

    |

    v

    TFLite Converter (with optional optimizations)

    |

    v

    .tflite file

    |

    v

    TFLite Interpreter (on device)

    Model Conversion from TensorFlow

    Converting a Keras Model

    import tensorflow as tf
    

    from tensorflow import keras

    Create and train a sample image classification model

    def buildmodel():

    model = keras.Sequential([

    keras.layers.Conv2D(32, (3, 3), activation='relu',

    inputshape=(224, 224, 3)),

    keras.layers.MaxPooling2D((2, 2)),

    keras.layers.Conv2D(64, (3, 3), activation='relu'),

    keras.layers.MaxPooling2D((2, 2)),

    keras.layers.Conv2D(128, (3, 3), activation='relu'),

    keras.layers.GlobalAveragePooling2D(),

    keras.layers.Dense(256, activation='relu'),

    keras.layers.Dropout(0.5),

    keras.layers.Dense(10, activation='softmax')

    ])

    model.compile(

    optimizer='adam',

    loss='sparsecategoricalcrossentropy',

    metrics=['accuracy']

    )

    return model

    model = buildmodel()

    Method 1: Convert from Keras model directly

    converter = tf.lite.TFLiteConverter.fromkerasmodel(model)

    tflitemodel = converter.convert()

    Save the model

    with open('model.tflite', 'wb') as f:

    f.write(tflitemodel)

    print(f"Model size: {len(tflitemodel) / 1024:.1f} KB")

    Related Articles

    llama.cpp and GGUF Quantization: Local LLM Deployment

    llama.cpp dan GGUF Quantization: Deploy LLM Secara Lokal Pendahuluan Menjalankan Large Language Model (LLM) secara lokal...

    Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

    Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

    ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

    ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

    SGLang Tutorial: Fast LLM Serving and Structured Generation

    SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...