Tutorial 18: TensorFlow Lite - Deploy ML on Mobile

Introduction

Prerequisites

Understanding TensorFlow Lite

Model Conversion from TensorFlow

Model Conversion from PyTorch

Post-Training Quantization

Quantization-Aware Training (QAT)

Model Optimization Techniques

Android Deployment

iOS Deployment with CoreML Bridge

Edge TPU Deployment

Benchmarking and Profiling

Best Practices

Conclusion

Introduction

Deploying machine learning models on mobile and edge devices opens up possibilities that cloud-only inference cannot match: offline capability, reduced latency, improved privacy, and lower operational costs. TensorFlow Lite (TFLite) is Google's framework for running ML models on mobile phones, embedded devices, and edge hardware.

This tutorial covers the complete workflow from converting trained models to the TFLite format, applying quantization to reduce model size and improve speed, deploying on Android and iOS, running on Edge TPUs, and benchmarking performance. Whether you are building a real-time image classifier, an on-device NLP model, or a sensor data pipeline, these techniques apply directly.

Prerequisites

Python 3.9+ with TensorFlow 2.15+
Android Studio (for Android deployment)
Xcode 15+ (for iOS deployment)
Basic understanding of neural network architectures
A trained model (we will create one for demonstration)

# Install required packages
pip install tensorflow tflite-support onnx onnx-tf torch

import tensorflow as tf
import numpy as np
print(f"TensorFlow version: {tf.version}")

Understanding TensorFlow Lite

TFLite uses a different model format (.tflite) from standard TensorFlow (.pb, SavedModel). The TFLite format is a FlatBuffer-based schema optimized for:

Small binary size - FlatBuffers are compact and zero-copy
Fast initialization - no parsing overhead, memory-mapped
Reduced memory footprint - designed for constrained devices
Hardware acceleration - supports GPU delegates, NNAPI, Edge TPU

The conversion pipeline looks like this:

TensorFlow Model / PyTorch Model
        |
        v
    TFLite Converter (with optional optimizations)
        |
        v
    .tflite file
        |
        v
    TFLite Interpreter (on device)

Model Conversion from TensorFlow

Converting a Keras Model

import tensorflow as tf
from tensorflow import keras

Create and train a sample image classification model
def buildmodel():

    model = keras.Sequential([
        keras.layers.Conv2D(32, (3, 3), activation='relu',
                           inputshape=(224, 224, 3)),
        keras.layers.MaxPooling2D((2, 2)),
        keras.layers.Conv2D(64, (3, 3), activation='relu'),
        keras.layers.MaxPooling2D((2, 2)),
        keras.layers.Conv2D(128, (3, 3), activation='relu'),
        keras.layers.GlobalAveragePooling2D(),
        keras.layers.Dense(256, activation='relu'),
        keras.layers.Dropout(0.5),
        keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(
        optimizer='adam',
        loss='sparsecategoricalcrossentropy',
        metrics=['accuracy']
    )
    return model

model = buildmodel()


Method 1: Convert from Keras model directly
converter = tf.lite.TFLiteConverter.fromkerasmodel(model)

tflitemodel = converter.convert()

Save the model
with open('model.tflite', 'wb') as f:
    f.write(tflitemodel)


print(f"Model size: {len(tflitemodel) / 1024:.1f} KB")

TensorFlow Lite Tutorial: Deploy ML on Mobile and Edge Devices

Tutorial 18: TensorFlow Lite - Deploy ML on Mobile

Table of Contents

Introduction

Prerequisites

pip install tensorflow tflite-support onnx onnx-tf torch

Understanding TensorFlow Lite

Model Conversion from TensorFlow

Converting a Keras Model

Create and train a sample image classification model

Method 1: Convert from Keras model directly

Save the model

Related Articles

llama.cpp and GGUF Quantization: Local LLM Deployment

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

SGLang Tutorial: Fast LLM Serving and Structured Generation

Related Articles

llama.cpp and GGUF Quantization: Local LLM Deployment

llama.cpp dan GGUF Quantization: Deploy LLM Secara Lokal Pendahuluan Menjalankan Large Language Model (LLM) secara lokal...

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...