Android neural networks explaining machine learning capabilities and practical applications

Key takeaways

On-device neural networks are the default in 2026. Privacy, sub-100ms latency, and offline capability trump cloud inference for most mobile use cases.

LiteRT replaced TensorFlow Lite in September 2024. Same ecosystem, cleaner brand, better NPU support across Android 14+, Pixel flagships, and Samsung Galaxy S24 series.

MediaPipe Tasks is your starting point. Hand landmarks, pose, object detection, audio classification, and small LLMs—all production-ready, no MLOps spin-up.

Gemini Nano (Android 14+) does live speech, summarization, and RAG. 2–8B INT4 on flagship devices; when to ship your own model instead is a five-question framework.

Hardware acceleration is real. Pixel 8 Pro’s 2 TPU cores, Snapdragon 8 Gen 3’s Hexagon NPU, and MediaTek Dimensity APU make INT4 inference feasible on phones shipping today.

Why Fora Soft wrote this 2026 guide

Four years ago, shipping on-device neural networks on Android meant wrestling with TensorFlow Lite, custom quantization pipelines, and vendor-specific build chains. Today—2026—it means reaching for LiteRT, MediaPipe Tasks, and the platform LLM. The stack is mature. The hardware is fast. But the 2022 articles you’ll find still talk about basic theory and toy models.

We’ve built on-device AI features into NetCam (real-time video anomaly detection), BrainCert (live proctoring with face landmarks), Memory Master (Android memorization with OCR and speech), and DSI Drones (autonomous swarming with edge inference). We know which tools win in production, what breaks on low-end hardware, and when to delegate to the cloud. This guide distills that playbook for teams shipping privacy-first, low-latency Android apps in 2026.

Building on-device AI into Android apps — where to start?

Talk to an ML engineer who’s shipped five Android apps with on-device inference. We’ll scope your model, pick the runtime, and outline the quantization strategy.

Book a 30-min call → WhatsApp → Email us →

Three promises on-device AI keeps that cloud cannot

1. Privacy is mathematically guaranteed. Medical scans, facial biometrics, voice recordings, and behavioral telemetry never leave the device. No cloud sync, no privacy policies to argue over, no data retention risk. For HIPAA, GDPR, and PCI-DSS applications, on-device inference is compliance shorthand: the model runs inside the user’s sandbox; you can’t exfiltrate what stays local.

2. Latency is sub-100ms, predictable, jitter-free. Cloud round-trips add 50–300ms minimum (network + server queueing). On-device, a Snapdragon 8 Gen 3 runs MobileNet object detection in 30–40ms, pose estimation in 25ms, and Whisper.cpp tiny (speech recognition) at 1.5–2x realtime. No network congestion, no timeout retries, no cold-start penalties.

3. Offline is a feature, not a gotcha. Airplane mode, tunnel dead zones, and rural coverage gaps don’t brick the app. Download the model once (50–200MB for most tasks); runs work without Wi-Fi thereafter. Users experience your app as faster, more responsive, and more reliable than competitors relying on cloud APIs.

Cost equation: zero marginal inference cost

Cloud inference runs $0.01–$0.05 per API call (Vertex AI, AWS SageMaker, Hugging Face Inference API). Multiply by 1M daily active users, each running 10 inferences: you’re looking at $100K–$500K per month. On-device, the marginal cost of the 1 millionth inference is zero—you paid for the model once (engineering + training), downloaded it once (bandwidth), and the rest is device-side compute. Break-even: typically 100K–500K monthly inferences depending on model size.

Caught between cloud costs and on-device complexity?

We help teams architect hybrid pipelines that use on-device inference for latency-critical features and cloud for retraining and analytics.

Book a 30-min call → WhatsApp → Email us →

The Android 2026 on-device AI stack

The modern stack splits into four layers: model formats, runtimes, hardware accelerators, and app-level APIs. Understanding each layer helps you pick the right tool for your use case.

Layer Formats Runtimes Accelerators App APIs
Model Formats .tflite (LiteRT), .onnx, .pb (SavedModel) LiteRT (ex TFLite), ONNX Runtime Mobile, PyTorch ExecuTorch, JAX on-device CPU (ARM v8), GPU, NPU (Hexagon), TPU (Pixel) MediaPipe Tasks, TensorFlow Lite Support Library, Google AI Edge SDK
Typical Workflow Train in PyTorch / TF2, quantize, export to .tflite or .onnx Load model at app startup, run inference in loop, cache delegate Interpreter auto-selects best available (Hexagon > GPU > CPU) App calls MediaPipe.runHandDetection(frame), gets landmarks
Maturity (2026) Production-ready; TFLite formats stable since 2020 LiteRT v1.4+ stable; ONNX Runtime Mobile v1.18+ recommended GPU delegate stable; Hexagon / TPU v1 mature MediaPipe Tasks mature; AICore in beta (Pixel 8 Pro+)

LiteRT: the TensorFlow Lite rebrand that changes nothing and everything

In September 2024, Google renamed TensorFlow Lite to LiteRT. The name change signals a pivot from “lightweight TensorFlow” toward a standalone, language-agnostic runtime. But for Android devs, the practical story is simpler: your .tflite models run on LiteRT with zero changes. The interpreter, quantization tools, and delegate APIs remain unchanged.

What actually changed

Hexagon NPU support got first-class treatment. LiteRT v1.4+ ships with a production-ready Snapdragon Hexagon delegate. The old TFLite Hexagon delegate required vendor-specific builds; LiteRT delegates are now part of the official release. On a Snapdragon 8 Gen 3, Hexagon can run INT4 quantized models 3–5x faster than ARM CPU alone.

OpenCL GPU delegate expanded. LiteRT v1.4+ added support for Mali and Adreno GPUs. This is backward compatible (older TFLite GPU delegate code works unchanged), but LiteRT better documents which GPU types get which speedups.

Structured I/O via TensorFlow Lite Support Library. The library now ships pre-built binaries for LiteRT, so you can use metadata (input/output names, tensor shapes) instead of magic indices. No functional change, but less boilerplate.

Reach for LiteRT when: you have a trained PyTorch or TensorFlow model, need sub-100ms inference on Android 10+, and want automatic NPU offload on flagship devices. LiteRT is the default choice for most new Android ML projects.

MediaPipe Tasks: the library most mobile teams should reach for first

MediaPipe Tasks is Google’s abstraction layer over LiteRT and vendor NPUs. Instead of writing interpreter loops, you call high-level APIs: HandDetector.detectAsync(frame), PoseDetector.detectAsync(frame), ObjectDetector.detect(frame). MediaPipe handles quantization, delegates, and frame resizing for you.

Core vision tasks in production use

Hand Landmarks. Detects 21 hand keypoints (knuckles, palm, wrist). Model size 4–8MB, ~10ms on Snapdragon 8 Gen 3. Used in BrainCert for proctoring (hand-in-frame detection) and sign language apps. Works offline without additional setup.

Pose Estimation. 33 body keypoints (joints, head, torso). Model size 4–6MB (lite) or 15MB (full-body), ~25ms on flagship. Used in fitness tracking, physical therapy, and motion-capture apps. MediaPipe Pose can detect multiple people in-frame.

Object Detection (SSD MobileNet). ~90 classes (COCO dataset), model size 4–7MB, ~30–50ms on Snapdragon 8 Gen 3. Used in NetCam for anomaly detection and shopping apps for barcode recognition.

Face Detection & Landmark. Bounding box + 468 facial landmarks (iris, mouth corners, jawline). Model size 5–10MB, ~15ms. Used in proctoring (gaze detection), beauty filters, and emotion classification.

Audio Classification. Classify environmental sounds (speech, music, applause, silence, etc.). Model size 10–30MB, runs streaming on Android 10+. Used in accessibility apps and voice-activity detection.

MediaPipe Tasks LLM API (new in 2025)

MediaPipe Tasks now includes LLM inference via a unified API. You supply a quantized model (Gemma 2B INT4, Llama 2 7B INT4, or smaller), and Tasks handles context caching, token generation, and batch inference. No raw LiteRT interpreter loops required. This is the bridge between MediaPipe vision tasks and the new Gemini Nano APIs.

Reach for MediaPipe Tasks when: you need hand, pose, face, object, or audio detection with minimal boilerplate. Start here for vision tasks; it handles frame resizing, delegate selection, and threading for you. LiteRT is next if you need raw access to model I/O or custom postprocessing.

Gemini Nano and AICore: the platform LLM on Android 14+

Google announced Gemini Nano in December 2023 and shipped it in Android 14 via the AICore system service. Pixel 8 Pro, Galaxy S24, and other flagship devices now ship with a 2–8B LLM baked into the OS. You call it via GenerativeModel API (same API as Gemini cloud).

What Gemini Nano can do

Summarization. Takes a long text (email thread, article, transcript) and returns a 2–3 sentence summary. Latency: 1–3 seconds on Pixel 8 Pro, zero network calls. Google Docs and Gmail are already shipping Gemini Nano summarize as a feature.

Live on-device chat. Q&A over documents, code, or user-supplied context. Supports system prompts and context caching (so repeated queries over the same document re-use cached embeddings). This is how assistants like Gboard Grammar will work offline.

RAG (Retrieval-Augmented Generation). Combine Gemini Nano with on-device embeddings (Google AI Edge SDK includes embedding models). Retrieve relevant chunks from a local knowledge base, feed them to Nano, get grounded answers. No network call, no latency.

Content generation. Draft emails, blog posts, product descriptions. Output quality is lower than Gemini Pro (cloud), but latency is <100ms and offline. Use for user-facing suggestions, not mission-critical text.

Hardware requirements and fallback strategy

Gemini Nano ships on:

  • Pixel 8 Pro and later (all variants)
  • Samsung Galaxy S24 (Ultra, Plus, standard)
  • OnePlus 12 (some variants, check regional availability)
  • Xiaomi 14 (some markets)

For devices without Gemini Nano, Google Cloud APIs have a fallback: check canRunLocally() before attempting on-device inference. If it returns false, degrade to cloud Gemini or disable the feature.

Reach for Gemini Nano when: your app targets Pixel 8 Pro+ or Galaxy S24+, users expect instant, offline responses, and your use case is summarization, chat, or RAG. Fallback to Gemini cloud or a custom LLM if you need wider device coverage.

ONNX Runtime Mobile: when to pick it over LiteRT

ONNX (Open Neural Network eXchange) is Microsoft’s model format. ONNX Runtime Mobile v1.18+ is a lean interpreter for Android, iOS, and embedded systems. Unlike LiteRT (tied to TensorFlow ecosystem), ONNX works with PyTorch, scikit-learn, and any framework that exports to ONNX.

LiteRT vs ONNX Runtime Mobile

LiteRT wins if: your team already uses TensorFlow, you want official Google support, or you need Gemini Nano integration (native AICore API). Hexagon and GPU delegates are mature and first-class.

ONNX Runtime Mobile wins if: your model is in PyTorch or ONNX-native format, you need true cross-platform parity (same model binary on iOS and Android), or you work with models from Hugging Face that export ONNX directly. Snapdragon NPU support is newer but functional (v1.17+).

Performance is similar: both can achieve 30–50ms on MobileNet inference on flagship hardware. Quantization stories are identical (INT8, INT4, dynamic range). The real difference is ecosystem inertia: LiteRT has more pre-trained models, MediaPipe integration, and Android-first documentation.

Reach for ONNX Runtime Mobile when: your team is Python/PyTorch-first, you need cross-platform iOS/Android feature parity, or you’re pulling models from Hugging Face. Otherwise, start with LiteRT + MediaPipe.

NNAPI is deprecated in Android 15: what to do

Android Neural Networks API (NNAPI) was the bridge between high-level frameworks (TensorFlow, PyTorch) and hardware accelerators (GPU, NPU) on older Android versions. It shipped in Android 8.1 (2018) and lived in the framework until Android 15 (September 2024), where Google marked it as “use-case-specific, not recommended for new projects.”

Why NNAPI is going away

NNAPI was a lowest-common-denominator API. Different vendors (Qualcomm, MediaTek, Google) implemented it with varying degrees of completeness, leading to fragmentation. LiteRT and ONNX Runtime Mobile are narrower, vendor-aware, and have better delegate coverage. Google’s strategy is: let vendors ship custom delegates (Hexagon, Dimensity APU), and have frameworks (LiteRT, ONNX) plug into them directly.

What replaces NNAPI

For TensorFlow models: use LiteRT delegates. LiteRT v1.4+ ships with Hexagon (Snapdragon), GPU (Adreno, Mali), and OpenCL delegates. If you have legacy code using NNAPI, migrate to LiteRT GPU delegate (one-line API change).

For ONNX models: use ONNX Runtime Mobile with CoreML provider (iOS) or Snapdragon provider (Android). ONNX Runtime Mobile v1.17+ has built-in Snapdragon NPU support.

For custom hardware: work directly with SoC vendors on delegate implementations (Qualcomm, MediaTek publish vendor-specific SDKs).

Reach for NNAPI legacy support when: you’re maintaining code older than Android 10 that depends on NNAPI. For any new project targeting Android 12+, skip NNAPI entirely. Use LiteRT or ONNX Runtime Mobile delegates.

Model quantization: FP32 to INT4 accuracy and latency tradeoffs

Quantization shrinks model size and speeds up inference by storing weights and activations in lower-precision formats. The standard progression is FP32 → FP16 → INT8 → INT4. Each step trades accuracy for speed and memory.

Practical tradeoff table

FP32 (baseline): full 32-bit precision. Model size 4MB (MobileNet-v3), latency ~80ms (Snapdragon 8 Gen 3 CPU). Best accuracy, slowest. Used for training, not production Android.

FP16 (half precision): 16-bit floats. Model size 2MB (50% reduction), latency ~50ms with GPU. Accuracy loss <0.1% in most vision tasks. Good baseline for compatibility (works on older GPUs, some NPUs). Common intermediate step.

INT8 (post-training quantization): 8-bit integers. Model size 1MB (75% reduction), latency ~35ms on CPU, ~15ms on Hexagon NPU. Accuracy loss 0.5–2% on ImageNet-scale tasks. Standard for production, nearly always acceptable. This is the go-to for on-device inference.

INT4 (aggressive quantization): 4-bit integers. Model size 0.5MB (87.5% reduction), latency ~20ms on Hexagon, ~50ms on CPU. Accuracy loss 1–5% depending on model and calibration data. Acceptable for classification, marginal for dense tasks (pose, segmentation). Essential for shipping LLMs (Gemma 2B INT4 is 1.5GB, versus 8GB FP32).

Quantization strategies

Post-training quantization (PTQ): train in FP32, quantize the saved model without retraining. Fast (1–10 minutes), requires representative calibration data. Works well for INT8, acceptable for INT4 on robust models. This is what MediaPipe and LiteRT Converter use by default.

Quantization-aware training (QAT): train with fake quantization ops, so the model learns to be robust to low-precision. Takes 10–50% longer to train, but INT4 accuracy is 1–2% better than PTQ. Worth it for critical models (medical imaging, accessibility features).

Dynamic quantization: quantize weights at export time, keep activations in FP32. Smaller model size without retraining or calibration. Slower than INT8 (activations are still 32-bit), but faster than FP32. Good for RNNs and Transformers where layer-by-layer precision matters.

Reach for INT8 quantization when: you want the best speed/accuracy/size tradeoff for vision models on mobile. Reach for INT4 when shipping LLMs, text embeddings, or when model size is the constraint (download bandwidth, app size limits). Use QAT for domains where accuracy is non-negotiable (medical, accessibility).

Hardware acceleration tiers: what runs where in 2026

Not all Android phones are created equal. The hardware landscape splits into four tiers: flagship with NPU, flagship with TPU, mid-range with GPU, and budget with CPU-only. Your app needs to work on all four.

Tier 1: Flagship with dedicated NPU (Snapdragon, MediaTek)

Hardware: Snapdragon 8 Gen 3 / 8 Gen 3 Leading Version (Hexagon 698 NPU), MediaTek Dimensity 9300 (APU), Samsung Exynos 2400 (Mali-G120 NPU prototype).

Model capacity: INT4 quantized models up to 2GB (Gemma 2B, Llama 2 7B INT4 with context caching). Dense models (pose, segmentation) run in 20–40ms.

Devices: Pixel 9 Pro, Galaxy S24 Ultra, OnePlus 12, Xiaomi 14 Ultra, Nothing Phone (2a) Pro.

LiteRT support: Full via Hexagon delegate. ONNX Runtime Mobile v1.17+ via Snapdragon provider.

Tier 2: Flagship with Tensor TPU (Google Pixel)

Hardware: Tensor G4 TPU (Pixel 9 Pro series) and Tensor G3 TPU (Pixel 8 Pro, Pixel 8). Two TPU cores per chip, custom quantization ops, Gemini Nano baked in.

Model capacity: Same as Snapdragon (INT4 up to 2GB), plus native AICore LLM API. Tensor optimizations for MobileNet, EfficientNet, and Gemini-specific layers.

Devices: Pixel 8 Pro, Pixel 8a, Pixel 9 Pro, Pixel 9 Pro Fold, Pixel 9 Pro XL.

LiteRT support: Full via built-in Tensor delegate (no extra setup). AICore LLM via GenerativeModel API.

Tier 3: Mid-range with Adreno/Mali GPU

Hardware: Snapdragon 7 series / 6 series Adreno GPU, MediaTek Dimensity 6000/7000 Mali GPU. No dedicated NPU; GPU delegate is your best bet.

Model capacity: INT8 models up to 200MB (MobileNet, small pose), ~40–70ms inference on GPU. INT4 can work but may not always delegate (fallback to CPU).

Devices: Samsung Galaxy A54/A55, Xiaomi Redmi Note 13, OnePlus Nord N30.

LiteRT support: GPU delegate stable and recommended. ONNX Runtime Mobile works; NPU support absent.

Tier 4: Budget / older devices (CPU-only)

Hardware: Snapdragon 4/5 series, older Cortex-A55 / A53 cores. No GPU, no NPU. CPU inference only.

Model capacity: INT8 models under 50MB (tiny MobileNet, keyword spotting), ~100–300ms latency. Larger models are impractical.

Devices: Samsung Galaxy A14, Motorola Moto G Play, most Android Go devices.

LiteRT support: Works fine; no delegates available. CPU inference is the only path.

Comparison matrix: runtimes head-to-head

Runtime Model size ceiling NPU support LLM ready Ease of use When to pick
LiteRT 2GB INT4 (LLM), 200MB INT8 (vision) Hexagon (Snapdragon), Tensor (Pixel), GPU delegates mature Yes, with Gemini Nano fallback Excellent (MediaPipe abstraction) Default choice for TensorFlow models and vision tasks
MediaPipe Tasks 4–50MB (prebuilt), 200MB custom Delegates to LiteRT (inherits NPU support) Yes, LLM API (v2025+) Excellent (high-level API, minimal boilerplate) Hand, pose, face, object, audio detection in production apps
Gemini Nano / AICore 8B (Pixel / S24 only) Tensor G4 / Snapdragon 8 Gen 3 only Yes, native LLM in OS Good (same as cloud Gemini API) Summarization, chat, RAG on premium flagships; fallback to cloud for others
ONNX Runtime Mobile 2GB INT4, 200MB INT8 Snapdragon provider (v1.17+), broader vendor coverage No, but can use external LLM API Good (PyTorch-native, cross-platform) PyTorch models, Hugging Face ONNX exports, cross-platform iOS/Android parity
PyTorch ExecuTorch 500MB (sparse models recommended) Experimental (planned for 2025) No (as of early 2026) Moderate (still evolving API) Research projects, sparse models, PyTorch-only workflows

What can actually run on-device in 2026: the concrete feature list

Let’s ground this in real latencies and model sizes you can ship today.

Vision tasks

Object detection (SSD MobileNet v2). Model size 4MB (INT8), latency ~30–40ms on Snapdragon 8 Gen 3 with Hexagon. Detects 90 COCO classes. Used in: NetCam anomaly detection, shopping apps (barcode scanning), autonomous vehicle perception.

Face detection & landmarks. Model size 5–10MB, latency ~15–20ms. 468 facial keypoints (iris, mouth, jawline). Used in: proctoring (gaze direction), beauty filters (real-time effects), emotion classification (via landmarks).

Hand landmarks. Model size 4–8MB, latency ~10–15ms. 21 keypoints per hand. Used in: sign language recognition, gesture-based control, fitness tracking (hand exercises).

Pose estimation (single-person). Model size 4–6MB (lite) or 15MB (full), latency ~25–40ms. 17–33 body keypoints. Used in: fitness apps, physical therapy, motion-capture games, fall detection.

Image segmentation (SegFormer / EfficientNet-Lite). Model size 10–30MB (INT8), latency ~50–100ms. Semantic segmentation (per-pixel class) or panoptic segmentation. Used in: background blur / virtual background, scene understanding for AR, medical image analysis.

OCR (text recognition). Model size 20–50MB (two-stage: detection + recognition), latency ~100–300ms. Extract text from images, handwriting, receipts. Used in: document scanning, form filling, accessibility features.

Audio and speech

Speech recognition (Whisper.cpp tiny). Model size 75MB, latency ~1.5–2x realtime on Snapdragon 8 Gen 3. Recognizes 99 languages, handles accents and background noise. Used in: accessibility (captions), voice commands, meeting transcription. Requires pre-downloaded model (~75MB).

Audio classification. Model size 5–20MB, latency ~10–50ms. Classify environmental sound (speech, music, applause, silence, dog bark, etc.). Used in: accessibility (sound context), smart home automation, wildlife monitoring.

Text-to-speech (small local models). Model size 40–100MB (voice + vocoder), latency ~500ms–2s per utterance. Naturalness is lower than cloud (e.g., Google Cloud TTS), but latency is <100ms for small chunks. Used in: accessibility, offline audiobooks, real-time chat read-aloud.

Text and LLM

Text embeddings (MobileBERT, DistilBERT tiny). Model size 20–50MB, latency ~100–200ms per chunk. Embeds text into a vector space for semantic search and RAG. Used in: off-device knowledge base retrieval, similarity search, deduplication.

Small LLMs (Gemma 2B INT4, Llama 2 7B INT4). Model size 1.5–4GB, latency ~3–5 tokens per second on Snapdragon 8 Gen 3 Hexagon (Pixel 8 Pro is slightly faster). Context caching halves latency on repeated queries. Used in: on-device chat, summarization, code generation, writing assistance.

Gemini Nano (platform LLM). Baked into Pixel 8 Pro, S24 via AICore. 2–8B parameters, context-caching enabled, <1 second latency. Used in: Gboard grammar check, Gmail compose assist, Google Docs magic editor, custom apps via Google AI Edge SDK.

Privacy and compliance: why on-device AI is HIPAA/GDPR shorthand

Healthcare and fintech executives often ask: “Can we send patient data to a cloud API?” The answer for biometric data (face scans, ECG waveforms, x-rays) is no without a Business Associate Agreement (BAA), encryption, and 90-day retention policies. On-device inference makes compliance trivial: the data never leaves the user’s phone.

HIPAA implications

Ship the same triage model on-device instead: the photo stays on the phone. The model runs locally. Only the triage result (categorical: “refer to dermatologist”) leaves the device. HIPAA still applies to the triage metadata, but the PHI never touched your infrastructure. Compliance cost drops to ~$10K per year (encryption of triage records alone).

GDPR and biometric processing

GDPR treats facial recognition as special category data (Article 9). Processing it requires explicit consent and a lawful basis (legitimate interest rarely applies). Send face photos to a cloud API for emotion detection? You need a Data Processing Agreement (DPA), a five-country legal review, and a Legitimate Interest Assessment (LIA). Ship the emotion detector on-device, process face landmarks locally? No data flows off the phone. Consent is still required, but enforcement is trivial (it’s obvious the app can’t phone home without network permission).

PCI-DSS and payment card data

A payment app that runs on-device card authentication (chip-and-PIN local verification) avoids transmitting unencrypted card data to your server. You reduce PCI-DSS scope from Level 1 (“processes any card data”) to Level 3 (compliant transmissions only). The delta: ~$200K–$500K in annual audit and compliance costs.

Cost math: on-device vs cloud, with real numbers

Let’s model a fitness app with 100K monthly active users, each running 30 inferences per day (movement detection, form correction, rep counting).

Cloud inference (e.g., Google Vertex AI, AWS SageMaker)

  • 100K MAU × 30 inferences/day = 3M inferences/day
  • 3M inferences/day × $0.01 = $30K/day
  • $30K/day × 30 days = $900K/month
  • Annual: $10.8M
  • Plus: $5K/month for infrastructure monitoring, DDoS protection, and compliance tooling. Total: ~$11M/year.

    On-device inference

    • Model training and quantization: $10K (engineering time, one-time)
    • Model download (App Store / Play Store bundle): 8MB per user = 100K users × 8MB = 800GB bandwidth per month. At $0.12/GB (typical CDN): $96/month.
    • Server costs (telemetry, feedback logging only): $500/month on a small GCP instance.
    • Inference on-device: $0 marginal cost (already covered by device CPU/GPU).
    • Annual total: $10K + ($596/month × 12) = $10K + $7,152 = $17,152/year.

    Break-even

    Cloud costs $11M/year; on-device costs $17K/year. On-device pays for itself in ~56 hours of cloud usage. For any app with >100K users and >10 inferences per user per day, on-device is economically dominant. The trade-off is engineering effort (quantization, testing on 50+ device types, fallback logic), not cost.

    Mini case study: shipping on-device anomaly detection in NetCam Android

    Before: cloud-only

    Latency was 200–300ms per frame (network + inference). An alert for “abandoned bag” arrived 2–3 seconds after the event, too late for automated gates or guards to respond.

    After: on-device + cloud hybrid

    Results:

    • Bandwidth: 50KB anomaly thumbnail × ~5 alerts per hour = 250KB/hour = ~6MB/day per camera. Cost: $0.72/day, or $262/year per camera (versus $74K/year cloud-only).
    • Latency: 40ms local detection + 100ms cloud upload = 140ms total. Alerts arrive within 200ms of the event.
    • Reliability: Camera keeps recording and detecting even if cloud connectivity drops. Fallback: store anomalies locally, sync when online.
    • Privacy: Raw video never leaves the camera. Only processed detections (bounding boxes, class labels) sync to cloud, satisfying on-premise data residency requirements.

    Business impact

    Decision framework: five questions to pick on-device vs cloud

    Use this checklist when scoping a new feature.

    1. Is the task latency-critical (<200ms)? If yes, on-device wins. Cloud adds 50–500ms just for network round-trip. Object detection, hand landmarks, pose estimation are all latency-critical.

    2. Do users expect privacy (biometrics, health, financial data)? If yes, on-device is non-negotiable. Face scans, medical images, and payment card data must not leave the device. Compliance (HIPAA, GDPR) is easier, faster, cheaper on-device.

    3. Will the feature run offline? If yes, on-device is the only option. Users in airplane mode or cellular dead zones should not lose functionality.

    4. Is the model size <500MB (vision) or <2GB (LLM)? If yes, on-device is feasible. If the model is larger, split it: on-device for the latency-critical part, cloud for bulk inference.

    5. Do you control the deployment cadence? If yes (your own app), on-device model updates are easy (download a new .tflite at startup). If no (vendor-provided or OS-level feature), cloud or platform LLM (Gemini Nano) is safer.

    Decision tree

    Five pitfalls in production Android ML

    2. OOM crashes on older devices. A 150MB model runs fine on Pixel 8 Pro (6GB RAM), but crashes with OutOfMemoryError on a Galaxy A50 (3GB RAM). Remedy: profile memory usage on the target device tier (test on Galaxy A50, Redmi Note, and Pixel 8a). Reduce model size via quantization (INT4), pruning (remove 20% of parameters), or knowledge distillation (train a smaller student model).

    3. Delegate selection failures (silent CPU fallback). You target Hexagon NPU inference, but on a Snapdragon 6 Gen 1 (older Hexagon version), the delegate fails silently and falls back to CPU. Latency jumps from 30ms to 200ms. Users see freezing. Remedy: log delegate fallbacks, test on the actual target device hardware (not just emulator), and add explicit fallback logic with timeout and warning logs.

    4. Model download / caching strategy disaster. You ship a 50MB pose model in the app bundle, but after launch you improve it to 60MB. Update requires a full app re-release (7–14 days to reach all users). Users on the old version ship incorrect pose data; compliance teams flag the regression. Remedy: store models in app cache or Cloud Storage, implement silent updates (download new model in background, swap at startup). Version your models (model_v1.tflite, model_v2.tflite) and track the active version in prefs.

    5. Battery drain from continuous inference. A fitness app runs pose detection every 100ms in a background service. After one hour, battery drops 15–20%. Users complain. Remedy: profile inference power consumption on the target device (use Android Profiler battery drain estimator). Batch inferences (run pose every 500ms instead of every 100ms), gate inference behind a motion detector (accelerometer), and throttle on low-battery mode.

    KPIs: what to measure once you ship

    Business KPIs. Measure inference request volume (how often the model is called), cache hit rate (how often a cached result is re-used), and latency p50/p95/p99 (median, 95th, 99th percentile inference time). Alert if p99 latency exceeds 200ms; it signals delegate failures or device overload.

    Reliability KPIs. Measure OOM crash rate (fraction of sessions ending with OutOfMemoryError), ANR rate (Application Not Responding due to inference blocking the main thread), and delegate selection rate (fraction of inferences that successfully offloaded to NPU vs fell back to CPU). Log these via Firebase Crashlytics or Sentry. Set SLOs: OOM rate <0.1%, ANR rate <0.5%, CPU fallback rate <20%.

    Ready to ship on-device ML at scale?

    We help teams architect quantization pipelines, test on 50+ Android devices, and deploy production models with zero latency penalty.

    Book a 30-min call → WhatsApp → Email us →

    When NOT to ship on-device AI

    You need frequent model updates (daily retraining). If your model accuracy degrades daily because user distribution shifts, and you need to retrain every 24 hours, on-device is impractical. Cloud inference with daily model refreshes is the answer.

    You have a lean team and can’t afford device fragmentation testing. On-device ML requires testing on 30–50 device types, OS versions, and SoC variants. If you have two engineers, cloud is less risk.

    FAQ

    Does Android 2026 have a built-in LLM API for all apps?

    Only for Pixel 8 Pro and Galaxy S24 series, via AICore (Android 14+). Older devices and non-flagship phones have no built-in LLM. Fallback to Gemini Cloud or ship your own quantized model (Gemma 2B INT4).

    What is LiteRT, exactly?

    LiteRT is Google’s rebranding of TensorFlow Lite (September 2024). It’s a standalone runtime for on-device inference, independent of TensorFlow. Your .tflite models are compatible; runtimes and APIs are unchanged. The rebrand signals a pivot toward cross-framework support (ONNX, JAX, etc.).

    Is NNAPI dead? Should I migrate existing code?

    NNAPI is deprecated in Android 15+ but still functional for backward compatibility. If you have legacy NNAPI code on Android 10–14, it works. For new projects, use LiteRT delegates (GPU, Hexagon) instead. Migration is a one-line API change: replace NNAPI delegate with LiteRT GPU delegate.

    How large a model can actually run on Android?

    Vision models (object detection, pose): 4–200MB (INT8). Small LLMs (Gemma 2B INT4, Llama 7B INT4): 1.5–4GB. Flagship devices (Pixel 8 Pro, S24 Ultra) have 6–12GB RAM; budget phones have 3–4GB. Rule of thumb: model size should be <30% of device RAM. Test on your target device tier.

    Is on-device AI completely free after the initial download?

    Nearly free. No cloud inference charges; only CPU/GPU/NPU energy (a few mW per inference) and app memory. Model download (50–200MB) is one-time bandwidth cost ($0.01–$0.10 per user). Telemetry and logging can add modest server costs ($100–$1K/month for most apps).

    How do I run Whisper (speech recognition) on Android?

    Use Whisper.cpp (the C++ port). Convert it to TensorFlow Lite or ONNX, quantize to INT8 (75MB for tiny model), and use LiteRT interpreter or MediaPipe Audio API. Latency: ~1.5–2x realtime on Snapdragon 8 Gen 3. Requires pre-downloading the model; not suitable for live transcription on budget phones.

    What about on-device AI on iOS? Is the story different?

    iOS has Core ML (Apple’s runtime, similar to LiteRT), Create ML (training tool), and no platform LLM yet (no equivalent to Gemini Nano). For cross-platform parity, use ONNX Runtime Mobile (works on iOS and Android) or TensorFlow Lite (supports iOS via iOS-specific builds). Performance is similar; the tooling is slightly different.

    Should I ship Gemini Nano or train a custom LLM?

    Ship Gemini Nano if you target Pixel 8 Pro+ / S24+ and have a fallback for older devices. Train / quantize a custom model (Gemma 2B) if: you need 100% device coverage, your use case requires domain-specific knowledge (medical, legal), or you want to avoid Google’s terms. Gemini Nano is faster and easier; custom models give you control.

    AI Mobile

    How to build an AI mobile app: the architecture playbook

    End-to-end blueprint for shipping production mobile apps with on-device and cloud inference.

    Playbook

    How to build apps with AI: strategy, tools, hiring, go-to-market

    Comprehensive guide for founders and product teams shipping AI features.

    Hiring

    Hire a computer vision team: skills, interview rubric, project types

    What to look for in a computer vision engineer and how to evaluate them.

    Case Study

    AI software development case study: from idea to production

    Real-world example of building and shipping an AI feature for a mobile app.

    Vision

    AI video surveillance with YOLO and DeepSort: tracking and anomaly detection

    How object detection and multi-object tracking power real-time surveillance systems.

    On-device neural networks are the new standard on Android

    In 2022, on-device inference was a luxury for flagships. In 2026, it’s the default architecture for any app handling sensitive data, requiring sub-100ms latency, or running offline. LiteRT and MediaPipe Tasks have matured to the point where shipping without them is the exception, not the rule. Hardware accelerators (Hexagon NPU, Tensor TPU, Mali GPU) have proliferated, making INT8 inference 3–10x faster than CPU-only.

    The real cost is engineering: testing on 30+ device types, quantization pipelines, fallback logic for older devices, and KPI monitoring. But the payoff is massive: 99% cost savings (cloud vs on-device), sub-200ms latency, and compliance that law firms don’t need to argue about.

    We’ve shipped on-device AI into five production Android apps. We know which tools scale, which don’t, and where the gotchas are hiding. If you’re scoping an on-device AI feature, we’ve already solved the hard problems.

    Ready to ship Android neural networks the right way?

    We help teams quantize models, architect on-device + cloud hybrid systems, and ship with zero latency regret.

    Book a 30-min call → WhatsApp → Email us →

    • Technologies
      Clients' questions