Blog: Custom Agora.io Development: What AI Mobile App Development Companies Need to Know

Key takeaways

  • Agora’s out-of-the-box SDK covers the last mile of RTC. It does not cover the AI you actually want to ship. Emotion detection, real-time translation, custom background replacement, conversational agents, AR filters — all of these live in the custom-development layer on top.
  • The battle is fought at the frame observer. registerVideoFrameObserver and registerAudioFrameObserver are the two hooks where 80% of AI custom work happens. Get them right and the rest follows.
  • Expect $40–200K for serious custom AI on Agora. Plus inference cost, plus battery engineering, plus QA across low-end Android. Budget honestly.
  • Agora Conversational AI Engine, OpenAI Realtime, LiveKit Agents, Deepgram and Cartesia now compose cleanly. You don’t have to pick one AI partner — you pick a pipeline.
  • Agent Engineering trims 30–40% off the scaffolding work — SDK wiring, token service, channel management, event plumbing — without touching the ML judgment calls.

Short answer: in 2026, building an AI-powered mobile app on Agora.io means owning three layers the default SDK doesn’t ship: a custom video frame pipeline, a custom audio frame pipeline, and an AI orchestration layer that plugs third-party models (OpenAI Realtime, Deepgram, Cartesia, Hume, MediaPipe, your own) into calls without destroying latency, battery or call quality.

This guide is written for CTOs and product leads of AI mobile app companies who’ve already picked Agora (or are about to) and need to know what the custom dev layer actually looks like, what it costs, where it breaks, and when a different RTC vendor (LiveKit, Daily.co, 100ms) would be the better choice.

Building an AI feature on top of Agora?

We’ve shipped custom frame processors, conversational agents, translation pipelines and AR filters on Agora since 2018. In 30 minutes we’ll tell you what’s reasonable in your timeline and what isn’t.

Book a 30-min call →

What Agora ships out of the box in 2026

The default Agora stack is good. Worth stating what you already get:

  • Video SDK with codec auto-negotiation (H.264, VP8, VP9, AV1 where available), bitrate adaptation, simulcast.
  • Voice SDK with AI Noise Suppression, echo cancellation and dynamic network adaptation.
  • Real-Time Messaging and Signaling for chat, presence and call signaling.
  • Cloud Recording for composite and individual recordings, plus Web Recording with a headless Chromium.
  • Virtual Background (segmentation + blur/image).
  • Voice AI Agent and the 2024 Conversational AI Engine for injecting AI agents into calls.
  • Analytics (Agora Analytics, Vocal Analytics) for call-quality debugging.

That covers maybe 60–70% of an AI mobile app’s RTC needs. The rest lives in the custom-dev zone, and that’s what this guide is about.

Where the default SDK stops and custom work starts

The moment you want to do anything AI-flavored that Agora doesn’t ship, you cross into custom dev. The honest list:

  • Custom video frame processing. Register a VideoFrameObserver, grab the CVPixelBuffer (iOS) or byte[] YUV (Android), run it through Core ML / TFLite / MediaPipe / your own model, then push the processed frame back.
  • Custom audio frame processing. Register an AudioFrameObserver, route raw PCM to a transcription / emotion / noise-suppression model, feed audio back into the call.
  • Third-party AI integrations. Wiring up Deepgram, AssemblyAI, Hume, OpenAI Realtime, Cartesia, TwelveLabs to live streams without blowing latency.
  • Server-side pipelines. Agora Cloud Recording / Media Push out, FFmpeg, an inference service (Groq, Cerebras, Triton, SageMaker), and a response path back into the channel via Media Pull.
  • AR filters. Banuba / DeepAR / ModiFace / Snap Camera Kit, each with its own texture pipeline.
  • Live moderation. Hive, Sightengine, custom classifiers running on recording output.
  • Spatial audio. Custom HRTF processing, object-based audio positioning.
  • Conversational AI agents. Agora’s engine works; integrating your own agent logic is still custom.

The AI integrations we see most on Agora mobile apps

Feature Typical stack Typical cost to ship Watch out for
Real-time transcription + captionsAudioFrameObserver → Deepgram streaming / AssemblyAI / Azure Speech → caption overlay.$12–30KWER on non-English, speaker diarization, latency under 600 ms.
Live translationDeepgram/AssemblyAI → Translation API → Cartesia Sonic TTS → back into call via Media Push.$30–70KEnd-to-end latency stacking; budget <1.5 s total.
Emotion / sentiment detectionHume AI (voice) + MediaPipe FaceLandmarker (video) fused in the client.$25–60KBias testing, consent flows, battery drain.
Custom background blur / replacementVideoFrameObserver → MediaPipe Selfie Segmentation or Core ML DepthPro → Metal/OpenGL compositor.$15–40KEdge halos on hair, GPU thermals on long calls.
AR filtersBanuba / DeepAR SDK → VideoFrameObserver bridge.$20–60K + SDK licenseLicensing tiers, asset pipeline, low-end device perf.
Conversational AI agent in-callAgora Conversational AI Engine or OpenAI Realtime via Media Push; orchestration with Pipecat / LiveKit Agents.$40–120KBarge-in, turn-taking, cost per minute of LLM inference.
Live AI moderationCloud Recording → Hive / Sightengine → moderation dashboard.$15–50KGARM alignment, false-positive review workflow.

Three architecture patterns that actually work

Pattern 1 — Full on-device pipeline. Frame observer → on-device model (MediaPipe, Core ML, TFLite) → push processed frame back. Lowest latency (10–30 ms). Best for background blur, AR filters, lightweight face-tracking, on-device noise suppression. Battery and thermal drain is the risk — Metal / Vulkan optimizations are non-optional.

Pattern 2 — Sidecar server pipeline. Agora Media Push streams call audio/video to your server, where FFmpeg + an inference service (Groq, Cerebras, a custom Triton cluster) processes it, and results either go to a companion channel or back into the call via Media Pull. Best for heavy models (Whisper Large, Llama 3.3 70B). Latency sits at 300–1,200 ms depending on network and model.

Pattern 3 — Hybrid. Cheap models on-device (VAD, segmentation, wake-word, light emotion), heavy models server-side (large LLMs, diarization, translation). This is our default recommendation for any product shipping to consumer iOS + Android.

The latency budget for “it still feels human”

End-to-end, natural conversation with an AI agent starts to feel natural under about 700 ms round-trip and excellent under 400 ms. Typical breakdown:

  • Agora audio ingest and media push: 80–150 ms.
  • STT (Deepgram Nova, AssemblyAI): 150–300 ms partial / 600 ms final.
  • LLM first token (Groq Llama 3.3 70B or OpenAI Realtime): 150–350 ms.
  • TTS first audio chunk (Cartesia Sonic): 80–150 ms.
  • Agora media pull back to the call: 80–150 ms.

If any one link sits above 400 ms, the call doesn’t feel real-time. The engineering work is stacking optimizations, not any single “fast” service.

Latency rule of thumb. Budget your round-trip from the far end of the WAN, not from a developer laptop on office Wi-Fi. A demo that feels instant on a 20 ms LAN routinely degrades by 250–400 ms on mobile LTE — which is exactly where most of your paying users will sit.

Platform specifics that matter

  • iOS: CVPixelBuffer, CIImage, AVAudioEngine. Run models through Core ML (with ANE when possible) and composite frames with Metal Performance Shaders. Watch out for CIKernel-induced latency spikes.
  • Android: SurfaceTexture, MediaCodec, TFLite with GPU/NNAPI delegates, Vulkan for composition. Low-end device testing is essential — Agora’s own SDK is polished but custom frame work exposes hardware variance brutally.
  • React Native Agora: the official plugin supports basic calls and some frame observer work via native modules. Advanced AI pipelines almost always require writing your own TurboModule.
  • Flutter Agora: the community plugin is mature for basic calls; custom frame processing needs platform channels bridging to native Swift/Kotlin.
  • Agora Web SDK: MediaStreamTrackProcessor + WebCodecs + ONNX Runtime Web / TensorFlow.js run on modern Chromium-based browsers. Safari support lags; plan fallbacks.

Realistic 2026 cost model

Project tier Build cost What you get Run-cost notes
Light AI$40–80KAgora calls + live captions + custom background blur + light moderation.Agora $0.99–3.99/1K min + STT $0.003–0.01/min.
Moderate AI$90–150K+ real-time translation, emotion detection, AR filters, in-call conversational agent.+ LLM inference $0.20–2.00 per hour of conversation.
Heavy AI$160–300K+Multi-party translated calls, voice cloning, empathetic voice (Hume), custom LLM fine-tunes, spatial audio, studio-grade moderation.Server inference usually dominates run-cost; budget per-minute, not per-seat.

Negotiating tip. Agora’s list pricing is a starting point, not a quote. Any team doing >100M minutes a year can negotiate 30–60% off. Get the quote before you lock your unit economics.

Agora vs the alternatives for AI mobile apps

  • LiveKit. Open source, modular, a first-class Agent framework. Best if you want to self-host, own your infra, or compose AI heavily server-side.
  • Daily.co. Purpose-built SDK with excellent AI-native docs (Daily Bots). Cleanest for OpenAI Realtime and Pipecat integrations.
  • 100ms. Strong in South Asia, great pricing, emerging AI feature set.
  • Amazon Chime SDK. Works, especially if you’re deep in AWS. AI integration is DIY.
  • Zoom SDK. Good for reusing Zoom brand trust. AI customization is limited to what Zoom allows.
  • Vonage (Nexmo / Video API). Enterprise stable, feature-weaker than Agora on AI custom.
  • Twilio Video. Sunset in 2024. Don’t start here.

Agora wins when: you need excellent low-bandwidth handling, strong APAC and China coverage, predictable pricing, and a broad SDK across iOS / Android / Web / RN / Flutter / Unity.

LiveKit or Daily win when: your team prefers open source, the AI orchestration is the core of the product, or you’re building on OpenAI Realtime from day one.

The 2026 real-time AI stack, composed

Nobody builds the whole thing in-house anymore. The composition we ship most often on Agora:

  • RTC: Agora Video + Voice SDK.
  • STT: Deepgram Nova-2 or AssemblyAI Universal-Streaming. Whisper Large V3 server-side when accuracy matters more than latency.
  • LLM: OpenAI Realtime for turn-taking-native, Groq + Llama 3.3 70B for cost, Claude via Bedrock / Vertex for quality.
  • TTS: Cartesia Sonic for sub-100 ms, ElevenLabs for voice cloning, Azure Neural TTS for broad language coverage.
  • Emotion / empathy: Hume AI for voice, MediaPipe FaceLandmarker for facial expression.
  • Moderation: Hive for live, Sightengine for image/video.
  • Orchestration: Pipecat, LiveKit Agents, or our own middleware; whichever fits the integration surface.

Team composition

  • Senior iOS (Core ML, Metal, AVFoundation). Non-negotiable.
  • Senior Android (NDK, TFLite, Vulkan, MediaCodec). Non-negotiable.
  • ML engineer focused on mobile model optimization — quantization, pruning, delegate tuning.
  • Backend engineer for token service, session management, Media Pull/Push pipelines, agent orchestration.
  • DevOps / SRE for GPU inference infra and observability.
  • QA with device-farm chops — BrowserStack / Sauce Labs / AWS Device Farm for low-end Android coverage.

Mini case: what V.A.L.T. taught us about custom RTC pipelines

Situation. V.A.L.T. — our video management platform — is not Agora-based, but the custom-pipeline discipline transfers directly. 700+ orgs, 25K daily users, 2,500+ cameras, custom frame pipelines, evidentiary storage.

Lessons that apply to Agora-based AI apps. (1) Frame observers must not block the SDK’s own render loop — everything runs on its own Metal/Vulkan queue. (2) Reconnect logic is 20% of the code and 80% of the bugs; test it on flaky LTE, not on Wi-Fi. (3) Feature flags around AI features save you from app-store rollbacks when a new model underperforms.

Outcome. We’ve re-used the V.A.L.T. pipeline mental model on every Agora AI build since — and it’s the reason those builds don’t hit week-12 surprises. If you want our architect to walk your team through the same discipline applied to your Agora pipeline, grab a 30-minute slot and bring your topology.

Field rule. If your AI feature needs to run on a $150 Android device and a flagship iPhone, design for the $150 phone first — then let the flagship simply run faster. Every team we’ve rescued from a thermal-throttling spiral did it in the opposite order.

Shipping an AI feature on Agora this quarter?

We’ll look at the specific pipeline you’re planning and tell you where it’ll break on low-end Android before you write a line of code.

Book a 30-min review →

A 16-week plan for an AI-powered Agora mobile app

  • Weeks 1–2. Agora App ID + token service, signaling, baseline 1–1 call on iOS and Android.
  • Weeks 3–4. Custom VideoFrameObserver + AudioFrameObserver plumbing. Core ML / TFLite inference harness.
  • Weeks 5–6. First AI feature (usually captions or background blur). Full end-to-end on both platforms.
  • Weeks 7–8. Second AI feature (emotion detection, live translation, or AR filters). Thermal and battery profiling.
  • Weeks 9–10. Conversational agent via Agora Conversational AI Engine or OpenAI Realtime. Media Pull/Push pipeline.
  • Weeks 11–12. Cloud Recording + moderation pipeline. Feature flags, analytics.
  • Weeks 13–14. Low-end Android QA. Weak-network testing. Consent and privacy flows.
  • Weeks 15–16. Staged rollout, on-call runbooks, SRE alarms on latency and inference cost.

KPIs that tell you the AI pipeline is healthy

  • End-to-end audio latency. Agent-to-user < 700 ms P50, < 1,200 ms P95.
  • AI inference latency. STT < 300 ms partial. LLM first token < 350 ms. TTS first chunk < 150 ms.
  • Frame drop rate on custom video processing. < 2% P95.
  • Audio jitter. < 30 ms.
  • Battery drain. < 10% per 30-minute call on mid-range iPhone / Pixel.
  • Transcription WER. < 10% English, < 18% major non-English languages.
  • Moderation false-positive rate. < 5% with a human review queue.
  • Call join success rate. > 99.3%.

Seven pitfalls we clean up on new Agora AI projects

  1. Token security. Embedding App Certificate or short-lived tokens client-side. Always mint tokens server-side with the right RTC role and expiry.
  2. Frame observer threading. Doing inference on Agora’s own thread will tank frame rate. Always hand off to a dedicated GPU/CPU queue.
  3. Battery drain from on-device ML. Quantize aggressively, cap FPS to 24 for face-ML tasks, pause model execution when the call is backgrounded.
  4. Background / foreground transitions. iOS kills AVAudioSession after 30 s backgrounded. Implement proper background modes and re-init paths.
  5. Regional routing. Pick the right Agora region per user; default routing surprises users in EU and APAC.
  6. Underestimating QoS. Test on weak LTE and on 10%-packet-loss Wi-Fi. Agora handles it well, custom pipelines often don’t.
  7. Skipping low-end Android QA. A Pixel 3a is the cheapest insurance policy you can buy.

Single biggest cost to watch. Per-minute LLM inference on a chatty agent can easily exceed Agora’s minute pricing. Model the cost per minute of conversation before you ship, and always have a “cheap model” fallback for free-tier users.

How Agent Engineering changes the Agora custom build

Our Agent Engineering model puts senior engineers in charge of architecture and the video/audio frame path; LLM agents own scaffolding: token services, React Native / Flutter wrappers, channel management, UI screens, analytics plumbing, test scaffolds.

On Agora AI builds specifically, we see 30–40% reduction in time and cost on scaffolding and glue, and roughly zero on the hard parts — Metal shaders, custom NNAPI delegates, debugging a 50 ms jitter that only appears on Snapdragon 7 Gen 1. That split is the point: agents accelerate what doesn’t need senior judgment, and stay out of what does.

When NOT to use Agora for your AI mobile app

  • Your entire product is an AI agent that joins calls. LiveKit Agents or Daily Bots + Pipecat is a cleaner orchestration story.
  • You must self-host for regulatory reasons (HIPAA-tight, government, defence). LiveKit self-hosted or Jitsi Videobridge + a custom stack wins.
  • You need tight OpenAI Realtime orchestration from day one — Daily’s SDK and docs are designed around that.
  • Your market is US-only and you already live in AWS — Chime SDK may be cheaper and simpler.
  • Conversational AI agents become a default feature for any video product with a customer-support or coaching angle.
  • On-device 1B-parameter LLMs (Gemma 3 Nano, Llama 3.2 1B, Phi-4-mini) ship in flagship apps. Battery becomes the dominant constraint.
  • Empathetic voice AI (Hume, Cartesia) moves from novelty to product expectation in health and education.
  • Real-time AI dubbing closes the gap with “natural simultaneous translation.”
  • Edge inference via Cloudflare Workers AI, AWS Local Zones and Groq regional endpoints cuts round-trip latency 100–200 ms.
  • Spatial audio in group calls lands on iOS and higher-end Android. Expect customers to ask.

FAQ

Can I add live transcription to an Agora call without custom dev?

Partially. Agora Real-Time Transcription (RTT) covers English and a few other languages out of the box. If you need custom vocabulary, speaker diarization, or non-supported languages, you’ll register an AudioFrameObserver and route to Deepgram or AssemblyAI.

How do I add my own AI agent to an Agora call?

Two patterns. (1) Agora Conversational AI Engine — set up the agent definition, route LLM output via Agora. (2) Media Push to your own backend — run Pipecat or LiveKit Agents, generate audio with Cartesia / ElevenLabs, push back into the channel via Media Pull. Pick (1) for speed, (2) for flexibility.

How do I do custom background blur on Agora?

Register a VideoFrameObserver, grab the frame on each tick, run MediaPipe Selfie Segmentation (Core ML or TFLite), composite with Metal/Vulkan and push the processed frame back. Cap processing FPS at 24 to protect battery on mid-tier devices.

Is Agora still the right choice for consumer AI apps?

Yes, especially if your audience includes APAC / China or if low-bandwidth reliability is a feature. LiveKit and Daily.co are strong alternatives when the product is AI-orchestration-first and open-source-leaning.

What does a real custom-Agora project cost in 2026?

Light AI features on top of Agora: $40–80K. Moderate: $90–150K. Heavy (conversational agents, custom LLM fine-tunes, AR filters, empathetic voice): $160–300K+. Plus Agora minutes and inference cost per minute of conversation.

Can I do all this in React Native or Flutter?

You can get 80% there. The last 20% — frame observer, ML delegate tuning, background modes — always needs native Swift / Kotlin and a senior iOS / Android engineer. Budget for that.

What latency target should I aim for?

Under 700 ms round-trip for a conversational agent to feel natural; under 400 ms for it to feel excellent. Compose the stack so each link (STT, LLM, TTS, media path) is below 300 ms.

Framework

Cross-Platform Video App Development: A 2026 CTO Framework Guide

Which framework fits your AI app — and where cross-platform hits native-only walls on Agora.

Cost

Video Streaming App Development Cost: A 2026 CTO Pricing Guide

The financial twin of this guide — honest bands for video apps end-to-end.

Team

When and Why to Hire Computer Vision Developers

The roles you need on the team to ship AI mobile pipelines that stay reliable at scale.

Tech debt

Code Refactoring in Plain Words

When your Agora pipeline has grown enough that adding a new AI feature feels dangerous — this is the rollout plan.

Case study

V.A.L.T. — 700+ organizations, 25K daily users

The custom video pipeline mental model we reuse on every Agora AI build.

Ship a real AI feature on Agora — not a demo

Agora plus the 2026 AI stack lets you ship conversational agents, live translation, emotion detection and custom visual effects — but only if the custom pipeline is built with frame-observer discipline, battery budget and weak-network hardening from day one.

If you want a shortcut to the specific architecture your product needs, that’s exactly the conversation we have most weeks.

Get a 30-minute pipeline review for your Agora AI build.

Frame observer path, AI orchestration, latency budget, cost per minute — all on one call.

Book your 30-min call →

  • Technologies
    Development
    Services