Published 2026-05-23 · 28 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

If you build a video product in 2026, your AI bill is not a line item; it is the line item. Inference is already roughly two-thirds of total AI compute and is on a path to dominate data-center demand by 2030 (McKinsey, Deloitte). The same Gartner forecast that says inference will cost 90% less by 2030 is the reason your CFO is asking why your bill went up 320% in two years even though per-token prices fell 280×: scale outruns price decline. Get the cost model right and you ship features that pay for themselves; get it wrong and your runway evaporates between the seed round and Series A. This lesson is for the product manager who needs a defensible spreadsheet before the next board meeting, the founder who has to choose between Sora and Runway in week one, and the engineer who has to explain why the OpenAI Realtime bill is bigger than the AWS bill. It assumes you have already read lesson 1.2 on latency and deployment topology and lesson 1.3 on model artifact formats and the open-vs-closed procurement decision, because the cost question only makes sense in the context of where the model runs and what file it ships as.

How Video Turns Into Dollars — The One Mental Model That Carries The Article

Every AI cost in a video product traces back to one of five units: tokens (text and multimodal), audio minutes (specialised audio-token rates), generated seconds (for diffusion-based video models), GPU hours (self-host), and bytes (vector store, egress, cold storage). The thing that surprises engineers coming from text-only chat products is how quickly video converts time into all five of those units at once.

A one-hour video, fed to Gemini at default media resolution, is roughly 1,080,000 tokens of input. (Google tokenises video at about 300 tokens per second of footage at default resolution; one hour is 3,600 seconds; 300 × 3,600 = 1,080,000.) That single fact reframes every conversation about API cost: a chat product where a typical user sends a few thousand tokens per session is in a different universe from a video product where a single user upload is a million tokens before anyone reads a word of the answer. Multiply that by the audio track (an hour of audio at the OpenAI Realtime API tokenisation rate of 100 ms per token is 36,000 audio input tokens, which at $32 per million is $1.15 just for the audio), and you can see how video products generate their AI bills before they generate their first answer.

The corollary is the simplest, most overlooked cost lever in the field: most of those tokens never had to be sent. Caching, batching, downsampling, sparse frame sampling, and switching from a frontier model to a Flash-tier model on the easy 90% of queries are the difference between an AI feature that breaks even and one that bankrupts the product line. Every section of this article comes back to that lever.

Diagram showing five cost units that AI in a video product converts into: tokens, audio minutes, generated seconds, GPU hours, and bytes — each with a representative 2026 unit price. Figure 1. The five cost units that every AI feature in a video product eventually bills against, with representative 2026 unit prices.

The Closed-API Layer — Gemini, OpenAI, Anthropic In 2026 Numbers

The closed-API layer is the easiest cost line to model because the vendors publish a price per million tokens. The hard part is matching the model tier to the workload and remembering that flagship reasoning models are not 2× the cost of Flash tiers — they are 10× to 30× the cost. Mis-route a stream of routine queries to the wrong tier and the bill is the consequence.

Gemini API Pricing — The 2026 Lineup

Google's Gemini 2.5 family is the workhorse for most video workloads in 2026 because its tokeniser handles native video input. The standard developer tier sits at $1.25 per million input tokens and $10.00 per million output tokens for Gemini 2.5 Pro on prompts up to 200,000 tokens; above 200,000 tokens the price rises to $2.50 and $15.00 respectively. Gemini 2.5 Flash, the model most video products will actually default to, costs $0.30 input and $2.50 output per million tokens. Gemini 2.5 Flash-Lite, the smallest production-grade tier, drops to $0.10 input and $0.40 output per million.

Three discount mechanisms stack on top of those base prices and are routinely missed. Batch processing halves the price on jobs you can submit asynchronously and pick up later, which describes almost every offline video-understanding workload. Explicit context caching cuts input cost by 90% on the cached portion of long prompts — you pay $0.125 per million tokens for cache reads against the Pro tier, plus a storage fee of $4.50 per million tokens per hour. And the Live API for real-time audio and video, run on top of the Gemini 2.5 family, prices audio input at roughly $3.00 per million tokens (about $0.005 per minute) and audio output at $12.00 per million tokens (about $0.018 per minute).

The worked example a finance team needs to see. A one-hour user-uploaded video at default media resolution is 1,080,000 input tokens. Feeding it to Gemini 2.5 Flash for a summary that returns 2,000 output tokens costs 1.08 × $0.30 + 0.002 × $2.50 ≈ $0.33. The same task on Gemini 2.5 Pro is 1.08 × $1.25 + 0.002 × $10.00 ≈ $1.37. Switch to low-resolution media (100 tokens per second of video instead of 300) and Flash falls to about $0.11 per hour processed. Switch to batch processing and halve it again. That is the engineering decision that turns a Gemini bill from a hard problem into a rounding error: low-resolution video plus Flash plus batch, applied to the 90% of workloads that do not need frontier reasoning.

OpenAI API Pricing 2026 — The GPT-5 Family And The Realtime Layer

OpenAI re-shuffled its lineup hard in late 2025 and through 2026. The current production set is GPT-5 (the workhorse) at $0.625 input and $5.00 output per million tokens; GPT-5 Pro at $15.00 input and $120.00 output per million; GPT-5.5, launched in April 2026 as the new flagship reasoning model, at $5.00 input and $30.00 output; and GPT-5.5 Pro at $30.00 input and $180.00 output. The smaller tiers — GPT-5.4 Mini at $0.75 / $4.50 and GPT-5.4 Nano at $0.20 / $1.25 — cover most of the volume in production today. GPT-4o and GPT-4o-mini remain price-listed at $2.50 / $10.00 and $0.15 / $0.60 respectively while OpenAI winds down the line; new builds should target the 5.x family.

Batch processing and "flex" execution halve the GPT-5.5 cost from $5/$30 to $2.50/$15. Data residency in regulated regions adds a 10% surcharge on the 5.x family.

The line on OpenAI's price list that surprises every team is the Realtime API. Real-time voice and video — the layer that powers conversational AI agents inside calls — is priced not in text tokens but in audio tokens. The current generation, gpt-realtime-2, sits at $32 per million audio input tokens and $64 per million audio output tokens; cached input falls to $0.40 per million. A token represents about 100 ms of user audio or 50 ms of assistant audio. In practical per-minute terms, an uncached real-time agent costs about $0.18 to $0.46 per minute, and a well-cached agent (most prompt content held in cache) drops to about $0.05 to $0.10 per minute. The translate and transcribe variants — gpt-realtime-translate at $0.034 per minute and gpt-realtime-whisper at $0.017 per minute — are specialised, cheaper paths for the workloads that do not need a full conversational layer.

The pitfall: teams build a prototype on the standard Realtime API at $0.30 per minute, scale to ten thousand concurrent users, and discover their voice AI bill is six figures a month. The fix is almost always prompt caching, which holds the static system prompt in cache and bills it at the read rate, plus moving the easy half of voice queries off the Realtime API and onto a cheaper STT-LLM-TTS pipeline.

Claude API Pricing — Opus, Sonnet, Haiku

Anthropic's lineup is the cleanest of the three. As of May 2026 the production tiers are Claude Opus 4.7 at $5.00 input and $25.00 output per million tokens, Claude Sonnet 4.6 at $3.00 / $15.00, and Claude Haiku 4.5 at $1.00 / $5.00. Older trackers still cite legacy Claude 4 Opus at $15 / $75 — that is a previous generation; check the version letter, not just the model family name.

Prompt caching on Claude follows a different shape from OpenAI's: 5-minute cache writes cost 1.25× the base input rate, 1-hour cache writes cost 2× the base input rate, and cache reads cost 0.1× the base input rate (a 90% discount). Batch processing halves output cost on Sonnet and Opus, and stacks with caching. The combination matters because Anthropic's models tend to be picked for high-reasoning workloads where outputs are long — exactly the workloads where batch + cache savings are largest.

The One Table That Lives On Your Wall

The table below is the cheat sheet for the closed-API layer in May 2026. Treat it as a snapshot — refresh quarterly, because every line on it has moved more than once in the past year.

Model Input $/1M Output $/1M Batch Cache read Use case
Gemini 2.5 Flash-Lite 0.10 0.40 0.05 / 0.20 0.01 High-volume classification, routing
Gemini 2.5 Flash 0.30 2.50 0.15 / 1.25 0.03 Default for most video workloads
Gemini 2.5 Pro 1.25 10.00 0.625 / 5.00 0.125 Long-context video reasoning
GPT-5.4 Nano 0.20 1.25 n/a varies Low-cost routing, draft generation
GPT-5.4 Mini 0.75 4.50 n/a varies Mid-tier chat
GPT-5 0.625 5.00 n/a varies Workhorse reasoning
GPT-5.5 5.00 30.00 2.50 / 15.00 varies Flagship reasoning
GPT-5.5 Pro 30.00 180.00 varies varies Edge cases only
Claude Haiku 4.5 1.00 5.00 0.50 / 2.50 0.10 Fast structured output
Claude Sonnet 4.6 3.00 15.00 1.50 / 7.50 0.30 Long-form reasoning, agents
Claude Opus 4.7 5.00 25.00 2.50 / 12.50 0.50 Hardest reasoning tasks
OpenAI Realtime gpt-realtime-2 (audio) 32 64 n/a 0.40 Real-time voice in WebRTC
Gemini Live (audio) 3.00 12.00 n/a varies Real-time voice / video, Google stack

The Gemini Live and Realtime API rows are the only ones that bill against audio tokens rather than text tokens; do not compare them line-for-line with the text rows above without converting through the per-minute equivalence.

The Audio AI Layer — ASR, TTS, Noise Suppression

A video product without an audio AI layer is a partially built product. The good news is that the audio layer in 2026 is the cheapest of the AI lines on your bill: dedicated ASR providers price transcription in tenths of a cent per minute, and the only way to make audio AI expensive is to route it through the wrong API.

The dominant pricing patterns. Deepgram Nova-3 is the cheapest credible ASR at $0.0043 per minute for batch transcription in English and $0.0077 per minute for streaming — a one-hour video transcribed in batch costs about $0.26, and the same hour streamed live costs about $0.46. OpenAI's gpt-4o-mini-transcribe sits at about $0.003 per minute (roughly $0.18 per hour) and gpt-4o-transcribe at $0.006 per minute ($0.36 per hour). AssemblyAI's Universal Streaming, billed against session-open time at $0.15 per hour, is the cheapest streaming option per hour if your sessions are dense. Azure Speech standard real-time is $1.00 per hour and Google Cloud Speech-to-Text V2 is $0.96 per hour at the base tier — both an order of magnitude more expensive than Deepgram or OpenAI for the same task.

TTS is where the unit changes from per-minute to per-character, which is the source of every TTS budget overrun. ElevenLabs Flash and Turbo charge 0.5 credits per character against a credit pack that bottoms out at $99 per month for 100 credits-pack tier; effective cost is roughly $35 to $50 per million characters depending on plan. Cartesia Sonic 3 lands in the same range. The open-source Kokoro TTS, when self-hosted on a single A100, can deliver about 3.6 billion characters per month for roughly $750 in compute — about $0.20 per million characters, or 100× cheaper than the closed-API services, at the cost of operating the GPU yourself.

A worked example. A telemedicine product that produces a one-hour spoken summary at 150 words per minute, with about 5 characters per word plus spaces, generates roughly 60,000 characters of TTS per visit. At ElevenLabs Flash this is about $3 per visit; on Kokoro self-hosted it is about $0.012 per visit. Multiply by a thousand visits a day and the gap between the two options is about $1,080 per day — call it $400,000 a year. The trade-off is voice quality (ElevenLabs is currently the best in class for naturalness) and the cost of running your own GPU fleet. For high-volume products with internal voices the math is obvious; for low-volume products with strict brand voice requirements ElevenLabs wins.

Noise suppression is the surprise category. Krisp introduced metered SDK pricing in May 2026 and the per-minute rate for BVC v3 is now gated under a private SDK contract. The free alternatives — RNNoise and DeepFilterNet for CPU-side cancellation, NVIDIA's Maxine Audio Effects SDK for GPU-side — close most of the quality gap but require engineering effort. We cover the build-vs-buy decision in detail in the noise-suppression deep-dive in Phase 3.

The cheat sheet for an audio AI line item, costed against one hour of video, looks like this:

Service Per minute Per hour
Deepgram Nova-3 batch $0.0043 $0.26
Deepgram Nova-3 streaming $0.0077 $0.46
OpenAI gpt-4o-mini-transcribe $0.003 $0.18
OpenAI gpt-4o-transcribe $0.006 $0.36
AssemblyAI Universal Streaming n/a $0.15 (session-based)
Azure Speech standard real-time $0.0167 $1.00
Google STT V2 base $0.016 $0.96
Whisper self-hosted (GPU compute only) n/a ~$0.03 (plus ops overhead)

The "plus ops overhead" caveat on self-hosted Whisper is the line that gets ignored. A working Whisper deployment at production reliability with monitoring, autoscaling, retries, and on-call engineering ends up at roughly $1 per hour effective, against the $0.03 per hour pure compute number. The compute is cheap; the wrapping is not. The decision to self-host ASR rarely makes financial sense below ten thousand hours of audio processed per month.

The Generative Video Layer — Sora, Runway, Veo, Kling, Pika, Luma, Hailuo

Generative video is the line item that turns AI budgets into AI nightmares because the unit is generated seconds and the price per second is several orders of magnitude above any text or audio model. Sora 2 Standard at $0.10 per second is the floor for serious frontier-quality video; Sora 2 Pro at $0.50 per second for 1024p output is the ceiling. Translate those into "one hour of generated video" and you get $360 for Standard and $1,800 for Pro per hour — the kind of number that ends a product roadmap conversation if nobody has done the math up front.

The 2026 lineup, normalised to per-second pricing, sorts roughly as follows. The cheap tier covers Hailuo 02 Standard at 768p ($0.045 per second), Runway Gen-4 Turbo ($0.05 per second), and Veo 3.1 Lite ($0.05 per second) — all useful for fast iteration, draft generation, and most consumer-facing b-roll workloads. The middle tier covers Luma Dream Machine Ray 3 ($0.08 per second), Hailuo 02 Pro 1080p ($0.08 per second), Kling 2.0/3.0 ($0.075 to $0.112 per second depending on tier and resolution), and Runway Gen-4 standard ($0.10 to $0.15 per second). The frontier tier is Sora 2 Standard ($0.10 per second), Sora 2 Pro at 720p ($0.30 per second), Veo 3 with native audio ($0.75 per second), and Sora 2 Pro at 1024p ($0.50 per second).

The shape of the cost curve is more interesting than the absolute numbers. Quality, audio, resolution, and frame rate each multiply the bill in roughly linear fashion. Adding audio to Veo 3 takes it from $0.50 to $0.75 per second — a 50% premium. Going from 720p to 1024p on Sora 2 Pro takes it from $0.30 to $0.50 per second — a 67% premium. Doubling the duration from 5 seconds to 10 seconds is exactly 2× the price. The decision to add an audio track, raise the resolution, or extend the clip should be made knowing each one is a price multiplier, not a free toggle.

The pitfall is the prototype-to-production gap. A founder generates ten Sora 2 Pro 1024p clips during a Friday afternoon demo, the bill comes to $25, and the gut conclusion is "Sora is cheap". Three months later the product is generating ten thousand clips per day at the same tier and the bill is $25,000 per day — $9 million per year — and the founder is calling Anthropic for a Sonnet rebate. The path out of that trap is exactly the same as on the LLM side: route by use case. Draft generation goes to Runway Gen-4 Turbo or Hailuo 02 Standard at $0.05 per second; final hero content goes to Sora 2 Pro. The product never needs frontier quality on every clip.

Generator Per second Per 10s clip Per hour generated
Hailuo 02 Standard 768p $0.045 $0.45 $162
Runway Gen-4 Turbo $0.05 $0.50 $180
Veo 3.1 Lite 720p no audio $0.05 $0.50 $180
Kling 3.0 720p $0.075 $0.75 $270
Luma Dream Machine Ray 3 $0.08 $0.80 $288
Hailuo 02 Pro 1080p $0.08 $0.80 $288
Sora 2 Standard 720p $0.10 $1.00 $360
Runway Gen-4 standard $0.10 to $0.15 $1.00 to $1.50 $360 to $540
Sora 2 Pro 720p $0.30 $3.00 $1,080
Veo 3 with native audio $0.75 $7.50 $2,700
Sora 2 Pro 1024p $0.50 $5.00 $1,800

The other lever the price comparison hides: not every product needs to generate the video at all. AI-assisted editing of stock footage, automated trailer cuts from existing OTT libraries, and per-title thumbnail generation are often the right answers — the cost per delivered second of finished content is one to two orders of magnitude lower than diffusion-from-scratch. The Phase 5 article on generative video editing tools — Opus Clip, Descript, Submagic, Captions AI, DaVinci AI covers that route.

Bar chart comparing the per-second cost of seven generative video models in May 2026, ranging from Hailuo 02 Standard at $0.045/sec to Sora 2 Pro 1024p at $0.50/sec. Figure 2. Per-second cost of the major generative video models in May 2026, sorted cheapest to most expensive.

The Self-Host Layer — GPU Economics In 2026

The closed-API layer wins on developer velocity and frontier capability; the self-host layer wins on per-unit cost above a certain volume. The crossover point is the most important number in your cost model.

GPU hourly pricing in May 2026 has fallen sharply from 2024 highs but is bifurcated across providers. The headline numbers, sorted from cheapest to most expensive, look like this. NVIDIA H100 SXM5 is now available on Spheron spot for $1.03 per hour, RunPod community for $1.99 per hour, Lambda Labs on-demand at $2.99 per hour, GCP A3-high at about $3.00 per hour, CoreWeave HGX 8-GPU nodes at $6.15 per GPU-hour, AWS p5 at about $6.88 per hour, and Azure at about $12.29 per hour. The 12× gap from cheapest to most expensive is real and is the single biggest lever in the self-host cost equation. The trade-off is not just price — it is also reliability, geographic coverage, regulatory compliance, and how quickly you get instances back when they pre-empt. For production workloads on cheap providers, plan for 70-80% utilization, not 100%.

NVIDIA B200 (Blackwell) pricing has stabilised through 2026 at $2.12 to $2.25 per hour on spot, $2.65 per hour on CoreWeave one-year reserved, $3.49 per hour on Lambda on-demand, $5.98 per hour on RunPod on-demand, and up to $14.24 per hour on premium high-availability tiers. The B200 delivers roughly 2.5× the H100's throughput on transformer workloads, so the rough rule of thumb is that B200 at $3 per hour is competitive with H100 at $1.20 per hour on a throughput-normalised basis.

Older GPUs are the right answer for many video workloads. NVIDIA L40S sits at a $0.50 per hour spot floor and a median on-demand of about $1.73 per hour. A100 80GB sits at a $0.44 per hour spot floor and a median on-demand of about $2.06 per hour. Vast.ai has A100 listings as low as $0.67 per hour. For computer vision workloads — YOLO, SAM 2, optical flow, segmentation — these older GPUs are over-provisioned and the per-frame cost is essentially the cheapest you can get without going to a CPU. AMD MI300X has filled the gap on the LLM serving side, ranging from $0.95 per hour spot to $3.02 per hour median on-demand on RunPod and DigitalOcean.

The edge of the cost curve, both in price and in latency. NVIDIA Jetson Orin Nano Super Developer Kit at $249 (67 TOPS) and Jetson AGX Orin 64GB Developer Kit at $1,999 (275 TOPS) are the canonical edge-AI hardware for video applications. Amortised over a three-year deployment, a Jetson AGX Orin runs at roughly $0.08 per hour of compute — cheaper than any cloud GPU, with zero network latency and full data sovereignty. The trade-off is throughput: an AGX Orin handles tens of camera streams in parallel, not thousands.

Apple Silicon belongs on the cost map for any product with on-device inference. A Mac Studio M4 Max with 128 GB of unified memory starts at $1,999 and runs 70B-parameter models at 15 to 18 tokens per second using MLX; the M3 Ultra with 192 GB and 800 GB/s memory bandwidth starts at $3,999 and runs the same models at 25 to 30 tokens per second. For internal-use video tools where the company owns the hardware once, the cost-per-inference is essentially zero after the upfront capital outlay, with the additional benefit that no user data leaves the device.

The Crossover — When Does Self-Hosting Win?

The crossover from API to self-host happens at workload-specific volume thresholds and the math is straightforward enough to do on a napkin.

Take Whisper as the canonical example. OpenAI's Whisper API at $0.006 per minute is $9 per hour of processed audio at fully saturated continuous use of a single GPU (1 GPU streams about 1,500 minutes of audio per hour with a quality model). A self-hosted Whisper Large on an A100 at $1.07 per hour Spheron on-demand handles the same throughput for roughly $1.07 per hour of GPU time. The crossover is at roughly 7,200 minutes of audio per month — about 120 hours, four hours per day — assuming you can keep the GPU busy. Below that, the API is cheaper and infinitely simpler. Above it, self-host wins by a factor of 4× to 8× in raw compute, before the ops overhead is added.

The Lenovo Press 2026 TCO study finds an 8× cost advantage for self-hosting on Lenovo hardware vs cloud IaaS for LLM inference, and up to 18× vs frontier model-as-a-service APIs. Their five-year savings per server can exceed $5 million. The catch — and there is always a catch — is that the comparison assumes the workload fits comfortably on the server, the team has the operational competence to run inference platforms, and the alternative API is a frontier-tier model where you are paying for both compute and brand. If your workload is mostly cheap-tier API calls (Flash, Haiku, Gemini Flash-Lite), the crossover moves much further out and self-hosting may never pay off.

Line chart showing API cost vs self-host cost as a function of monthly audio processing volume, with the crossover point marked at roughly 120 hours per month. Figure 3. The crossover point where self-hosted Whisper Large on a $1/hr A100 becomes cheaper than the OpenAI Whisper API, plotted against monthly audio volume.

Fine-Tuning, RAG, And The Stuff Nobody Talks About

The base inference cost is only half the bill. Fine-tuning, retrieval, vector storage, and the operational tax of running anything in production are line items that show up only after the prototype is shipped and the CFO starts asking pointed questions.

OpenAI fine-tuning is priced at $25 per million tokens of training for GPT-4o, with fine-tuned inference at $3.75 per million input and $15.00 per million output. As of May 2026, OpenAI is winding down its fine-tuning platform for new users; existing jobs continue, and fine-tuned models remain available for inference until base-model deprecation. Migrate to prompt engineering plus retrieval for new builds, or to open-weight fine-tuning on Llama / Qwen / Mistral families running on your own GPUs.

Vertex AI prices Gemini fine-tuning per node-hour for custom training and per training token for supervised fine-tuning. There is no flat per-1M-token public rate — pricing is quote-based and varies by region and capacity. This opacity is one of the genuine downsides of the Vertex stack against the OpenAI alternative, and worth flagging during procurement.

Vector storage is the cost line that gets buried in the AWS bill until somebody runs the report. Pinecone Serverless at 10 million vectors costs about $70 per month; at 100 million vectors it crosses $700 per month. Pgvector self-hosted on Postgres handles the same scale for about $45 per month on a small instance and under $100 per month on a beefier one. The 7× to 14× cost gap is real and grows non-linearly with scale, so the right design pattern for any video product expecting more than a few million video segments in its RAG index is pgvector first, Pinecone only if the operational requirements make Postgres untenable. The Phase 4 article on video RAG and multimodal RAG over a video archive covers the architecture in depth.

The hidden operational tax is everything else: monitoring (Datadog, Grafana Cloud, Honeycomb — typically 5% to 10% of compute spend), error tracking, observability tooling for agent workflows (AgentOps, LangSmith, Helicone), data egress on cloud APIs (often 10% of compute on cross-region pipelines), and on-call engineering. The rule of thumb for a properly-instrumented production AI stack is to multiply the raw compute bill by 1.4× to 1.6× to get the all-in cost.

EU AI Act — The Cost Line That Hits In August 2026

The EU AI Act's general-purpose AI obligations have been in force since August 2, 2025, but the European Commission's enforcement powers and the corresponding fines — up to 3% of global turnover or €15 million — start applying on August 2, 2026. For any video product placed on the EU market that uses general-purpose AI models, the compliance cost is real and quantifiable.

Latham & Watkins and other legal trackers estimate the first-year compliance cost for foundation model providers and integrators at $12 to $25 million. For most video product companies the cost is smaller because they are integrators, not model providers — but the obligations are still non-trivial: technical documentation, instructions for use, copyright-directive compliance, training-data summaries, model evaluations, adversarial testing, incident reporting, and cybersecurity controls.

The cost showing up in the AI bill is not usually the lawyer-hours, although those add up. It is the engineering effort to instrument every model call with the metadata the regulator wants, retain the logs in compliant storage, and provide the audit trail on demand. Budget 10 to 15 percentage points on top of your raw API spend for compliance overhead if you ship into the EU market. The Phase 8 article on regulatory engineering for the EU AI Act walks through the specific articles and the engineering response.

Worked Example — A 100,000-User-Per-Month Video Product

The point of the cost model is to predict the bill before the bill arrives. Here is the full worked example for a typical 2026 video conferencing product with AI features: 100,000 monthly active users, each averaging 8 hours of video calls per month, with live captions, optional real-time translation, an AI meeting note-taker, and weekly AI-generated summary digests.

The numbers, line by line.

Live captions on every minute of every call. 100,000 users × 8 hours = 800,000 hours of audio. On Deepgram Nova-3 streaming at $0.46 per hour, the live-caption bill is $368,000 per month. On AssemblyAI Universal Streaming at $0.15 per hour session-based (assuming sessions are roughly continuous), the bill drops to $120,000 per month. The choice is real-time accuracy vs cost; for most products, Deepgram wins on quality, AssemblyAI on cost.

Real-time translation in 20% of calls. 20% of 800,000 hours = 160,000 hours, at the gpt-realtime-translate rate of $0.034 per minute = $326,400 per month. This is the single biggest line item in the example and is the one that justifies most aggressively the move to a Gemini Live stack or a self-hosted translation pipeline.

AI meeting note-taker on every call. The note-taker ingests the full call transcript (an hour of audio at 150 words per minute is about 9,000 words, or roughly 13,000 input tokens) and produces a structured summary (about 1,000 output tokens per call). 800,000 calls (assuming one call per hour on average) × $0.30 / 1M × 13K input + $2.50 / 1M × 1K output = 800,000 × ($0.0039 + $0.0025) = $5,120 per month if routed to Gemini 2.5 Flash. The Phase 6 articles on LiveKit real-time AI meeting assistant and LiveKit meeting note-taker — full repo walk through the production implementation.

Weekly digest summarising the week's calls per user. 100,000 users × 4 weeks = 400,000 digest summaries per month. Each ingests the week's 8 hours of call audio reduced to roughly 50,000 transcribed words (about 65,000 input tokens) and produces a 500-word digest (about 650 output tokens). At Gemini 2.5 Flash: 400,000 × ($0.30/1M × 65K + $2.50/1M × 650) = $8,450 per month.

Vector storage for the meeting archive (RAG over the company's last 90 days of meetings). 100,000 users × ~50 hours of audio per quarter × 1 embedding per 10 seconds = 1.8 billion embeddings active. On Pinecone Serverless at scale, this is roughly $12,600 per month; on pgvector self-hosted on a beefy Postgres cluster, roughly $2,000 per month including the database operator's time amortised.

Compliance overhead at 12% of the AI line: roughly $32,000 per month on the Deepgram/translation/Flash combination.

Operational tax at 1.5× compute: a multiplier on the raw compute bill.

Adding the cheap-path total: $120,000 (AssemblyAI) + $326,400 (translation) + $5,120 (note-taker) + $8,450 (digest) + $2,000 (pgvector) = $462,000 per month before overhead; with overhead and compliance roughly $700,000 per month all-in. That is $4.20 per user per month for the AI line — within the range a healthy consumer or B2B SaaS product can absorb at $10 to $30 per user per month revenue.

Now apply the levers. Switch the translation workload to Gemini Live at $0.018 per minute output = 160,000 × 60 × $0.018 = $172,800 per month instead of $326,400 — a $153,000 monthly saving. Cache the system prompts on the note-taker and digest workloads — call it a 70% reduction on input tokens, saving roughly $4,000 per month. Drop the note-taker to Gemini 2.5 Flash-Lite for the routing decision and reserve Flash for the actual summary, saving another $1,500 per month. Self-host Whisper Large for the live-caption workload, pulling that $120,000 monthly line down to roughly $8,000 per month in GPU compute plus engineering overhead — call it $25,000 per month all-in. Total: $200,000 to $250,000 per month all-in for the same feature set — $2.20 per user per month for AI, a 3× improvement on the naive build.

The cost-model XLSX accompanying this article walks the same example with editable cells so you can plug in your own user counts, workload mix, and tier choices. Download it below.

Decision tree showing how to route an AI workload in a video product through cost tiers — start with Flash-Lite for classification, escalate to Flash or Sonnet on hard queries, reserve Opus or GPT-5.5 Pro for edge cases. Figure 4. The routing decision tree that turns an AI bill from "unsustainable" into "obvious line item".

Common Mistake — Treating The Prototype Bill As The Production Bill

The single most expensive mistake teams make is extrapolating the prototype's monthly bill linearly to production scale. Three forces conspire to make production cost growth super-linear, not linear.

First, prototype workloads are usually concentrated on the cheapest tier — the developer is paying their own credit card and is making careful choices. Production workloads accumulate exceptions over time: a customer hits an edge case, the engineer adds a fallback to a more expensive model; a new feature ships, the PM asks for "the smart version", the routing logic adds another GPT-5 Pro path. Each addition is a 5× to 30× cost multiplier on the affected slice of traffic.

Second, retries, caching invalidations, and reasoning loops compound. A single retry on a failed Realtime API call doubles the cost of that minute. A cache invalidation forced by a system-prompt change wipes a week of savings. An agent that loops three times to refine a tool call instead of one-shotting it triples the cost of that interaction.

Third, the operational tax is multiplicative, not additive. The 1.5× multiplier on raw compute is itself a multiplier on a growing base, and the operational tooling (Datadog, error tracking, observability) charges per ingested event, so its bill grows in step with the API bill.

The defensive practice is to build the cost model first, the product second. Pick a target unit cost per user per month, decompose it into per-feature unit costs, instrument every API call against those targets in production, and alert when any feature crosses 120% of its target. The Phase 8 article on video AI cost optimization — 25 levers catalogues the full set of optimisations, but the discipline of measuring before optimising matters more than any single lever.

The Twelve Cost Levers — Stack Them All

The same cost-shape repeats across every category we have covered. Below is the master list. Most teams pull two or three of these; high-performing teams pull all twelve.

The first six are model-level levers. Route by use case to the cheapest tier that meets the quality bar. Use Flash, Haiku, Mini, and Nano tiers as the default; reserve flagship tiers for the queries that genuinely need them. Cache static prompts aggressively; on Gemini, OpenAI, and Anthropic the cache read rate is 90% off the base input rate, and the cache write cost amortises after a handful of reads. Batch-process every workload that does not need real-time response; batch saves 50% of compute cost on most providers. Use low media resolution for video and audio inputs when the task does not need full resolution; on Gemini this is a 3× cost reduction on video tokens. Truncate context aggressively — most reasoning tasks need a fraction of what the prompt actually contains. Compress system prompts after every feature change; system prompts grow over time and nobody ever shrinks them.

The next three are architecture-level levers. Move from frontier APIs to self-host on Llama / Qwen / Mistral families for the 80% of workloads that do not need frontier capability — the Lenovo TCO study finds 8× to 18× savings at scale, and the open-weight models are within 5% of frontier on most production benchmarks (see the open-frontier VLM comparison in Phase 4). Push inference to the edge — Jetson Orin or Apple Silicon — for any workload where the user is the bottleneck on the data, not the compute. And replace expensive synchronous APIs with cheaper asynchronous pipelines wherever the UX permits.

The last three are operational levers. Negotiate volume contracts above $50,000 per month of API spend — every major provider has discounted pricing for committed-use customers, and the discount is typically 20% to 40%. Monitor cost per request, not just request count; cost per request flags routing regressions before the bill does. Use spot or interruptible GPU instances for batch workloads — the savings are 50% to 80% over on-demand and the engineering complexity to handle pre-emption is modest with the right framework.

Lever Typical saving Where it applies
Route to cheaper model tier 5× to 30× on routed traffic Every closed-API workload
Prompt caching 70% to 90% on cached portion Stable system prompts
Batch processing 50% Non-realtime workloads
Low-resolution media input 3× on video tokens Most VLM tasks
Context truncation 20% to 60% Long-context reasoning
System-prompt compression 10% to 30% Mature products
Self-host on open weights 5× to 20× High-volume workloads
Edge inference up to 100× Latency-sensitive features
Async pipelines 50% to 80% Batch-amenable UX
Volume contracts 20% to 40% Above $50K/mo spend
Cost-per-request monitoring 10% to 30% (regressions caught) Production-scale products
Spot / interruptible GPUs 50% to 80% Batch GPU work

Where Fora Soft Fits In

Fora Soft has shipped video products with closed-API AI, self-hosted inference, and hybrid topologies across video conferencing, OTT, video surveillance, telemedicine, and e-learning since 2005. We have rebuilt the cost model in production for clients moving from a $200,000-per-month closed-API bill to a $40,000-per-month hybrid stack without losing quality on the user-visible features. The work is not glamorous — it is routing logic, caching policies, batch-vs-realtime decisions, and a quarterly refresh of the unit-economics spreadsheet — but it is the difference between AI features that pay back their development cost and AI features that get killed at the next budget review. If you are running into a wall on AI cost in a video product, we are the team you call after you have read every blog post on the topic and the numbers still do not work.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer — book a 30-minute scoping call to walk through the cost model for your specific product. Contact us.
  • See our case studies — read how we have shipped AI in video products across conferencing, OTT, surveillance, telemedicine, and e-learning. View case studies.
  • Download the AI Cost Model Worksheet — the same spreadsheet logic walked above, ready to plug your own numbers into. Download (PDF).

References

  1. Google. Gemini Developer API pricing. Read 2026-05-23. https://ai.google.dev/gemini-api/docs/pricing — Primary source for Gemini 2.5 Pro, Flash, Flash-Lite pricing and context-caching discounts. Tier-1 primary vendor doc.
  2. Google. Gemini video understanding documentation. Read 2026-05-23. https://ai.google.dev/gemini-api/docs/video-understanding — Primary source for video tokenisation rates (300 tokens/sec default, 100 tokens/sec low).
  3. OpenAI. API pricing. Read 2026-05-23. https://openai.com/api/pricing/ — Primary source for the GPT-5 family pricing, GPT-4o legacy pricing, Whisper API, and audio transcription tiers.
  4. OpenAI. Introducing gpt-realtime. Read 2026-05-23. https://openai.com/index/introducing-gpt-realtime/ — Primary source for OpenAI Realtime API pricing in audio tokens ($32 input, $64 output per 1M).
  5. Anthropic. Claude API Pricing (2026). Read 2026-05-23. https://benchlm.ai/blog/posts/claude-api-pricing — Snapshot of Claude Opus 4.7, Sonnet 4.6, Haiku 4.5 pricing including cache discount mechanics; cross-checked against Anthropic primary docs.
  6. Deepgram. Pricing — Speech-to-Text. Read 2026-05-23. https://deepgram.com/pricing — Primary source for Nova-3 streaming and batch pricing.
  7. ElevenLabs. API Pricing. Read 2026-05-23. https://elevenlabs.io/pricing/api — Primary source for ElevenLabs TTS, Conversational AI, and Scribe (STT) pricing.
  8. AssemblyAI. Universal-Streaming session-based pricing FAQ. Read 2026-05-23. https://www.assemblyai.com/docs/faq/how-does-universal-streaming-session-based-pricing-work — Primary source for the $0.15/hr session-based Universal Streaming rate.
  9. Azure. Speech Services pricing. Read 2026-05-23. https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/ — Primary source for Azure Speech real-time, batch, and custom-model pricing.
  10. Google Cloud. Speech-to-Text pricing. Read 2026-05-23. https://cloud.google.com/speech-to-text/pricing — Primary source for Google Cloud STT V2 base, standard, and enhanced rates.
  11. Sora 2 API Pricing & Quotas. Read 2026-05-23. https://www.aifreeapi.com/en/posts/sora-2-api-pricing-quotas — Snapshot of Sora 2 Standard / Pro per-second pricing across resolutions.
  12. Runway. API Pricing documentation. Read 2026-05-23. https://docs.dev.runwayml.com/guides/pricing/ — Primary source for Runway Gen-4 credit pricing.
  13. Luma. API pricing. Read 2026-05-23. https://lumaai-help.freshdesk.com/support/solutions/articles/151000210176-what-are-your-prices-for-api- — Primary source for Luma Dream Machine Ray 3 per-second API pricing.
  14. Google Cloud. Vertex AI generative AI pricing. Read 2026-05-23. https://cloud.google.com/vertex-ai/generative-ai/pricing — Primary source for Veo 3 / Veo 3.1 per-second pricing across resolutions and audio tiers.
  15. Spheron. GPU Cloud Pricing 2026. Read 2026-05-23. https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/ — Comparative source for H100, B200, A100, L40S, MI300X cross-provider pricing in May 2026.
  16. CoreWeave / IntuitionLabs. H100 Rental Prices Cloud Comparison. Read 2026-05-23. https://intuitionlabs.ai/articles/h100-rental-prices-cloud-comparison — Cross-provider H100 pricing snapshot.
  17. NVIDIA Developer. Buy the Latest Jetson Products. Read 2026-05-23. https://developer.nvidia.com/buy-jetson — Primary source for Jetson Orin Nano Super and AGX Orin 64GB Developer Kit pricing.
  18. Apple. Mac Studio. Read 2026-05-23. https://www.apple.com/mac-studio/ — Primary source for M4 Max and M3 Ultra Mac Studio configurations and pricing.
  19. Lenovo Press. On-Premise vs Cloud Generative AI TCO 2026 Edition. Read 2026-05-23. https://lenovopress.lenovo.com/lp2368-on-premise-vs-cloud-generative-ai-total-cost-of-ownership-2026-edition — Source for the 8× / 18× self-host-vs-cloud cost-advantage finding.
  20. Gartner. Press release — 90% inference cost decline by 2030. Read 2026-05-23. https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025 — Forward-looking forecast for inference cost trajectory.
  21. McKinsey. The future of AI workloads. Read 2026-05-23. https://www.mckinsey.com/featured-insights/week-in-charts/the-future-of-ai-workloads — Source for inference-vs-training workload mix projections.
  22. Epoch AI. LLM inference price trends. Read 2026-05-23. https://epoch.ai/data-insights/llm-inference-price-trends — Source for the unequal price-decline finding across task complexity tiers.
  23. Latham & Watkins. EU AI Act — GPAI obligations in force. Read 2026-05-23. https://www.lw.com/en/insights/eu-ai-act-gpai-model-obligations-in-force-and-final-gpai-code-of-practice-in-place — Primary legal source for the August 2026 enforcement deadline and fine structure.
  24. Pinecone. Pricing. Read 2026-05-23. https://www.pinecone.io/pricing/ — Primary source for Pinecone Serverless tier pricing at vector counts referenced.