Published 2026-06-03 · 26 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
In a video product, inference — the cost of running an AI model, as opposed to training it — is not a line item on the budget; it is usually the line item, because video converts time into billable work at a rate text products never see. A one-hour video sent to a frontier model at full resolution is roughly a million tokens of input before anyone reads a word of the answer, and a detector watching a camera runs thirty times a second forever. Get the cost structure wrong and a feature that demos beautifully becomes the reason the unit economics never close; get it right and the same feature pays for itself and funds the next one. This lesson is for the product manager who has to defend an AI budget, the founder choosing where to spend a limited runway, and the engineer who has to actually turn the bill down without breaking the product. It is the operational companion to the cost model lesson, which shows you how to measure what AI in video costs; this one shows you the full menu of ways to cut it.
The One Idea: Levers Multiply, They Don't Add
Before the list, hold onto the idea that makes the list worth reading. Cost levers compound. If one lever cuts your bill to a third, and a second independent lever also cuts it to a third, and a third does the same, the combined effect is not "saved a third three times" — it is a third of a third of a third, which is one twenty-seventh of where you started. Three modest, unglamorous changes, each a "3× win," turn a $135,000 bill into a $5,000 one.
That is why the right mental model is a stack, not a ranking. You are not looking for the one silver-bullet optimization; you are looking for every independent lever that applies to your workload and pulling all of them. The word independent matters, and we will return to it: two levers compound cleanly only when they act on different parts of the cost. Cutting the number of tokens you send (a quantity lever) and cutting the price you pay per token (a price lever) are independent and multiply. Two different ways of cutting the same token count overlap and do not. The skill is recognising which is which.
Figure 1. The twenty-five levers in five families. Read it as a menu: pull every lever that applies to your workload, because levers that act on different parts of the cost multiply together.
Where A Video AI Bill Actually Comes From
You cannot optimise what you have not located, so start by knowing which units your money turns into. Every AI cost in a video product traces to one of a few units — tokens (for text and multimodal models), audio minutes, generated seconds (for models that create video), GPU hours (when you host the model yourself), and stored bytes — and the cost model lesson walks each one in detail with 2026 prices. The one fact to carry into this article is how fast video manufactures those units: Google's Gemini tokeniser turns one hour of video at its default resolution into roughly 1.08 million input tokens (about 300 tokens for each second of footage, times 3,600 seconds), so a single user upload can cost more than an entire day of a text chatbot's traffic. That number is the reason the first family of levers — sending less of the video to the model in the first place — is, for video specifically, the family that moves the bill the most.
A note on order. The families below run roughly from "biggest video-specific win and cheapest to try" to "most structural." If you do nothing else, work the first two families and the half-price batch and cached-input levers in family four. Those are the changes that need no new hardware, no model retraining, and usually a week of engineering — and they are where most of the savings hide.
Family 1 — Process Less Video (Levers 1–5)
The cheapest token is the one you never send. For video this is not a minor optimisation; it is the whole game, because the naive pipeline sends grotesque amounts of redundant data. A thirty-frames-per-second feed is mostly thirty near-identical copies of the same scene, and a model that has to attend to every one of them pays a cost that grows faster than linearly — the attention mechanism inside a transformer model scales with roughly the square of the input length, so doubling the frames you feed it can more than double the work.
Lever 1 — Sample frames; do not process all of them. Instead of sending every frame, send a sparse selection. The simplest version takes one frame a second (a 30× reduction on a 30 fps feed); the smarter version, adaptive keyframe sampling, picks the frames that actually changed or that the user's question is about, which research shows is both cheaper and more accurate than uniform sampling because it spends the model's attention where the information is. Published "efficient video sampling" methods that prune temporally redundant frames cut a vision-language model's time-to-first-token by up to 4× with minimal accuracy loss. The technique is covered in depth in the video VLM frame-sampling lesson.
Lever 2 — Downscale resolution before the model sees it. A model billed per visual token charges more for more pixels. Most frontier multimodal APIs expose a "low" media-resolution mode that tokenises each frame more coarsely, and most video understanding tasks — is there a person, what are they doing, is this safe — do not need 4K to answer. Sending a 480p frame instead of a 1080p one can cut the per-frame token count several times over for no meaningful loss on the task you actually run.
Lever 3 — Crop to the region of interest. If only one corner of the frame matters — a doorway, a whiteboard, a single speaker — send only that crop. A cheap, fast detector can find the region first, and the expensive model then sees a fraction of the pixels. This is the per-stage thinking from the serving lesson applied to cost: a small model upstream shrinks the input to the big model downstream.
Lever 4 — Gate the expensive model behind a cheap trigger. The most wasteful pattern in video AI is running a large model on a schedule — every second, every clip — when almost nothing is happening. Replace the schedule with an event. A near-free motion detector, a sound trigger, or a tiny on-device classifier decides when something is worth a closer look, and only then does the expensive vision-language model or cloud API run. On a surveillance feed that is interesting two percent of the time, gating cuts the expensive model's invocations roughly fiftyfold. This is the single biggest structural lever for always-on video.
Lever 5 — Trim the prompt and cap the output. Text tokens cost money too, on both sides. A bloated system prompt repeated on every call, or an unbounded answer, leaks money at scale. Trim instructions to what the model needs, and set a hard maximum on output length — output tokens are typically the most expensive tokens of all (often four to eight times the input price), so an answer that rambles to 2,000 tokens when 200 would do is a 10× overspend on the costliest line.
Family 2 — Run A Smaller, Cheaper Model (Levers 6–10)
Once you are sending only what you must, the next question is which model receives it. Most teams default to the most capable model for everything, which is like sending every letter by overnight courier. The fix is to match the model to the difficulty of the job.
Lever 6 — Route easy work to a cheaper tier. Frontier models are not 2× the price of their small siblings; they are commonly 8× to 30× the price. A "Flash" or "mini" tier at $0.30 per million input tokens does the routine 90% of video tasks — basic description, simple moderation, transcription clean-up — that a flagship at $2.50 or more was overqualified for. A router is a small, fast classifier that reads each request and sends it to the cheapest model that can handle it. The open-source RouteLLM framework, published at ICLR 2025, reports matching 95% of GPT-4's quality while cutting cost by about 85% on a standard benchmark, by sending only the genuinely hard queries to the expensive model.
Lever 7 — Cascade: try the cheap model first, escalate only on doubt. A cascade is a router's more honest cousin. Instead of predicting which model a request needs, it runs the cheap model first, checks the result's confidence, and escalates to the expensive model only when the cheap one is unsure. Published cascades reach roughly 97% of a frontier model's accuracy at about a quarter of its cost. The trade-off is latency on the escalated minority, which makes cascades a great fit for offline and batch work and a careful choice for real-time. Setting the confidence threshold well requires the measurement discipline from the eval-rig lesson.
Lever 8 — Swap a frontier API for a small open model where quality allows. For high-volume, well-scoped tasks, a 1-to-7-billion-parameter open model you host yourself can be dramatically cheaper per call than a closed API, once volume is high enough to keep the hardware busy. The break-even and the open-versus-closed procurement question are the subject of the model-artifact-formats lesson; the cost point here is that beyond a certain steady volume, owning the model beats renting it per token.
Lever 9 — Distil a smaller student model. Distillation trains a new, smaller "student" model to copy a larger "teacher," producing a genuinely smaller and faster model rather than a compressed copy — Distil-Whisper, the distilled speech model, is about half the size and roughly 6× faster while staying within one percent of the original's accuracy. The mechanism, costs, and when to reach for it are in the distillation and quantization lesson; as a cost lever, it shrinks the model you have to serve.
Lever 10 — Quantize the model. Quantization stores the same model's numbers in fewer bits — dropping from sixteen-bit to eight-bit integers cuts the model's memory roughly fourfold for usually under one percent of accuracy, which lets it run on a smaller, cheaper chip and run faster on that chip. Distillation and quantization stack with each other (distil first, then quantize the student), and both are detailed in the compression lesson.
Figure 2. The order-of-magnitude each family can move the bill. These are ranges, not guarantees — but because the families act on different parts of the cost, their wins multiply.
Family 3 — Serve The Model More Efficiently (Levers 11–15)
This family applies when you host the model yourself, and it is the subject of the inference-serving lesson in full. The headline: the same GPU can do many times more work with the right serving software, and since you pay for the GPU by the hour either way, more work per hour is directly less cost per result.
Lever 11 — Continuous batching. A GPU is most economical when it works on many requests at once, but naive batching makes fast requests wait for slow ones. Continuous batching slots new requests into the batch the moment a slot frees up, keeping the GPU full. The vLLM serving engine, using continuous batching, reports up to 23× the throughput of a naive setup — meaning one GPU does the work of many, at one GPU's cost.
Lever 12 — Reuse the KV cache (PagedAttention). When a model generates text, it keeps a running memory of the conversation called the KV cache; managing that memory poorly wastes most of the GPU's capacity. vLLM's PagedAttention stores it efficiently and lets requests share it, which is much of where the throughput gain above comes from. The mechanism is explained in the serving lesson.
Lever 13 — Speculative decoding. A small "draft" model guesses the next several tokens, and the large model verifies them all in one parallel pass instead of generating them one at a time. The output is mathematically identical to running the big model alone, but it arrives 2–3× faster — now a standard feature in vLLM, SGLang, and TensorRT-LLM. Faster generation on the same hardware is lower cost per result.
Lever 14 — Use a purpose-built serving runtime. Running a model through a research framework wastes the hardware. A production runtime — vLLM, NVIDIA's TensorRT-LLM, or SGLang — applies the optimisations above and several more automatically. Choosing among them is the core of the serving lesson.
Lever 15 — Right-size the GPU to the model. Running a small model on a top-end datacenter card is paying for capacity you cannot use. A quantized 7-billion-parameter model that fits comfortably on a mid-range card costing a fraction of a flagship's hourly rate should run there. Match the chip to the model's memory footprint, not to the most powerful thing available.
Family 4 — Pay Less Per Unit Of Compute (Levers 16–20)
The first three families reduce how much work you do. This family reduces the price of each unit of that work, which is why it stacks so cleanly on top of the others — you are cutting a different part of the cost. These levers are mostly procurement and configuration, not engineering, and several ship in an afternoon.
Lever 16 — Use the batch API for anything that is not real-time. Both OpenAI and Google offer a batch mode that runs your jobs within a 24-hour window for a flat 50% discount on input and output. Any video task that does not need an instant answer — overnight moderation of the day's uploads, bulk transcription of an archive, generating descriptions for a back catalogue — should run on batch. It is a literal halving of the bill for a scheduling change.
Lever 17 — Cache the prompt. When many calls share the same long prefix — a fixed system prompt, a rulebook, a reference document — prompt caching stores that prefix so you are not billed full price to re-process it every time. On Anthropic's API a cache hit costs 10% of the normal input price, a 90% discount on the repeated portion; the other major APIs offer the same idea. For a video moderation prompt that ships the same multi-page policy on every call, this is close to free money.
Lever 18 — Cache the answer (semantic caching). Prompt caching reuses the input; semantic caching reuses the output. If the same or a near-identical request has been answered before, return the stored answer instead of calling the model at all. For video this catches re-uploaded clips, duplicate frames, and repeated questions about the same content — and a cache hit costs essentially nothing. Deduplicating identical uploads before they ever reach a model is the same lever in its simplest form.
Lever 19 — Run interruptible work on spot GPUs. Cloud providers sell spare capacity — spot or preemptible instances — at 60% to 90% below on-demand prices, with the catch that they can be reclaimed with little warning. That makes them perfect for work that can be paused and resumed: batch transcription, archive processing, offline analysis. Build the job to checkpoint and retry, and you run it for a fraction of the on-demand cost. (Spot is the wrong tool for a live video call that cannot tolerate an interruption.)
Lever 20 — Reserve capacity for your steady baseline. The opposite case from spot: the floor of demand that runs 24/7 anyway. For that predictable baseline, a 1-year or 3-year committed-use or reserved contract buys the same GPUs at 20% to 70% below on-demand, depending on provider and term. The pattern most teams land on is a blend — reserved capacity for the always-on floor, on-demand for the normal peaks, and spot for the interruptible batch work — so no single pricing mode pays for the whole, lumpy load.
Figure 3. The same lever can be a perfect fit or a footgun depending on the workload. Batch and spot are for work that can wait; reserved is for the steady floor; serverless is for bursty, unpredictable demand.
Family 5 — Architect And Govern For Cost (Levers 21–25)
The last family is structural. These levers do not have a single tidy multiplier; they change the shape of the system so that the other twenty levers have somewhere to land, and they keep the savings from leaking back out over time.
Lever 21 — Scale to zero with serverless GPUs. An idle GPU you rent by the hour bills whether or not a request arrives. Serverless GPU platforms bill per second of actual work and scale to zero between requests, so a feature with bursty or unpredictable traffic costs nothing while idle. The trade-off is cold start — the seconds it takes to load the model when traffic returns — which ranges from sub-second for small models to tens of seconds for large ones, so serverless fits spiky workloads better than steady high-traffic ones.
Lever 22 — Pick the right provider and kill the waste around the compute. The same H100 GPU rents for roughly $2–3 an hour on a specialist "neocloud" and around $7 an hour on a hyperscaler — a 40% to 85% gap for identical silicon. Beyond the headline rate, the hidden costs bite: data egress fees for moving video out of the cloud, idle storage of footage no one will watch again, and over-provisioned instances. Auditing those is often a larger win than the GPU rate itself, because video is heavy and egress is billed by the gigabyte.
Lever 23 — Use a hybrid edge-plus-cloud topology. Run the constant, cheap, latency-sensitive work on or near the device that captured the video — the camera, the phone, the on-premises box — and reserve the cloud for the rare, hard, heavy work. On-device inference has near-zero marginal cost after the hardware is bought, keeps sensitive footage local, and removes the bandwidth bill entirely. Where the edge-versus-cloud line falls is the subject of the latency and deployment lesson.
Lever 24 — Precompute derived artifacts once. A video is watched many times but only needs to be analysed once. Compute the transcript, the embeddings, the scene index, the thumbnail set a single time on ingest, store the cheap results, and serve them forever — never re-run the expensive model to answer a question the stored artifact already covers. Storage is pennies; recomputation is the model bill all over again.
Lever 25 — Instrument per-feature cost budgets and alert on drift. This is the lever that makes the other twenty-four durable. Tag every model call with the feature it serves, set a target unit cost per user or per hour of video for each feature, watch the actual cost against that target in production, and alert when any feature crosses it. Costs creep — a model gets swapped, a prompt grows, a retry loop misfires — and without instrumentation the first sign is the invoice. As the cost model lesson puts it, the discipline of measuring before optimising matters more than any single technique.
Common mistake. Optimising before measuring, and trusting that multipliers always compound. Teams reach for the clever lever — speculative decoding, a custom router — before they have instrumented where the money actually goes, and then optimise the 5% while the 80% sits untouched. The other half of the trap is arithmetic: levers multiply only when they are independent. Routing routine work to a cheaper model and then quantizing that model both reduce the model-side cost and partly overlap — stacking their headline multipliers double-counts the saving. Cutting the token count (a quantity lever) and moving to the batch API (a price lever) are independent and do multiply. Before you celebrate a "50× plan," check that each lever acts on a different part of the bill, and verify the total against a real measurement, not a spreadsheet of optimistic factors.
A Worked Example — Stacking Three Levers
Make the compounding concrete with a workload we see often: a user-generated-content platform that runs "describe and moderate every uploaded video," covered as an engineering problem in the real-time moderation lesson. Say the platform ingests 50,000 hours of video a month.
The naive bill. Send every hour to a frontier model at full resolution. One hour is about 1.08 million input tokens (the Gemini video-tokenisation fact from above), and at the frontier model's large-context input price of $2.50 per million tokens, that is:
per hour: 1,080,000 tokens × $2.50 / 1,000,000 = $2.70
per month: $2.70 × 50,000 hours = $135,000
(Output is a short description, a few cents an hour, negligible against the input.) So the starting point is $135,000 a month.
Lever 1 — process less video. Switch to low media resolution and adaptive frame sampling. Conservatively, that cuts the input token count to about a third — roughly 360,000 tokens an hour instead of 1,080,000:
per hour: 360,000 × $2.50 / 1,000,000 = $0.90
per month: $0.90 × 50,000 = $45,000
A 3× cut, with no change to which model or how it is billed. $45,000 a month.
Lever 2 — route to a cheaper tier. Most uploads are routine; only a fraction need the flagship. Route 90% of clips to a Flash-tier model at $0.30 per million input tokens and keep 10% on the flagship at $2.50. The blended input price is:
blended: (0.90 × $0.30) + (0.10 × $2.50) = $0.27 + $0.25 = $0.52 per million
per hour: 360,000 × $0.52 / 1,000,000 = $0.187
per month: $0.187 × 50,000 = $9,360
That is a further ~4.8× cut — and it is independent of the first lever, because lever 1 changed the token count and lever 2 changed the price per token. About $9,400 a month.
Lever 3 — run it on the batch API. Moderation of uploads does not need an answer in the same second; a few hours is fine. The batch API runs the same jobs at a flat 50% discount:
per month: $9,360 × 0.50 = $4,680
Under $4,700 a month — and again independent, because the batch discount is a price lever that stacks on top of both the token-count cut and the model-tier choice.
The three levers together took the bill from $135,000 to about $4,680 — a 29× reduction — and none of them touched the product the user sees. A fourth lever, prompt caching of the shared moderation policy, would help only in proportion to how much of each call is the repeated policy versus the video tokens; here the video dominates, so it adds little. That last point is the discipline: stack the levers that act on your dominant cost, and do not credit yourself for levers that act on a part of the bill that is already small.
Figure 4. Three independent levers stacked on one workload: token count, then price per token, then billing mode. $135,000 a month becomes about $4,680 — a 29× cut, with no change to what the user sees.
Which Levers For Which Workload
The levers are not universal; the right set depends on the shape of the work. A simple way to choose is to ask four questions in order. Does it have to be real-time? If yes, the batch API and spot GPUs are off the table, but caching, routing, and a fast serving stack are in. Does the input repeat? If yes, prompt and semantic caching are among your biggest wins; if every request is unique, they do almost nothing. Is the load steady around the clock? If yes, reserve capacity for the baseline. Is it spiky and unpredictable? If yes, serverless scale-to-zero and spot capacity stop you paying for idle hardware. Run any new feature through those four questions before you build it, and you will have picked the right family of levers before writing a line of serving code.
Figure 5. Four questions route you to the right family of levers. Two levers apply to every workload regardless of the answers: process less video, and route to the cheapest model that can do the job.
Where Fora Soft Fits In
We build video products across conferencing, streaming and OTT, e-learning, telemedicine, and surveillance, and in every one of them the AI bill is decided by these levers, not by the model choice alone. Our practice is to instrument cost per feature before optimising anything, then work the cheap, high-leverage levers first — process less video through frame sampling and resolution control, gate expensive models behind cheap triggers on always-on feeds, route routine work to a smaller tier, and move every non-real-time job to the half-price batch API with cached shared prompts. For self-hosted workloads we serve on an efficient runtime with continuous batching and right-sized GPUs, blend reserved capacity for the steady floor with spot for interruptible batch, and push the constant, latency-sensitive work to the edge where the marginal cost falls toward zero. The verticals change which levers dominate — a surveillance feed lives or dies on event-gating, an OTT archive on precomputing artifacts once, a live conferencing feature on caching and a fast serving stack — but the method is the same: measure, stack the independent levers, and re-measure.
What To Read Next
- The cost model — what AI in video actually costs at scale
- vLLM + Triton + TensorRT — inference serving for video AI
- Distillation + quantization for edge video AI
Talk To Us · See Our Work · Download
- Talk to a video engineer — bring a feature and its current bill, and we will map which of the 25 levers apply and what each is worth on your workload: /services/ai-software-development
- See our case studies — conferencing, streaming, surveillance, telemedicine, and AI work: /portfolio
- Download the 25-lever cost-optimization checklist — all five families on one page, with the workload-matching table, the order to apply them, and the questions to ask before you optimize: Download the checklist
References
- Ong, Almahairi, Wu, Zhang, Willmott, Stoica, et al. (LMSYS / UC Berkeley, Anyscale, Canva) — "RouteLLM: Learning to Route LLMs with Preference Data" (arXiv:2406.18665; ICLR 2025;
lm-sys/RouteLLMrepository, accessed June 2026) — https://github.com/lm-sys/RouteLLM — tier 3 (peer-reviewed paper + reference implementation). Source for model routing: matching ~95% of GPT-4 quality at ~85% lower cost on MT-Bench by routing only hard queries to the expensive model, and the cascade/confidence-threshold framing. - Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (arXiv:2309.06180; SOSP 2023; vLLM project, accessed June 2026) — https://arxiv.org/abs/2309.06180 — tier 3 (peer-reviewed paper). Source for PagedAttention and KV-cache management, the basis of vLLM's throughput advantage.
- Anyscale — "Achieve 23x LLM Inference Throughput & Reduce p50 Latency" (Anyscale engineering blog, accessed June 2026) — https://www.anyscale.com/blog/continuous-batching-llm-inference — tier 4 (vendor engineering blog by the framework authors). Source for the continuous-batching throughput figure (up to ~23× over a naive baseline) and the static-vs-continuous batching comparison.
- Leviathan, Kalman, Matias — "Fast Inference from Transformers via Speculative Decoding" (arXiv:2211.17192; ICML 2023, accessed June 2026) — https://arxiv.org/abs/2211.17192 — tier 3 (peer-reviewed paper). Source for speculative decoding: a small draft model proposes tokens verified in parallel by the target model, yielding 2–3× faster generation with mathematically identical output.
- OpenAI — "Batch API" guide and API pricing (OpenAI developer documentation, accessed June 2026) — https://developers.openai.com/api/docs/guides/batch — tier 3 (vendor primary). Source for the batch API: a flat 50% discount on input and output for jobs completed within a 24-hour window, and that prompt caching stacks with batch.
- Anthropic — "Prompt caching" and "Pricing" (Claude API documentation, accessed June 2026) — https://platform.claude.com/docs/en/build-with-claude/prompt-caching — tier 3 (vendor primary). Source for prompt caching economics: a cache read costs 10% of the base input price (a 90% discount on the repeated prefix); 5-minute cache writes cost 1.25× and 1-hour writes 2× the base input price.
- Zhang, Liu, et al. — "Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference" (arXiv:2510.14624, 2025, accessed June 2026) — https://arxiv.org/abs/2510.14624 — tier 3 (peer-reviewed/preprint). Source for adaptive vs uniform frame sampling and pruning temporally redundant tokens: up to 4× reduction in vision-language model time-to-first-token with minimal accuracy loss, and the quadratic cost of sequence length in transformers.
- Amazon Web Services — "Spot Instances" and "Savings from purchasing Spot Instances" (EC2 User Guide, accessed June 2026) — https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html — tier 4 (vendor documentation). Source for spot/preemptible pricing at 60–90% below on-demand, the reclamation-with-notice model, and the suitability of spot for interruptible work only.
- Google Cloud — "Committed use discounts (CUDs) for Compute Engine" (Compute Engine documentation, accessed June 2026) — https://docs.cloud.google.com/compute/docs/instances/committed-use-discounts-overview — tier 4 (vendor documentation). Source for committed-use / reserved pricing: 1-year and 3-year terms at 20–70% below on-demand (less for some GPU SKUs), with longer terms giving deeper discounts.
- Runpod — "Serverless GPU" product and pricing; Modal serverless pricing and cold-start documentation (vendor documentation, accessed June 2026) — https://www.runpod.io/product/serverless — tier 4 (vendor documentation). Source for serverless scale-to-zero (per-second billing that stops at idle) and cold-start ranges (sub-second to tens of seconds depending on model size), and the neocloud-vs-hyperscaler H100 hourly gap (~$2–3 vs ~$7).
- Fora Soft Learn — "The Cost Model — What AI in Video Actually Costs at Scale" (companion lesson 1.4, accessed June 2026) — /learn/ai-for-video-engineering/articles-ai/real-cost-of-ai-in-video-products-gemini-openai-pricing — tier 4 (internal companion, primary for the pricing/unit-economics layer). Source for the per-model 2026 prices used in the worked example (Gemini 2.5 Pro/Flash tiers), the five cost units, and the Gemini video-tokenisation fact (~300 tokens/second of footage → ~1.08M tokens per hour).


