Published 2026-06-03 · 27 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

Every AI feature in a video product — the auto-summary of a recording, the live captions, the "is this safe to publish" check, the search box that answers questions about an archive — runs a model, and that model has to be served: wrapped in software that loads it onto a graphics card and answers requests quickly, for many users at once, without falling over. Get the serving layer wrong and you either burn money on idle hardware or watch latency spike the moment a second user shows up. This lesson is for the product manager, founder, or engineering lead who has to approve an infrastructure plan, read a cloud bill, or ask a vendor why inference costs what it does — and who needs to know what vLLM, SGLang, TensorRT-LLM, Triton, and serverless GPUs actually do, where each one wins, and which questions separate a serious serving plan from hand-waving. It is the production-operations companion to the eval-rig lesson: once you can prove a video AI feature is good, this is how you run it at scale without the bill running you.

What "Inference Serving" Actually Means

Start with the words, because three of them get used loosely. A model is the trained file — a few gigabytes of numbers — that turns an input into an output: frames into detected objects, audio into text, a video-plus-question into an answer. Inference is the act of running that model on one input to get one output, as opposed to training, which is the far more expensive act of creating the model in the first place. You train a model once; you run inference on it millions of times. Inference is the part you pay for forever.

Serving is the software that sits between your product and the model and makes inference happen on demand. Without it, a model is an inert file on disk. The serving layer loads the model onto a GPU (a graphics processing unit — the specialized chip that runs AI math hundreds of times faster than an ordinary processor), opens a network endpoint your app can send requests to, and manages the awkward reality that a GPU is expensive, has limited memory, and can only do so much at once. Think of the model as a chef and the GPU as a single very fast stove: serving is the kitchen management — the system that decides which orders go on the stove, in what order, and how to cook several at once so the stove is never sitting idle.

That kitchen-management problem is the whole game. A GPU rents for real money by the hour whether it is working or not, so the entire economics of a video AI feature comes down to one number: how much useful work you get out of each GPU-hour. A good serving layer keeps the stove busy and cooks many orders in parallel; a bad one leaves it idle between orders. The same model, the same GPU, the same requests can cost you twice as much under a careless server as under a careful one — which is exactly why the choice of serving software is an engineering decision worth making on purpose, not a default to inherit.

Why Video AI Is The Hard Case: It Is A Pipeline

Most writing about inference serving quietly assumes you are running one model — a chatbot. Video almost never works that way, and that is the first thing a video team has to internalize. A single video feature is usually a pipeline: a chain of different models, each doing one job and handing its output to the next.

Take "summarize this recorded webinar with searchable highlights." Behind that one button is a sequence. First the video is decoded into frames and an audio track. The audio goes to a speech-recognition model (the kind covered in the streaming-ASR lesson) that turns it into a transcript. Selected frames go to a vision-language model that describes what is on screen — the same class of model behind the video VLMs lesson. A large language model then reads the transcript and the descriptions and writes the summary. If the feature also detects faces or objects, a computer-vision model like YOLO (from the YOLO lesson) runs on the frames too. Five model types, one feature.

Here is why that breaks the "just pick a server" instinct: these models have genuinely different shapes, and the thing that serves one well serves another badly. A language model generates text one token at a time, so its bottleneck is memory for the running context — which is exactly what vLLM is built to manage. A detection model takes a fixed-size image and returns a fixed-size answer in one shot, so its bottleneck is raw matrix speed — which is what TensorRT is built for. A speech model streams audio through a specialized engine of its own. Trying to run all five on one server tuned for one of them is how teams end up with a feature that is fast in the demo and ruinous at scale. The job is to match each stage to the server that fits it, and to know where the handoffs are.

A left-to-right pipeline diagram of a video AI feature broken into five stages — decode, computer-vision detection, speech recognition, vision-language description, and large-language-model summary — with each stage labeled by the kind of model it runs and the inference server best suited to it, showing that one video feature spans several model types and several serving choices Figure 1. One video feature, many models. A summarize-and-search feature chains decode, detection, transcription, description, and summarization — and each stage wants a different inference server.

The Two Ideas That Make A Server Fast

Before comparing products, understand the two ideas that separate a fast LLM server from a slow one, because every product below is really competing on how well it does these two things. Both are about not wasting the GPU.

The first idea solves a memory problem. When a language model generates text, it keeps a running scratchpad of everything it has read and written so far, called the KV cache (you can ignore the acronym; think of it as the model's short-term memory for the current conversation). This scratchpad grows with every token and has to live in the GPU's limited memory. Older servers reserved a big fixed block of memory per request up front — enough for the longest possible answer — even though most answers are short. The result was waste: the published research found that older systems threw away 60–80% of their KV-cache memory to this over-reservation and fragmentation, and wasted memory means fewer requests fit on the card at once, which means lower throughput.

PagedAttention is the fix, and its idea is borrowed straight from how an operating system manages a computer's memory. Instead of one big reserved block per request, it chops the scratchpad into small fixed-size pages and hands them out only as each request actually needs them, like a library lending shelves one at a time instead of reserving a whole wing for every reader. The published result is dramatic: waste drops from 60–80% to under 4%, and that recovered memory lets far more requests share the card. This is the technique that launched vLLM, and it is named for the paper that introduced it.

The second idea solves a scheduling problem. Naive servers do static batching: they group a fixed set of requests, run them together, and wait for the slowest one to finish before starting the next group — so a single long answer holds up everyone behind it, and the GPU sits half-idle. Continuous batching fixes this by treating the batch as a moving queue: the instant any request in the batch finishes, a new one slots into its place, so the GPU never waits. In practice this lifts GPU utilization from a typical 30–40% under static batching to 75–90% under continuous batching — the difference between a stove that is half-cold and one that is always cooking.

These two ideas matter more for video than for chat, because video inputs are token-heavy. When a vision-language model looks at a frame, it does not see a small amount of text — it converts the image into a large block of tokens. A single sampled frame can become a few hundred tokens, so a judge or summarizer that samples sixteen frames is carrying thousands of image tokens in its KV cache before it reads a single word of the prompt:

16 frames  ×  256 tokens per frame  =  4,096 image tokens of context

Four thousand tokens of scratchpad, per request, just for the pictures — and that is the memory PagedAttention is recovering. The more your video feature leans on frames, the more the serving layer's memory management decides whether you fit eight requests on a card or eighty.

A two-panel before-and-after diagram contrasting naive static serving with vLLM-style serving — the left panel shows static batching reserving large fixed memory blocks per request with 60 to 80 percent of the KV cache wasted and the GPU at 30 to 40 percent utilization, the right panel shows PagedAttention handing out small memory pages on demand with under 4 percent waste and continuous batching keeping the GPU at 75 to 90 percent utilization for two-to-four-times the throughput Figure 2. The same GPU, two serving strategies. Static batching reserves memory it never uses and lets the card idle; PagedAttention plus continuous batching recovers the memory and keeps the card busy — 2–4× the throughput.

vLLM — The Open Default

vLLM is the open-source inference server that those two ideas built, and in 2026 it is the sensible starting point for serving almost any language or vision-language model. It came out of UC Berkeley as the system attached to the PagedAttention paper, and it has since become the most widely used open server, with the broadest model support and the largest contributor community of any option here.

Three things make it the default. First, breadth: it runs the long tail of open models — the LLaMA family, Qwen, Mistral, and the vision-language models a video team actually needs, including Qwen2.5-VL, LLaVA, InternVL, and MiniCPM-V — and it runs on more than just NVIDIA hardware, with support for AMD, Google TPUs, AWS Trainium, and Intel accelerators. Second, a familiar front door: vLLM exposes an OpenAI-compatible API, meaning the same request format that talks to OpenAI's cloud talks to your self-hosted vLLM by changing one setting, the server address. Code written against the OpenAI SDK switches to your own model with a one-line change, which is why migration off a paid API to self-hosting is usually a small job. Third, it ships ready to run: there are official Docker images, Kubernetes Helm charts, and a re-architected internal engine (the V1 engine, the default since 2025) that made the scheduler and multimodal support faster and cleaner.

For a video product the practical pattern is simple. You pull the vLLM Docker image, point it at a model, and you get a running endpoint that serves your VLM judge, your captioner, or your summarizer with PagedAttention and continuous batching switched on by default. The vllm docker route — a container you can run identically on a laptop, a rented GPU, or your own datacenter — is the most common way teams deploy it, precisely because the same image behaves the same everywhere.

vLLM vs Ollama — The Production-vs-Laptop Split

The question that comes up first, because both tools are everywhere, is vLLM vs Ollama — and the honest answer is that they are built for different jobs and the comparison is really about where the model runs, not which tool is "better."

Ollama is a tool for running a model on one machine for one user — your laptop, a workstation, an Apple Silicon Mac. It is wonderful for exactly that: install it, pull a model, and you are talking to it in a minute, with no GPU cluster and no configuration. It wraps an efficient single-machine engine and optimizes for the case of one person, one model, talking to it themselves. For prototyping, local development, and demos, it is the fastest path there is.

The split appears the moment you have more than one user. Under real concurrency, vLLM's batching and memory management pull far ahead: in 2026 benchmarks, vLLM served roughly 2.3× more tokens per second than Ollama with eight concurrent users, and the gap widened with load — at fifty concurrent users one published test measured vLLM holding around 920 tokens per second against Ollama's roughly 155, with vLLM's tail latency a fraction of Ollama's. The reason is structural, not a tuning accident: Ollama is not built to pack many simultaneous requests onto a card, and vLLM is built for almost nothing else.

Common mistake. Shipping Ollama to production because it was easy to set up in development. It will work in the demo with one tester and then fall over the first time real traffic arrives — latency climbs, throughput plateaus, and the GPU bill makes no sense because the card is mostly idle between requests. Ollama is the right tool for the desk and the wrong tool for the fleet. The mature pattern is to prototype on Ollama and graduate the workload to vLLM (or one of the servers below) behind a load balancer once more than a handful of users share a GPU. Use each where it belongs; do not let the development tool become the production architecture by inertia.

A grouped bar chart comparing throughput in tokens per second for vLLM versus Ollama at two concurrency levels — at eight concurrent users vLLM leads by roughly 2.3 times, and at fifty concurrent users vLLM reaches about 920 tokens per second while Ollama plateaus near 155, with a caption noting Ollama is built for one user on one machine and vLLM for many users sharing a GPU Figure 3. vLLM vs Ollama under concurrency. The two tools are built for different jobs — Ollama for one user on one machine, vLLM for many users sharing a GPU — and the throughput gap widens as load rises.

SGLang — The Prefix-Caching Challenger

vLLM is the default, not the only serious open server. SGLang is the closest competitor in 2026, and it is worth knowing because of one trick that is especially valuable for video work. SGLang's signature feature is aggressive prefix caching: when many requests share the same beginning — the same long system prompt, the same rubric, the same retrieved context — SGLang reuses the already-computed scratchpad for that shared prefix instead of recomputing it for every request.

That pattern is everywhere in video AI. A VLM judge grades every output against the same long rubric. A meeting copilot answers many questions against the same retrieved transcript. A moderation pass runs the same policy prompt over thousands of clips. Whenever the expensive front of the prompt is shared, SGLang's prefix caching can pull far ahead: published 2026 benchmarks put SGLang at roughly a 29% throughput edge over vLLM on an H100 for general workloads, rising to as much as 6.4× on prefix-heavy workloads like retrieval-augmented question answering and multi-turn chat. The trade-off is ecosystem maturity: vLLM still has broader model and hardware coverage and a larger community, so the common advice holds — start on vLLM, and reach for SGLang when your workload is genuinely prefix-heavy and you have measured that it pays.

TensorRT-LLM — Maximum NVIDIA Speed, And The 2026 News

When the goal is the absolute lowest latency and highest throughput on NVIDIA hardware and nowhere else, the answer is TensorRT-LLM, NVIDIA's own inference library. It wins on raw speed — published 2026 benchmarks put it 15–30% above vLLM in throughput on H100 cards, and it supports speculative decoding (a technique that drafts several tokens ahead and verifies them in a batch) for up to several times faster generation on the right models. If you are committed to NVIDIA and the last 20% of performance is worth real engineering effort, this is the ceiling.

The historical catch was setup cost. TensorRT-LLM traditionally required compiling the model into a hardware-specific "engine" — a build step that could take tens of minutes and had to be redone whenever the model, the GPU type, or the settings changed. For a team that swaps models monthly that compile tax was painful, and it is the reason TensorRT-LLM was long considered the expert option.

That is the part the tensorrt-llm news of 2025–2026 changes, and it is worth knowing because most older comparison articles still describe the painful version. NVIDIA rebuilt TensorRT-LLM around a PyTorch-native backend that does not require building an engine at all — you can run a standard PyTorch model directly through a high-level Python interface (the LLM API), which collapses the old multi-step build into something close to vLLM's "point it at a model and go." A companion feature, AutoDeploy, takes an off-the-shelf PyTorch model and applies the inference optimizations automatically. The practical effect: TensorRT-LLM in 2026 is far less of an expert-only tool than its reputation suggests, and the gap in setup effort against vLLM has narrowed sharply. The vendor lock-in to NVIDIA remains the real trade-off — TensorRT-LLM does not run on AMD, TPUs, or anything else — so the decision is less "can we tolerate the compile step" and more "are we NVIDIA-only by choice."

Triton And NVIDIA Dynamo — Running Many Models On One Fleet

Everything so far serves one model well. The video pipeline problem from earlier — five model types behind one feature — needs a layer above the single-model server, and that layer is Triton Inference Server.

Triton's job is to run many different models, of many different kinds, on one shared GPU fleet, behind one endpoint. It speaks multiple backends at once — it can host a TensorRT engine, a PyTorch model, an ONNX model, an OpenVINO model, and even a vLLM instance side by side in the same server process — and it executes them concurrently on the same cards. Its most video-relevant feature is the ensemble: a pipeline defined inside Triton so that one client request flows through several models in sequence — decode, then detect, then describe — without a round-trip back to your application between each step. For the multi-model video feature, that is the native shape: a Triton ensemble is the pipeline in Figure 1, expressed as server configuration. Triton is actively maintained in 2026 (the December 2025 release added, among other things, improvements to its vLLM backend and ensemble configuration), and it remains the production standard when a fleet has to serve an LLM, an embedding model, and a vision encoder at the same time.

The bigger 2026 development sits above even Triton. At its GTC 2025 conference NVIDIA announced Dynamo, an open-source, datacenter-scale distributed inference framework it describes as the successor to Triton. Dynamo is built for the largest deployments — thousands of GPUs — and its headline ideas are disaggregated serving (splitting a model's two phases, the initial "read the prompt" prefill and the token-by-token "write the answer" decode, onto different GPUs so each can be tuned and scaled independently) and KV-cache-aware routing (sending each request to the GPU that already holds the relevant scratchpad). NVIDIA reported up to 30× more requests served on a reasoning model on its newest Blackwell hardware, and — importantly for everyone else — Dynamo is built to orchestrate vLLM, SGLang, and TensorRT-LLM underneath rather than replace them. For most video teams Dynamo is over the horizon — it is a fleet-of-thousands tool — but it is the direction the serving world is moving, and it tells you where the single-node servers above are heading.

Serving The Models That Aren't LLMs

The LLM-and-VLM servers get the attention, but two stages of the video pipeline run on entirely different engines, and a serving plan that ignores them is incomplete.

Speech recognition is the first. The transcription stage usually does not run on vLLM at all — it runs on a specialized engine. The most common production choice is faster-whisper, a re-implementation of OpenAI's Whisper model on top of CTranslate2, a high-performance inference engine with custom GPU code and built-in 8-bit compression. The speedups are large and very relevant to anyone transcribing archives of video: faster-whisper is about 4× faster than the original Whisper at the same accuracy, and the batched version reaches roughly 12.5× faster, processing long files at hundreds of times real-time. One published 2026 figure has a single mid-range L40S card handling 100+ concurrent transcription streams; a bank of consumer cards can transcribe a thousand hours of audio in about half an hour. For a product that has to caption or transcribe at volume, the ASR engine choice moves the bill as much as the LLM server does.

Computer-vision models are the second. Detection and segmentation models — YOLO, SAM, pose estimators — take a fixed-size image and return a fixed-size result in a single pass, with no growing scratchpad. Their bottleneck is pure matrix throughput, and the standard way to serve them fast is to convert them to a TensorRT engine (the underlying optimizer beneath TensorRT-LLM) and host them, often through Triton, alongside the rest of the pipeline. This is where the model-format decisions from the model-artifact-formats lesson come home: exporting a detector to ONNX and compiling it to TensorRT is the routine path to real-time CV serving. The lesson for the serving plan is that "what server do we use" has at least three answers in a single video feature — a text-generation server for the language models, an ASR engine for speech, and a CV runtime for vision — and the architecture is how they connect.

Serverless GPUs — When You Shouldn't Run A Server At All

There is one more option, and for many video products it is the right first move: do not manage a GPU server yourself. Serverless GPU platforms — Modal, Replicate, RunPod, and others — rent you inference by the second and scale to zero, meaning a GPU spins up when a request arrives, runs it, and shuts down, so with no traffic you pay nothing. For the spiky, unpredictable load that describes most early video features — a summary feature used in bursts, an occasional moderation check — this is far cheaper and simpler than keeping a GPU rented and idle around the clock.

The cost of scale-to-zero is the cold start: the wait while a GPU powers up and loads the model. Published 2026 figures put this at a few seconds for small models and 15–30 seconds for 7-billion-parameter-plus models — fine for a background job, fatal for a live interaction. The platforms differ on how they fight it. Modal's GPU-memory snapshotting (capturing the loaded model's state so it can be restored fast) has cut some cold starts dramatically; RunPod competes hardest on raw price (an H100 around $7 an hour at the time of writing); Replicate keeps popular models warm so standard ones have effectively no cold start. The decision rule is about traffic shape, covered more fully in the latency and deployment lesson: steady, high-volume traffic justifies a dedicated vLLM cluster you keep warm; bursty, low-volume, or unpredictable traffic usually belongs on serverless until the numbers say otherwise.

The Decision — Which Server For Which Job

Put the options side by side. There is no single winner; there is a right tool per stage and per traffic shape. The table below is the decision in one place.

Serving choice Best for Hardware Setup effort Watch out for
vLLM The default for LLMs and VLMs — multi-user production serving NVIDIA, AMD, TPU, Trainium, Intel Low — Docker, OpenAI-compatible None major; the safe starting point
Ollama One user, one machine — prototyping and local dev Any, incl. Apple Silicon Lowest Not built for concurrency; do not ship to a fleet
SGLang Prefix-heavy workloads — shared rubric, RAG, multi-turn Mainly NVIDIA Low–medium Smaller ecosystem than vLLM
TensorRT-LLM Maximum speed on NVIDIA; latency-critical NVIDIA only Medium (lower since the PyTorch backend) Vendor lock-in to NVIDIA
Triton (→ Dynamo) Many models on one fleet; pipeline ensembles NVIDIA-centric Medium–high Overkill for a single model
Serverless (Modal/RunPod/Replicate) Spiky or low-volume traffic; scale to zero Provider's GPUs Lowest ops Cold starts on large models

The shorthand most teams converge on: start on vLLM for anything that generates text or reads frames; serve speech on faster-whisper and detection on TensorRT; reach for SGLang when the workload is prefix-heavy, for TensorRT-LLM when you are NVIDIA-only and chasing the last of the latency, and for Triton when one fleet must serve several models as a pipeline. Run it all on serverless while traffic is small and spiky, and move to a kept-warm dedicated cluster once volume makes the math flip.

A decision diagram for choosing an inference serving stack for a video AI feature — a top question splits on what the stage does (generate text or read frames, recognize speech, or detect or segment), routing language and vision-language work to vLLM with SGLang and TensorRT-LLM as branches, speech to faster-whisper, and computer vision to TensorRT, with a second axis on traffic shape routing spiky low-volume traffic to serverless and steady high-volume traffic to a dedicated cluster, and a Triton or Dynamo layer shown above when many models share one fleet Figure 4. Choosing the serving stack. The first question is what the stage does; the second is the shape of the traffic. Most video features end up using more than one of these.

A Worked Example — What Continuous Batching Is Worth

Make the money concrete, because the whole argument for caring about the serving layer is the bill. Suppose you serve a summarization model on a single rented GPU that costs, to keep the arithmetic clean, $3.00 an hour. Under naive static batching, the card sits idle much of the time waiting for slow requests, and runs at about 35% utilization. Switch to a server with PagedAttention and continuous batching and the same card runs at about 85% utilization.

Output scales roughly with how busy the card is, so the throughput multiplier is the ratio of utilizations:

85% busy  ÷  35% busy  =  2.4×  more tokens from the same GPU

Now turn that into a cost per unit of work. Say the slow setup produces 1.0 million tokens in an hour. The cost per million tokens is just the hourly rent divided by the tokens produced:

slow:  $3.00 / hr  ÷  1.0M tokens/hr  =  $3.00 per million tokens
fast:  $3.00 / hr  ÷  2.4M tokens/hr  =  $1.25 per million tokens

The serving software alone — same model, same GPU, same hardware bill — cut the cost per token by 58%. That is the number to hold onto. You did not buy a faster card or a smaller model; you changed the kitchen management. This is also why the cost discussions in the cost-of-AI lesson and the cost-optimization lesson treat the serving layer as a first-class lever, and why a model made smaller through the techniques in the distillation and quantization lesson compounds with a good server rather than substituting for it.

Where Fora Soft Fits In

We build video products across conferencing, streaming and OTT, e-learning, telemedicine, and surveillance, and the AI features in them are pipelines, not single models — a recording becomes a transcript, a set of frame descriptions, a summary, and a moderation decision, each from a different model. Our serving practice follows the shape of that pipeline: language and vision-language stages on vLLM behind an OpenAI-compatible endpoint so the product code does not care whether the model is ours or a vendor's; speech on a CTranslate2-based Whisper engine sized to the transcription volume; detection and segmentation exported to TensorRT for real-time work; and the whole chain wired together so a single request flows through it without wasteful round-trips. We start features on serverless GPUs while traffic is small and spiky and move them to kept-warm dedicated capacity when the volume makes that cheaper, and we treat the cost per token, not just the latency, as a number worth engineering down. The verticals change the models and the latency targets; the method — match each stage to the server that fits it, and keep the GPU busy — does not.

What To Read Next

Talk To Us · See Our Work · Download

  • Talk to a video engineer — design the inference-serving stack for the AI features in your video product, from a single model to a multi-model pipeline: /services/ai-software-development
  • See our case studies — conferencing, streaming, surveillance, telemedicine, and AI work: /portfolio
  • Download the inference-serving decision checklist — the five serving choices, the two ideas that make a server fast, the per-stage decision for a video pipeline, and the questions to ask before you commit, on one page: Download the checklist

References

  1. Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (arXiv:2309.06180; SOSP 2023, accessed June 2026) — https://arxiv.org/abs/2309.06180 — tier 3 (peer-reviewed paper by the system's authors). Source for PagedAttention, the 60–80% KV-cache waste in prior systems reduced to under 4%, and the measured 2–4× throughput gain over the previous state-of-the-art servers (FasterTransformer, Orca) at the same latency. Per §4.3.2, where popular articles quote "up to 24×," that figure is from the vLLM launch blog against a naive HuggingFace Transformers baseline; the peer-reviewed paper's own measurement against production servers is 2–4×, and this article uses the paper's number as primary.
  2. vLLM Project — "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" (vLLM blog, 2023–2026, accessed June 2026) — https://blog.vllm.ai/2023/06/20/vllm.html — tier 3 (first-party engineering blog). Source for the "up to 24× over HuggingFace Transformers" launch figure (correctly attributed as against a naive baseline), the continuous-batching design, and vLLM's positioning. Used alongside, and subordinate to, the SOSP paper for the throughput framing.
  3. vLLM Project — "Supported Models," "Multimodal Inputs," and deployment documentation (docs.vllm.ai, accessed June 2026) — https://docs.vllm.ai/en/latest/models/supported_models/ — tier 4 (official project documentation). Source for vLLM's vision-language model support (Qwen2.5-VL, LLaVA, InternVL, MiniCPM-V), the OpenAI-compatible server endpoints, the official Docker images and Helm charts, the V1 engine default, and multi-hardware support (NVIDIA, AMD, TPU, Trainium, Intel).
  4. NVIDIA — "TensorRT-LLM Documentation," "LLM API Introduction," "PyTorch Backend," and "Automating Inference Optimizations with TensorRT-LLM AutoDeploy" (NVIDIA developer docs and technical blog, 2025–2026, accessed June 2026) — https://nvidia.github.io/TensorRT-LLM/ — tier 3 (vendor's own technical documentation). Source for the 2026 shift to a PyTorch-native backend that does not require building an engine (available v0.17+), the high-level LLM API, AutoDeploy, speculative decoding support, and NVIDIA-only hardware scope.
  5. NVIDIA — "Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework" and "NVIDIA Dynamo Open-Source Library Accelerates and Scales AI Reasoning Models" (NVIDIA Technical Blog + Newsroom, GTC 2025, accessed June 2026) — https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/ — tier 3 (vendor primary announcement). Source for Dynamo as the announced successor to Triton, disaggregated prefill/decode serving, KV-aware request routing, the up-to-30× requests-served figure on DeepSeek-R1 on Blackwell, and Dynamo's compatibility with vLLM, SGLang, and TensorRT-LLM.
  6. NVIDIA — "Triton Inference Server documentation," "Ensemble Models," and "Release Notes rel-25.12" (NVIDIA Triton docs, December 2025, accessed June 2026) — https://docs.nvidia.com/deeplearning/triton-inference-server/ — tier 4 (official project documentation). Source for Triton's multi-backend concurrent model execution (TensorRT, PyTorch, ONNX, OpenVINO, vLLM backends), the ensemble pipeline mechanism, and confirmation of active 2026 maintenance including vLLM-backend and ensemble-configuration improvements.
  7. SYSTRAN / MobiusML / Modal — "faster-whisper" (CTranslate2) repository and batched-Whisper engineering writeups (2024–2026, accessed June 2026) — https://github.com/SYSTRAN/faster-whisper — tier 4 (reference implementation + first-party engineering). Source for faster-whisper as a CTranslate2 wrapper roughly 4× faster than the original Whisper at equal accuracy, the batched-Whisper ~12.5× speedup and hundreds-of-times-real-time figures, and the L40S 100+-concurrent-stream production number for the ASR stage.
  8. Red Hat — "vLLM vs. Ollama: When to use each framework" (Red Hat AI topic page, 2026, accessed June 2026) — https://www.redhat.com/en/topics/ai/vllm-vs-ollama — tier 4 (vendor engineering reference). Source for the architectural distinction between Ollama (single-machine, single-user, llama.cpp-class engine) and vLLM (multi-tenant production serving), and the guidance to prototype on Ollama and graduate to vLLM under concurrency.
  9. Spheron / Particula / Northflank — "vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)," "SGLang vs vLLM," and "vLLM vs TensorRT-LLM and how to run them" (vendor engineering benchmarks, 2026, accessed June 2026) — https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/ — tier 6 (engineering benchmark references). Source for the 2026 comparative throughput figures: SGLang ~29% over vLLM on H100 and up to 6.4× on prefix-heavy workloads; TensorRT-LLM 15–30% over vLLM on H100; the vLLM-vs-Ollama 2.3× (8 users) and ~920 vs ~155 tok/s (50 users) measurements. Treated as competitor/benchmark references, not as primary sources; figures will date and should be re-verified at refresh.
  10. RunPod / Modal / BuildMVPFast / GPU Tracker — "Scale-to-Zero Serverless GPUs: Modal vs RunPod vs Replicate" and serverless GPU comparisons (vendor + engineering references, 2026, accessed June 2026) — https://www.buildmvpfast.com/blog/scale-to-zero-serverless-gpu-modal-runpod-ai-hosting-2026 — tier 6 (engineering references). Source for serverless GPU scale-to-zero economics, cold-start ranges (2–5 s small models, 15–30 s for 7B+), Modal GPU-memory snapshotting, Replicate warm pools, and indicative RunPod H100 pricing (~$7/hr). Pricing and cold-start numbers are time-sensitive.