Published 2026-05-17 · 13 min read · By Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you run a streaming service, a video conferencing app, a surveillance platform, an OTT channel, or a telemedicine product, hardware acceleration is the single biggest lever you have on your AWS bill and on your end-user latency. A software encoding farm that costs three engineers a month to keep up with peak load is a single rack of NETINT Quadra cards that costs ten thousand dollars and idles half the day. A live stream that buffers because the CPU encoder cannot keep up with 4K at 60 frames per second runs comfortably on a single 100-dollar GPU's video engine. But picking the wrong tier — a gaming GPU for a 24/7 broadcast workload, or a NETINT VPU for a one-off cinema master — costs you either quality or money and sometimes both.
This article gives founders, product managers, and engineers the same map. We start with the three classes of acceleration chip and what makes each one fast. We then walk through the five vendors that you will actually meet in a product specification document. We end with a decision tree and a download you can hand to your DevOps team. Everywhere we cite a number, we cite the source and the year — encoder benchmarks age fast, and last year's verdict on AV1 hardware quality is already wrong.
What "hardware acceleration" actually means
Before any vendor names, settle the mental model. A video encoder turns a stream of raw pixel frames into a compressed bitstream by running the same five steps every modern codec uses, covered in detail in our article on hybrid codec architecture: split each frame into blocks, predict each block from already-coded data, transform the prediction error to the frequency domain, quantize the result, and entropy-code the surviving numbers. Each step is a pile of arithmetic. The encoder's job is to try many combinations — many predictions, many block sizes, many quantizer settings — and keep the combination that gives the best quality per bit. That search is called rate-distortion optimization, or RDO, and it is what burns most of the encoder's compute.
A general-purpose CPU runs that search in software. It is flexible — you can change the algorithm, swap presets, try a new mode — but it is slow, because every block decision goes through the same arithmetic-logic units that also run your operating system. A hardware-accelerated encoder runs the same search on silicon built for one job. The chip has dedicated motion-estimation engines, fixed-function transforms, hardware entropy coders, and pipelines that move pixels straight from memory to the encode block without crossing the operating-system boundary. That dedicated path is what makes hardware acceleration fast — and what limits the quality.
There are three classes of acceleration chip you will see in 2026:
A GPU video engine is a fixed-function block bolted onto a graphics processor — NVIDIA NVENC, AMD VCN, Intel's Xe Media Engine. The chip's main job is gaming and AI compute; the video engine is a side feature that ships for free with every card. You can buy a single GPU for 300 dollars and get a real-time 1080p encoder; you can rent one in the cloud by the hour.
A CPU-integrated media engine is the same idea but living inside the CPU package. Intel's QuickSync (now part of the Xe architecture on recent chips) and Apple's VideoToolbox are the two examples that matter. You pay nothing extra: the engine ships inside any modern laptop or workstation chip.
A purpose-built VPU or ASIC is a chip that does only video — no graphics, no general compute, no display output. NETINT's Quadra and AMD's Alveo MA35D are the production-grade examples in 2026. A VPU costs more than a GPU per card but encodes ten to forty times more 1080p streams per watt, and it ships in 2.5-inch U.2 form factors that drop into a standard server bay. 1
Figure 1. The three classes of hardware acceleration ranked by density and target workload. A consumer GPU's NVENC engine handles one stream at very high quality; a NETINT Quadra T2A in the same server bay handles thirty-two.
The simplest way to remember the trade-off: a CPU encoder thinks slowly and carefully, a GPU encoder thinks fast and roughly, and a VPU thinks fast and roughly but in parallel across dozens of streams at once.
Why hardware is faster — and what it gives up
The speed gap comes from three places. First, fixed-function silicon: motion estimation, the discrete cosine transform, the deblocking filter, and the entropy coder are all wired into the chip. The CPU runs the equivalent algorithms as software loops; the hardware runs them as one clock-cycle operations. Second, memory locality: the encode block sits next to the video memory that holds the decoded reference frames, so the encoder never pays the cost of moving pixels across the PCI Express bus during a search. Third, massive parallelism: a single NETINT Codensity G5 ASIC processes blocks from many streams concurrently, where a CPU thread can only work on one block at a time.
The quality gap comes from one place: the search space is smaller. A software AV1 encoder like SVT-AV1 at preset 6 tries hundreds of block partitions, dozens of prediction modes, and many quantizer choices per block. A hardware AV1 encoder must reach a decision in the few clock cycles it has before the next block arrives, so it trims the search aggressively — fewer block sizes considered, fewer reference frames, fewer entropy-coding refinements. The Moscow State University 2025 codec study evaluated AV1 transcoders from NETINT, NVIDIA, and AMD and found that all three produced output that lagged the best software HEVC encoders on objective quality, even though the hardware was running a "newer, more efficient" codec. 2 A Gianni Rosato comparison published in late 2025 ranked SVT-AV1 preset 8 ahead of every hardware AV1 encoder tested, with preset 6 pulling further ahead. 3
What you actually lose in quality terms is one to four VMAF points at the same bitrate, depending on codec and content. For a per-title master that lives on disk for years, that gap costs you tens of percent more storage and bandwidth over the lifetime of the file, which is why a streaming service still encodes its long-tail catalogue in software. For a live channel that disappears in seconds, the gap is invisible to viewers and the speed-cost gain is the only thing that matters.
NVIDIA NVENC — the default video engine
NVIDIA's NVENC is the engine you will meet most often, because almost every recent NVIDIA GPU ships with it. As of mid-2026 the current generation is 9th-generation NVENC on the Blackwell architecture (RTX 50-series consumer cards and the H200/B200 data-center parts), introduced with Video Codec SDK 13 in early 2025. 4 Blackwell NVENC adds two big features: 4:2:2 chroma support (broadcast-grade colour, previously available only on professional CPU encoders), and an AV1 Ultra-High-Quality mode that NVIDIA's own measurements put within a few percent of software AV1 quality while running about three times faster than software AV1. 5 An RTX 5090 carries three NVENC engines that together push past 8K at 240 frames per second, and it exports a typical Premiere project sixty percent faster than the previous-generation RTX 4090. 6
In product terms, you reach NVENC from FFmpeg by selecting the right encoder name — h264_nvenc, hevc_nvenc, or av1_nvenc — and optionally adding the -hwaccel cuda and -hwaccel_output_format cuda flags to keep decoded frames in GPU memory across the whole pipeline. 7 A minimal command line looks like this:
# 1080p H.264 encode on an NVIDIA GPU, with hardware decode kept on the GPU.
# -preset p5 is a balanced quality/speed point; p7 is slowest/highest quality.
ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
-i input.mp4 -c:v h264_nvenc -preset p5 -b:v 6M output.mp4
The numbers to remember for budgeting: a single consumer-tier NVENC encodes roughly eight to twelve concurrent 1080p30 streams; a professional card with multiple NVENC engines pushes that to twenty-plus; a Blackwell card with three NVENCs handles eight 4K streams or thirty 1080p streams comfortably. NVIDIA's "consumer concurrent session limit" used to cap consumer cards at three simultaneous encodes — that limit was lifted in driver release 522.25 in late 2022, so any current driver gives you the chip's true throughput. 8
Intel QuickSync and the Xe Media Engine
Intel's QuickSync Video (QSV) is the video engine inside every modern Intel CPU and the Arc discrete GPUs. The 2026 flagship is the Xe Media Engine in the Arc B-series ("Battlemage") cards, in particular the Arc B580 released in December 2024, which delivers roughly twice the AV1 encoding throughput of the previous-generation A770. 9
QuickSync's strongest feature for product teams is its cost-to-quality position. The Arc B580 lands a 12 GB card under 250 dollars and outputs AV1 at quality comparable to entry-level NVENC AV1 while pulling about half the power. 10 On laptop CPUs, the integrated QSV engine handles a real-time 1080p H.264 encode while consuming under five watts — that is why Zoom, Teams, and Google Meet hit QSV first on every Intel laptop and only fall back to software encoding when no hardware encoder is available.
A minimal QSV FFmpeg command, on Linux with a recent Intel driver:
# 1080p HEVC encode using Intel QuickSync. -global_quality is QSV's CRF equivalent.
ffmpeg -hwaccel qsv -hwaccel_output_format qsv \
-i input.mp4 -c:v hevc_qsv -global_quality 23 output.mp4
The cautions: Intel's AV1 hardware encoder works very well for video-on-demand recording but shows higher variance for live streaming — the Arc B580 review benchmarks recorded occasional encoder overloads on long streams that did not appear on the H.264 path. 11 If your workload is "record a forty-minute lecture and upload", AV1 on QSV is the cheapest professional-quality option on the market. If your workload is "stream a live concert for six hours", pin H.264 or HEVC for stability and re-encode to AV1 in a second pass.
Apple VideoToolbox
Apple's VideoToolbox is the macOS and iOS framework that exposes the dedicated media engines built into every Apple Silicon chip. The 2026 lineup matters for product teams in two specific ways. First, hardware AV1 decoding is broadly available — M3 Macs and newer, the M4 iPad Pro, and the iPhone 15 Pro and newer all decode AV1 in hardware. 12 Second, hardware AV1 encoding is still narrow — only the highest-end M4 Ultra, M5 Pro, and M5 Max chips include an AV1 hardware encoder. 13 Standard M3 and base M4 Macs still encode AV1 in software, which means a Mac mini build farm doing AV1 will burn CPU cycles rather than media-engine cycles.
In practice, VideoToolbox shines when you are encoding HEVC. An M2 or later chip encodes 4K HEVC at roughly real-time speed while staying under fifteen watts, which is why the Mac mini has quietly become a popular small-scale transcoding node for indie studios and content delivery vendors who do not want to run a Linux server room. The framework also exposes hardware ProRes encoding, which no other vendor offers — relevant for production pipelines that originate footage on Apple cameras and want to keep mastering in ProRes 4444 without paying for cloud compute.
The minimal FFmpeg call:
# 4K HEVC encode on Apple Silicon. -q:v 50 is VideoToolbox's quality slider (0=best, 100=smallest).
ffmpeg -i input.mov -c:v hevc_videotoolbox -q:v 50 -tag:v hvc1 output.mp4
The -tag:v hvc1 is the one easy-to-miss flag: without it, the output HEVC will play in VLC and FFmpeg but not in Safari or QuickTime, because Apple's player requires the hvc1 codec tag rather than the more permissive hev1.
AMD AMF and the RDNA 4 VCN
AMD's Advanced Media Framework (AMF) is the API that sits on top of the Video Core Next (VCN) engines inside Radeon GPUs. The 2026 generation is VCN 5.0 on RDNA 4 cards (RX 9070 and RX 9070 XT), launched in March 2025. 14 VCN 5.0 was the catch-up release: it finally added B-frame support to AMD's AV1 encoder, brought H.264 quality up by approximately 25 percent, H.265 quality up by 11 percent, and pushed AV1 encoding efficiency to within touching distance of NVENC at the same speed. 15 Phoronix and Hot Chips 2025 measurements show RDNA 4 AV1 hitting a 93 VMAF score at CQ 70, in the same range as Blackwell NVENC, while costing notably less per card. 16
AMD's historical weakness was that AMF lagged NVENC on driver maturity, FFmpeg integration, and quality at low bitrates. RDNA 4 closed the quality gap; the driver-maturity gap is still real, especially on Linux, where the open-source amfenc stack is more recent than NVIDIA's. For product teams, the practical rule in 2026 is: if your encoding fleet is already Linux + AMD, AMF on RDNA 4 is now competitive with NVENC and worth benchmarking against your specific content. If your fleet is mixed, NVENC is still the easier integration target.
NETINT — the data-center VPU
NETINT Technologies ships the Quadra family of VPUs — Video Processing Units that look like NVMe SSDs and slot into a standard server's U.2 bay. The 2026 lineup centres on the Codensity G5 ASIC, the first chip to ship a hardware AV1 encoder for the data centre. 17 The product matrix is straightforward: a Quadra T1A (one G5) encodes sixteen 1080p60 streams in real time across AV1, HEVC, or H.264; a Quadra T2A (two G5s) doubles that to thirty-two streams; both ship with 36 TOPS of on-chip AI inference for object detection, region-of-interest coding, and content-adaptive rate control. 18
The arithmetic that makes VPUs interesting: a Quadra T2A draws roughly 500 watts and replaces a rack-unit of CPU-based transcoding that pulls four to five kilowatts to deliver the same throughput. NETINT's own analysis of the Bitmovin 2026 State of Video Encoding Report notes that VPU and ASIC adoption has reached 32 percent, with another 49 percent of respondents planning to evaluate VPUs in 2026 — the first cycle in which VPU evaluation intent is on par with GPU evaluation intent. 1
What NETINT gives up is flexibility. The G5 implements a fixed feature set decided when the chip was taped out — you cannot ship a firmware update that adds a new AV1 tool. If a new codec comes out before the next silicon generation, you have to wait. For a streaming operator who knows their codec mix for the next three years, that is fine; for a research-oriented team that wants to experiment with VVC or neural codecs, a GPU plus software is the safer bet.
One more — AMD Alveo MA35D
AMD's Alveo MA35D, released in late 2024, is the other production VPU on the market — dual-die ASIC, hardware AV1, similar form factor, similar power envelope. It is worth benchmarking against NETINT if you are at the scale that justifies a multi-vendor procurement, especially as AMD has been pricing aggressively to win streaming workloads from NETINT and from CPU farms.
A worked comparison: 1 000 concurrent 1080p30 streams
Here is the arithmetic that decides the procurement choice. Suppose you need to deliver one thousand simultaneous live 1080p30 streams in HEVC at six megabits per second.
The CPU baseline. A modern dual-socket server with sixty-four cores encodes roughly eight to twelve real-time 1080p30 HEVC streams with libx265 at the medium preset. To hit one thousand streams you need about a hundred such servers. Each pulls 600 watts at full tilt, so the fleet draws sixty kilowatts. A reasonable cloud bill at on-demand prices is around 400 US dollars per server per month, so 40 000 US dollars per month.
The GPU option. A data-center NVIDIA L40S handles roughly thirty real-time 1080p30 HEVC streams with NVENC. You need about thirty-four cards. At a typical cloud rate of around 1.50 US dollars per L40S-hour, the bill is roughly 35 000 US dollars per month — comparable to the CPU bill, with much lower latency and ten times less rack space.
The VPU option. A NETINT Quadra T2A handles thirty-two real-time 1080p30 streams. You need thirty-two cards. They fit into eight servers at four cards each. The capital cost is roughly 80 000 US dollars one-off plus around eight kilowatts of draw. Operating cost is dominated by power and rack space, both an order of magnitude below the CPU option. Within six months the VPU fleet costs less than the cloud CPU fleet; within a year it costs less than half.
The numbers are illustrative — actual densities depend on rate-control settings, resolution mix, and codec — but the shape is right. VPUs win at scale and on cost. GPUs win when you also need AI inference on the same chip. CPUs win for one-off masters and for any workload where the codec mix changes faster than silicon ships.
Figure 2. Concurrent 1080p30 HEVC streams a single device produces in real time. Numbers are 2026 ballpark figures and depend on rate-control settings; the order-of-magnitude shape is stable.
Quality — the part people get wrong
The trap to avoid is judging hardware encoders by a single VMAF number. Hardware quality varies sharply by bitrate range, by codec, and by content type. NVENC HEVC at six megabits per second on a talking-head scene is indistinguishable from x265 medium; the same encoder on a six-megabit-per-second sports broadcast loses two to three VMAF points because complex motion eats the smaller search space. AV1 hardware is still maturing — the Moscow State University 2025 study explicitly noted that even the best 2025 AV1 hardware loses against the best 2025 HEVC software at the same bitrate. 2 By the time you read this, Blackwell's Ultra-High-Quality AV1 mode may have closed that gap on NVIDIA silicon, but the rule still holds: benchmark against your own content before committing, and always at the bitrates you actually ship at.
A second mistake is over-trusting VMAF. The metric was excellent for its first decade but is now well known to be game-able by a Contrast-Adaptive Sharpening filter that does not improve real fidelity. 19 When you compare encoders, pair VMAF with PSNR and at least one human subjective pass — eyes still catch hardware artefacts that VMAF and SSIM miss.
Pitfall — the "free" encoder is rarely free
The most common mistake we see on Fora Soft projects is assuming the on-CPU media engine is free because the chip is already in the rack. It is free in licence and capital cost, but a saturated QuickSync engine on a webinar host eats the latency budget for screen sharing and audio mixing, because the same memory controller is shared. On a video conferencing product, profile the whole pipeline under load before choosing hardware; the cheapest path on paper can be the slowest path in production.
Where Fora Soft fits in
We have shipped 239+ video-heavy projects since 2005 across video conferencing, video streaming, OTT, video surveillance, e-learning, telemedicine, and AR/VR. On every one, the hardware-acceleration decision shows up in the first architecture review. For low-latency conferencing products we typically route the encoder through QuickSync or VideoToolbox on the client and through NVENC on the SFU server, because conferences live or die on the few-hundred-millisecond end-to-end budget. For OTT and IPTV streams we lean on NETINT VPUs the moment the channel count crosses about 50 concurrent streams, because the rack-space and power savings start to matter. For one-off video-on-demand masters and for any encoding job where a one-VMAF-point gain pays off across a million viewers, we use software encoders on standard CPUs — usually x265 or SVT-AV1.
What to read next
- FFmpeg: a must-know cheat sheet for developers
- Choose a codec for your service in 2026: a decision tree
- AV1: the new internet standard and where it stands in 2026
Talk to us / See our work / Download
- Talk to a video engineer — book a 30-minute scoping call about your encoding pipeline.
- See our case studies — recent OTT, conferencing, and surveillance projects.
- Download the Hardware Encoding Selection Checklist (PDF) — one-page A4 sheet with the decision tree, density numbers, and FFmpeg command lines from this article.
References
-
NETINT Technologies, "Video Encoding Trends 2026: Industry Shift", citing the Bitmovin 2026 State of Video Encoding Report. https://netint.com/video-encoding-trends-2026/ (accessed 2026-05-17). Supports VPU/ASIC adoption at 32% and 49% of respondents planning to evaluate VPUs in 2026. ↩↩
-
Moscow State University Graphics & Media Lab, "AV1 Hardware Encoder Comparison 2025", as summarised in StreamingMedia.com, "The State of the Video Codec Market 2025". https://www.streamingmedia.com/Articles/Editorial/Featured-Articles/The-State-of-the-Video-Codec-Market-2025-168628.aspx (accessed 2026-05-17). Supports the claim that 2025 AV1 hardware quality still lags the best HEVC software. ↩↩
-
Gianni Rosato, "Who Has the Best Hardware AV1 Encoder?". https://giannirosato.com/blog/post/nvenc-v-qsv/ (accessed 2026-05-17). Supports the relative ranking of SVT-AV1 preset 6/8 versus hardware AV1 encoders. ↩
-
NVIDIA Developer Blog, "NVIDIA Video Codec SDK 13.0 Powered by NVIDIA Blackwell". https://developer.nvidia.com/blog/nvidia-video-codec-sdk-13-0-powered-by-nvidia-blackwell/ (accessed 2026-05-17). Supports 9th-gen NVENC generation, SDK version, and feature list. ↩
-
NVIDIA, "NVENC Application Note", Video Codec SDK 13.0. https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvenc-application-note/index.html (accessed 2026-05-17). Supports Blackwell AV1 Ultra-High-Quality mode and 3× throughput claim. ↩
-
PCIsAwesome.com, "NVIDIA 50 Series GPUs Are Here: Worth the Upgrade or Not?". https://pcisawesome.com/gpus/nvidia-50-series-gpu-the-ultimate-guide/ (accessed 2026-05-17). Supports RTX 5090's three NVENCs and 60% export-speed gain over RTX 4090. ↩
-
NVIDIA, "Using FFmpeg with NVIDIA GPU Hardware Acceleration", Video Codec SDK 13.0. https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/ffmpeg-with-nvidia-gpu/index.html (accessed 2026-05-17). Supports FFmpeg flags
-hwaccel cudaand encoder namesh264_nvenc,hevc_nvenc,av1_nvenc. ↩ -
Wikipedia, "NVENC". https://en.wikipedia.org/wiki/NVENC (accessed 2026-05-17). Supports concurrent-session limit history and driver 522.25 removal. ↩
-
Puget Systems, "Intel Arc B580 Content Creation Review". https://www.pugetsystems.com/labs/articles/intel-arc-b580-content-creation-review/ (accessed 2026-05-17). Supports Arc B580 release timing and AV1 encoding performance roughly 2× of A770. ↩
-
WCCFTech, "Intel Arc B580 Limited Edition 12 GB Battlemage Review". https://wccftech.com/review/intel-arc-b580-battlemage-graphics-card-review/2/ (accessed 2026-05-17). Supports Arc B580 price point and AV1 quality at low power. ↩
-
GIGA CHAD LLC, "Intel Arc B580 Streaming Benchmarks Breakdown". https://gigachadllc.com/intel-arc-b580-streaming-benchmarks-breakdown/ (accessed 2026-05-17). Supports AV1 live-streaming variance versus H.264 stability on B580. ↩
-
Bitmovin, "Apple AV1 Support: M4 chip adds AV1 support for iPad Pro". https://bitmovin.com/blog/apple-av1-support/ (accessed 2026-05-17). Supports AV1 hardware decode support on M3+ Macs, M4 iPad Pro, and iPhone 15 Pro+. ↩
-
VideoConverterFactory, "Apple AV1 Support in 2026: Hardware Encoding and Decoding". https://www.videoconverterfactory.com/multimedia-solution/apple-av1.html (accessed 2026-05-17). Supports AV1 hardware encoding being limited to M4 Ultra, M5 Pro, and M5 Max chips in 2026. ↩
-
Tom's Hardware, "AMD RDNA 4 and Radeon RX 9000-series GPUs start at $549". https://www.tomshardware.com/pc-components/gpus/amd-rdna4-rx-9000-series-gpus-specifications-pricing-release-date (accessed 2026-05-17). Supports RDNA 4 / RX 9070 release timing and VCN generation. ↩
-
KAD8.com, "AMD RDNA 4 Unlocks AV1 Encoding with B-Frame Support". https://www.kad8.com/news/amd-rdna-4-graphics-card-has-av1-encoding-capability/ (accessed 2026-05-17). Supports B-frame addition, 25% H.264 quality gain, 11% HEVC gain. ↩
-
Chips and Cheese, "AMD's RDNA4 GPU Architecture at Hot Chips 2025". https://chipsandcheese.com/p/amds-rdna4-gpu-architecture-at-hot (accessed 2026-05-17). Supports AV1 VMAF 93 result at CQ 70 on RX 9070 XT. ↩
-
NETINT Technologies, "Quadra T1U Video Processing Unit" product page. https://netint.com/products/quadra-t1u-video-processing-unit/ (accessed 2026-05-17). Supports Codensity G5 ASIC, 8K60 capability, AV1/HEVC/H.264 codec coverage. ↩
-
NETINT Technologies, "Quadra T2A Video Processing Unit" product page. https://netint.com/products/quadra-t2a-video-processing-unit/ (accessed 2026-05-17). Supports 32× 1080p60 stream density, dual G5 architecture, 36 TOPS AI inference capacity. ↩
-
Forum discussions and Gianni Rosato analysis on VMAF and Contrast-Adaptive Sharpening, https://giannirosato.com/blog/post/nvenc-v-qsv/ (accessed 2026-05-17). Supports the caution that VMAF can be gamed by CAS filters and should be paired with PSNR plus subjective testing. ↩


