Published 2026-05-27 · 26 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
Streaming services live or die on catalogue depth, and the catalogue is mostly old. A 2024 Bitmovin Video Developer Report estimate puts roughly 60% of premium OTT inventory at SD or 720p source resolution — content shot before 2010 that is being delivered to viewers whose default screen now resolves 8.3 million pixels. The gap between source quality and display capability is widening every year as TVs grow and streaming bitrates climb. AI-based video super-resolution turns that gap from a content problem into an engineering problem: you can take a 480p master from 1998 and ship a watchable 1080p or 4K encode in 2026 without going back to the original tape (which may not exist) and without re-shooting (which is impossible). This article is for the product manager, archive lead, or video-platform engineer who needs to decide whether to build, buy, or rent an upscaling pipeline for an OTT catalogue. It is also the engineering background under every later lesson that touches restoration — content-aware encoding ladders, AI b-roll generation for OTT post-production, scene classification on archive footage.
The Mental Model — Upscaling Is Hallucination, Not Resizing
Before AI, "upscaling" meant geometric resampling: a 480-line image becomes a 1080-line image by interpolating between known pixels. The standard recipes — bilinear, bicubic, Lanczos — are mathematical formulas that estimate the value of a new pixel as a weighted average of its neighbours. They are fast, deterministic, and fundamentally limited: they cannot invent detail that was not there to begin with. A bicubic-upscaled VHS clip is a bigger blurry rectangle that hides nothing more than the original, just shown larger.
AI super-resolution does something different. It learns what high-resolution textures look like from a large training corpus of high-quality images, then hallucinates plausible detail when given a low-resolution input. The fence in the background becomes individual wires; the skin texture becomes pores and hairs; the leaves on a tree become individually rendered. None of that detail was in the source. The model invented it, conditioned on what high-resolution video typically looks like.
This is both the technology's superpower and its trap. The superpower: with enough training data and the right architecture, the hallucinations are visually convincing for most content. The trap: when the hallucination is wrong, it is plausibly wrong — a face gets the wrong texture, text on a sign becomes gibberish that looks like text, a logo gets re-invented. Real-ESRGAN, BasicVSR++, and every commercial upscaler we discuss are all variations on this hallucination strategy. The engineering question is not whether to hallucinate; it is how much, with what guardrails, and for which content.
The brightness number — what engineers call a luma value — for any output pixel is no longer a deterministic average of input pixels. It is a learned function of the entire surrounding patch, plus the model's prior beliefs about what high-resolution video looks like.
Figure 1. Classical resamplers compute the new pixel as a weighted average of neighbours. AI super-resolvers invent plausible detail from a learned prior. The output looks sharper because the model added textures that were not in the source.
How Real-ESRGAN Works — The 2026 Default For Per-Frame Restoration
Real-ESRGAN was published by Xintao Wang and colleagues at Tencent ARC Lab in 2021, with the paper "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data" (arXiv 2107.10833). It is the practical successor to ESRGAN (ECCV 2018 workshops), which itself was a successor to SRGAN (CVPR 2017). The contribution that put Real-ESRGAN into every production pipeline was not the architecture — that was already well-known — but a synthetic data generation process that finally taught the model to handle the kinds of degradations that real video archives actually have.
The architecture has two networks that train together. The generator is the network that turns a low-resolution image into a high-resolution one. The discriminator is a network that tries to tell real high-resolution images apart from generator outputs. They train against each other in a generative adversarial network — GAN — setup: the generator gets better at fooling the discriminator, the discriminator gets better at catching fakes, and after enough rounds the generator produces images that pass for real.
Inside the generator, the heavy lifting is done by Residual-in-Residual Dense Blocks — RRDBs. An RRDB is a stack of dense blocks (each layer feeds every later layer in the block) wrapped in a residual connection (the block's input is added to its output). Real-ESRGAN-x4 has 23 RRDB blocks. The output of the last block goes into a pixel-shuffle upsampler that increases spatial resolution by 4× (or 2× for the lighter variants), then a final convolutional head that produces the RGB output.
The training data is the trick. ESRGAN was trained on pairs of high-resolution images and their bicubic-downscaled versions. That works for clean inputs and fails spectacularly on a real-world video archive where the input is also blurred, noisy, JPEG-compressed, h.264-compressed, sharpened, ringing, blocky, and degraded by an unknown chain of past processing. Real-ESRGAN's high-order degradation synthesis applies a randomised sequence of blur, downsampling, noise, and JPEG compression twice to the training images, producing inputs that look like genuinely abused archive material. The model learns to undo that combined degradation in one pass.
The model ships in three sizes. RealESRGAN-x4plus (the default) is the full 16.7-million-parameter network for natural photo and video content. RealESRGAN-x4plus-anime-6B is a six-block variant tuned for animation and cartoons. RealESRGAN-x2plus is a 2× variant for less aggressive upscales. There is also RealESR-general-x4v3, a lighter and faster network released in 2022 with better robustness on noisy real-world inputs. The reference implementation lives at github.com/xinntao/Real-ESRGAN under the BSD 3-Clause license — fully commercial-friendly.
The Token-Free Math, Worked Out Loud
Real-ESRGAN is not a Transformer. The cost calculus is different from the Vision Transformer arithmetic in the Vision Transformer primer lesson. A 4× upscale of a 480×270 input to 1920×1080 output goes through the generator as follows.
The input tensor is (480, 270, 3) — 388,800 numbers. The first convolution maps it to a 64-channel feature map of shape (480, 270, 64) — about 8.3 million numbers. That feature map then traverses 23 RRDB blocks at the same spatial resolution and channel count. Each RRDB performs roughly 5 dense convolutions on a 64-channel tensor; one such convolution at 480×270 with a 3×3 kernel costs 480 × 270 × 64 × 64 × 9 ≈ 4.8 billion multiply-add operations.
Across the 23 RRDB blocks with 5 convolutions each, that is 23 × 5 × 4.8 billion ≈ 550 GFLOPs of work for the residual core alone. The pixel-shuffle upsampler then turns the 64-channel (480, 270, 64) feature map into a 1920×1080 RGB output through a sequence of 1×1 and 3×3 convolutions in higher-channel space. End-to-end, a single forward pass through RealESRGAN-x4plus on a 480p frame costs about 700 GFLOPs of compute.
On a single NVIDIA RTX 4090 (82 RT TFLOPS for fp16), the theoretical peak frame rate is 82,000 / 700 ≈ 117 FPS. The measured real-world rate from the project's benchmarks at fp16 with no tiling on an RTX 4090 is closer to 60–80 FPS for 480p input, because the convolutions are memory-bound rather than compute-bound. For a 1080p input upscaled to 4K, throughput drops to roughly 6–10 FPS on the same card. For 4K input upscaled to 8K, throughput drops below 1 FPS, and you typically need to tile the input.
For an OTT archive workload of 1,000 episodes at 22 minutes each = 22,000 minutes = 39.6 million 30-FPS frames at 480p, a single 4090 running at 60 FPS would take 39.6M / 60 / 3600 ≈ 183 hours of compute. Add encode time (typically 2–3× the upscale time for a high-quality x265 master) and you have a 600-hour, 25-day campaign on a single GPU — or about 25 hours on a 24-GPU cluster.
Why Real-ESRGAN Alone Flickers On Video
If you take Real-ESRGAN and run it frame-by-frame on a video clip, two things happen. The frames are sharper. They also flicker. The model has no concept of time — each frame is upscaled independently — and the hallucinated details on a frame at t = 0 differ from the hallucinated details on the same content at t = 1. The fence in the background gets one set of inventions in frame 1 and a slightly different set in frame 2. When played back, the result is visibly unstable: edges shimmer, textures crawl, faces twitch.
Temporal flicker is the single most important visual defect in AI video upscaling, and it is the reason a frame-by-frame strategy on its own is not production-ready. Three engineering responses have emerged. The first is temporal smoothing — running a small post-process that blurs each output frame with its neighbours, trading sharpness for stability. The second is video-native architectures like BasicVSR++ that propagate information across frames during the upscaling itself. The third is anchored reference frames that constrain the hallucination to be consistent across a clip.
In practice, for OTT archive workloads, the answer is almost always: use a video-native model for moving content, and reserve Real-ESRGAN for still images, key art, posters, and per-frame work where temporal consistency does not matter.
How BasicVSR++ Works — Recurrent Propagation For Temporal Consistency
BasicVSR was introduced by Kelvin Chan and colleagues at MMLab, NTU, in 2021 (CVPR 2021). The follow-up, BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment (arXiv 2104.13371, CVPR 2022), is the model that became the open-source reference for temporally-consistent video super-resolution and that swept the NTIRE 2021 challenges (three champions, one runner-up across the Video Super-Resolution and Compressed Video Enhancement tracks).
The architecture is a recurrent network with two ideas that matter. The first is bidirectional grid propagation. The model maintains a running feature representation that propagates information both forward (frame 1 → frame 2 → frame 3) and backward (frame N → frame N-1 → frame N-2) along the video. Information from earlier and later frames is available when the model decides what the high-resolution output for the current frame should look like. The second is flow-guided deformable alignment. Before fusing information from a neighbouring frame, the model warps that neighbour's features to align with the current frame, using a learned optical-flow estimate and deformable convolutions that adjust the alignment locally per pixel.
The result is dramatic. On the standard REDS4 benchmark (a four-clip subset of the REDS dataset used in NTIRE), BasicVSR++ achieves 32.39 dB PSNR — a 0.82 dB improvement over BasicVSR with roughly the same parameter count. On Vid4 (an older standard low-resolution video benchmark), BasicVSR++ achieves 27.79 dB. Those numbers are 2022 state of the art for open-weight pure-supervised VSR, and the model is still the workhorse in 2026 production pipelines despite three years of new diffusion-based competitors, because it runs faster, costs less to deploy, and is comprehensively understood.
The reference implementation lives at github.com/ckkelvinchan/BasicVSR_PlusPlus (and is also integrated into MMagic, the OpenMMLab generative model toolkit) under the Apache 2.0 license. BasicVSR++ generalises to video deblurring and denoising — the paper's follow-up technical report (arXiv 2204.05308) shows it winning the NTIRE 2022 Quality Enhancement of Compressed Video challenges as well.
The trade-off: BasicVSR++ assumes the input has been somewhat sanitised. It is excellent at recovering detail from genuinely high-quality but low-resolution source (a 480p Blu-ray master, a 720p broadcast feed) and visibly weaker on heavily degraded inputs (a VHS rip, a tape that has been re-encoded three times). For truly degraded archive content, the production pattern is to run Real-ESRGAN-style restoration first (to remove compression artifacts and basic noise), then run BasicVSR++-style video super-resolution on the cleaned-up intermediate.
Figure 2. BasicVSR++ runs both forward and backward through the clip, warping neighbour frames into alignment with the current frame before fusing them. This is the architectural reason its output is temporally stable.
The Wider Family — Diffusion Upscalers And The 2026 Frontier
Real-ESRGAN and BasicVSR++ are the two open-weight workhorses, but they are not the only options. Three other families are worth knowing because they show up in commercial products and in the engineering choices ahead of you.
VideoGigaGAN (Adobe Research, arXiv 2404.12388, April 2024) is a video-native generative super-resolver that builds on the large-scale image GAN GigaGAN. It produces visibly richer textures than BasicVSR++ at 8× upscales — its headline claim — but ships only as a research preview, with Adobe stating no immediate plans to release it in Premiere Pro. It also has a hard limit: performance drops sharply beyond about 200 frames (8–10 seconds at 24 FPS), which is fine for shot-level processing but not for long-form OTT episodes. Treat it as the research benchmark the commercial 8× tools were judged against, not as a tool you can ship.
SeedVR (CVPR 2025 Highlight) and SeedVR2 (accepted to ICLR 2026) are ByteDance's diffusion-transformer-based video restoration models, released as open weights on Hugging Face under non-commercial licenses for the smaller variants. SeedVR2 is a one-step diffusion model with adaptive window attention, capable of high-quality restoration in a single denoising pass — ten times faster than the older multi-step diffusion approaches. The headline open release is ByteDance-Seed/SeedVR2-3B. It is the strongest 2026 open-weight VSR option on heavily degraded content, but its license restricts commercial use; read it before shipping.
NVIDIA Maxine Video Super Resolution is the proprietary real-time path. It upscales 16:9 video from 480p up to 4K (and now claims up to 8K for some configurations) with user-controllable sharpness, denoising, and what NVIDIA's docs call a "hallucination limit". Maxine is the standard choice when the requirement is sub-frame latency in a live broadcast or video-call pipeline rather than offline archive processing — it is the pipeline that powers RTX Video Super Resolution in NVIDIA's drivers and the upscaling stage in many real-time streaming products.
MambaVSR (2025) and MIA-VSR (CVPR 2024) are recent Transformer-and-state-space variants that incrementally improve REDS4 PSNR over BasicVSR++ by 0.3–0.6 dB. They are research-grade today; expect at least one of them to displace BasicVSR++ as the default open-weight VSR by 2027, but in May 2026 BasicVSR++ remains the most production-ready open option for moving content.
The Commercial Tier — Topaz Video AI And Pixop
Two commercial products dominate the OTT archive upscaling market and you need to know both, because the build-vs-buy decision is rarely about quality alone.
Topaz Video AI is the offline cinematic-grade default. It ships nineteen specialised models in its 2026 release — Proteus, Iris, Iris LQ, Artemis, Theia, Rhea, Hyperion (HDR path), Starlight (diffusion-based deep restoration), and others — each tuned for a different input type. Proteus is the balanced enhancer for medium-quality source. Iris produces the sharpest edges (with occasional warping artifacts on faces). Artemis specialises in interlaced content. Hyperion handles HDR mastering. Topaz transitioned from perpetual licensing to subscription in late 2025; the 2026 plans are $299/year (Personal — 25 cloud credits/month, non-commercial or limited commercial) and $699/year (Pro — 100 cloud credits/month, full commercial use, seat management). On-premise rendering is unlimited within the subscription. Topaz is the right choice when you need maximum visual quality on a curated catalogue, have GPUs available in-house, and can spare an operator to pick the right model per asset.
Pixop is the cloud-API archive-scale default. It is a pure SaaS product: you upload, you pick filters (super-resolution, denoise, deinterlace, deep restoration), it processes on Pixop's cloud GPUs, you download. The pricing is pay-as-you-go from a $10 minimum, charged per minute of source per filter applied; storage and downloads are billed separately. Pixop offers a REST API for bulk integration; archive customers and broadcasters use it precisely because they don't want to operate a local GPU fleet for a 6,000-hour catalogue restoration that runs once a year. Pixop's quality is broadly comparable to Topaz on common content; its edge is the operational simplicity at scale.
The third option — and the one most engineering teams underestimate — is operating Real-ESRGAN + BasicVSR++ in-house on your own GPUs. It is free except for compute, gives you full control of the model and the pipeline, and integrates cleanly into your existing encoding ladder. The cost is operational: you need an engineer who understands the model, the GPU fleet to run it on, the storage and dataflow to handle terabytes of intermediate output, and the patience to tune parameters per content type. For an OTT operator with a 10,000-hour catalogue and an existing video engineering team, the in-house path frequently wins on total cost of ownership. For a 100-hour catalogue or a team without GPU operations experience, the cloud path wins.
| Property | Real-ESRGAN (open) | BasicVSR++ (open) | Topaz Video AI (commercial) | Pixop (cloud) | SeedVR2 (open, NC) |
|---|---|---|---|---|---|
| Type | Per-frame GAN | Recurrent VSR | Multi-model offline | Cloud API | One-step diffusion |
| Best for | Stills, key art, light video | Cleaned-up moving content | Curated archive, in-house | Bulk catalogue, no GPU ops | Heavily degraded source |
| Open-source | Yes (BSD-3) | Yes (Apache 2.0) | No | No | Weights yes, license NC |
| Temporal stability | Poor on video | Strong | Strong | Strong | Strong |
| Reference benchmark | DIV2K, OST300 | REDS4: 32.39 dB | Internal | Internal | Outperforms BasicVSR++ on degraded |
| Cost model | Free + your GPU | Free + your GPU | $299/$699 per year | ~$0.50–$2 per min source | Free + your GPU (NC) |
| Production maturity | Battle-tested | Battle-tested | Battle-tested | Battle-tested | New (ICLR 2026) |
The Three Failure Modes That Wreck First Attempts
We have shipped or evaluated AI archive upscaling pipelines on six OTT projects at Fora Soft over the last three years. The three failure modes below come up in roughly that order every time.
Failure 1: Treating "Upscale" As "Make Sharper". A team picks a strong upscaler, applies it to a SD master, encodes the output at 4K, and ships. The output is sharper. The output is also wrong: faces have been re-textured, hair has been reinvented, text on signs has become gibberish that looks like text, logos have been re-drawn by the model. For documentary content, news archives, and anything where the original visual record matters, that is a content integrity violation, not a quality improvement. The fix is to constrain the hallucination — pick a lighter model (RealESR-general-x4v3 over RealESRGAN-x4plus, or Topaz Proteus with low "Recover Detail"), reduce the sharpness knob, and run a final pass that compares the output against the bicubic-upscaled baseline to flag clips that have drifted too far from the source.
Failure 2: Ignoring The Encoding Pipeline. A team runs Real-ESRGAN, gets a beautiful 4K master, runs it through their existing x264 ladder at the bitrates they used for the SD original, and ships. The viewer sees 4K content at 4 Mbps — bitrate-starved 4K that has been re-blurred by the encoder. The hallucinated detail the upscaler produced is exactly the high-frequency signal that codecs throw away first. The fix is to budget bitrate honestly. A 4K master earned by AI super-resolution needs the same bitrate as a natively-shot 4K master — typically 15–25 Mbps for HEVC, 8–15 Mbps for AV1 — because the signal genuinely has 4K-grade frequency content the encoder must preserve.
Failure 3: One Model For Everything. A team picks RealESRGAN-x4plus and uses it for everything in the catalogue: stand-up comedy, documentaries, animation, archive news. The animation gets the wrong textures (it should have used RealESRGAN-x4plus-anime-6B). The news archive gets faces re-invented (it should have used a content-preserving setting). The documentary gets believable but incorrect details on critical archival footage. The fix is content-aware model selection — classify each asset (live-action / animation / archive news / documentary / sports) and route to the appropriate model and parameter preset. This is where a small classifier on a Vision Transformer backbone earns its keep: one inference at ingest time, one routing decision, one model per content type.
Figure 3. The three failure modes that wreck first attempts at AI archive upscaling. Each has a specific fix; none of them is "pick a better model".
The OTT Archive Pipeline — What Production Actually Looks Like
A working production pipeline for AI archive upscaling has six stages. The shape below is what we use in our own OTT projects and what you see, with variations, inside Topaz workflows, Pixop pipelines, and the bigger services' internal systems.
Stage 1 — Ingest and classify. Each asset enters the pipeline with a manifest: source format, source resolution, source quality assessment (a VMAF score against a high-quality reference if one exists, otherwise a no-reference quality model like MUSIQ), and content type (live-action / animation / news / documentary / sports). The classifier is a small Vision Transformer fine-tuned on a labelled subset of your catalogue — a one-time investment of a few thousand labels.
Stage 2 — Restore. If the source quality is low (heavy compression, noise, interlacing), run a restoration pass first. Real-ESRGAN with conservative settings, Topaz Artemis for interlaced source, or SeedVR2 (under appropriate license) for heavily degraded content. The output is a cleaned-up intermediate at the same resolution as the source.
Stage 3 — Upscale. Apply the video-native super-resolver. BasicVSR++ at 4× on cleaned-up live-action; Topaz Iris or Proteus on the same; commercial choices for animation or archival film. Tile inputs that exceed the GPU's memory.
Stage 4 — Temporal QA. Sample the upscaled output and run a temporal stability metric — typically the difference between consecutive frames after they have been warped into alignment by optical flow. Flicker and crawl show up as anomalously large residuals. Flag clips that fail the threshold for re-processing with stronger temporal regularisation.
Stage 5 — Content-integrity QA. Compute VMAF against the bicubic-upscaled source baseline. A VMAF of, say, 92 against the source bicubic means the upscaler stayed structurally faithful to the source while adding plausible detail. A VMAF below 80 means the model drifted — re-process with a less aggressive setting. For news, documentary, and historically sensitive footage, additionally compare faces and text against the source with a face-identity model and an OCR model; both must match.
Stage 6 — Encode at honest bitrate. The upscaled 4K master goes into your normal encoding ladder at 4K-appropriate bitrates. The lesson from Failure 2 applies: do not under-bitrate a 4K master earned through AI super-resolution.
For the budget arithmetic on a 10,000-hour catalogue at 30 FPS: that is 1.08 billion frames. At 60 FPS on a single RTX 4090 (RealESRGAN-x4plus on 480p input), the upscale pass alone takes 1.08B / 60 / 3600 ≈ 5,000 hours of GPU time. On a 40-GPU cluster, 125 hours of wall-clock — about five days. Add restoration (1×), QA (~0.2×), and encode (3× for high-quality x265), and the full pipeline runs at about 5–6× the upscale time. Plan for two to three weeks of wall-clock on a 40-GPU cluster, or two to three months on a 4-GPU workstation. The cost arithmetic at AWS p4d on-demand pricing in 2026 — eight A100s at roughly $30/hour combined — is approximately $0.50–$1.50 per minute of source content, all-in, depending on what fraction of stages run on GPU vs CPU. Pixop's pricing tracks this number closely, which is no coincidence.
Where Fora Soft Fits In
At Fora Soft we have integrated AI super-resolution into video pipelines across OTT, video surveillance, and telemedicine. In OTT we have run BasicVSR++ at 4× on legacy episodic content, paired with content-aware encoding ladders so the upscaled master genuinely ships with the bitrate it earned. In surveillance we use Real-ESRGAN selectively on still-frame evidence exports — never on live monitoring footage where hallucinated detail would create chain-of-custody problems. In telemedicine, archive consultation footage is upscaled with conservative settings to support remote second-opinion workflows; the medical reviewers always have access to the original alongside the upscaled version. We do not run our own super-resolution research; we integrate the open and commercial state of the art into video pipelines that ship and stay shipped.
What To Read Next
- Vision Transformer primer for video AI engineers — the architecture that powers the content-type classifier in stage 1 of the pipeline.
- Multi-object tracking — DeepSORT, ByteTrack, OC-SORT — the upstream lesson on tracking primitives that often consume upscaled archive footage.
- Optical flow — RAFT vs Lucas-Kanade — the next lesson; optical flow is the alignment primitive inside BasicVSR++.
Talk To Us / See Our Work / Download
- Talk to a video engineer — book a 30-minute scoping call about an OTT archive upscaling project.
- See our case studies — review the OTT, surveillance, and telemedicine projects we have shipped with AI restoration in the pipeline.
- Download the OTT archive upscaling decision worksheet — a one-page printable with the build-vs-buy decision tree, the six-stage pipeline checklist, the three-failure-mode audit, and the per-minute cost arithmetic.
References
-
Wang, X., Xie, L., Dong, C., Shan, Y. "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data." ICCV Workshop 2021. arXiv:2107.10833. Accessed 2026-05-27. The Real-ESRGAN paper; the synthetic high-order degradation recipe; the RRDB generator architecture with 23 blocks; the discriminator design; the natural and animation training splits.
-
Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Loy, C. C., Qiao, Y., Tang, X. "ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks." ECCV 2018 Workshops. arXiv:1809.00219. Accessed 2026-05-27. The predecessor architecture; the original Residual-in-Residual Dense Block design; the perceptual loss formulation.
-
xinntao/Real-ESRGAN. GitHub repository, reference implementation. github.com/xinntao/Real-ESRGAN. Accessed 2026-05-27. BSD 3-Clause license; PyTorch and NCNN-Vulkan implementations; model checkpoints for RealESRGAN-x4plus, RealESRGAN-x4plus-anime-6B, RealESRGAN-x2plus, RealESR-general-x4v3; fp16 inference at ~1.5–2× speedup on RTX-class GPUs.
-
Chan, K. C. K., Zhou, S., Xu, X., Loy, C. C. "BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment." CVPR 2022. arXiv:2104.13371. Accessed 2026-05-27. The BasicVSR++ paper; second-order grid propagation; flow-guided deformable alignment; REDS4 PSNR 32.39 dB (a 0.82 dB improvement over BasicVSR); NTIRE 2021 three-champion result.
-
Chan, K. C. K., Wang, X., Yu, K., Dong, C., Loy, C. C. "BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond." CVPR 2021. arXiv:2012.02181. Accessed 2026-05-27. The predecessor model; the four-essential-component framework (propagation, alignment, aggregation, upsampling) that BasicVSR++ extends.
-
Chan, K. C. K., Zhou, S., Xu, X., Loy, C. C. "On the Generalization of BasicVSR++ to Video Deblurring and Denoising." Technical Report, April 2022. arXiv:2204.05308. Accessed 2026-05-27. The follow-up showing BasicVSR++ winning the NTIRE 2022 Quality Enhancement of Compressed Video challenges; transfer to deblurring and denoising.
-
Xu, Y., Park, T., Zhang, R., Zhou, Y., Shechtman, E., Liu, F., Huang, J., Liu, D. "VideoGigaGAN: Towards Detail-Rich Video Super-Resolution." Adobe Research, April 2024. arXiv:2404.12388. Accessed 2026-05-27. The 8× video super-resolver; the 200-frame limit on long-clip stability; research preview, not shipped in Premiere Pro.
-
ByteDance-Seed/SeedVR. GitHub repository for SeedVR (CVPR 2025 Highlight) and SeedVR2 (ICLR 2026). github.com/ByteDance-Seed/SeedVR. Accessed 2026-05-27. One-step diffusion-transformer video restoration; adaptive window attention; non-commercial license on the released weights — read carefully before commercial use.
-
NVIDIA Maxine Video Effects (VFX) SDK User Guide — Video Super Resolution. docs.nvidia.com/maxine/vfx/latest/Filters/VideoSuperResolution.html. Accessed 2026-05-27. The Maxine VSR filter; real-time path; 480p to 4K (and now claimed up to 8K) on RTX GPUs; user-controllable sharpness, denoise, and hallucination-limit parameters.
-
Topaz Labs. "Topaz Video AI." topazlabs.com/topaz-video. Accessed 2026-05-27. The 19-model commercial offline upscaler; 2026 subscription tiers Personal $299/year and Pro $699/year following the late-2025 transition from perpetual licensing; Proteus, Iris, Artemis, Hyperion, Starlight models.
-
Pixop. "Pixop Pricing." help.pixop.com/en/articles/4373624-pixop-pricing. Accessed 2026-05-27. The cloud-API video enhancement service; pay-as-you-go pricing from $10 minimum; REST API for bulk integration; per-minute-per-filter billing.
-
Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P. "Image Quality Assessment: From Error Visibility to Structural Similarity." IEEE Transactions on Image Processing, 13(4), 2004. DOI: 10.1109/TIP.2003.819861. Accessed 2026-05-27. The original SSIM paper; structural-similarity image quality metric used in super-resolution evaluation.
-
Netflix VMAF. github.com/Netflix/vmaf. Accessed 2026-05-27. The Video Multi-Method Assessment Fusion perceptual video quality metric maintained by Netflix; the standard metric for content-integrity QA on upscaled output.
-
MMagic — OpenMMLab. Video super-resolution model zoo, including BasicVSR++ and EDVR reference implementations. mmagic.readthedocs.io. Accessed 2026-05-27. The reference toolkit packaging for BasicVSR++ that most production pipelines use under the hood.
-
REDS (Realistic and Dynamic Scenes) dataset. NTIRE 2019 challenge dataset. seungjunnah.github.io/Datasets/reds.html. Accessed 2026-05-27. The 30,000-image training and 4-clip REDS4 evaluation dataset used for VSR benchmarking; the source of the REDS4 PSNR number quoted for BasicVSR++.
-
Bitmovin Video Developer Report 2024–2025. bitmovin.com/video-developer-report. Accessed 2026-05-27. The annual survey of streaming engineering practice; source of the catalogue-resolution distribution and the production-pipeline patterns cited in the OTT section.


