Opus Clip + Descript + Submagic + Captions AI + DaVinci AI

Why This Matters

If you run a product that handles video — an online course platform, a podcast host, a webinar tool, an OTT library, a social app — your users are about to expect "turn this into clips" as a button, not a service they pay an agency for. Deciding whether to wire in a vendor API, self-host an open pipeline, or build from parts changes your cost, your margins, and how much of the experience you control. This lesson is written for the product manager, founder, or engineering lead who has watched an Opus Clip demo and needs to understand what is running underneath before committing a roadmap to it. It builds on the generative video landscape, which covers the models that create footage from scratch, and on the streaming ASR lesson, because the very first stage of every tool here is turning speech into text.

One Pipeline Hiding Behind Every Tool

Start with the single most useful idea in this lesson: almost every AI video editing tool you can name runs the same assembly line. The product names, the pricing pages, and the marketing make them look like five different inventions. They are not. They are five different front ends bolted onto the same five-stage pipeline, each tuned for a different job.

Picture a factory line. A long raw video goes in one end. It passes five stations, and a finished short video comes out the other end. Here are the five stations, in order, in plain language.

The first station is transcription. A speech-recognition model — the technology that turns spoken words into written text, called automatic speech recognition or ASR — listens to the audio and writes down every word, with a timestamp on each word saying exactly when it was spoken. This word-level timing is the spine the rest of the pipeline hangs on. Tools here mostly use OpenAI's Whisper or commercial services like Deepgram and AssemblyAI, which we cover in the streaming ASR lesson.

The second station is moment detection. Now that the whole video is text, a large language model — the kind of model behind ChatGPT and Claude, called an LLM — reads the transcript and decides which slices are worth keeping. It looks for a strong opening line (a "hook"), an emotional peak, a surprising claim, a clean conclusion. Think of it as a film editor who has read the entire script and circles the ten best scenes. The output is a list of start-and-stop timestamps.

The third station is reframing. Most source video is wide (a 16:9 landscape shape, like a TV). Most short-form video is tall (a 9:16 vertical shape, like a phone held upright). The tool has to crop the wide frame down to a tall one without cutting off the person talking. A computer-vision model — software trained to find faces and bodies in an image — locates the speaker in each frame and slides the crop box to follow them, the way a camera operator pans to keep a subject centered. This is why your clips do not have a headless torso wandering off the edge.

The fourth station is captions. Remember the word-level timestamps from station one. The tool now draws each word on screen at the exact instant it is spoken, often highlighting the current word so the viewer's eye tracks the speech. These are "burned-in" captions — painted directly onto the video pixels, so they show even when a viewer scrolls with the sound off. On most platforms, 85% of video is watched muted, which is why this station is not optional.

The fifth station is render. The tool stitches the chosen segment, the moving crop, and the captions together and encodes a finished MP4 file. This is the only station that is pure video engineering — cut, composite, encode — with no AI in it.

Figure 1. The five-stage pipeline behind nearly every AI video editing tool. The tools differ in which stations they emphasize, not in the stations themselves.

Once you see the pipeline, the tools stop being mysterious. Opus Clip is a pipeline tuned to be excellent at stations two and three — finding moments and reframing. Submagic is tuned to be excellent at station four — captions — and weaker at station two. Descript replaces the whole idea of a timeline with station one, letting you edit the video by editing its transcript. Captions adds a sixth, optional station — generating a synthetic presenter — in front of the line. DaVinci Resolve drops AI helpers into a few stations of a traditional professional editor. Hold this map in your head; the rest of the lesson fills it in.

The Pipeline In Detail, With The Numbers

Walk each station once more, slowly, because the engineering trade-offs live in the details and they decide what your product can promise.

Station 1 — Transcription and the word timestamp

The transcription step does two jobs at once. It produces readable text, and it produces a timing track: a list that says "the word pricing starts at 4.21 seconds and ends at 4.58 seconds." That second job is the one that matters for everything downstream. If the timing is off by even a fifth of a second, captions drift out of sync and clips cut mid-word.

How accurate is the timing? With clean studio narration, modern word-level alignment lands within 30 to 40 milliseconds of the true word boundary — about one frame of video. With noisy audio (a café, a phone recording, crosstalk), it can drift to 80 to 100 milliseconds, which is two or three frames and starts to look sloppy. This is why every serious tool runs audio cleanup before transcription, and why your product's clip quality is capped by your worst audio.

Station 2 — Moment detection, and why it is the hard one

This is the station that separates a good clipping tool from a bad one. Anyone can transcribe. The skill is deciding which thirty seconds out of ninety minutes a stranger will watch to the end.

Early tools did this with keyword counting — find the part of the transcript with the most "exciting" words. It produced clips that cut off mid-thought, because a transcript keyword does not know where a sentence ends. Modern tools feed the whole transcript to an LLM and ask it to return complete, self-contained segments with a virality score and a reason. The best of them blend signals beyond the words: face focus, changes in voice tone, pacing, and laughter, so the chosen clip "feels" natural rather than statistically loud. The frontier of this is to skip the transcript and feed the model the actual frames — the multimodal approach we cover in the closed-frontier VLM lesson — but for clipping, a text transcript plus a strong LLM is still the cheaper, faster default in 2026.

Here is the cost reality nobody puts on the pricing page. Running a ninety-minute transcript through a frontier LLM is not free. A ninety-minute podcast is roughly 13,500 words, which is about 18,000 tokens (the chunks of text an LLM bills by — roughly ¾ of a word each). Score that against a model, ask for ten candidate clips with reasons, and you spend input plus output tokens on every single video. Multiply by thousands of videos a month and station two becomes a real line item. This is the single biggest reason vendors meter you by "credits" instead of charging a flat fee.

Station 3 — Reframing, and the math of following a face

Reframing sounds trivial and is not. The model must find the speaker in every frame, then move a crop box smoothly so the speaker stays centered, without jerking when two people are on screen or when the speaker steps out of frame.

The 2024 research direction here is to treat reframing as a reasoning problem, not just a tracking problem. The "Reframe Anything" work framed it as an agent that decides what the important subject is, then tracks it — so a soccer clip follows the ball, not a random face in the crowd. In production tools this shows up as branded names: Opus Clip calls it ReframeAnything, and it uses object tracking to keep moving subjects centered without manual keyframing.

The arithmetic that matters: the crop box must move smoothly. If you snap it to the exact face position every frame, it jitters, because face detection wobbles by a few pixels frame to frame. Tools apply easing — they move the box a fraction of the way toward the target each frame, so it glides. This is the same math as a camera operator who anticipates rather than chases.

Station 4 — Captions, and the format that makes the animation possible

Captions are the most visible station and the one with a real engineering decision baked in: which subtitle format.

Plain subtitle formats — the SRT files you attach to a movie — can show a line of text on a timer. They cannot highlight one word at a time as it is spoken, the "karaoke" effect that defines modern short-form captions. For that you need a richer format. The Advanced SubStation Alpha format, abbreviated ASS, supports per-word timing through a family of tags (the \k karaoke tags), where one variant sweeps a color fill smoothly across each word and another flips the color instantly. The web format WebVTT supports a lighter version of the same idea for browser playback.

The standard open-source recipe is exact and worth knowing even if you buy a tool, because it tells you what the tool is doing:

# Burn an ASS subtitle file (with karaoke word-highlight tags) into a video.
# -c:a copy keeps the original audio untouched; only the video is re-encoded.
ffmpeg -i input.mp4 -vf "ass=karaoke.ass" -c:a copy output.mp4

Two rules fall out of how the format works. Show four to six words at a time at most, or the highlight sweep moves too fast to read. And once captions are burned in, they are part of the pixels — changing a typo means re-encoding the whole clip. That immutability is the price of captions that survive a muted scroll.

Station 5 — Render

The last station is plain video engineering: take the chosen time range, apply the moving crop, composite the captions and any B-roll or effects, and encode an MP4. For a thirty-minute source video, the whole five-station run typically takes 5 to 15 minutes of wall-clock time depending on whether the stations run in parallel, and transcription is usually the slowest single step. None of this is real-time, and it does not need to be: clipping is a batch job, run once after a recording, not a live stream.

Common mistake: trusting the virality score as truth. Every clipping tool hands you a number — a "virality score" or a ranked list — and it is tempting to auto-publish the top three and walk away. Do not. That score is a language model's guess from a transcript; it has not seen your audience, your brand, or whether the clip cuts off an awkward half-second of dead air at the end. In our experience the model picks a strong moment most of the time and a confidently wrong one often enough to matter. The fix is cheap: keep a human approval step between the pipeline and the publish button, and feed the human's accept/reject choices back as a signal. The tools that let you do this — a review queue rather than a fire-and-forget API — are the ones safe to build a product on.

Captions are an accessibility feature, not just a growth hack

One more thing about station four, because it is easy to treat captions purely as a way to win the muted scroll and miss the larger point. Burned-in captions make your video usable by people who are deaf or hard of hearing, and by anyone watching in a noisy place or a quiet one. The same word-level timestamps that drive the karaoke highlight are exactly what a screen reader and a search index want.

This has two practical consequences for a product. First, if you only ever burn captions into the pixels, you have made the video look accessible while throwing away the machine-readable text — so also keep a separate caption track (a WebVTT or SRT file) alongside the burned-in version, so assistive technology and your own search can read it. Second, caption quality is not cosmetic: a wrong word in a medical or legal video is a real failure, not a typo. When you evaluate a tool, test it on your worst audio, not the demo reel, and check the text track, not just how the animation looks. The tool that captions a clean podcast beautifully may mangle a telemedicine consult recorded on a laptop microphone.

The Five Tools, One At A Time

Now place each product on the pipeline. For each, the question is the same: which stations does it make excellent, what does it cost in 2026, and who is it for.

Opus Clip — the clip-generation specialist

Opus Clip is the tool most people mean when they say "AI clipping." You give it a long video — a podcast, a webinar, a YouTube recording, a Zoom call — and it returns a set of short vertical clips with captions and reframing already applied. It is a pipeline tuned hard for stations two and three.

Its moment-detection engine is branded ClipAnything (now ClipAnything 2.0), an AI model that analyzes visual, audio, and sentiment cues to find clip-worthy moments across any genre, from interviews to gaming to sports. The 2.0 version added natural-language search: you can type "find every time I mention pricing" and it returns those moments. Its reframing is ReframeAnything, which tracks the subject to keep a stable 9:16 crop. It captions in 25-plus languages and assigns each clip a "virality score."

The 2026 pricing: a free tier (60 credits a month, with a watermark), a Starter plan at $15 a month (150 credits, watermark removed, one brand template), and a Pro plan at $29 a month — about $14.50 a month if billed annually at $174 — where the full feature set lives. Credits map roughly to minutes of source video processed.

The important engineering fact for a product team: Opus Clip has an API. It positions itself as an AI video clipping API for building pipelines at scale — TV networks, news apps, large creators generating thousands of clips a day. The integration pattern is a job queue: you POST a clip project, then either poll a GET endpoint or receive a webhook when clips are ready. As of 2026 the API is in closed beta, gated to large-volume annual plans, with a rate limit around 30 requests per minute, support for videos up to 10 hours and roughly 30 GB, and up to 50 concurrent projects. If your product needs "turn this upload into clips" as a feature, this is the buy-side option you evaluate first.

Descript — edit video by editing text

Descript is a different animal. It is built around station one taken to its logical end: it transcribes your video, and then you edit the video by editing the transcript. Delete a sentence in the text and that sentence vanishes from the video. Remove every "um" with one click. It turns video editing into word processing, which is why podcasters and course creators love it.

Its AI layer is broad. Underlord is an AI co-editor that can tighten cuts, remove silences and filler words, improve audio, and add captions on your direction. Studio Sound cleans up audio. Overdub is voice cloning: you train it on your voice, and to fix a misspoken line you simply type the correction and Descript generates your voice saying it. The platform lists 30-plus AI tools in total.

The 2026 pricing: a free plan for testing, then Hobbyist at $16 a month billed annually (or $24 month-to-month), Creator at $24 a month annually (or $35), and Business at $50 a month annually (or $65). Descript handles full multi-hour episodes without trouble, which matters because some caption-first tools cap clip length.

Where Descript fits in a product strategy: it is the tool for editing the whole thing, not just clipping it. If your users produce long-form content — courses, full podcast episodes, training videos — and want to edit and publish from one place, Descript's text-based model is the reference design. Opus Clip and Descript are not really competitors; many creators use Opus Clip to find the clips and Descript to edit the full episode.

Submagic — captions and B-roll, very good, very fast

Submagic is a pipeline tuned hard for station four. It is built for one job: turn raw footage into polished short-form content for TikTok, Reels, and Shorts, with the best-looking captions in the category. By early 2026 it claimed over four million registered users and 5,000 to 10,000 new signups a day.

In one pass it generates animated captions (it claims 99% accuracy across 48-plus languages), removes silences and filler words, inserts contextual B-roll from a stock library, adds auto-zoom effects and sound effects at transitions, and writes an attention-grabbing hook title. For a typical three-minute video the process takes 2 to 5 minutes. Its Magic Clips feature does station-two moment extraction from longer videos, though reviewers consistently note its moment detection is weaker than Opus Clip's — Submagic shines on captioning, not on choosing the segment.

The 2026 pricing varies by source and plan structure, landing in the range of roughly $12 to $20 a month entry (billed annually), with higher tiers around $40 to $80 for more videos per month and API access. One hard limit to design around: Submagic caps source video at 30 minutes per video even on its top plan, which rules it out as the clipping engine for full-length podcasts.

The practical pattern reviewers recommend: use Opus Clip (or your own pipeline) to find and cut the clip, then Submagic to caption it. That division of labor is a direct consequence of the pipeline — Submagic owns station four, Opus Clip owns station two.

Captions (by Mirage) — restyling and synthetic presenters

Captions, the app from the company now branded Mirage, started as a captioning app and grew into something broader: a tool that adds a generation station in front of the pipeline. Its two signature features sit at opposite ends.

AI Edit applies a chosen visual style to an entire video in one action — you pick from a library of 20-plus named looks (Prism Pro, Vinyl, Film, Neon, Cinematic II) and the tool restyles the whole thing. That is station four and station five fused into a one-tap aesthetic. At the front of the line, AI Avatars and Twin generate a synthetic presenter: one selfie creates an AI "twin" of you that can speak any script, or you pick from a library of AI actors. These run on Mirage, the company's own AI video model, which generates voice, expression, and movement together as one performance. A chat-based editor lets you describe edits in plain language. (For the deep mechanics of synthetic presenters, see the lip-sync and avatars lesson.)

The 2026 pricing runs on a credit system: a Pro plan around $9.99 a month, Max at $24.99, Scale at $69.99, a Business tier at $399 a month with 8,000 monthly credits, and custom Enterprise pricing. Credits are consumed by generation, which is the expensive station.

Captions is the tool to study when your product needs to create a presenter, not just edit existing footage — a use case that overlaps the avatar platforms (Tavus, HeyGen, Synthesia) covered in the lip-sync lesson.

DaVinci Resolve 20 — AI inside the professional editor

The four tools above are AI-first products. DaVinci Resolve is the opposite: a full professional editing suite (free and paid) that has dropped AI helpers into a traditional timeline. It matters here because it shows where the pipeline lands when you are not throwing away manual control — and because, in its free tier, it is the most capable AI-assisted editor you can get for nothing.

Resolve 20's AI runs on the DaVinci Neural Engine. The 2025–2026 features map onto our stations cleanly. IntelliScript builds a timeline automatically from a script by matching transcribed audio to the script's lines and assembling the best takes — that is stations one and two, applied to long-form assembly rather than short clips. AI Multicam SmartSwitch auto-selects the best camera angle in real time by recognizing the active speaker from lip movement and audio, trained on thousands of hours of multicam footage. AI Voice Convert applies a voice model to a recording while keeping its inflection and emotion. AI Audio Assistant balances a full mix automatically, and AI Animated Subtitles animate captions to the pace of speech — station four, inside a pro editor.

The point for a product team: if your users are professional editors, the AI they want is assistive — it speeds the timeline, it does not replace it. Resolve is the reference for that philosophy, and a useful counterweight to the fully automated tools when you are deciding how much control to take away from your users.

Figure 2. The five tools mapped to the pipeline. Read the table by the job you need first, then by price and whether an API exists for your build.

A Worked Cost Example: Buy The API Or Build The Pipeline

The decision every product team faces is build versus buy. Walk one concrete example so the trade-off is numbers, not vibes.

Say your product processes 2,000 source videos a month, each about 30 minutes, turning each into five clips. That is 10,000 clips a month, 1,000 hours of source video processed.

The buy path. You call a vendor API (Opus Clip's, or a competitor's). You pay per minute of source video, metered as credits. There is no infrastructure to run, no model to host, no GPU bill. You pay a predictable per-clip price and the vendor absorbs every cost spike. The downside: you are locked to their feature set, their caption styles, their watermark rules, and their price changes, and you ship your users' video to a third party.

The build path. You assemble the five stations yourself: Whisper or a commercial ASR for station one; a frontier LLM for station two; an open reframing model for station three; an ASS-and-FFmpeg renderer for stations four and five. Now price the hidden line item — station two. At roughly 18,000 input tokens per 30-minute transcript, plus output tokens for ten scored candidates, you are paying an LLM bill on every video, multiplied by 2,000 videos. Add GPU time for transcription and rendering. The build path wins on control and per-unit cost at high volume, and loses badly at low volume where the vendor's free tier and flat pricing are cheaper than your engineering time.

The rule of thumb: buy the API until the per-clip vendor cost times your volume exceeds the fully loaded cost of running the pipeline yourself — including the engineers who maintain it. For most products under a few thousand clips a month, that crossover never arrives, and buying is correct. Above it, owning the pipeline pays back. We cover this calculation in depth in the cost-model lesson.

Figure 3. The build-versus-buy decision. Volume and control are the two axes; the LLM moment-detection bill is the hidden cost in the build path.

The One Thing Teams Forget: Disclosure

Here is the pitfall that turns into a legal problem, and it has a hard deadline. When your tool edits or generates video with AI, you may be legally required to say so.

The European Union's AI Act, Article 50, sets transparency obligations that take effect on 2 August 2026. Two parts apply directly to the tools in this lesson. First, outputs from generative AI systems must be marked in a machine-readable format that is detectable as artificially generated. Second, and more pointed: anyone deploying an AI system that generates or manipulates image, audio, or video content that constitutes a "deep fake" must disclose that the content has been artificially generated or manipulated.

Read that second clause against this lesson. A Captions AI twin presenting your script is squarely a deepfake under the law. A Descript Overdub clip where the voice was regenerated from typed text is a manipulated audio output. Even an auto-reframed, auto-captioned clip from Opus Clip involves AI manipulation that may trigger disclosure depending on how it is used. The penalty for getting this wrong reaches 15 million euros or 3% of worldwide annual turnover, whichever is higher.

The technical mechanism the European Commission's draft Code of Practice names by example is C2PA Content Credentials — a standard that embeds a signed manifest into the file declaring which AI system generated or modified the content, when, and which organization signed the claim. The Code prescribes layers: C2PA metadata, an imperceptible watermark that survives compression and cropping, and logging as a fallback. If you build the pipeline, you build this in; if you buy an API, you verify the vendor emits it. We go deep on this in the disclosure-engineering lesson.

The engineering takeaway is simple: treat disclosure as a station in your pipeline, not an afterthought. Add a sixth station that signs and marks every output, and design your consent and audit trail before you ship, not after a regulator writes.

Where Fora Soft Fits In

We build the video products these tools plug into — video conferencing, OTT and Internet TV platforms, e-learning systems, telemedicine apps, and video surveillance software. When a client wants "turn recordings into clips" inside their own platform rather than sending users off to a third-party site, the work is exactly the build-versus-buy decision above: wire in a clipping API behind a clean interface, or assemble the five-stage pipeline on their own infrastructure when volume and data-control rules demand it. The disclosure and consent layer is part of that work, not a bolt-on. The judgment we bring is knowing which stations to buy, which to build, and where the EU AI Act obligations land for a given vertical.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your opus clip plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the AI Video Editing Decision Sheet — One-page reference: the five-stage pipeline (transcribe, detect moments, reframe, caption, render), the five tools mapped to it with 2026 entry prices, the build-vs-buy rule, the word-timestamp accuracy and 4-6-word caption math, and….

References

OpusClip — Pricing and plans (free / Starter $15 / Pro $29; credit model). https://www.opus.pro/pricing — accessed 2026-06-01. Tier 4 (vendor first-party).
OpusClip — API documentation and ClipAnything model (job-queue POST/poll/webhook pattern, rate limits, 10-hour / 30 GB limits, closed beta). https://www.opus.pro/api and https://help.opus.pro/api-reference/endpoints/create-project — accessed 2026-06-01. Tier 4.
Descript — Pricing and plans (Hobbyist $16 / Creator $24 / Business $50 annual). https://www.descript.com/pricing — accessed 2026-06-01. Tier 4.
Descript — Underlord AI co-editor, Overdub voice cloning, text-based editing, Studio Sound. https://www.descript.com/video-editing — accessed 2026-06-01. Tier 4.
Submagic — Features (animated captions 48+ languages, B-roll, Magic Clips, silence removal) and pricing; 30-minute source cap; user-base claims. https://www.submagic.co/ and https://www.submagic.co/pricing — accessed 2026-06-01. Tier 4.
Captions / Mirage — Pricing (credit system) and features (AI Edit styles, AI Avatars / Twin, Mirage video model, chat editor). https://captions.ai/plans and https://mirage.app/captions/pricing — accessed 2026-06-01. Tier 4.
Blackmagic Design — DaVinci Resolve 20 AI features (Neural Engine, IntelliScript, AI Multicam SmartSwitch, AI Voice Convert, AI Audio Assistant, AI Animated Subtitles). Coverage: CineD, Larry Jordan, Imaging Resource (Apr 2025 public beta, 100+ features). https://www.cined.com/davinci-resolve-20-released-with-handful-of-ai-assisted-features/ — accessed 2026-06-01. Tier 4.
Vidyo.ai (now Quso.ai) — AI video repurposing, clip generator, pricing (Lite $15 / Essential $20 / Growth $25, credit model), rebrand to Quso. https://quso.ai/pricing — accessed 2026-06-01. Tier 4.
"Reframe Anything: LLM Agent for Open World Video Reframing" — auto-reframe as a subject-reasoning agent that tracks the salient object, not just a face. https://what-makes-good-video.github.io/assets/27_Reframe_Anything_LLM_Agent_.pdf — accessed 2026-06-01. Tier 5 (academic).
Mux — Extracting and handling subtitles/captions with FFmpeg; subtitle format behavior. https://www.mux.com/articles/extracting-subtitles-and-captions-from-video-files-with-ffmpeg — accessed 2026-06-01. Tier 6 (educational).
ASS / WebVTT karaoke caption rendering — \k tag family, word-level highlight, FFmpeg ass= burn-in recipe, 4–6 words rule, word-timestamp accuracy (30–40 ms clean, 80–100 ms noisy). VidNo / VocalLab / QuickLRC format guides. https://vidno.ai/blog/karaoke-style-word-highlight-captions — accessed 2026-06-01. Tier 6.
EU Artificial Intelligence Act — Article 50, Transparency obligations (effective 2 Aug 2026; machine-readable marking of generative output; deepfake disclosure obligation). https://artificialintelligenceact.eu/article/50/ — accessed 2026-06-01. Tier 1 (primary law).
European Commission — Code of Practice on marking and labelling of AI-generated content (C2PA Content Credentials named by example; metadata + watermark + logging layers). https://digital-strategy.ec.europa.eu/en/policies/code-practice-ai-generated-content — accessed 2026-06-01. Tier 1.
C2PA — Content Credentials specification (signed manifest declaring generator, time, signer). https://c2pa.org/ — accessed 2026-06-01. Tier 1 (standards body).
AI video editing market and short-form statistics 2026 (market size, CAGR, adoption, cost-reduction figures). market.us / autofaceless / ALM Corp aggregations. https://market.us/report/ai-in-video-editing-market/ — accessed 2026-06-01. Tier 7 (analyst aggregation; cited as directional only, with year labels).

Opus Clip + Descript + Submagic + Captions AI + DaVinci AI — AI Editing Tool Engineering