Fine-Tuning A Video VLM On A Domain — Surveillance, Telemedicine, E-Learning

Why This Matters

You have read the open-frontier lesson and picked an open model — Qwen3-VL, InternVL, or Pixtral — to run on your own hardware. It works on general video. Then you point it at your footage and it stumbles: it does not know that "tailgating" means two people through one badged door, it writes consult notes in the wrong shape for your clinic, it cannot tell a finished lab exercise from an abandoned one. Fine-tuning is how you close that gap, and it is the single most misunderstood step in shipping video AI — half of teams reach for it when a better prompt would have done, and the other half avoid it when it was the only thing that would work. This lesson is written for the product manager, founder, or operations lead who has to decide whether to fund a fine-tuning project, and needs to understand the trade-offs well enough to challenge an over-eager or over-cautious engineer. It assumes you have read the open-frontier lesson; if "weights", "open model", and "GPU" are new words, read that first.

What Fine-Tuning Actually Is

Start with the word, because it is a precise metaphor. A radio is already built and mostly working; tuning it is the small final adjustment that locks onto one station clearly. Fine-tuning a model is the same: the model already knows how to see and how to write — that knowledge came from its original training on billions of images and words — and you are making a small, final adjustment so it locks onto your domain.

Mechanically, fine-tuning means continuing the model's training, but on a much smaller, hand-picked set of examples that look like the job you actually need done. The model you downloaded is a giant table of numbers, called weights, that encode everything it learned. Training nudges those numbers; fine-tuning nudges them a little more, in the direction of your examples. If you show it a thousand surveillance clips each paired with the exact alert text your operators expect, it gradually shifts from writing generic descriptions to writing your alerts.

Here is the distinction that trips everyone up. Fine-tuning changes the model's behaviour, vocabulary, and output format — how it responds. It is a poor way to teach the model new facts that change often, such as today's staff roster or this week's policy. For facts, a different tool called retrieval (covered in the video RAG lesson) is the right answer, and we draw that line sharply below. The one-sentence version: fine-tune to change how the model talks; retrieve to change what it knows.

Figure 1. Fine-tuning is a small, final adjustment to a model that already works. You are not building a model — you are pointing an existing one at your domain.

The Decision That Comes First — Prompt, Retrieve, Or Fine-Tune

Before you spend a cent on fine-tuning, you must rule out the two cheaper tools, because they solve most problems and fine-tuning solves only some. The discipline that saves teams the most money is to try them in order: prompt first, then retrieval, and reach for fine-tuning only when both fall short.

The first tool is prompt engineering — simply writing better instructions and showing a few examples inside the request itself, with no training at all. If the model gets it right when you spell out the task clearly and paste in three good examples, you are done. This is free, takes an afternoon, and should always be tried first. Many "we need to fine-tune" problems evaporate when someone writes a proper prompt.

The second tool is retrieval-augmented generation, or RAG — you keep a searchable store of your documents or past footage, and at question time you fetch the relevant pieces and hand them to the model alongside the question. RAG is the right tool when the problem is missing knowledge that changes: which staff member is on shift, what the current escalation policy says, what happened in last Tuesday's recording. You can update a retrieval store in seconds by adding a document; you cannot update a fine-tuned model without retraining it. This is why RAG owns facts and fine-tuning does not.

You reach for fine-tuning only when the gap is about behaviour the prompt cannot fix: the model needs a specialised vocabulary it was never taught, a consistent output format no amount of instruction reliably produces, a visual judgment your domain makes that general training never covered, or you are calling it so often that baking the behaviour in is cheaper than re-sending a long prompt every time. In practice the strongest systems combine all three — a fine-tuned model that follows a good prompt and pulls live facts from retrieval — and a 2026 industry comparison found that pairing retrieval with fine-tuning improved accuracy by more than eleven percentage points over either one alone.

Figure 2. Try the cheap tools first. Fine-tuning is the right answer for behaviour and format, not for facts — and it is rarely the first thing you should try.

How The Training Actually Happens — Full, LoRA, And QLoRA

Once you have decided to fine-tune, there are three ways to do it, and the choice is mostly about how much GPU memory you are willing to pay for. Understanding the difference is what lets you push back when an engineer quotes you a number that sounds too high.

The first way is full fine-tuning: you let training adjust every one of the model's billions of numbers. This gives the model the most freedom to change, but it is brutally expensive — you need enough GPU memory to hold the entire model plus several working copies of it during training. It also has a dangerous side effect we cover below: because you are moving everything, the model can forget its general skills.

The second way, and the one almost every team uses in 2026, is LoRA — short for Low-Rank Adaptation. Instead of changing the model's numbers, LoRA freezes all of them and adds a small set of brand-new, trainable numbers alongside, like a thin transparent overlay laid on top of a finished map. Training only adjusts the overlay; the original map underneath is untouched. The payoff is enormous: the team that invented LoRA reported it can reduce the number of values you train by up to ten thousand times and cut the GPU memory needed by about three times, compared with full fine-tuning, while reaching comparable quality. Because the original is frozen, LoRA also forgets far less.

The third way is QLoRA — quantized LoRA. It does everything LoRA does, but first it compresses the frozen original model to a smaller, lower-precision form (this compression is called quantization) so it takes up less memory while training. The result is the cheapest option of all: the researchers who introduced QLoRA showed it could fine-tune a 65-billion-parameter model on a single consumer graphics card — the kind a serious hobbyist owns — something that would otherwise need a small server. The trade-off is a slight quality cost from the compression, usually small enough to ignore.

The memory difference is not abstract, so put real numbers on it. Using the published hardware table from LLaMA-Factory, a popular open fine-tuning toolkit, for a 7-billion-parameter model:

Full fine-tuning (16-bit) : about 120 GB of GPU memory
Full fine-tuning (8-bit)  : about  60 GB
LoRA (16-bit)             : about  16 GB
QLoRA (4-bit)             : about   6 GB

Read those four numbers and the whole economics becomes clear. Full fine-tuning of a mid-sized video model needs roughly 120 gigabytes — more than the largest single GPU you can rent, so you need several chips wired together. LoRA brings the same model down to about 16 gigabytes, which fits comfortably on one mainstream data-centre GPU. QLoRA squeezes it to about 6 gigabytes, which fits on a high-end laptop-class or desktop card. That is the difference between a multi-GPU server project and an afternoon on one machine — and it is why "we need to fine-tune" stopped being a scary sentence around 2024.

Method	What it trains	GPU memory (7B model)	Forgetting risk	Reach for it when…
Full fine-tuning	Every weight in the model	~120 GB (16-bit)	High	You have lots of data and the domain is very far from general video
LoRA	A small trainable overlay; original frozen	~16 GB	Low	The default for almost every domain video project
QLoRA	LoRA on top of a compressed model	~6 GB	Low	Hardware is tight, or the model is large and you have one GPU

Memory figures are estimates from the LLaMA-Factory hardware table for a 7-billion-parameter model and scale roughly with model size. Treat them as orientation, not a quote — your batch size, video length, and sequence length move them.

One caveat that catches video teams specifically: when you fine-tune a vision-language model, you sometimes want to adjust the part that reads images, not just the part that writes text. Several toolkits warn that you cannot combine the heaviest compression (4-bit) with training that image-reading part — you have to use a lighter setting, which costs more memory. It is a small technical point, but it is exactly the kind of detail that turns a "6 GB" estimate into a "16 GB" reality, so ask your engineer which parts of the model the plan actually trains.

The Part Nobody Budgets For — Building The Dataset

Here is the truth the tutorials bury: the training run is the easy, cheap, fast part. The expensive part is the data. A fine-tuning dataset for video is a collection of examples, each one a piece of footage paired with the exact output you want the model to produce for it — the right alert, the right note, the right label. The model learns by example, so every example has to be correct, because the model will faithfully learn your mistakes too.

How many examples do you need? Less than people fear. For teaching a consistent output format or a modest vocabulary, a few hundred well-chosen examples often move the needle; for a genuinely new visual judgment, you are usually in the low thousands. More is not automatically better — a thousand clean, varied, correctly-labelled examples beat ten thousand sloppy ones, every time. The work is in the quality: clips that cover the real range of situations, labels that are consistent between the three different people who wrote them, and no accidental shortcut the model could cheat on (if every "intruder" clip happens to be filmed at night, the model learns "night", not "intruder").

Crucially, you must split your examples into two piles before you train. The larger pile, the training set, is what the model learns from. The smaller pile, the held-out test set — perhaps ten to twenty percent, never shown during training — is how you measure whether fine-tuning actually helped. Without a held-out set you are flying blind: a model can look brilliant on the examples it memorised and fail on everything new, and you would never know. This single discipline — hold data back, measure on it — separates teams that ship working fine-tuned models from teams that ship confident-sounding broken ones.

A worked example of the budget makes the point. Suppose you want a surveillance model to write your alert format, and you decide a thousand labelled clips is the target.

1,000 clips to label
÷ 20 clips labelled per hour by a trained annotator
= 50 hours of annotation work

Plus a held-out test set: 200 more clips = 10 hours
Plus one review pass for label consistency = ~15 hours
≈ 75 person-hours before a single training run

The training run itself, on one GPU with LoRA, might take three to six hours and cost ten to thirty dollars in compute. The seventy-five hours of human labelling is the real project. Any plan that treats data as a footnote and the GPU as the main event has the budget upside down.

Catastrophic Forgetting — The Failure Mode To Watch For

There is one failure mode specific to fine-tuning that you should be able to name, because it is both common and invisible if you are not testing for it. It has a dramatic name: catastrophic forgetting. It means that while the model was busy learning your domain, it quietly lost some of its general ability — it now writes perfect surveillance alerts but can no longer answer a simple question about an ordinary scene, because the training pulled its numbers so far toward your examples that the old skills washed out.

The cause is over-aggressive training: too many passes over the same small dataset (each full pass is called an epoch), or full fine-tuning that moves every number at once. Researchers have shown the damage can appear after as little as a single epoch on a small dataset when you train too much of the model. The reason LoRA is the default is partly this: because it freezes the original and only trains a small overlay, the general skills underneath survive, and forgetting is far milder.

The defence is the held-out test set again, used a second way. Before you fine-tune, run the original model on a handful of general tasks and write down the scores. After fine-tuning, run the same general tasks again. If the general scores collapsed while your domain scores rose, you over-trained — dial back the number of epochs, use LoRA instead of full fine-tuning, or mix a few general examples back into your training data. The mistake is never noticing, shipping a model that aces your demo and then disappoints on the long tail of real inputs.

Three Domains, Three Different Jobs

The same machinery serves very different goals depending on the vertical, and the differences are instructive because they change what "a good dataset" even means.

In video surveillance, fine-tuning teaches the model your site's vocabulary and your alert format. A general model says "two people walk through a door"; a fine-tuned one says "TAILGATING — two persons, single badge event, Door 14, 02:14" because you showed it a few hundred clips labelled exactly that way. The dataset challenge here is rare events: the interesting clips (an actual intrusion) are a tiny fraction of footage, so you over-sample them deliberately rather than feeding the model a realistic stream that is 99.9 percent empty corridor. A second reason surveillance fine-tunes is residency: this footage usually cannot leave the customer's servers, so renting a closed API is off the table, and an open model fine-tuned on-premises is the only compliant path — which connects directly to the build-versus-buy case in the open-frontier lesson.

In telemedicine, the goal is usually format and terminology, not new vision. A clinician needs the model's summary of a consultation to land in the exact structure their records demand — complaint, history, observations, plan — using the right clinical terms, every time, without a paragraph of prompt instructions each call. Fine-tuning on a few hundred consult-and-note pairs bakes that structure in. The dataset challenge is privacy and correctness: every training example is sensitive patient data that must be handled under the relevant health-privacy rules, de-identified where required, and reviewed by a clinician, because a confidently wrong medical note is worse than no note. The same data-residency logic as surveillance applies — which is why this work runs on self-hosted open models.

In e-learning, fine-tuning teaches a visual judgment a general model lacks: did the student actually complete the lab exercise shown in this screen-recording, and where did they get stuck? A general model describes the screen; a fine-tuned one applies your rubric. The dataset challenge is subjectivity — two graders may disagree on "stuck" — so the review pass that makes labels consistent matters more here than anywhere, and a clear rubric written before labelling starts is half the battle.

The Tools You'll Hear Named

You do not need to operate these, but you will hear them in planning meetings, and recognising them helps you tell a grounded plan from hand-waving. The dominant open toolkits in 2026 are LLaMA-Factory (a broad, beginner-friendly framework with a no-code web interface, Apache-2.0 licensed, supporting Qwen3-VL, Qwen2.5-VL, InternVL, and LLaVA out of the box), ms-swift from the ModelScope community (supporting over three hundred multimodal models including video fine-tuning), and Unsloth (specialised in squeezing training onto the least possible memory and time). For teams already living in the Hugging Face ecosystem, its TRL library is the reference path. All of them support LoRA, QLoRA, and full fine-tuning; the choice between them is about team familiarity and interface preference, not capability. If an engineer cannot tell you which one they will use and why, the plan is not ready.

Where Fora Soft Fits In

We run domain fine-tuning inside the video products we build, and the pattern is the one this article describes. In video surveillance and telemedicine, where footage cannot legally leave the customer's servers, we fine-tune an open model on-premises so the data never moves, and we spend most of the project on the dataset — the labelling, the consistency review, and the held-out test set — not the training run. In e-learning and OTT, where features run at high volume, we fine-tune to bake in a behaviour rather than re-send a long prompt on every call, because at scale that is both faster and cheaper. The judgment we bring is the one this lesson teaches: rule out prompting and retrieval first, default to LoRA, budget the human labelling honestly, and always measure on data the model never saw — including a check that it did not forget its general skills.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your fine-tune vlm plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the VLM Fine-Tuning Readiness Checklist — One-page printable decision sheet: the prompt-vs-retrieve-vs-fine-tune test, the LoRA/QLoRA GPU-memory cheat-sheet, dataset-size and labelling-budget rules of thumb, and the catastrophic-forgetting check.

References

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685 (2021, accessed 2026-06-01). — Primary source for the LoRA method (freeze pre-trained weights, inject trainable low-rank matrices into each Transformer layer) and the headline figures: up to 10,000× fewer trainable parameters and ~3× less GPU memory versus full fine-tuning of GPT-3 175B, at comparable quality.
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L. "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314 (2023, accessed 2026-06-01). — Primary source for QLoRA (4-bit quantization of the frozen base model plus LoRA adapters) and the result that a 65-billion-parameter model can be fine-tuned on a single consumer GPU while matching full-fine-tuning performance.
"LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs." hiyouga/LLaMA-Factory GitHub repository, Apache-2.0 licence (ACL 2024 system; accessed 2026-06-01). — Source for the per-method GPU-memory hardware table (7B: full 16-bit ≈120 GB, full 8-bit ≈60 GB, LoRA ≈16 GB, QLoRA 4-bit ≈6 GB), the FSDP+QLoRA 70B-on-2×24GB result, the supported-VLM list (Qwen3-VL, Qwen2.5-VL, InternVL, LLaVA), and the no-code interface.
"ms-swift: Use PEFT or Full-parameter to fine-tune 600+ LLMs and 300+ MLLMs." modelscope/ms-swift GitHub repository (AAAI 2025; accessed 2026-06-01). — Source for ms-swift's multimodal coverage (Qwen3-VL, InternVL3.5, LLaVA, DeepSeek-VL2 and others), its support for SFT/DPO/GRPO, and documented video fine-tuning (VQA/OCR/Grounding/Video; video-chatgpt dataset).
"Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)." Hugging Face cookbook, huggingface.co/learn/cookbook (accessed 2026-06-01). — Reference implementation for LoRA/QLoRA fine-tuning of a video-capable VLM in the TRL library, including the data-formatting and image/video token conventions.
Zhai, Y., et al. "Investigating the Catastrophic Forgetting in Multimodal Large Language Models." Proceedings of Machine Learning Research, vol. 234 (accessed 2026-06-01). — Source for the catastrophic-forgetting phenomenon in fine-tuned multimodal models, the finding that degradation can appear after a single epoch on a small dataset, and the role of held-out general benchmarks in detecting it.
"Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting." arXiv:2509.22195 (2025, accessed 2026-06-01). — Source for the point that full fine-tuning and jointly training the language model plus adapter overfit small datasets and cause forgetting, while LoRA training schemes and data-alignment strategies mitigate it.
"RAG vs Fine-Tuning vs Prompt Engineering." IBM Think (ibm.com/think), and the 2026 K2view and FreeAcademy comparisons (accessed 2026-06-01). — Sources for the prompt-then-retrieve-then-fine-tune decision order, the rule that RAG owns frequently-changing facts while fine-tuning owns behaviour and format, and the reported >11-percentage-point accuracy gain from combining retrieval with fine-tuning over either alone.
"How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?" arXiv:2504.14391 (2025, accessed 2026-06-01); and "Parameter-Efficient Adaptation of Large Vision-Language Models." PMC11944706 (accessed 2026-06-01). — Sources for domain adaptation of general VLMs to medical/specialised video via parameter-efficient adapters (LoRA), and the use of instruction-following data built from domain video.