Key takeaways

67 % of production LLM deployments use RAG (McKinsey 2026 enterprise AI report). Up from 31 % in 2024. The pattern is mainstream; video and audio applications are the underserved frontier.

Generic RAG fails on video. Transcript chunking strategy, time-anchored retrieval, multimodal embedding choice, speaker diarisation — all matter more than the choice of vector store.

The 73 % rule. When RAG fails in production, the failure is retrieval 73 % of the time, not generation. Invest in retrieval quality (chunking, reranking, hybrid search) before tuning the LLM prompt.

Cost is dominated by embedding + LLM inference, not vector store. 1k hours of video cost ~$120/mo for ASR + $40/mo for embedding + $15/mo for vector store. LLM inference scales with query volume.

Time-anchored deep links are the killer feature. A RAG response that links back to the exact moment in the source video is what makes “chat with my recordings” useful, not novelty. Engineer the time anchors from day 1.

Why Fora Soft wrote this playbook

Fora Soft has shipped 200+ video and audio products since 2005 and built RAG implementations on top of several. BrainCert ($10M ARR e-learning) for lecture-search; VALT (650+ legal organisations) for e-discovery search across deposition recordings; TransLinguist for multilingual transcript retrieval; Mangomolo for broadcaster archive search.

Through 2024–2026 we built four production video-RAG systems and audited two more. The patterns in this guide come from those engagements plus public references — the VideoRAG paper (KDD 2026), AWS V-RAG, Morphik’s production strategies, McKinsey’s 67 % adoption number.

If you are building “chat with my recordings”, semantic video search, lecture Q&A, or any RAG application where the source content is video or audio, this guide gives you the architecture, chunking strategies, vector-store choice and cost model.

Need RAG over your video / audio archive?

Send us your video count, total hours, and use case. We will return architecture and cost forecast in 48 hours, free.

Book a 30-min call → WhatsApp → Email us →

Why generic RAG fails on video

Generic RAG (Pinecone + LangChain + GPT-4) on a documentation set works in 50 lines of Python. The same approach on video archives produces useless answers. Three reasons:

1. Transcript chunking is harder than document chunking. Documents have headings, paragraphs, sentences. Transcripts have speaker turns, pauses, interruptions, mid-sentence topic switches. A 500-word chunk grabs 3 different speakers discussing 4 different topics. Retrieval surfaces the wrong moment because the chunk has no clear semantic boundary.

2. Time anchoring is essential, not optional. “The patient mentioned chest pain” is useless without “at minute 12:34 in the August 14 consult.” The deep link back to source video is what makes the response actionable. Engineer time anchors into every chunk; track them through retrieval and response.

3. Multimodal context matters for some use cases. A surveillance recording where a person enters a restricted area needs visual context, not just audio transcript. A lecture with whiteboard diagrams needs visual content. Pure text-embedding RAG misses the visual signal entirely.

Reference architecture

Video / audio source MP4, recording, archive ASR + diarisation Whisper / AssemblyAI / Deepgram Chunking + time-anchor Speaker-turn aware, ~30s overlap Embedding BGE / OpenAI Vector store — Qdrant / Pinecone / Weaviate / pgvector / Chroma Hybrid search (BM25 + dense) · reranking · metadata filters Reranker Cohere Rerank / BGE Reranker / cross-encoder · top-50 to top-5 LLM with citation injection GPT-4 / Claude 3.5 / Gemini · system prompt enforces citations to chunk IDs Response with time-anchored deep links “Patient mentioned chest pain at 12:34 in Aug 14 consult

Figure 1. Video/audio RAG pipeline — ASR, chunking, embedding, vector store, reranker, LLM, time-anchored response.

Chunking strategies for transcripts

Sentence-level chunking. Each sentence becomes a chunk. Tight semantic boundary; many small chunks; retrieval can pull surrounding context if needed. Best for FAQ-style queries against well-spoken content.

Speaker-turn chunking. Each speaker turn (continuous span by one speaker) becomes a chunk. Preserves conversational structure; aligns with diarisation output. Best for conversational content (meetings, depositions, interviews).

Semantic chunking. Detects topic shifts via embedding similarity. Chunk boundaries fall at natural semantic breaks. Higher quality; expensive at ingest time. Best for lectures, podcasts, long-form content.

Fixed-window chunking with overlap. 30-second windows with 5-second overlap. Simple, predictable, works adequately. Best for prototyping or when speaker / topic structure is poor.

Hybrid (recommended). Speaker-turn primary, with secondary semantic-shift detection within long turns. Window of 60–90 seconds typical; overlap of 10–20 seconds for context preservation. Used in most of our production deployments.

Time anchors per chunk. Each chunk stores: chunk ID, source video ID, start_time, end_time, speaker (if known), embedding. Retrieval returns chunk plus time anchor; response cites “at minute X in source Y”.

When you need multimodal embedding vs text-only

Content typeRecommendedWhy
Meetings, podcasts, depositionsText-only embeddingSpeech-heavy; visual content low-signal
Lectures with diagrams, demosHybrid (text + key-frame embedding)Visual content carries information
Surveillance recordingsMultimodal (CLIP + text)Visual is primary; little speech
Sports broadcastsMultimodal + commentator trackVisual + audio commentary both matter
Music, ambient audioAudio embedding (CLAP)No language; sonic features dominate

Multimodal embedding tooling. CLIP for image-text pairs (mature, fast). VideoCLIP / X-CLIP for video. CLAP for audio. Twelve Labs commercial multimodal API. ImageBind from Meta. The 2026 sweet spot is hybrid: text embedding for transcript + sparse key-frame embeddings for visual context.

Vector store comparison

StoreHostingStrengthBest for
QdrantSelf-hosted or cloudPerformance, hybrid search, payload filtering2026 default for serious deployments
PineconeManaged cloud onlyOperational simplicity, broad ecosystemTime-to-market priority, no SRE
WeaviateSelf-hosted or cloudSchema, hybrid, modulesEnterprise with structured metadata needs
pgvectorPostgres extensionAlready runs Postgres; no new infra<10M vectors, transactional integration
ChromaSelf-hosted, embeddedDeveloper ergonomics, embed-in-appPrototyping, single-tenant apps
MilvusSelf-hosted, Zilliz cloudScale — billions of vectorsMassive datasets, R&D-grade workloads

Cost difference between vector stores is small relative to embedding + LLM costs. Pick the one your team can operate; do not over-optimise the vector-store choice.

The 73 % rule — when RAG fails, it’s retrieval

Industry analysis in 2026 consistently shows: when RAG fails in production, the failure is retrieval 73 % of the time, not generation. The LLM is rarely the bottleneck once you are using GPT-4-class or Claude-3.5-class. The bottleneck is what the retriever surfaces.

Common retrieval failure modes. (1) Chunks are too small (lose context). (2) Chunks are too large (dilute query relevance). (3) Hybrid search not enabled (dense + sparse beats either alone). (4) No reranker (top-50 has the answer; LLM never sees it). (5) Embedding model mismatch (multilingual content with English-only embeddings).

Investment order. Fix retrieval first: chunking, hybrid search, reranking, metadata filters. Each iteration measures retrieval precision/recall on a labelled validation set (we cover this in our LLM evaluation guide). Only after retrieval is solid, tune the LLM prompt.

Hybrid search. Combine dense vector search (semantic similarity) with sparse BM25 (keyword match). Rerank the union. Tip: BM25 alone catches queries with proper nouns, technical terms, IDs that dense embedding miss. Hybrid wins by 10–20 % retrieval precision in our deployments.

Reranker. Run top-50 from initial retrieval through a cross-encoder reranker (Cohere Rerank, BGE Reranker, ColBERT). Output top-5 to LLM. The reranker is much smaller than the LLM but radically improves precision. Worth the latency hit on any serious deployment.

Cost model

Worked example: 1 000 hours of meeting recordings, 100 queries/day.

ASR. Whisper self-hosted on GPU: ~$0.005/audio-minute. 1k hours = 60k minutes = $300 one-time + $50/mo for new content. AssemblyAI / Deepgram managed: $0.01–$0.02/minute. Use self-hosted Whisper for cost efficiency once volume justifies the GPU.

Embedding. OpenAI text-embedding-3-small: $0.00002/1k tokens. 1k hours of speech ≈ 8M tokens. One-time embed = $160. New content $30/mo.

Vector store. Qdrant self-hosted: ~$15/mo for 1M vectors. Pinecone managed: $70/mo for similar workload. Tiny relative to LLM cost.

Reranker. Cohere Rerank: $1 per 1k queries. 100 queries/day = $3/mo.

LLM inference. GPT-4 with 5-chunk context: ~$0.02 per query. 100 queries/day = $60/mo. Or Claude 3.5 Sonnet at similar pricing. The variable cost.

Total. ~$160/mo at this scale, dominated by ASR (if you keep adding content) and LLM inference. Linear scaling: 10k hours + 1000 queries/day = ~$1.6k/mo.

Want a cost model for YOUR archive?

Send us your video count, average duration, and query volume. We will return a complete cost forecast in 48 hours, free.

Book a 30-min call → WhatsApp → Email us →

Build vs buy — AssemblyAI LeMUR, Twelve Labs, custom

AssemblyAI LeMUR. ASR + LLM-on-transcript pipeline as a managed service. Q&A, summary, custom prompts on transcripts. Lowest time-to-market for transcript-only RAG. Cost premium over DIY of 2–3× at scale.

Twelve Labs. Multimodal video search and Q&A as a service. Video understanding (visual + audio), semantic search, scene-level retrieval. Best when visual content matters and you do not want to build CLIP-based pipeline yourself.

Morphik / V-RAG (AWS). AWS-native video RAG framework. Bedrock + S3 + Kendra integration. Best when AWS lock-in is acceptable and your data already lives there.

LlamaIndex / LangChain. Application frameworks that handle the orchestration. Combine with your choice of vector store, embedding model, LLM. The 2026 default for custom builds.

Custom. When the use case is vertical-specific (legal e-discovery, surveillance footage search, medical transcript analysis), or when compliance demands self-hosted (HIPAA-grade telehealth RAG). Higher engineering cost; full control.

Mini case — meeting platform adds ‘chat with last week’ in 6 weeks

A B2B meeting recording / transcription platform (NDA, ~50k weekly recordings) approached us in late 2025 wanting “chat with my recordings” as a v2 feature. Goal: user types “what did we agree on yesterday with Mark?” and the assistant returns the answer with deep links to the relevant moments.

The 6-week build. Week 1–2: chunking strategy (speaker-turn primary, 60-second windows, 10-second overlap), Whisper-large for ASR (already in their pipeline), text-embedding-3-small for embeddings. Weeks 3–4: Qdrant vector store with hybrid search; Cohere Rerank for top-50→top-5; GPT-4 with citation-injection system prompt. Week 5: time-anchored deep links; multi-recording filtering. Week 6: evaluation with RAGAS on 200 labelled queries; iterate retrieval until precision > 80 %.

Outcome. 84 % retrieval precision on labelled validation set. Adoption: 35 % of users used the feature within 30 days; 58 % within 90 days. Increased monthly retention by 11 percentage points among users who used the feature 5+ times. Book a 30-min call for a similar build on your archive.

Evaluation — RAGAS, Braintrust, LangSmith

RAGAS. Open-source RAG-specific evaluation framework. Metrics: context relevance, answer relevance, groundedness (does the answer come from the retrieved context). Standard tooling for RAG-specific eval; integrates with most stacks.

Braintrust. Comprehensive eval platform; raised $80M Feb 2026 at $800M valuation. Connects production traces, evaluations, prompt iteration, CI/CD quality gates. Best for teams running multiple LLM features.

LangSmith. LangChain-native eval. Strong tracing, prompt-iteration tooling, framework integration. Best when your stack is LangChain-heavy.

The golden dataset. The hardest step in eval setup. Typical pattern: 100–200 labelled query-answer pairs covering main use cases plus edge cases. Curated by domain experts, not engineers. Without it, you are tuning blindly.

A decision framework — pick stack in five questions

Q1. Speech-heavy or visual-heavy content? Speech: text-only embedding pipeline. Visual: multimodal (CLIP + text). Mixed: hybrid scoring.

Q2. Volume in 12 months? <1k hours: managed services (AssemblyAI LeMUR, Twelve Labs). 1k–100k: hybrid — managed ASR, custom RAG. >100k: self-hosted Whisper + custom RAG.

Q3. Compliance posture? Standard: any path. HIPAA: HIPAA-eligible STT + Azure OpenAI text endpoints + self-hosted vector store. EU AI Act high-risk: documentation requirements add 4–6 weeks.

Q4. Vector store scale? <10M vectors: pgvector if Postgres exists. 10M–1B: Qdrant or Weaviate. >1B: Milvus or distributed Qdrant.

Q5. Latency budget? Sub-second response: skip reranker, use smaller LLM, more aggressive caching. 2–3 second tolerable: full pipeline (rerank + GPT-4) is the default.

Pitfalls to avoid

1. Skipping the reranker. Top-5 from initial retrieval is noise without reranking. Always invest in cross-encoder reranking; cuts retrieval failures by 30–50 %.

2. No time anchors. A response without deep link to source video is novelty, not utility. Engineer time anchors into chunks from day 1.

3. Tuning LLM before retrieval. The 73 % rule. Fix retrieval first.

4. No evaluation framework. Without RAGAS or equivalent on a golden dataset, every “improvement” is anecdotal. You will regress without noticing.

5. Forgetting multilingual. Default OpenAI embeddings handle 50+ languages but quality varies. For multilingual archives, test embedding quality per language; consider language-specific models for poorly-supported tongues.

KPIs to measure

Quality KPIs. Retrieval precision @ k=5 (target: >80 %). Retrieval recall @ k=20 (target: >90 %). Groundedness via RAGAS (target: >0.85). Answer relevance (target: >0.85).

Business KPIs. User adoption rate (target: 30 %+ within 30 days for power users). Query-to-success rate (user finds what they wanted within 3 follow-ups). Retention uplift among users who use the feature.

Reliability KPIs. p95 query latency (target: <3 s). ASR completion success rate (target: 99 %+). Embedding pipeline lag (target: new content searchable within 5 minutes).

FAQ

Whisper or AssemblyAI for ASR?

Self-hosted Whisper-large is the most cost-effective at >1k hours/month. AssemblyAI / Deepgram win on time-to-market and managed-service convenience for <1k hours/month. Quality is comparable; for English Whisper-large-v3 is best-in-class.

Pinecone vs Qdrant?

Pinecone wins on operational simplicity (managed-only). Qdrant wins on cost (self-hostable), performance (faster on identical hardware), and feature set (better hybrid search, payload filtering). We default to Qdrant for new builds where SRE muscle exists; Pinecone where it does not.

Should I use OpenAI text-embedding-3-large or BGE?

OpenAI text-embedding-3-large for English at managed convenience. BGE-large-v1.5 (open-source) for self-hosted scenarios; matches OpenAI quality at $0 marginal cost once you run the inference. Multilingual: BGE-M3 wins for non-English.

How do I evaluate RAG before launch?

Build a 100–200-question golden dataset with domain experts. Run RAGAS / Braintrust / LangSmith on it. Iterate retrieval (chunking, reranking, hybrid) until precision > 80 % and groundedness > 0.85. See our LLM evaluation guide for full methodology.

Can I do RAG over HIPAA-protected video?

Yes — with HIPAA-eligible STT (Azure Speech / AWS Transcribe Medical), Azure OpenAI text endpoints (BAA-eligible), self-hosted vector store inside your VPC. Logs and prompts are PHI-adjacent; redact before any external service touches them. See our HIPAA + SOC 2 guide.

How long to build production video RAG?

Greenfield with managed services: 4–6 weeks. Custom on top of LangChain + Qdrant + Whisper: 8–12 weeks. Vertical-specific (legal e-discovery, medical) with compliance: 12–16 weeks. With our pattern reuse from BrainCert and VALT we typically deliver toward the lower end.

What about VideoRAG (the KDD 2026 paper)?

Strong academic baseline for full multimodal video RAG. Open-source implementation by HKUDS on GitHub. Useful for research and as a reference; production deployments still mostly transcript-first because of cost and latency.

Twelve Labs vs custom?

Twelve Labs is fastest path to multimodal video search; cost premium over DIY of 3–5× at scale. Use for <1k hours of video and time-sensitive launches; build custom for >5k hours where unit economics flip.

Voice AI

OpenAI Realtime Production Guide

RAG via voice agent — user speaks, agent searches.

AI Infra

MCP for Video Apps

Expose RAG as MCP tools for any agent.

Evaluation

LLM App Evaluation

RAGAS + Braintrust for production RAG eval.

Agents

LiveKit AI Agents

Voice agent that wraps RAG retrieval.

Compliance

HIPAA + SOC 2

When RAG touches PHI, BAA architecture matters.

Ready to ship ‘chat with my recordings’?

67 % of production LLM apps now use RAG. Generic RAG fails on video; the wins come from chunking strategy, time-anchored retrieval, hybrid search and reranking. The 73 % rule says invest in retrieval first; the LLM is rarely the bottleneck.

Cost is dominated by ASR + LLM inference, not the vector store. Pick Qdrant or Pinecone based on your team’s ops muscle. Evaluation with RAGAS + golden dataset is non-optional. Time-anchored deep links to source video make the response useful, not novelty — engineer them from day 1.

Want a 6-week video RAG build plan?

Send us your archive size, content type and use case. We will return architecture, vendor pick, and 6–12-week plan in 48 hours, free.

Book a 30-min call → WhatsApp → Email us →

  • Technologies