One-page planner for the whole article: the three engines (transcript-first vs hybrid vs native multimodal) with what each sees, its cost, and its best-fit content; the deciding 'words vs picture' question; the five-stage pipeline (acquire / clean & segment / summarize / shape / check) with the captions-vs-ASR fork; the three summarization strategies (stuff / map-reduce / refine) and the worked token-cost arithmetic (transcript ~6,000 tokens vs multimodal ~300 tokens/sec); the rent-vs-self-host split; and the ship gate (engine choice, timestamp grounding, LLM-as-judge faithfulness, caching, sanctioned API access).
Download free PDF