Blog: How We Use Spec-Driven Agents to Speed Up Video Development

Key takeaways

Spec-driven agentic engineering is not vibe coding. It is writing a precise, executable specification first and letting Claude Code, Cursor, Kiro and Copilot agents implement under strict human review.

Four artefacts do the heavy lifting. requirements.md, design.md, tasks.md, AGENTS.md. Skip any one and you’re back to vibe coding.

Cycle-time wins are real. Published studies: GitHub Copilot 75% PR-cycle reduction, AWS Kiro 3 weeks vs 12–16 weeks, DORA 2026 +66% epic throughput. On Fora Soft video projects we measure 20–40% faster feature delivery and estimates within ~6% variance.

Video development is a sweet spot. WebRTC glue, media pipelines, DASH manifests, player UI, mobile SDK wrappers, SFU config and test generation are high-boilerplate, spec-friendly workloads where agents hit 60–95% of the code.

Verification is where the savings go, not the skipping. Rigorous PR gates, CI-enforced spec drift detection and human review on security & codec paths are non-negotiable. Teams that skip verification ship the DORA 2026 horror story: +441% PR review time, +243% incidents per PR.

Why Fora Soft wrote this playbook

Fora Soft ships real-time video, WebRTC, streaming and multimodal AI products for a living. Over the last 18 months we’ve rebuilt our delivery stack around spec-driven agentic engineering and shipped it on projects across the range — Worldcast Live concerts with 10,000 concurrent viewers, HIPAA-regulated telemedicine on CirrusMED, 1080p production rooms on Speed.Space, and 62-language live translation on Translinguist.

This playbook is the exact method we use. The four artefacts we write before any code. The review gates that catch hallucinations and security drift. The cycle-time and cost deltas we actually observe. And the pitfalls we’ve already stepped in so you don’t have to. Read it end-to-end once, then keep it open when you’re scoping the next WebRTC or streaming build.

If you want the pipeline applied directly to your codebase, our Agentic Engineering team does this as a fixed-scope 3–6-week engagement.

Curious what spec-driven agents can do for your roadmap?

30 minutes with our tech lead, your roadmap, your codebase context. You walk out with a concrete estimate of what the next quarter looks like on SDD and what it would cost traditionally.

Book a 30-min call → WhatsApp → Email us →

What spec-driven agentic engineering actually is

Spec-driven development (SDD) inverts the traditional loop. Instead of “write some code, then maybe document it”, the specification is the source of truth and the code is the derived artefact. When you pair that with modern AI coding agents — Claude Code, Cursor, GitHub Copilot in agent mode, AWS Kiro, OpenAI Codex CLI — the spec becomes the contract the agent implements against, and the human becomes a reviewer and architect rather than a typist.

The practical workflow, as GitHub Spec Kit, Kiro and our own delivery all converge on, is a four-phase loop: SpecifyPlanTasksImplement. Each phase has an explicit gate. Specs don’t get committed without review. Tasks don’t start without specs. PRs don’t merge without tests that assert the spec.

The payoff is not that the agent writes code faster (though it does). It is that the system gets faster: less rework, fewer mid-flight scope changes, estimates that actually hold, and a built-in audit trail from business intent to line of code. For regulated verticals this is the difference between a HIPAA auditable pipeline and a governance nightmare.

Rule of thumb: if your team cannot point at a versioned spec before writing code, you are vibe coding with extra steps. The agent speeds up typing; it does not fix scope drift.

Four modes: vibe coding, spec-first, TDD, spec-driven

These four are often conflated; they are distinct practices with different risk profiles.

Mode Spec lifecycle Agent role Best fit
Vibe coding None — intent lives in chat history Guess & iterate Prototypes, research spikes, throwaway demos
Spec-first Written once, drifts thereafter Structured start, ad-hoc later Short features, single-sprint work
TDD Test suite is the spec, unit-level Implements against failing tests Libraries, SDKs, well-bounded modules
Spec-driven (SDD) Persistent, executable, version-controlled Contract-driven autonomy with gates Greenfield features, regulated code, video & infrastructure

TDD and SDD are complementary — SDD operates above the unit level and treats the test suite as one output of the spec, not the spec itself.

The 2026 agent landscape

Claude Code (Anthropic). Our default for sustained work. CLAUDE.md persistent memory, subagents, hooks, excellent multi-file refactors. Pragmatic Engineer survey (Feb 2026): 46% “most loved” agent.

Cursor. IDE-first. Great for tight edit loops, parallel-agent orchestration after the April 2026 redesign. We reach for it for frontend and tight refactor passes.

GitHub Copilot + Spec Kit. 4.7M paid subscribers by Jan 2026; ~46% of code generated at Fortune 100 companies. The Spec Kit CLI adds the explicit Specify → Plan → Tasks → Implement phases inside Copilot agent mode.

AWS Kiro. The only mainstream IDE with first-class spec-driven dev plus event-driven hooks. Shipped 95%+ of business-logic code in the AWS 3-week drug-discovery case study. Strong for AWS-heavy stacks.

OpenAI Codex CLI. Lightweight companion that runs alongside Claude Code; good for quick refactors, code search, and structured edits. The April 2026 Claude Code plugin makes them compose rather than compete.

Cognition Devin, Replit Agent. Heavier-autonomy agents. Useful for narrow repetitive work; still need supervised gates in production.

Our stack right now is Claude Code as the primary agent, Cursor for IDE-heavy work, Spec Kit CLI for gated workflows, and hand-rolled CI jobs for drift detection. No single tool wins every job; the skill is orchestrating them through the same spec.

Anatomy of a useful spec

Specs an agent can act on share a shape. They are markdown, version-controlled, short, and opinionated about what’s out of scope.

Business goal. One sentence, user-facing. Not “build a transcoder” but “let users live-stream 1080p with < 500 ms latency and a graceful audio-only fallback below 1 Mbps”.

Acceptance criteria. Bullet list, each one testable. Given-When-Then style works well for WebRTC state machines and media pipelines. Vague criteria (“handle network failures”) leak assumptions the agent will fill in wrong.

Constraints. Latency budgets, codec preferences, bandwidth caps, platform requirements, library versions. The agent needs to know what is fixed before it improvises.

Out of scope. A one-line “not this time” list. Prevents agent helpfulness from turning into scope creep.

Known risks. The things you already know might bite (H.265 decoder variance on Android, iOS background-audio entitlements). Noting them in the spec keeps them visible at review time.

The four artefacts we write on every project

.project-root/
├ AGENTS.md           # project-wide rules for every agent
├ specs/
│   ├ requirements.md # user stories + acceptance criteria
│   ├ design.md       # architecture + constraints + trade-offs
│   └ tasks.md        # ~2-hour units, test-first, agent-sized
├ src/
└ tests/

AGENTS.md. The one file every agent reads first. Build commands, naming conventions, testing rules, no-gos (“never edit infra/prod.tf without approval”). The emerging standard, now hosted under the Linux Foundation Agentic AI Foundation, with 60,000+ repos adopting it by early 2026.

requirements.md. User stories, acceptance criteria, success metrics. No code, no architecture. Keep it under two pages per feature; longer means the feature should be split.

design.md. Stack, service boundaries, data contracts, trade-offs considered and rejected. Agents can reason about this — they just need the context. Include an explicit list of “agent must not do X without asking” for security-sensitive code.

tasks.md. Granular tasks, each 1–2 hours for a human, ≤ 100 LOC of diff, with acceptance tests defined per task. This is what the agent actually consumes during the Implement phase.

Real productivity numbers (with caveats)

The honest version of the productivity story is mixed. Reported gains vary by study, by task type, by team experience, and by how seriously the team takes the spec discipline.

Study / Source Metric Result
GitHub Copilot RCT PR cycle time 9.6 days → 2.4 days (−75%)
GitHub Copilot RCT JS HTTP task speed +55%
DORA 2026 Epics completed / dev +66%
DORA 2026 (caveat) PR review time +441%
DORA 2026 (caveat) Incidents / PR +243%
AWS Kiro drug-discovery case Time to ship 3+ months → 3 weeks
METR OSS study (early 2025) Experienced OSS devs +19% slower with early AI tools
ArXiv SDD paper LLM code errors with refined specs ~−50%

How to read this. The gains are real when (a) specs are disciplined, (b) the review process is budgeted for, and (c) the team is already familiar with the domain. The negative DORA signals — PR review time up, incidents per PR up — are the predictable failure mode of throwing agents at teams without SDD. Agents produce more code; specs direct that code; reviews still catch defects. All three are needed.

On our own video projects we measure 20–40% faster feature completion and estimates within roughly 6% variance end-to-end. That’s the payoff we commit to when we scope fixed-price work.

Where this wins hardest: video, WebRTC, streaming

Video is an ideal SDD target. High boilerplate, many well-defined protocols, heavy cross-platform glue, clear acceptance criteria (“< 500 ms glass-to-glass”), and strong incentives to test rigorously. Where the agents actually carry their weight:

WebRTC glue and signaling. ICE candidate handling, SDP negotiation, connection state machines, NACK/PLI plumbing. 70–80% of the boilerplate is spec-friendly and agent-written.

Media pipelines and DASH/HLS. FFmpeg glue, manifest generation, chunking, bitrate ladders. Agents excel at orchestrator code; humans keep performance tuning.

Player UI (web and mobile). HTML5 video wrappers, gesture handlers, subtitle rendering, responsive CSS. 60–70% agent-written on a clear spec.

Test generation for streams. Given-When-Then acceptance criteria turn directly into pytest/mocha tests. One sentence in the spec becomes 5–10 test cases. See our WebRTC stream quality playbook for the metrics those tests assert against.

SFU config and deployment. LiveKit, mediasoup and Janus configs; Kubernetes manifests; autoscaling rules. Highly template-able; agents deliver reliably when the spec is tight.

Mobile SDK wrappers. iOS Swift bindings, Android JNI glue, permissions flows, lifecycle handling. The agent does the repetitive work; humans own memory management and thread safety on the C++ core.

Have a video product roadmap you want agent-accelerated?

We scope SDD engagements on WebRTC, live streaming, video-chat and media-pipeline builds. Fixed price, fixed timeline, the cycle-time math up front.

Book a 30-min call → WhatsApp → Email us →

Applying SDD on a legacy codebase

Pure SDD shines on greenfield. Brownfield work — the codebase your product lives in today — needs a narrower variant. The mistake is trying to retroactively spec the whole monolith before agents are allowed to touch anything. You will never finish and the agents will never help.

Spec one change at a time. Write a requirements and design for each concrete feature or fix, not for the whole system. “Add HEVC to the existing transcoder” is a useful spec. “Rewrite the transcoding system” is a project.

Sidecar Inference pattern. New AI logic runs as a separate service behind an adapter. Legacy system stays untouched. Specable in isolation, testable in isolation, rolled back in isolation.

Strangler fig for migrations. Route new traffic to the spec-driven service; legacy handles everything else until it’s gradually replaced. Meta used this pattern to migrate 50+ WebRTC use cases off a divergent fork, with agents handling automated build-error fixes and merge-conflict resolution.

Autonomy grows with trust. Start with agents doing narrow, reversible tasks (refactors, test generation, docstring writing). Expand scope as review data proves the agent is reliable in your codebase. Never give an agent root in a codebase it has never been trained on.

Human-in-the-loop: review gates that matter

1. Spec sign-off gate. No code is written until requirements.md + design.md is merged. The tech lead and product owner both sign off. This kills 70% of downstream scope changes.

2. Task breakdown gate. Before the agent implements, a senior engineer reviews tasks.md. Any task > ~2 hours of work is split. Any task without an acceptance test is rejected.

3. PR review gate. Every agent-authored PR gets human review. For crypto, codec, auth and data-handling code, a domain expert reviews. For boilerplate, a regular engineer is enough. Budget 25–40% of delivery time for this — SDD shifts the work, it doesn’t eliminate it.

4. CI spec-drift check. A lightweight CI job parses the spec against the merged PR: does the code satisfy every acceptance criterion? Deviations fail the build. Tools: GitHub Spec Kit, home-grown scripts on top of JSON schemas.

5. Staged rollout. Agent-authored code lands in staging first. Real WebRTC pair testing, synthetic monitors, SLO checks. Only then promoted to production.

Mini case: LiveKit video-chat MVP in 40 engineer-hours

Situation. A client needed a LiveKit-based video-chat MVP on their existing auth and notification stack for an investor demo. Traditional scope: 4–6 weeks with a dedicated frontend and backend engineer. Demo was 10 calendar days out.

12-day plan. Days 1–2 we wrote the four artefacts: user stories (1-on-1 call, room, mute/unmute, screen share, chat), design (LiveKit Cloud + Next.js + our auth), tasks (18 tasks, ~2 hours each), and an AGENTS.md that forbade touching production auth. Days 3–7 Claude Code and Cursor worked through the tasks in parallel; one senior engineer reviewed every PR, the tech lead owned the spec-drift CI. Days 8–10 we added polish, ran load tests on LiveKit Cloud, and did a production smoke test with a real signed-in user on iOS and Android.

Outcome. 40 human engineer-hours across 10 days. MVP shipped on day 10. Demo closed the round. Post-mortem showed spec variance of ~4% against the final hours and zero rework on WebRTC signaling. We built it on our LiveKit development practice, which has since shipped similar velocity on subsequent video projects.

If you want a similar sprint on your own product, book a 30-minute call. Bring a one-page brief; we’ll walk you through what the spec set would look like.

Cost math: what SDD actually changes on the invoice

Phase Traditional SDD + agents Shift
Design / spec 10–20% of project 15–25% More detailed upfront
Implementation 60–70% 10–20% Agents do most typing
Review / test 10–20% 25–40% More rigorous verification
Rework 10–20% 3–8% Specs catch drift early
Total calendar 10–16 weeks (typical feature) 5–9 weeks ~40% reduction in wall time

Agent subscriptions. Claude Code, Cursor and Copilot together land at $50–$100/engineer/month. Meaningful, but trivial next to engineer salaries. Agents are the cheap input; specs and reviews are the expensive input.

Our pricing posture. With Agent Engineering we pass most of the time savings to the client and commit to fixed-scope delivery. If you’re being quoted traditional-timeline prices on a video or WebRTC build in 2026, the market has already priced you wrong.

The Fora Soft agent stack in 2026

Orchestration. Claude Code as the default driver. Cursor when the work is tight IDE loops. Kiro for AWS-heavy greenfield.

Spec authoring. GitHub Spec Kit for the Specify → Plan → Tasks → Implement pipeline. Markdown everywhere; no proprietary formats.

Agent configuration. AGENTS.md at the repo root, CLAUDE.md for Claude Code, .cursorrules for Cursor. Same content, different files.

Verification. CI with unit + integration + spec-drift + security scan (Snyk, Trivy). Staging with synthetic WebRTC monitors. Human code review on every agent-authored PR.

Observability. Datadog or Grafana dashboards, log the agent’s reasoning trace per PR for post-mortem use. If a bug ships, we replay the spec, the agent run, and the review notes together to find the gap.

QA layer. AI-assisted QA on top of the agent-authored test suite, and a Minto-pyramid test summary report per release.

A decision framework — five questions before you commit to SDD

1. Can you write the acceptance criteria in one page? If yes, SDD fits. If the team can’t agree on what “done” looks like, the problem is not ready for agents.

2. Is the work greenfield or a well-bounded module? Greenfield and well-bounded brownfield modules are SDD sweet spots. Entangled legacy rewrites are not — use sidecar/strangler-fig patterns instead.

3. Do you have senior reviewers? Agents multiply output. Without senior reviewers to triage it, you ship 3× the code and 3× the incidents. If you don’t have the review capacity today, fix that first.

4. Is the domain regulated? HIPAA, SOC 2, PCI, IEC 62304 — SDD is actually a gift here because the spec + drift-check CI + review trail is exactly what auditors want. But budget extra cycles for security review and immutable logs.

5. Does leadership have the patience for upfront specs? Teams under pressure to “just start coding” tend to skip the spec phase and then blame the agents for the chaos. If leadership won’t protect the first week of spec writing, SDD won’t stick.

Five pitfalls that will kill your SDD rollout

1. Hallucinations with no test harness. Agents happily write confident wrong code, especially on codec, crypto, and protocol edges. Your test suite is the defense. No acceptance test means no PR merge.

2. Specification drift. The spec says “< 500 ms latency”; the agent ships code that has no latency measurement. Passes tests, fails in the field. Ship a CI job that checks every acceptance criterion shows up as an assertion somewhere in the test suite.

3. Over-delegation. Agents handle 80% of boilerplate beautifully and catastrophically on the 20% that requires domain judgement (thread safety, lip-sync tolerance, regulatory edge cases). Keep the human in the loop on those 20% explicitly.

4. Skipping AGENTS.md. Without project-wide agent rules, each agent run reinvents naming conventions, re-argues style, and occasionally refactors files you didn’t ask it to touch. A 200-line AGENTS.md on day one saves a hundred messy PRs.

5. Pretending review is optional. The DORA 2026 numbers — PR review +441%, incidents +243% — are what happens to teams that adopt agents without protecting review capacity. Budget 25–40% of delivery time for review, or accept the incident volume.

Building a video product in 2026? Let’s bench-mark your scope.

Share the feature list; we return traditional vs SDD estimates in cost and calendar time. Typical delta: 40% faster, ~20% cheaper, with tighter variance.

Book a 30-min scoping call → WhatsApp → Email us →

Security and compliance in agentic workflows

The 2026 State of AI Agent Security report flagged 80.9% of teams running agents in production but only 14.4% with full security approval. 88% reported at least suspected incidents. The risks are real and manageable.

Isolate agent permissions. Never give an agent write access to production credentials, secrets, or deploy pipelines. Sandbox the workspace; require explicit approval for network calls.

Scan every agent-authored PR. Snyk, Trivy, GitGuardian, Semgrep. Over 25% of 30,000 analyzed agent skills contained at least one vulnerability — scan before merge, not after.

Immutable audit trail. Log every agent run (prompt, response, files touched, tests run) to an immutable store. For regulated verticals this is the audit artefact.

Threat-model the agents themselves. Agent Drift (silent deviation from intent) and Prompt Injection (malicious input redirecting the agent) are the two novel threat classes. Mitigations: spec-drift CI, sanitized inputs, restricted tool lists per agent.

Wire SDD into the compliance pipeline. Specs + review notes + CI logs + incident records land in the same immutable repository. For SOC 2 or HIPAA audits this is now the faster path, not the slower one. See how we do it on reliable, crash-proof software in 2026.

KPIs to report to the business

Velocity KPIs. Feature lead time (spec merged to production), story points per sprint, estimate variance (%).

Quality KPIs. Escaped-defect rate, incidents per PR, rework ratio (days in rework / days in original delivery), automation flakiness (< 1% target).

Efficiency KPIs. Review time per PR, share of PRs auto-authored by agents, spec coverage (% of acceptance criteria mapped to tests), cost per feature (engineer hours + agent tokens).

When NOT to use spec-driven agents

Research spikes and exploratory prototypes. If you don’t know what you’re building, a spec is premature optimization. Vibe code, learn, then spec the keeper.

Genuinely novel domains. Agents trained on public code struggle with first-of-its-kind problem spaces. Humans lead; agents assist with boilerplate around the edges.

Teams without review capacity. Agent output without review creates worse outcomes than no agents at all. Fix review capacity first.

Ultra-low-trust environments. Air-gapped classified work, pre-revenue regulatory approval stages — the audit surface of an agent pipeline may still be too broad. Use constrained patterns or wait for on-prem agents that meet your threat model.

FAQ

Is spec-driven development the same as waterfall?

No. Waterfall assumes the spec is complete and final before implementation starts. SDD assumes the spec is the current best understanding, lives in version control, and iterates alongside the code. Changes to behavior update the spec first, then the code. Waterfall is a phase-gate model; SDD is a feedback loop with the spec as the anchor.

Which AI agent should my team use?

Claude Code as the default in 2026 — strong multi-file context, good spec handling, 46% preference in the Pragmatic Engineer survey. Cursor for IDE-heavy front-end work. Kiro for AWS-heavy greenfield. GitHub Copilot + Spec Kit when you’re Copilot-native already. Most serious teams run two in parallel rather than betting on one.

How much faster is spec-driven agentic engineering really?

GitHub’s RCT showed PR cycle time dropping 75%. DORA 2026 measured +66% epic throughput. AWS Kiro shipped a 3-month project in 3 weeks. On our own video projects we measure 20–40% faster feature completion. The variance is real — results depend on spec discipline, team experience and domain familiarity.

Can we apply this to our existing legacy codebase?

Yes, but carefully. Spec one change at a time, use sidecar inference to isolate new AI logic, adopt the strangler-fig pattern for gradual migration. Trying to spec the whole legacy monolith before touching anything is a trap. Expand agent autonomy as review data proves the agent is reliable in your codebase.

How do we stop the agent from hallucinating?

Three controls. (1) Precise specs with concrete examples — ArXiv research shows refined specs reduce LLM code errors ~50%. (2) Test harness that fails the PR if acceptance criteria aren’t satisfied. (3) Human code review on every agent-authored PR, with extra rigor on crypto, codec, and security code. No single control is enough; all three together are.

What does AGENTS.md actually contain?

Build and test commands, style and naming conventions, files or directories the agent must not edit without approval, security-sensitive paths, how to run the local environment, which external libraries are pre-approved, and a pointer to the spec directory. Agents.md is now an open standard hosted under the Linux Foundation Agentic AI Foundation with 60,000+ repos adopting it.

Is this safe for HIPAA / SOC 2 / PCI workloads?

Yes, when you design for it. Specs + review notes + CI logs form the immutable audit trail compliance auditors want. Restrict agent permissions (no production secrets, no deploy access), scan every PR, and route security-sensitive code through domain-expert review. On the HIPAA telemedicine and PCI payment systems we ship, SDD has shortened audit preparation rather than lengthened it.

How much does a spec-driven video MVP cost?

With Agent Engineering we typically ship a LiveKit or mediasoup-based video-chat MVP in 3–6 weeks of calendar time with 40–120 human engineer hours. Actual quote depends on platforms (web + iOS + Android vs web only), auth integration scope, and feature list — book a 30-minute call for a specific number.

Quality testing

How to test WebRTC stream quality in 2026

The metrics your spec’s acceptance criteria should assert on.

Latency

How to minimize latency to under 1 second at mass scale

The protocol decisions that frame what your specs have to honor.

AI streaming

AI language translation for seamless live streaming

Another SDD-friendly vertical — three-stage pipeline done right.

QA reporting

How to write an effective test summary report

The release gate your agent-authored PRs flow through.

Reliability

How to build reliable, crash-proof software in 2026

SLOs, DORA metrics, patterns your SDD pipeline should enforce.

Ready to ship a video product on SDD?

Spec-driven agentic engineering is not a fad. It is the current best answer to “how do we ship complex video software faster without getting buried in incidents.” The disciplines are simple to describe and hard to shortcut: write the spec first, let the agent implement against it, review hard, test against the acceptance criteria, and let the spec drive the audit trail.

Done properly, it ships features 20–40% faster, cuts rework to single digits, shrinks estimates to ~6% variance, and yields an audit artefact compliance auditors appreciate. Done sloppily, it produces the DORA 2026 horror story. The difference is discipline, and we’ve had two decades of practice getting that discipline into regulated, high-stakes video products.

Want an SDD-ready scope for your next video build?

Bring a brief, leave with the four artefacts skeleton, a task list, and a calendar estimate. Fixed-scope delivery with our Agentic Engineering team on request.

Book a 30-min call → WhatsApp → Email us →

  • Technologies
    Processes
    Development