
Key takeaways
• Vocal Views proves the model. Fora Soft built a research marketplace adopted by Google, McDonald’s, Netflix, and Samsung — video interviews, automatic transcription in 30+ languages, live human interpreters, and a panel of incentivised respondents in one product.
• The market is large and consolidating. Global market research is on track for $140B in 2024; UserTesting alone sold for $1.3B; the GenAI insights tier is rewriting the buyer’s spec sheet.
• Three engineering pillars decide the build. WebRTC SFU with observer + interpreter modes; multilingual ASR (Deepgram, Whisper, AWS) with speaker diarization; AI synthesis (clips, sentiment, themes) tied to a panel-payment marketplace.
• Live interpretation is a deliberate choice. Three audio streams, 2–3 second translation lag, and a UI that survives multilingual chaos — or skip it and rely on post-session translation.
• Build vs. buy is about panels and AI moat. If the differentiator is a niche panel or a specific analysis workflow, build. If you need broad recruitment fast, integrate UserTesting / Respondent.io APIs and build only the differentiating layer.
Why Fora Soft wrote this playbook
We did not write this guide as a marketing brief. We wrote it because we shipped Vocal Views — a marketplace for online market research that today serves Google, McDonald’s, Netflix, and Samsung. Every architectural call below — SFU choice, transcription stack, observer mode, interpreter UX, panel payments — was made under pressure of paying enterprise customers running real interviews on the platform.
Vocal Views is not our only video-marketplace product. Our case studies include BrainCert — a virtual classroom and LMS with 100K+ paying customers and four Brandon Hall awards — and ProvideoMeeting, an enterprise video conferencing platform. Roughly 40% of our active engineering capacity sits in video, real-time, and AI — the disciplines a serious research-platform build needs from day one.
If you are a founder, product owner, or research-ops lead scoping a custom platform — or weighing UserTesting, dscout, Respondent, Lookback, or Discuss.io against a build — this article walks the same trade-offs we walked with Vocal Views. We use Agent Engineering — an AI-assisted internal delivery process — to compress the calendar and the cost on every project, which is why our quotes typically beat a US agency for the same scope.
Scoping a research-marketplace build?
Get a 30-minute architecture review with the team behind Vocal Views — SFU choice, ASR stack, panel payments, AI synthesis, realistic budget.
Vocal Views in numbers — what the platform actually does
Vocal Views is a marketplace, not just a video tool. Researchers post studies, qualified respondents are matched and paid an incentive, the interview happens over WebRTC video, the conversation is transcribed in real time, and the data is exportable for analysis — all on one platform.
| Capability | Detail | Why it matters |
|---|---|---|
| Enterprise customers | Google, McDonald’s, Netflix, Samsung | Compliance, scale, and SLAs proven against the strictest procurement teams. |
| Languages supported | 30+ via automatic transcription | Cross-region research without bolt-on tools. |
| Interview formats | 1:1, focus group, observer mode, interpreter mode | Same product handles UXR, brand research, and qualitative academic studies. |
| Live human interpreter | Behind-the-scenes simultaneous translation | Removes the language barrier without forcing the respondent into English. |
| Outputs | Recordings, transcripts, analytics dashboards, exports | Researchers leave the call with everything insight-ready. |
| Stack | React + Node.js + MongoDB + WebRTC/Kurento + Socket.io | Mature stack chosen for fast iteration and a healthy ecosystem. |
The two-minute walkthrough below is the actual operator interface, not a marketing render. Watch the Vocal Views product video on YouTube →
The market in numbers — why research platforms keep getting funded
Market research is a $140B industry and the qualitative-video slice is where the AI tooling is rewriting expectations.
| Indicator | Number | Why it matters |
|---|---|---|
| Global market research industry (2024) | ~$140B | Even a 0.1% share is a healthy SaaS company. |
| Online video platform market CAGR | 17.3% (2021–2027) | Compounding tailwind for any video-first product. |
| UserTesting acquisition | $1.3B (Thoma Bravo / Sunstone, 2022) | Validates that this is a billion-dollar outcome category. |
| Respondent.io panel | 4M+ verified participants, 4.9/5 rating | Bigger panels = bigger network effect. |
| AI insight delivery uplift | ~40% faster vs. manual analysis | AI synthesis is the new differentiator. |
| Conversation analytics adoption | 72% report better CX | The buyer’s ROI story is now well documented. |
The 2026 feature checklist for a serious research platform
Use the four tiers below to scope your MVP, then post-MVP, then differentiation. Skipping a tier-1 feature blocks enterprise procurement; skipping a tier-3 feature limits expansion revenue but does not stop launch.
Tier 1 — the marketplace + interview core
Panel and incentive management. Screener flow, qualification, scheduling, automated incentive payouts (Stripe, PayPal, Tremendous, Hyperwallet). Video chat over WebRTC. 1:1 plus observer-only viewers, with two-way audio when allowed. Recording. Cloud storage with retention policies and per-tenant key management. RBAC + audit log. Researcher, observer, admin, panel-ops — with immutable logs.
Tier 2 — transcript, search, and clip workflow
Multi-language ASR with speaker diarization. Whisper or Deepgram Nova for cost; AWS or Azure for enterprise compliance. Searchable transcripts. Word-level timestamps indexed in OpenSearch / Elastic. Clip and quote generation. One-click highlight extraction with auto-generated descriptions. Live captions during interview. Sub-2-second latency for accessibility and observer note-taking.
Tier 3 — AI synthesis and integrations
AI summary per session. 5–7 bullet narrative with key quotes. Sentiment per speaker. 0–1 score with timeline visualisation. Theme clustering across studies. Unsupervised topic mining; ask-your-data Q&A across the corpus. Integrations. Miro, FigJam, Notion, Slack, Figma, plus a Zapier tier for the long tail.
Tier 4 — live human interpreter mode
Three-stream audio routing (participant → interpreter → researcher), separate channels per language, recording of every channel for audit, and a UI that survives multilingual chaos. This is the Vocal Views differentiator and the hardest feature to copy.
Reach for live interpretation when: the buyer’s research must include populations who do not interview comfortably in English. For pure UXR with global enterprise users, post-session translation plus AI summary is good enough and saves 30–40% of build cost.
Reference architecture for a research-interview platform
The pipeline below is the canonical layout we use on Vocal Views and similar projects. Each stage scales horizontally; per-tenant tenant-isolation lives in stages 6–8.
| Stage | Component | Output |
|---|---|---|
| 1. Capture | WebRTC client (browser + mobile) | VP9 / H.264 video, Opus audio |
| 2. SFU + observer fan-out | MediaSoup / Janus / LiveKit / Kurento | Forwarded streams + recording egress |
| 3. Recording | FFmpeg or AWS MediaConvert | MP4/HLS in S3 with lifecycle tiering |
| 4. ASR + diarization | Whisper / Deepgram / AWS Transcribe | JSON transcript with speaker labels |
| 5. AI synthesis | GPT-4-class model | Summary, clips, sentiment, themes |
| 6. Search | OpenSearch / Elastic | Per-tenant search index |
| 7. API + auth | Node.js + Socket.io | RBAC, audit log, realtime events |
| 8. Marketplace + payments | Stripe + Tremendous / Hyperwallet | Incentive payouts, take-rate splits |
| 9. Client | React (web) + native iOS/Android | Researcher console, observer view, panellist app |
A deeper dive into the SFU choice lives in P2P vs MCU vs SFU for video conferencing and the broader 2026 architecture map in our WebRTC architecture guide for business.
WebRTC SFU choice — MediaSoup, Janus, LiveKit, or Kurento
For a research interview the SFU has to handle 1:1 reliably, plus 5–15 silent observers, plus an interpreter mode that introduces a third audio channel. Four engines cover the field.
| Engine | License | Strengths | Best for |
|---|---|---|---|
| MediaSoup | ISC | Highest performance, modular, Node.js + Rust | High-volume marketplaces, custom recording |
| Janus | GPLv3 | Mature plugin system (recording, SIP, broadcast), C-based | Carrier-grade reliability, SIP bridging |
| LiveKit | Apache 2 (OSS) + SaaS | Fastest integration path, managed cloud option | Teams without deep WebRTC expertise |
| Kurento | Apache 2 | Pluggable media pipeline, CV/ML hooks | Custom processing, computer-vision overlays |
Vocal Views shipped on Kurento because the team needed pipeline-level hooks for observer fan-out and live ASR routing. Modern projects often default to MediaSoup or LiveKit for simpler ops; Janus remains the carrier-grade choice. We documented the Kurento story in What is Kurento Media Server.
Multi-language ASR — the engine that turns interviews into data
Transcription is no longer a feature; it is the gateway to every downstream insight. For 30+ languages with speaker diarization the field narrows to four engines.
| Engine | Languages | Pricing | Best for |
|---|---|---|---|
| Deepgram Nova-3 | 45+ | ~$4.30 / 1K min | Lowest cost, conversational accuracy |
| AWS Transcribe | 100+ | $0.024 / min batch | Enterprise compliance, BAA available |
| Google Speech-to-Text | 100+ | ~$16 / 1K min | Robust noise handling, Google ecosystem |
| Azure Speech | 140+ | Custom enterprise | HIPAA-heavy verticals, Microsoft stack |
| Whisper Large-v3 (self-host) | ~50 | GPU rental only | Cost control at scale, on-prem |
Practical rules of thumb: Deepgram for cost-sensitive English-heavy workloads, AWS for HIPAA verticals, Azure if the buyer is a Microsoft shop, Whisper if you need on-prem or process more than ~10K hours per month.
Need an ASR vendor decision in writing?
We have shipped Vocal Views, BrainCert, and dozens of streaming products. We will pick the right ASR engine for your language mix and unit economics in one call.
Live interpreter mode — the engineering challenge that wins enterprise deals
Most research platforms quietly assume the participant speaks the researcher’s language. Vocal Views does not. The interpreter feature is what makes Google research a Korean teenager via an English-speaking moderator without forcing the participant into a foreign tongue. The trade-offs are concrete.
1. Three audio channels. Participant audio → interpreter; interpreter audio → researcher; researcher audio → interpreter. Recording captures every channel separately for audit.
2. Time multiplier. A 30-minute interview becomes a 60-minute session because of translation lag and cognitive load. Pricing must reflect the doubling.
3. UI clarity. Multiple languages on screen are a UX hazard. Indicate who is speaking, in which language, and which stream the listener is hearing — without overwhelming the moderator.
4. Latency budget. Two to three seconds of translation lag is the floor. Anything more breaks the conversation. Use WebRTC sub-stream prioritisation, not HLS, for the interpreter loop.
Want to see how AI is starting to compete with human interpreters? Our deep dive in AI simultaneous interpretation covers the trade-offs.
Marketplace economics — take rate, panel cost, incentive payouts
A research-platform business has two cost centres researchers never see: panel acquisition and incentive payment infrastructure. Understanding both is what separates a sustainable take rate from a margin disaster.
Take rate. Industry standard sits between 10% on session fees and 35% mark-up on incentive costs. Aggressive marketplaces blend both — subscription for sellers plus a take rate on transactions.
Panel acquisition. Building a 4M-participant panel like Respondent.io takes 2–3 years and significant CAC. For most new entrants the right move is to integrate a third-party panel API for the first 18 months and grow your own panel where the market underserves (industry verticals, hard-to-reach professionals, niche geographies).
Incentive payouts. Same-day to next-day payouts are now the floor; Stripe Connect, Tremendous, and Hyperwallet all solve the cross-border payment problem at different price points. Plan for FX, tax-form generation, and KYC.
Security & compliance — what enterprise procurement asks first
1. SOC 2 Type II. The enterprise tick-box. $25–$40K readiness assessment plus annual audit. Build the controls in sprint 1.
2. GDPR. Lawful basis, DPIA where high-risk, EU data residency option, right to erasure on individual research participants. The right-to-erasure flow is the part teams underestimate.
3. CCPA / CPRA. US data-subject rights. Light lift if GDPR is already in place.
4. HIPAA. Required for medical research interviews. BAA with the cloud provider, encryption at rest and in transit, audit logging on every replay.
5. IRB-friendly workflows. Academic studies need consent capture, anonymisation tooling, and retention controls exposed in the UI — not buried in admin scripts.
Build vs. buy — the panel-and-moat rule
The honest decision rule: if your differentiator is a niche panel or a specific analysis workflow, build. If you need broad recruitment fast, integrate UserTesting / Respondent.io APIs and build only the differentiating layer on top. The framework below maps the common scenarios.
| Scenario | Recommendation | Why |
|---|---|---|
| Need broad consumer panel fast | Buy + integrate (UserTesting API) | Building a 4M panel takes years; vendor amortises the cost. |
| Niche panel or proprietary access | Build custom (Vocal Views zone) | Vendors do not have your supply; the panel itself is the moat. |
| AI-native synthesis is the moat | Hybrid — UserTesting API + custom AI layer | Pay for recruitment; differentiate on the analysis. |
| Vertical workflow (medical, legal) | Custom build with HIPAA/IRB hardening | Compliance and workflows shape the product end-to-end. |
| Reseller / SaaS go-to-market | Custom build | You cannot resell UserTesting. You can resell what we build for you. |
Cost model — what an MVP and a full-feature platform actually run
Numbers below are realistic Fora Soft engagements with Agent Engineering applied. They are intentionally conservative; in practice we beat them more often than we miss them.
| Scope | Included | Indicative range | Calendar |
|---|---|---|---|
| MVP — interview + transcript | WebRTC, ASR, recording, panel basics, Stripe payouts | $70K–$120K | 12–14 weeks |
| AI synthesis layer | Clip extraction, sentiment, themes, AI summary | +$25K–$50K | +3–5 weeks |
| Live interpreter mode | Three-stream routing, interpreter UI, multi-track recording | +$30K–$60K | +4–6 weeks |
| Integrations pack | Miro, Notion, Slack, Figma, Zapier | +$15K–$30K | +2–3 weeks |
| Compliance pack (SOC 2 + GDPR) | Audit logs, encryption review, vendor due-diligence pack | +$25K–$50K | +1–2 months |
Run-rate operations: $500–$3,000/month for cloud infrastructure at MVP scale, plus $200–$2,000/month in API costs (ASR, AI synthesis), plus a dedicated SRE on call. Scale linearly with active studies.
Mini case — what shipping Vocal Views taught us
Situation. The client came to Fora Soft with a hypothesis — that enterprise market research could move online without losing the human-interpreter quality clients like Google and McDonald’s expect. The problem was that no off-the-shelf platform combined a panel marketplace, multi-language ASR, and a live interpreter loop in one product.
Plan. Build the WebRTC core on Kurento for pipeline-level control; layer 30+ language ASR via cloud APIs; design the three-stream interpreter mode from sprint 1 instead of bolting it on later; ship a panel marketplace with screener, scheduling, and Stripe payouts; expose researcher analytics for export. Roughly 9–12 months end-to-end with iterative releases to early enterprise customers.
Outcome. Today Vocal Views serves Google, McDonald’s, Netflix, and Samsung on the same multitenant codebase. The interpreter mode is the feature procurement teams cite when they choose Vocal Views over UserTesting or dscout. Want a similar 12-week MVP roadmap for your research platform?
A decision framework — pick a research-platform path in five questions
1. Where does your panel come from? Niche or proprietary → build. Broad consumer → integrate a third-party panel API.
2. What languages does your buyer interview in? English-only → one ASR vendor. 30+ → multi-vendor abstraction layer with fall-through fallback.
3. Live interpreter or post-session translation? Live = differentiator and 30–40% extra build cost; post-session is good enough for most UXR.
4. Where is the AI moat? Synthesis (clip extraction, sentiment, themes) → build the layer; recruitment → buy a panel API.
5. Compliance floor? SOC 2 + GDPR is the modern baseline. HIPAA only if you sell into clinical / healthcare research.
Pitfalls we have watched research-platform teams fall into
1. Treating ASR as “just an API call.” Speaker diarization, language detection, and noise robustness are real engineering work. Audit transcripts on the team’s own recordings before shipping.
2. Skipping observer mode in v1. Stakeholder observer rooms are how research becomes a team sport. Without them adoption is capped.
3. Underestimating incentive payouts. Cross-border payment, FX, tax-form generation, and chargeback handling are weeks of work, not a sprint.
4. Bolting on compliance. SOC 2 retrofit costs 3× designed-in. Same for HIPAA.
5. Ignoring the post-interview workflow. Researchers need clips, summaries, and exports. Without them the platform is a recording tool, not an insights tool.
KPIs — what to actually measure
Quality KPIs. Stream uptime per session (99.5%+), p95 join time (< 4 s), ASR word-error rate per language (target < 8% on clean audio), AI summary acceptance rate (researchers don’t edit it within 24 h).
Business KPIs. Take rate, gross margin per session, panellist NPS, repeat-buyer rate, integration-driven expansion revenue (Miro / Notion / Slack-pulled exports).
Reliability KPIs. Recording success rate (over 99.5%), incentive payout SLA (under 24 h), audit-log completeness (100%), data-deletion SLA on GDPR/CCPA requests (under 30 days).
When NOT to build a custom research platform
Skip the build when (a) the team needs to ship inside eight weeks, (b) the panel requirement is broad consumer and you don’t have a recruitment moat, (c) total budget is below $80K, or (d) the buyer’s differentiator is brand recognition rather than a unique workflow. In each case the right move is to integrate UserTesting, dscout, or Respondent.io APIs and build only the differentiating layer on top — we still help build that layer.
Want a build-vs-buy verdict in writing?
A 30-minute call gets you a one-page recommendation matched to your panel, language mix, AI moat, and budget — honestly told when buying instead would be cheaper.
FAQ
Who uses Vocal Views?
Vocal Views serves enterprise market-research teams at Google, McDonald’s, Netflix, and Samsung, plus mid-market consumer-insights teams. The platform handles 1:1 interviews, focus groups, observer rooms, and live human interpretation in 30+ languages.
How long does it take to ship an MVP research platform?
12–14 weeks for an MVP with WebRTC video, multi-language ASR, recording, basic panel management, and Stripe-based incentive payouts. AI synthesis and live interpreter mode add 3–6 weeks each. Agent Engineering at Fora Soft compresses these calendars below the industry baseline.
Should we build on MediaSoup, Janus, LiveKit, or Kurento?
MediaSoup for raw performance, Janus for carrier-grade reliability, LiveKit for fastest path to MVP (managed cloud option), Kurento for pipeline-level control over recording and ASR routing. Vocal Views shipped on Kurento; modern projects often default to MediaSoup or LiveKit.
Which ASR engine handles 30+ languages best?
Deepgram Nova-3 for cost-sensitive workloads (45+ languages, ~$4.30 / 1K minutes). AWS Transcribe for enterprise compliance (BAA available, 100+ languages). Azure Speech for Microsoft-stack buyers (140+ languages, HIPAA-friendly). Self-hosted Whisper for on-prem or large-scale cost control.
Is live human interpretation worth the engineering cost?
If your buyer interviews populations who would not consent to interview in English, yes — it is the feature procurement cites when choosing Vocal Views over generic competitors. If not, post-session translation plus AI summary is good enough and saves 30–40% of build cost.
When does building a custom platform beat using UserTesting or dscout APIs?
Build when your differentiator is a niche panel or specialised analysis workflow that vendor APIs cannot deliver. Buy + integrate when you need broad recruitment fast and your moat sits in synthesis, integrations, or vertical workflows.
Has Fora Soft shipped other video-marketplace products?
Yes. Beyond Vocal Views, we built BrainCert — a virtual classroom and LMS with 100K+ paying customers — and ProvideoMeeting, an enterprise video conferencing platform. Around 40% of our active engineering work is in video and AI.
What does compliance hardening cost?
$25–$50K for the SOC 2 + GDPR build pack on top of a base MVP, plus $15–$25K annual auditor fees. HIPAA adds another $20–$50K depending on cloud provider and recording-encryption scope. Build the controls in sprint 1 instead of retrofitting.
What to Read Next
Architecture
P2P vs MCU vs SFU for video conferencing
A practical comparison for product owners deciding the topology for an interview platform.
AI
AI simultaneous interpretation
Where AI translation is starting to compete with human interpreters — and where it isn’t.
Media server
What is Kurento Media Server
The pipeline-level engine that powers Vocal Views’ observer and interpreter modes.
Build vs buy
Build vs buy a video chat platform
The same framework applied to general-purpose video chat — useful sanity check for marketplaces.
Ready to ship your own research marketplace?
A modern research-marketplace platform has four jobs — recruit cleanly, interview reliably, transcribe accurately, and synthesise instantly. Vocal Views is the proof point at enterprise scale; the architecture, the ASR mix, the SFU choice, and the live-interpreter loop are the levers that decide whether your product wins enterprise deals or stalls at pilot.
If your panel, your AI moat, or your vertical workflow puts you in the build column, the fastest next step is a 30-minute call with the team that already shipped a system serving Google, McDonald’s, Netflix, and Samsung. We will walk the architecture, the cost, and the realistic calendar in one session — and tell you honestly when buying instead would be cheaper.
Talk to the team behind Vocal Views
Book a 30-minute call. We will scope your research-platform build — SFU, ASR, AI synthesis, panel, payouts, compliance, calendar, budget — in one session.


.avif)

Comments