transcript: { mode, provider }
Per-bot config on dispatch. mode ∈ {async, realtime}. provider ∈ {hosted-whisper, deepgram, assemblyai, gladia, aws-transcribe, elevenlabs}.
05 · transcription api
hosted Q4 2026 · BYOK works todayWe do not ship hosted transcription today. What we ship today: per-speaker audio tracks you can pipe straight into Whisper, Deepgram, AssemblyAI, Gladia, AWS Transcribe, or ElevenLabs on your own key — zero meetbot fee on that leg. See the action-items-bot sample for a working Whisper integration. Hosted Whisper-large-v3 lands Q4 2026 at $0.10/hr.
overview
Honest scope. We do not generate transcripts today. The recording side ships per-speaker audio (one Opus track per participant, name-tagged), and we surface the meeting platform's native captions verbatim — Meet/Teams/Zoom each have their own captioner and we pass it through as captions.jsonl. We do not run ASR ourselves. If you need a transcript today, the path is BYOK: pipe each per-speaker track into your provider of choice, get a per-speaker transcript back. The action-items-bot sample at github.com/meetbot/samples shows the Whisper integration end-to-end.
Q3 2026: hosted Whisper. When the hosted path ships it'll run on a Hetzner GPU box (RTX 4090), serve about twenty concurrent realtime streams, and support mid-meeting language switching. Speaker tagging will inherit straight from the bot's existing per-speaker audio mapping — we already know who said what; the transcript will inherit that. Pricing at GA: $0.10/hr add-on. Default for new accounts will stay "no transcription," because the cheapest API call is the one you don't make.
BYOK today, async or realtime. The shape we'll ship at GA: async is one POST after the meeting ends; realtime opens a WebSocket on wss://api.meetbot.dev/v1/transcripts/:bot_id and streams partial + finalized utterances as they're produced. Today, you build the equivalent on your end — the per-speaker audio is in your S3 bucket the second the meeting ends; route to your provider, then to your downstream consumer. The JSONL shape we'll use at GA matches the captions JSONL we already emit today, so the migration is a one-line consumer change.
honest scope
BYOK on per-speaker audio is available right now and free. Hosted Whisper-large-v3 lands Q4 2026 at $0.10/hr add-on. We are not in the model-training business; the GA path will use frontier ASR providers as black boxes.
works today
Per-speaker audio tracks (the input you need for ASR)
Every bot dispatch ships audio.{speaker}.webm — one Opus track per participant, name-tagged from the meeting roster. Pipe each into Whisper / Deepgram / AssemblyAI / Gladia / AWS Transcribe / ElevenLabs on your own key. Zero meetbot fee on that leg.
Native captions JSONL passthrough
captions.jsonl already ships per-meeting, surfaced verbatim from Meet/Teams/Zoom's own captioner. Newline-delimited, one row per finalized utterance with speakerId + start/end ms.
BYOK reference implementation
samples/action-items-bot wires the per-speaker tracks into Whisper-large-v3 end-to-end. MIT-licensed; clone, swap your OPENAI_API_KEY, ship by Friday.
not yet
Hosted Whisper-large-v3 endpoint
Q4 2026. Will run on a Hetzner GPU box (RTX 4090). $0.10/hr add-on. Speaker tagging inherits from the bot's existing per-speaker mapping.
Realtime WebSocket (wss://api.meetbot.dev/v1/transcripts/:bot_id)
Q4 2026. Streams partial + finalized utterances as they're produced, per speaker.
Async transcript on completed recordings (POST /v1/recordings/:id/transcript)
Q4 2026. For when you decided to enable transcription only after the call.
Mid-meeting language switching + per-utterance lang tags
Q4 2026. Whisper-large-v3 handles this natively; we just have to surface it.
Hosted BYOK key vault (provider keys at rest under per-tenant KMS)
Q4 2026. Today, BYOK means your key in your env; at GA we'll let you store it under our key-vault and rotate from /account/keys.
BYOK works today: dispatch a bot, take the per-speaker WebM tracks from the manifest, run them through Whisper / Deepgram / AssemblyAI yourself. The action-items-bot sample shows the pattern at github.com/meetbot-dev/sample-action-items-bot-ts.
planned surface
Per-bot config on dispatch. mode ∈ {async, realtime}. provider ∈ {hosted-whisper, deepgram, assemblyai, gladia, aws-transcribe, elevenlabs}.
Newline-delimited JSON. One row per finalized utterance, with speakerId, name, text, tStart, tEnd. Same shape as captions.
Realtime WebSocket. Emits {type: partial|final, ...} frames as utterances are produced. Per-speaker.
Async transcript on a previously-completed recording. Useful if you decided to enable transcription only after the call.
Whisper-large-v3 detects mid-meeting language switches. Per-utterance lang tag in the JSONL. No need to declare upfront.
Provider keys stored encrypted with per-tenant KMS-derived keys. Rotation through /account/keys without redeploys.