Skip to main content

Voice

Jorv Builder ships voice today — Whisper STT in, OpenAI or ElevenLabs TTS out. All BYOK. Voice transcripts flow into the PROV chain like every other artefact, so anything spoken in a Brainstorm is auditable months later.

What ships today

Push-to-talk dictation (⌘⇧M / Ctrl⇧M)

Hold the shortcut, speak, release. Whisper transcribes in chunks via streaming, so you see text fill in as you speak rather than waiting for a final response.

Configured in Settings → AI Models → Voice:

  • STT provider: OpenAI Whisper (default)
  • Language: auto-detect or specify (en-US, en-AU, etc.)
  • Model: whisper-1 (cheap) or gpt-4o-transcribe (faster + cleaner punctuation)

Brain TTS playback

The Brain can respond in voice. Toggle in Settings → AI Models → Voice → TTS playback.

Providers:

  • OpenAI TTS — six voices, fast, $15/M chars
  • ElevenLabs — higher-fidelity voices, voice cloning, $$$/M chars

Pick a voice in the same settings panel. Each provider has its own playback preview.

Voice transcripts in PROV

Every voice interaction lands in the PROV chain alongside text chats. The transcript is the canonical artefact; the audio is not stored unless you opt in. This keeps the audit chain compact and grep-able.

v1 roadmap (voice-on-the-go)

  • Bidirectional conversational mode — Brain responds in voice by default; cuts the click-to-talk cycle. Q3 2026.
  • Mobile Companion push-to-talk — Native iOS + Android app spawns a Mission on your desktop Jorv Builder and reports back. Q3 2026.
  • Apple SpeechAnalyzer — On-device STT option for macOS Tahoe+ (macOS 27). Free, no cloud round-trip. Q4 2026.
  • Speaker diarization — Multi-speaker recognition in shared brainstorm sessions; each speaker attributed in the transcript and PROV chain. Q1 2027.
  • iOS 27 Siri Extension — Register as a Siri intent provider so "Hey Siri, spawn a mission to fix the OAuth bug" works system-wide. Targeting Q4 2026 per WWDC 2026 details.

BYOK voice economics

You pay OpenAI or ElevenLabs directly. Jorv never proxies voice traffic.

Indicative costs (Solo tier, moderate use):

  • Whisper STT: ~$0.006/minute
  • OpenAI TTS: ~$15 per 1M characters
  • ElevenLabs TTS: ~$30-180 per 1M characters depending on plan

A reasonably chatty dev session running ~30 minutes of voice ends up ~$0.10-0.30 of provider spend.

Privacy posture

  • Voice is captured locally and streamed to your provider under your key.
  • We never store the audio.
  • The transcript is local-first (PROV chain on your disk); Enterprise tier writes to your corporate WORM bucket.
  • Reduced-motion + screen-reader users can disable voice entirely with no functional loss — keyboard equivalents exist for every voice command.

Troubleshooting