Talk to Your AI: Ditto's Realtime Voice Mode

Ditto now supports live voice conversations with AI — powered by Gemini's realtime API, client-side voice activity detection, and persistent memory. Every voice conversation is remembered.

Talk to Your AI: Ditto’s Realtime Voice Mode

Typing is fine for writing code or editing a document. But when you’re pacing around your apartment working through an idea, driving home and want to recap your day, or lying on the couch too tired to type — you just want to talk.

Ditto now supports realtime voice conversations with AI. Not speech-to-text-then-wait-for-a-response. Actual back-and-forth, fluid voice interaction where you talk and the AI responds out loud, in real time.

And because this is Ditto, every word is remembered.

How It Works

Ditto’s voice mode is powered by Gemini’s realtime API, which streams audio bidirectionally over a persistent connection. You speak, the model processes your audio directly (not a transcription — the actual audio waveform), and responds with synthesized speech. The latency is low enough that the conversation feels natural.

Here’s what a voice session looks like:

  1. Tap the voice icon in the composer bar
  2. Start talking — Ditto listens and responds in real time
  3. Interrupt freely — the AI stops speaking when you start
  4. End the session when you’re done — the full conversation is saved to your memory

No wake words. No “Hey Ditto.” You tap, you talk, you’re in a conversation.

Why Voice Changes Everything

Voice unlocks a category of interactions that text can’t match:

Brainstorming moves faster. When you’re riffing on ideas, the friction of typing breaks the flow. Voice lets you think out loud and get instant pushback, questions, and suggestions from an AI that already knows your context from previous conversations.

Capture thoughts on the go. Walking the dog, commuting, cooking — moments where your hands are busy but your mind is active. With voice mode, those thoughts become memories instead of forgotten ideas.

Complex explanations are easier to speak. Try explaining a system architecture by typing on your phone. Now try saying it out loud. Voice handles nuance, emphasis, and tangents naturally. Ditto captures the whole thing.

Accessibility matters. Not everyone can type comfortably. Voice makes Ditto usable for people who interact better through speech — whether by preference or necessity.

Client-Side Voice Activity Detection

One problem with voice AI: it’s expensive. Keeping a bidirectional audio stream open while you’re pausing to think, getting interrupted, or reading something before responding burns tokens on silence.

Ditto uses client-side voice activity detection (VAD) to solve this. Your device listens locally and only streams audio to the API when you’re actually speaking. Silence stays on your device. This cuts realtime API token usage significantly without adding perceptible latency.

You don’t need to configure this. It just works — your voice sessions cost less and perform better because of it.

The Inline Voice Recorder

Not every voice interaction needs to be a live conversation. Sometimes you just want to dictate a quick thought.

The inline voice recorder sits in the composer bar alongside text input. Tap it, speak your message, and Ditto transcribes it into text in the composer. Edit it if you want, then send. Or just send it as-is — your voice input becomes a regular message with full memory persistence.

This is the quick-capture workflow: pull out your phone, tap record, say “remind me that the API refactor needs to handle pagination before the v2 launch,” and you’re done. Five seconds. That thought is now a searchable memory connected to your knowledge graph.

Choose Your Voice

Ditto lets you preview and select from multiple voice options in Settings. Different voices suit different contexts — a crisp, professional voice for work sessions, a warmer tone for personal conversations. You can also configure which voice Ditto uses for read-aloud, so when you ask Ditto to read a long response back to you, it sounds the way you prefer.

Voice + Memory: The Combination That Matters

Here’s what makes Ditto’s voice mode different from talking to Siri, Alexa, or even ChatGPT’s Advanced Voice Mode: every voice conversation builds your persistent memory.

Talk to Ditto about your startup idea on a Monday evening walk. On Wednesday, type a follow-up question at your desk. Ditto remembers the voice conversation — the context, the decisions, the details — and continues where you left off. Voice and text are the same memory system. There’s no separate “voice history” you can’t search.

This works because Ditto’s memory is modality-agnostic. Text, images, documents, and voice all flow into the same knowledge graph with the same semantic search, the same learned retrieval weights, and the same transparent memory cards showing exactly what was retrieved.

Other voice assistants give you a conversation that vanishes when you hang up. Ditto gives you a conversation that compounds.

Voice on Native Mobile

Voice mode is particularly powerful on Ditto’s native iOS and Android apps. Native audio access means no browser permission prompts and direct access to the platform’s audio stack. The experience is:

  • One tap to start from the composer bar
  • Background audio — switch apps and keep the conversation going
  • Native audio routing — Bluetooth headphones, car speakers, AirPods work seamlessly
  • Low latency — native audio processing adds less overhead than browser-based alternatives

The combination of native performance and Ditto’s persistent memory means you can have a voice conversation in the car, park, walk into a meeting, and later pull up exactly what you discussed — with full context and semantic search.

Getting Started with Voice

Voice mode is available now. Open Ditto, tap the voice icon in the composer, and start talking. Preview voices in Settings to find one you like.

If you’re already using Ditto for text conversations, voice is just another input modality. Your existing threads, subjects, and goals carry over. The AI that remembers your text conversations now remembers your voice conversations too.

Talk to your AI. It’s listening — and this time, it’ll remember.


Try Ditto’s realtime voice mode at assistant.heyditto.ai. Free to start, no credit card required.

Join 660+ users · Try free

Try Ditto Free →