Skip to main content

Live Two-Way Chat

Real-time conversational AI with natural speech flow - moving beyond forum-style turn-taking.

Vision

Current chatbot conversations are essentially forums with near-instant replies. Humans don't listen to someone speak, stop, think about the context, then respond with an entire paragraph. Live Two-Way Chat simulates natural human conversation:

The Problem

Traditional chat interfaces:

  • Wait for complete user input before processing
  • Generate entire responses at once
  • Can't be interrupted or course-corrected mid-thought
  • Feel robotic and turn-based

The Solution

A real-time bidirectional conversation where:

  1. Continuous transcription - Human voice is transcribed in small constant chunks in the background
  2. Predictive response preparation - AI analyzes context and pre-prepares replies, modifying them as new context arrives
  3. Natural interruption - AI decides when to speak:
    • Sometimes interrupting if an important point needs to be made
    • Sometimes waiting for a question to be asked
  4. Bidirectional listening - The chatbot listens even while speaking, taking into account what it was saying when interrupted
  5. Shared context window - A visual workspace for files and artifacts

Shared Context Window

A drag-and-drop workspace visible to both human and AI:

Content TypeBehavior
ImagesDisplayed for user, visible to AI for analysis
CodeDisplayed and editable by user, AI can view and modify
DocumentsShared context for conversation
Split viewWindow can split to show 2+ files simultaneously

The AI can:

  • View what's in the window
  • Edit code or text files
  • Reference images in conversation
  • Suggest changes visually

Technical Challenges

  1. Streaming ASR - Real-time speech-to-text with low latency
  2. Incremental response generation - Partial responses that can be updated
  3. Turn-taking model - When to speak, when to wait, when to interrupt
  4. Context threading - Tracking what was said/being-said when interruptions occur
  5. Audio ducking - Managing simultaneous speech gracefully

Potential Architecture

┌─────────────────┐     ┌──────────────────┐
│ Microphone │────▶│ Streaming ASR │
│ (continuous) │ │ (Whisper/etc) │
└─────────────────┘ └────────┬─────────┘
│ text chunks

┌─────────────────┐ ┌──────────────────┐
│ Speaker │◀────│ Response Engine │
│ (TTS) │ │ (predictive) │
└─────────────────┘ └────────┬─────────┘

┌────────▼─────────┐
│ Context Window │
│ (shared state) │
└──────────────────┘

Inspiration

  • Natural human conversations (overlapping speech, interruptions, backchanneling)
  • Real-time collaborative editors (Google Docs)
  • Voice assistants that feel less robotic
  • Pair programming conversations
  • Ramble - Voice transcription (could provide ASR component)
  • Artifact Editor - Could power the shared context window