feat: Implement continuous dictation with chunked transcription

- Add ContinuousRecorder class using sounddevice for gapless audio capture - Record continuously and extract 10-second chunks with 0.5s overlap - Overlap prevents words from being cut off at chunk boundaries - Transcription happens in parallel with recording - Text appears incrementally as chunks complete - Add sounddevice and numpy as GUI dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 03:03:10 -04:00 · 2026-01-04 03:03:10 -04:00 · 57efe6c931
parent 55aec4c7f2
commit 57efe6c931
3 changed files with 386 additions and 111 deletions
--- a/general-discussion.discussion.md
+++ b/general-discussion.discussion.md
@ -14,3 +14,93 @@ This i just a general brainstorming discussion.
 ---
 *Discussion begins below.*
 ---
 Name: rob
 I'm trying to think think of a good way of documenting development projects.
 What program exist?  I usually put it all in a markdown file.
 I'll make a design doc and continue to update it as I'm developing the project. 
 Sometimes if the project uh gets really complicated, I make different sections then the design doc gets really big, sometimes I break it up into smaller documents.
 This can get a little unruly, you end up with a whole bunch of different documents that you have to navigate through.
 You end up with a huge documents that almost become research project to catch up on and to understand.
 It's hard to see the entire project all in one place. Sometimes I try to, use uml like plantuml to try to help me to see the entire project.
 Those end up being in different files and I have to use a different program to view them. Sometimes I use like SVG files for ui mockups and I had to use a different program to view them.
 I find it hard to keep it all together, you end up with documents spread all throughout your project,
 some that are for the users to see some of it for the developers, some are for me.
 I tried to use things like obsidian, and sometimes I just create entire web pages because you can link different documents but I
 haven't found a really good solution where I can view, navigate and edit it all in one solution. what ideas do you have?
 ---
 Name: AI-Architect
 Parse error
 VOTE: NONE
 ---
 Name: AI-Backend
 Backend feedback: You’re describing a doc hub that unifies markdown, diagrams, and media with seamless navigation/editing. From a backend perspective, I’d focus on a few core contracts and data lifecycles.
 - Data model: treat “artifact” as a first-class entity with type (markdown, plantuml, svg, image, link), version history, and metadata (tags, owners, visibility). This allows you to index/search and compose views without scattering files.
 - Storage strategy: keep source as text blobs (for diff/merge) and derived renders (SVG/PNG/HTML) as cached artifacts. Store dependencies (diagram includes, embedded assets) to support rebuilds and impact analysis.
 - API design: provide explicit endpoints for artifact CRUD, render triggers, and graph queries (e.g., backlinks, referenced-by). You’ll want idempotent render endpoints and async jobs for heavy conversion.
 - Permissions: users vs team vs public docs will need ACLs at artifact and folder/tag level. Plan for mixed visibility within a single project.
 - Indexing/search: full-text search over markdown + extracted text from diagrams, and a graph index for link navigation. This makes “see the whole project” feasible.
 - Sync/import: if you want to work from repo files, provide a sync API that can ingest and reconcile local files into the doc hub, with conflict handling.
 - Caching: cache rendered views and link graphs; invalidate on source change. Keep a render queue to avoid spikes.
 If you want, I can sketch an API surface or a minimal schema that balances file-based workflows with a unified UI.
 VOTE: NONE
 ---
 Name: AI-Designer
 The core issue you're describing is an **information architecture problem manifesting as a visual/navigation problem**. Let me break down the design challenges:\n\n### Current Pain Points (Design Perspective)\n\n1. **Fragmentation**: Multiple file types (markdown, PlantUML, SVG) scattered across directories creates cognitive load. Each context switch (switching programs) breaks focus.\n\n2. **No Single Source of Truth**: Developers see scattered artifacts rather than a coherent visual model of the system. This creates ambiguity about what's current/canonical.\n\n3. **Scale Problem**: Large documents become unnavigable. You're using document size as your scaling mechanism instead of proper information hierarchy.\n\n4. **No Visual Cross-Referencing**: Diagrams exist separately from prose. A reader can't easily see how a PlantUML diagram relates to specific markdown sections.\n\n### Design Solution Space\n\n**1. Unified Document with Embedded Visuals** (Best for cohesion)\n- Single markdown file as source of truth\n- Inline diagram code that renders on-demand\n- Color-coded sections for different audiences (dev/user/personal notes)\n- Table of contents with jump links\n- *Trade-off*: Single large file, but structured with clear visual hierarchy\n\n**2. Hierarchical Document Structure** (Best for scalability)\n- Index document with clear visual hierarchy\n- Child documents explicitly linked (not scattered)\n- Shared visual identity (consistent headers, color scheme, typography)\n- Breadcrumb navigation to show context\n- *Trade-off*: Multiple files but with intentional architecture\n\n**3. Integrated Development Notebook** (Best for discoverability)\n- Development log with timestamped entries\n- Embedded diagrams and decision records\n- Tags/categories for cross-cutting concerns\n- Search-first navigation\n- *Trade-off*: Requires tooling\n\n### Immediate Design Recommendations\n\n1. **Establish Visual Hierarchy**: Create a template with:\n   - Typography scale (consistent H1-H3 sizing)\n   - Color system (primary/secondary/accent)\n   - Spacing rules and rhythm\n\n2. **Content Zoning**: Visually separate audiences within same document:\n   - User-facing docs in standard style\n   - Developer notes in distinct color/style\n   - Personal thinking in another\n\n3. **Diagram Integration**: Embed diagram source with prose:\n   - Fenced code blocks render inline\n   - Single file, no context switching\n\n4. **Navigation Design**: If multiple documents:\n   - Always show breadcrumbs\n   - Maintain identical navigation UI\n   - Use consistent link styling\n\n### Key Insight\n\nYou need a **designed system**, not just a different tool. Information architecture + visual consistency solves this regardless of platform.\n\nDIAGRAM: diagrams/general-discussion_designer_1.puml
 DIAGRAM: diagrams/general-discussion_designer_1.puml
 VOTE: READY
 ---
 Name: AI-Ux
 The user's problem statement highlights a critical UX challenge: the fragmentation of project documentation across various tools and file formats. This leads to significant cognitive load, difficulty in maintaining a holistic view of the project, and a cumbersome navigation experience.
 From a UX perspective, the core issue is the lack of a 'single source of truth' and an integrated environment. Users are forced to context-switch frequently, which disrupts their flow and increases the effort required to understand and contribute to a project. The current process is neither intuitive nor efficient.
 Key UX pain points identified:
 - **High Cognitive Load:** Juggling multiple documents, tools, and mental models for different content types (text, diagrams, UI mockups).
 - **Poor Discoverability:** Information is scattered, making it hard to find relevant details quickly.
 - **Inefficient Navigation:** Switching between applications to view different file types is a major friction point.
 - **Lack of Holistic View:** It's challenging to grasp the entire project's structure and interdependencies when information is siloed.
 An ideal solution would prioritize:
 1.  **Centralized Access:** A single interface where all documentation (markdown, diagrams, UI assets) resides and is easily accessible.
 2.  **Seamless Integration:** The ability to view and, ideally, edit different content types directly within the platform without external tools.
 3.  **Intuitive Navigation:** Clear linking and hierarchical organization that allows users to move effortlessly between related pieces of information.
 4.  **Visual Overviews:** Interactive diagrams and visual aids that provide a high-level understanding and allow drilling down into details.
 5.  **Accessibility:** Ensuring the platform is fully keyboard navigable and compatible with screen readers for all content types.
 6.  **Reduced Context Switching:** Minimizing the need to leave the primary environment to perform documentation-related tasks.
 DIAGRAM: diagrams/general-discussion_ux_1.puml
 DIAGRAM: diagrams/general-discussion_ux_1.puml
 VOTE: CHANGES
 ---
 Name: AI-Frontend
 From a frontend perspective, this problem screams for a unified documentation interface. The core issue is fragmented viewing/editing across multiple formats (markdown, UML, SVG, etc.). A modern single-page application could solve this with:
 1. **Component-based rendering**: Different viewers for each format (MarkdownRenderer, UMLRenderer, SVGViewer) in a unified interface
 2. **Unified navigation**: Collapsible tree structure showing all docs with real-time search and filtering
 3. **Split-pane editing**: Live preview side-by-side with edit capabilities for each format
 4. **Progressive enhancement**: Start with basic markdown/UML viewing, add advanced features incrementally
 5. **State management**: Centralized store for document structure, active selections, and edit states
 The frontend should focus on performance (lazy loading heavy assets like UML diagrams) and accessibility (keyboard navigation, screen reader support for complex diagrams).
 DIAGRAM: diagrams/general-discussion_frontend_1.puml
 DIAGRAM: diagrams/general-discussion_frontend_1.puml
 VOTE: CHANGES
--- a/pyproject.toml
+++ b/pyproject.toml
@ -38,6 +38,8 @@ tui = [
 gui = [
    "dearpygui>=2.0.0",
    "artifact-editor @ git+https://gitea.brrd.tech/rob/artifact-editor.git",
    "sounddevice>=0.4.6",
    "numpy>=1.24.0",
 ]
 dev = [
    "pytest>=7.0",
--- a/src/discussions/ui/gui.py
+++ b/src/discussions/ui/gui.py
@ -42,6 +42,203 @@ except ImportError:
    get_format_info = None
 class ContinuousRecorder:
    """
    Continuous audio recorder with chunked transcription support.
    Records audio continuously and extracts overlapping chunks for transcription.
    This prevents word cutoff at chunk boundaries.
    """
    def __init__(
        self,
        chunk_duration: float = 10.0,
        overlap_duration: float = 0.5,
        sample_rate: int = 16000,
        on_chunk_ready: Callable[[bytes], None] = None,
        on_error: Callable[[str], None] = None
    ):
        """
        Initialize the continuous recorder.
        Args:
            chunk_duration: Duration of each chunk in seconds
            overlap_duration: Overlap between chunks in seconds
            sample_rate: Audio sample rate in Hz
            on_chunk_ready: Callback when a chunk is ready (receives WAV bytes)
            on_error: Callback when an error occurs
        """
        self.chunk_duration = chunk_duration
        self.overlap_duration = overlap_duration
        self.sample_rate = sample_rate
        self.on_chunk_ready = on_chunk_ready
        self.on_error = on_error
        self._recording = False
        self._stream = None
        self._buffer = []
        self._buffer_lock = threading.Lock()
        self._chunk_thread = None
        self._stop_event = threading.Event()
        # Calculate samples
        self._chunk_samples = int(chunk_duration * sample_rate)
        self._overlap_samples = int(overlap_duration * sample_rate)
    def start(self):
        """Start continuous recording."""
        if self._recording:
            return
        try:
            import sounddevice as sd
            import numpy as np
        except ImportError as e:
            if self.on_error:
                self.on_error(f"sounddevice not installed: {e}")
            return
        self._recording = True
        self._stop_event.clear()
        self._buffer = []
        def audio_callback(indata, frames, time_info, status):
            """Called by sounddevice for each audio block."""
            if status and self.on_error:
                self.on_error(f"Audio status: {status}")
            with self._buffer_lock:
                self._buffer.append(indata.copy())
        try:
            self._stream = sd.InputStream(
                samplerate=self.sample_rate,
                channels=1,
                dtype='int16',
                callback=audio_callback,
                blocksize=1024
            )
            self._stream.start()
            # Start chunk extraction thread
            self._chunk_thread = threading.Thread(target=self._chunk_loop, daemon=True)
            self._chunk_thread.start()
        except Exception as e:
            self._recording = False
            if self.on_error:
                self.on_error(f"Failed to start recording: {e}")
    def stop(self) -> bytes:
        """
        Stop recording and return any remaining audio as WAV bytes.
        Returns:
            WAV bytes of remaining audio, or empty bytes if none
        """
        if not self._recording:
            return b''
        self._recording = False
        self._stop_event.set()
        # Stop the stream
        if self._stream:
            try:
                self._stream.stop()
                self._stream.close()
            except Exception:
                pass
            self._stream = None
        # Wait for chunk thread to finish
        if self._chunk_thread:
            self._chunk_thread.join(timeout=2)
            self._chunk_thread = None
        # Get remaining audio
        remaining_wav = self._extract_remaining()
        return remaining_wav
    def _chunk_loop(self):
        """Background thread that extracts and processes chunks."""
        import numpy as np
        # We need chunk_duration + overlap of audio before we can extract a chunk
        min_samples_needed = self._chunk_samples + self._overlap_samples
        while not self._stop_event.is_set():
            # Wait a bit before checking
            self._stop_event.wait(timeout=1.0)
            if self._stop_event.is_set():
                break
            # Check if we have enough audio for a chunk
            with self._buffer_lock:
                if not self._buffer:
                    continue
                # Concatenate buffer
                try:
                    audio_data = np.concatenate(self._buffer)
                except ValueError:
                    continue
                total_samples = len(audio_data)
                # Need enough samples for a full chunk plus overlap
                if total_samples < min_samples_needed:
                    continue
                # Extract chunk (the first chunk_samples + overlap_samples)
                chunk_data = audio_data[:min_samples_needed]
                # Keep the overlap portion for next chunk (prevents word cutoff)
                # We keep from (chunk_samples) onwards, so next chunk starts with overlap
                keep_from = self._chunk_samples
                remaining = audio_data[keep_from:]
                self._buffer = [remaining] if len(remaining) > 0 else []
            # Convert to WAV and send to callback
            if self.on_chunk_ready:
                wav_bytes = self._to_wav(chunk_data)
                self.on_chunk_ready(wav_bytes)
    def _extract_remaining(self) -> bytes:
        """Extract any remaining audio from the buffer as WAV bytes."""
        import numpy as np
        with self._buffer_lock:
            if not self._buffer:
                return b''
            try:
                audio_data = np.concatenate(self._buffer)
            except ValueError:
                return b''
            self._buffer = []
        if len(audio_data) < self.sample_rate:  # Less than 1 second
            return b''
        return self._to_wav(audio_data)
    def _to_wav(self, audio_data) -> bytes:
        """Convert numpy audio data to WAV bytes."""
        import io
        import wave
        buffer = io.BytesIO()
        with wave.open(buffer, 'wb') as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)  # 16-bit
            wf.setframerate(self.sample_rate)
            wf.writeframes(audio_data.tobytes())
        return buffer.getvalue()
 class DiscussionGUI:
    """
    Main GUI application using Dear PyGui.
@ -214,142 +411,127 @@ class DiscussionGUI:
    def _dictate_clicked(self):
        """Handle dictate button click - toggle recording."""
-        if self._dictation_process is not None:
+        if self._continuous_recorder is not None:
-            # Currently recording - stop and transcribe
+            # Currently recording - stop
            self._stop_dictation()
        else:
            # Start recording
            self._start_dictation()
    def _start_dictation(self):
-        """Start recording audio from microphone."""
+        """Start continuous recording with chunked transcription."""
-        import tempfile
+        # Create recorder with callbacks
-        import signal
+        self._continuous_recorder = ContinuousRecorder(
            chunk_duration=10.0,  # 10 second chunks
            overlap_duration=0.5,  # 0.5 second overlap to prevent word cutoff
            sample_rate=16000,
            on_chunk_ready=self._on_audio_chunk_ready,
            on_error=lambda e: self._show_error(f"Recording error: {e}")
        )
-        # Create temp file for audio
+        self._continuous_recorder.start()
        self._dictation_audio_file = tempfile.mktemp(suffix=".wav")
-        # Start arecord without duration limit (records until interrupted)
+        # Update button to show recording state
-        try:
+        if dpg.does_item_exist("dictate_btn"):
-            self._dictation_process = subprocess.Popen(
+            dpg.set_item_label("dictate_btn", "● Recording...")
                [
                    "arecord",
                    "-D", "default",
                    "-f", "S16_LE",
                    "-r", "16000",
                    "-c", "1",
                    "-t", "wav",
                    self._dictation_audio_file
                ],
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE
            )
            # Update button to show recording state
            if dpg.does_item_exist("dictate_btn"):
                dpg.set_item_label("dictate_btn", "Stop Recording")
        except FileNotFoundError:
            self._show_error("arecord not found. Install alsa-utils: sudo apt install alsa-utils")
            self._dictation_process = None
            self._dictation_audio_file = None
        except Exception as e:
            self._show_error(f"Failed to start recording: {e}")
            self._dictation_process = None
            self._dictation_audio_file = None
    def _stop_dictation(self):
-        """Stop recording and transcribe the audio."""
+        """Stop recording and transcribe any remaining audio."""
-        import signal
+        if self._continuous_recorder is None:
        if self._dictation_process is None:
            return
-        # Send SIGINT to stop arecord gracefully
+        # Update button
        try:
            self._dictation_process.send_signal(signal.SIGINT)
            self._dictation_process.wait(timeout=2)
        except Exception:
            # Force kill if needed
            try:
                self._dictation_process.kill()
                self._dictation_process.wait(timeout=1)
            except Exception:
                pass
        self._dictation_process = None
        # Update button back to Dictate
        if dpg.does_item_exist("dictate_btn"):
-            dpg.set_item_label("dictate_btn", "Transcribing...")
+            dpg.set_item_label("dictate_btn", "Finishing...")
        # Stop recording and get remaining audio
        remaining_wav = self._continuous_recorder.stop()
        self._continuous_recorder = None
        # Transcribe remaining audio if any
        if remaining_wav:
            self._transcribe_chunk(remaining_wav, is_final=True)
        else:
            # No remaining audio, just reset button
            if dpg.does_item_exist("dictate_btn"):
                dpg.set_item_label("dictate_btn", "Dictate")
    def _on_audio_chunk_ready(self, wav_bytes: bytes):
        """Called when a chunk of audio is ready for transcription."""
        # Transcribe in background thread
-        audio_file = self._dictation_audio_file
+        self._transcribe_chunk(wav_bytes, is_final=False)
        self._dictation_audio_file = None
-        def transcribe():
+    def _transcribe_chunk(self, wav_bytes: bytes, is_final: bool = False):
        """Transcribe a chunk of audio and append to comment text."""
        def do_transcribe():
            transcript = ""
            temp_file = None
            try:
-                if not audio_file:
+                # Write WAV bytes to temp file
-                    transcript = "[No audio file path]"
+                temp_file = tempfile.mktemp(suffix=".wav")
-                elif not os.path.exists(audio_file):
+                with open(temp_file, 'wb') as f:
-                    transcript = f"[Audio file not found: {audio_file}]"
+                    f.write(wav_bytes)
                else:
                    # Check file size to ensure recording happened
                    file_size = os.path.getsize(audio_file)
                    if file_size < 1000:  # Less than 1KB likely means no audio
                        transcript = "[Recording too short or empty]"
                    else:
                        transcribe_cmd = os.path.expanduser("~/.local/bin/transcribe")
                        if not os.path.exists(transcribe_cmd):
                            transcript = "[transcribe tool not found at ~/.local/bin/transcribe]"
                        else:
                            result = subprocess.run(
                                [transcribe_cmd],
                                input=audio_file,
                                capture_output=True,
                                text=True,
                                timeout=60
                            )
                            transcript = result.stdout.strip()
                            if result.returncode != 0 and result.stderr:
                                transcript = f"[Transcribe error: {result.stderr.strip()}]"
                            elif not transcript:
                                transcript = "[No speech detected]"
-                    # Clean up audio file
+                # Check file size
                file_size = len(wav_bytes)
                if file_size < 1000:
                    transcript = ""  # Too short, skip silently
                else:
                    transcribe_cmd = os.path.expanduser("~/.local/bin/transcribe")
                    if not os.path.exists(transcribe_cmd):
                        transcript = "[transcribe tool not found]"
                    else:
                        result = subprocess.run(
                            [transcribe_cmd],
                            input=temp_file,
                            capture_output=True,
                            text=True,
                            timeout=60
                        )
                        transcript = result.stdout.strip()
                        if result.returncode != 0 and result.stderr:
                            transcript = f"[Error: {result.stderr.strip()[:100]}]"
            except subprocess.TimeoutExpired:
                transcript = "[Timeout]"
            except Exception as e:
                transcript = f"[Error: {e}]"
            finally:
                # Clean up temp file
                if temp_file and os.path.exists(temp_file):
                    try:
-                        os.remove(audio_file)
+                        os.remove(temp_file)
                    except Exception:
                        pass
            except subprocess.TimeoutExpired:
                transcript = "[Transcription timed out after 60 seconds]"
            except Exception as e:
                transcript = f"[Transcription error: {e}]"
-            # Update UI in main thread
+            # Append result to UI
-            self._on_transcription_complete(transcript)
+            self._on_transcription_chunk_complete(transcript, is_final)
-        thread = threading.Thread(target=transcribe, daemon=True)
+        thread = threading.Thread(target=do_transcribe, daemon=True)
        thread.start()
-    def _on_transcription_complete(self, transcript: str):
+    def _on_transcription_chunk_complete(self, transcript: str, is_final: bool):
-        """Called when transcription completes - append to comment text."""
+        """Called when a chunk transcription completes - append to comment text."""
-        # Reset button
+        # Handle final chunk - reset button
-        if dpg.does_item_exist("dictate_btn"):
+        if is_final:
-            dpg.set_item_label("dictate_btn", "Dictate")
+            if dpg.does_item_exist("dictate_btn"):
                dpg.set_item_label("dictate_btn", "Dictate")
        # Skip empty or error transcripts (for non-final chunks, don't show errors)
        if not transcript:
-            self._show_error("Transcription returned empty result")
+            return
-        elif transcript.startswith("["):
+        if transcript.startswith("[") and not is_final:
-            # Error message from transcribe tool
+            # Only show errors for final chunk
-            self._show_error(f"Transcription failed: {transcript}")
+            return
-        else:
+        if transcript.startswith("["):
-            # Append to existing comment text
+            self._show_error(f"Transcription: {transcript}")
-            if dpg.does_item_exist("comment_text"):
+            return
-                current = dpg.get_value("comment_text") or ""
+
-                separator = " " if current.strip() else ""
+        # Append to existing comment text
-                dpg.set_value("comment_text", current + separator + transcript)
+        if dpg.does_item_exist("comment_text"):
            current = dpg.get_value("comment_text") or ""
            separator = " " if current.strip() else ""
            dpg.set_value("comment_text", current + separator + transcript)
    def __init__(self, discussions_dir: Path = None):
        if discussions_dir is None:
@ -369,9 +551,10 @@ class DiscussionGUI:
        self._reading_button_tag: Optional[str] = None
        self._comment_counter = 0  # For generating unique button tags
-        # Dictation state
+        # Dictation state - continuous recording with chunked transcription
-        self._dictation_process: Optional[subprocess.Popen] = None
+        self._continuous_recorder: Optional[ContinuousRecorder] = None
-        self._dictation_audio_file: Optional[str] = None
+        self._transcription_queue = []  # Queue of transcription results
        self._transcription_lock = threading.Lock()
        # Initialize Dear PyGui
        dpg.create_context()