orchestrated-discussions/examples/brainstorm_sketch_test.disc...

<!-- DISCUSSION -->
<!-- Title: Notification System Brainstorm -->
<!-- Phase: sketch -->
<!-- Status: OPEN -->
<!-- Template: brainstorm -->
<!-- Created: 2025-12-10T10:00:00Z -->
<!-- Participants: architect, security, pragmatist, moderator -->

# Notification System Brainstorm

## Problem Statement

Our application needs a way to notify users about important events (new messages, status changes, alerts). Currently we have no notification system at all.

## Known Constraints

- Small team (2 developers)
- Need something working within 2 weeks
- Must work on web and mobile
- Budget: minimal (prefer open source)
- Current stack: Python backend, React frontend

## What Success Looks Like

- Users get timely notifications about events they care about
- Users can control what notifications they receive
- System is reliable enough that important notifications aren't lost

## Agreed Scope (from cluster phase)

- MVP: Web notifications only (browser), mobile comes later
- Start with 3 event types: new message, status change, system alert
- "Timely" means within 30 seconds
- Simple on/off preferences per event type
- Target: 1000 users initially

---

*We're now in the sketch phase - create rough diagrams to illustrate the approach.*
---

---

Name: AI-Architect
Parse error
VOTE: NONE

---

Name: AI-Security
## Security Analysis - Notification System

From a security perspective, a notification system introduces several critical attack surfaces that need early consideration:

### Key Security Components

1. **Authentication & Authorization Layer**
   - Who can send notifications? (internal services only? authenticated users?)
   - Who can receive which notifications? (proper user segmentation)
   - API authentication between backend and notification service

2. **Input Validation & Sanitization**
   - Notification content must be sanitized to prevent XSS in browser notifications
   - Event type validation to prevent unauthorized notification types
   - Rate limiting per user/service to prevent notification spam/DoS

3. **Data Protection**
   - Notification content may contain sensitive data (PII, business data)
   - Storage considerations if implementing delivery guarantees
   - Encryption in transit (WSS for WebSocket, HTTPS for polling)

4. **Session Management**
   - WebSocket connection authentication (token-based preferred)
   - Session hijacking prevention
   - Proper cleanup of stale connections

### Threat Model

DIAGRAM: diagrams/notification-system-brainstorm_security_1.puml

**Primary Threats:**
- **XSS via notifications**: Malicious user injects script into notification content → executes in victim's browser
- **Notification spam/DoS**: Attacker floods user with notifications, exhausting resources
- **Authorization bypass**: User receives notifications intended for another user
- **MitM attacks**: Unencrypted WebSocket connections expose notification content
- **Token theft**: Stolen auth tokens allow impersonation for WebSocket connections

### Critical Security Requirements

1. **Mandatory**: Sanitize ALL notification content server-side before sending
2. **Mandatory**: Authenticate WebSocket connections with short-lived tokens
3. **Mandatory**: Enforce per-user/per-service rate limits (suggest: 100/min per user)
4. **Recommended**: Use WSS (WebSocket Secure) protocol only
5. **Recommended**: Implement notification content ACLs (user can only receive their own data)

CONCERN: Browser notifications API requires user permission, but malicious notifications could still be used for phishing attacks ("Your account will be suspended - click here")

CONCERN: If using WebSockets, need strategy for connection state management - stale connections are a DoS vector

QUESTION: Will notifications contain sensitive data, or just "You have 1 new message" style summaries? This affects our encryption and storage requirements.

QUESTION: Will there be a notification history/persistence? This significantly expands the attack surface.

### Recommendations for MVP

- Use established libraries (e.g., Socket.IO with built-in security features) rather than raw WebSockets
- Implement rate limiting from day 1 (much harder to add later)
- Sanitize on server side using a whitelist approach (only allow plain text for MVP)
- Use short-lived JWT tokens for WebSocket auth (5-15 min expiry)
- Log all notification sends for security auditing

VOTE: CHANGES - Need to explicitly address XSS prevention and rate limiting before implementation

**Diagram:** `diagrams/notification-system-brainstorm_security_1.puml`
VOTE: CHANGES

---

Name: AI-Pragmatist
Looking at the agreed scope, this is actually well-scoped already. Let me sketch the simplest thing that could work.

MVP Architecture (Week 1-2):

1. **Event Generation** (Python backend)
   - When event happens → insert row in `notifications` table
   - Fields: user_id, event_type, message, created_at, read

2. **Delivery** (polling - yes, polling)
   - Frontend polls `/api/notifications/unread` every 30s
   - Returns new notifications since last check
   - Browser Notification API shows popup

3. **Preferences** (simple)
   - `user_preferences` table: user_id, event_type, enabled
   - Check before inserting notification

That's it. No WebSockets, no message queue, no separate notification service.

**Why polling?**
- Works everywhere (no WebSocket infrastructure)
- Handles 1000 users easily (30s interval = ~33 req/s)
- Dead simple to implement and debug
- Can upgrade later if needed

**What we're NOT doing (yet):**
- Push notifications (mobile)
- Real-time delivery (<30s)
- Notification history/pagination
- Batching/grouping
- Email/SMS delivery

See diagram for the flow. We can build this in a week, then iterate based on real usage.

READY to sketch - this is simple enough to ship quickly.

**Diagram:** `diagrams/notification-system-brainstorm_mvp_1.puml`
VOTE: READY

---

Name: AI-Moderator
## Discussion Status

We've just entered the **sketch phase** following a successful clustering phase that defined our MVP scope:
- Web notifications only (browser)
- 3 event types: new message, status change, system alert
- 30-second delivery target
- Simple on/off preferences
- 1000 initial users

## Next Steps

We need rough diagrams to illustrate the technical approach. Specifically:
- Component diagrams showing key pieces (3-5 components)
- Sequence diagrams for notification flow
- Mark any unknowns with "?"
- Keep it rough - we're exploring, not finalizing

## Waiting For

@architect - Please start us off with a high-level component diagram and sequence flow
@security - Review the approach for security considerations
@pragmatist - Assess feasibility for 2-week timeline with 2 developers

All participants should focus on creating simple diagrams and identifying the key architectural pieces.

---

Name: AI-Architect
Parse error
VOTE: NONE

---

Name: AI-Security
Parse error
VOTE: NONE

---

Name: AI-Pragmatist
## Pragmatic Reality Check

Looking at what's been proposed:

**Security's approach**: 5 critical requirements, threat modeling, ACLs, JWT rotation, audit logging, rate limiting "from day 1".

**Reality**: We have 2 developers and 2 weeks. That security spec is a month of work.

CONCERN: COMPLEXITY - Security is architecting for enterprise scale when we need to ship for 1000 users

## What We Actually Need

For 1000 users receiving basic notifications:

1. **Input sanitization**: Use Python's `bleach` library (2 lines of code)
2. **Rate limiting**: Flask-Limiter with `@limiter.limit("100/minute")` decorator (1 line per endpoint)
3. **Auth**: Reuse existing session auth (0 new code)

That's it. That's the security MVP.

**What we're NOT doing yet:**
- WebSocket infrastructure (we're polling, remember?)
- Notification persistence (just show & dismiss)
- Separate auth tokens (use existing sessions)
- Audit logging (add when someone asks for it)

## Simplified Flow

DIAGRAM: diagrams/notification-system-brainstorm_mvp_3.puml

This diagram shows polling vs the WebSocket complexity Security is implying. Polling means:
- No connection state management
- No token refresh logic
- No WebSocket security surface
- Standard HTTPS/session auth

## Week 1-2 Implementation

**Week 1**: Core flow
- Day 1-2: DB table + insert on events
- Day 3-4: `/api/notifications/unread` endpoint + preferences
- Day 5: Frontend polling + Browser Notification API

**Week 2**: Polish
- Add `bleach` sanitization
- Add Flask-Limiter (100/min)
- Test with team
- Ship to 50 beta users

Then we iterate based on actual problems, not theoretical ones.

QUESTION: @security - Can we agree on sanitization + rate limiting as the security MVP, and defer WebSocket security since we're using polling?

VOTE: READY - This is shippable in 2 weeks with 2 devs

DIAGRAM: diagrams/notification-system-brainstorm_mvp_3.puml
VOTE: READY

---

Name: AI-Moderator
## Sketch Phase Summary

We have two architectural approaches with diagrams:

**AI-Pragmatist's Polling Approach:**
- Simple polling every 30s to `/api/notifications/unread`
- `notifications` and `user_preferences` tables in existing DB
- Browser Notification API for popups
- No WebSockets, no message queue
- **Vote: READY** - considers this shippable in 1 week

**AI-Security's Analysis:**
- Identified 5 critical attack surfaces (XSS, DoS, auth bypass, MitM, token theft)
- Provided threat model diagram
- Recommends WebSocket approach with Socket.IO + security measures
- **Vote: CHANGES** - requires explicit XSS prevention and rate limiting

**AI-Architect:**
- Parse error on response - **needs to re-respond**

## Key Tension

There's a architectural mismatch:
- Pragmatist proposes simple polling (30s intervals, DB-based)
- Security's recommendations assume WebSockets (tokens, connection management)

Both are valid for different approaches, but we need to align on which path before advancing.

## Open Questions

1. **Data Sensitivity** (from Security): Will notifications contain sensitive data or just summaries?
2. **Persistence** (from Security): Will there be notification history? This affects security surface.
3. **XSS Prevention**: How will we sanitize notification content? (Security requires this explicit)
4. **Rate Limiting**: What limits per user/service? (Security requires from day 1)

## Phase Advancement Assessment

We have diagrams (✓) but:
- Missing architect's main flow diagram (parse error)
- Need to resolve polling vs WebSocket approach
- Security concerns not yet addressed in chosen design

Per phase instructions, we should have 3-5 key pieces identified with rough diagrams. We're close but not ready.

## Recommendations

1. @architect - Please re-submit your response (there was a parse error). We need your component diagram and sequence flow.
2. @security - After architect responds, please review the *specific* approach chosen and provide targeted security requirements for that architecture (polling vs WebSocket)
3. @pragmatist - Once security requirements are clear, confirm the 2-week timeline still holds with security measures included

**Next phase readiness:** Not yet - need architect's diagram and alignment on technical approach before advancing to reality_check phase.

---

Name: AI-Architect
## Architectural Sketch - Component & Flow Analysis

Let me provide the missing architectural perspective to resolve the polling vs WebSocket tension.

### High-Level Component View

ARTIFACT: diagrams/notification-system-brainstorm_architect_6.puml

The diagram shows both approaches side-by-side to clarify the architectural tradeoffs.

### Architectural Analysis

**Key Components (5 pieces):**

1. **Event Source** - Existing backend services that trigger notifications
2. **Notification Store** - Database table (`notifications`) - persistence layer
3. **Notification API** - REST endpoint(s) for delivery
4. **Client Poller** - Frontend polling mechanism (30s interval)
5. **Browser Notification API** - Native browser popup system

### Why Polling is Architecturally Sound for MVP

**Alignment with constraints:**
- Small team (2 devs) → Simpler mental model, fewer moving parts
- 2-week timeline → No new infrastructure (WebSocket servers, connection pools)
- 1000 users → 30s polling = ~33 req/s = trivial load
- Existing stack → Reuses current HTTP/session infrastructure

**Architectural advantages:**
- **Stateless** - No connection state management, no reconnection logic
- **Debuggable** - Standard HTTP requests in browser DevTools
- **Scalable horizontally** - Any app server can handle any poll request
- **Fail-safe** - Client automatically recovers by next poll
- **Cache-friendly** - Can use HTTP caching headers if needed

**When to evolve:**
- User count > 10,000 (polling becomes inefficient)
- Latency requirement < 10s (30s polling insufficient)
- Battery concerns on mobile (constant polling drains battery)
- Rich interactions (typing indicators, presence - need real-time)

### Security Integration

@security's concerns are valid but apply differently to polling:

**Required (Day 1):**
1. **XSS Prevention** - Sanitize notification content server-side using `bleach` (Security's point #1)
2. **Rate Limiting** - Flask-Limiter on `/api/notifications/unread` endpoint (Security's point #3)
3. **Authorization** - Existing session auth ensures users only see their notifications (Security's point #2)

**Not Applicable (Polling Architecture):**
- WebSocket token management (we're not using WebSockets)
- Connection hijacking prevention (stateless HTTP)
- Stale connection cleanup (no persistent connections)

**Deferred (Post-MVP):**
- Notification history/persistence beyond current session
- Audit logging (add when compliance requires it)
- Content encryption at rest (notifications are ephemeral for MVP)

### Answers to Open Questions

**Q: Will notifications contain sensitive data?**
A: For MVP, use summaries only ("You have 1 new message") - keeps security surface minimal. Full content requires user to click through to authenticated page.

**Q: Will there be notification history?**
A: Not for MVP - notifications shown once, then dismissed. This avoids storage security concerns. Add persistence in v2 if users request it.

### Unknown/Risks (Marked with "?")

1. **Browser support** - Browser Notification API permission model varies. Need to test on target browsers.
2. **Notification volume** - If users get >10 notifications/min, UX degrades. May need batching logic (not in scope yet).
3. **Backend event integration** - How do existing services trigger notifications? Need hook points identified (implementation detail).

### Recommendation

**VOTE: READY** to proceed with polling architecture + security essentials (sanitization + rate limiting).

This is a classic example of "boring technology" - proven, simple, maintainable. We can ship in 2 weeks, learn from real usage, and evolve to WebSockets later *only if* metrics show we need it.

**Phase advancement criteria met:**
- ✓ 3-5 key components identified
- ✓ Rough diagrams created (component + sequence)
- ✓ Unknowns marked with "?"
- ✓ Architectural tradeoffs explicit

Ready for reality_check phase to validate implementation details.
VOTE: READY

---

Name: AI-Security
Parse error
VOTE: NONE

---

Name: AI-Pragmatist
## Moderator's Point is Critical

We have a fundamental mismatch: I proposed polling, Security analyzed WebSockets. Let me clarify the polling approach with explicit security measures that address Security's concerns.

## Polling Architecture with Security Built-In

**Core Components (4 pieces):**

1. **Event Publisher** (Python backend)
   - Sanitizes content with `bleach.clean(text, tags=[], strip=True)` - no HTML allowed
   - Inserts to `notifications` table only if user preference enabled
   - Rate limit: max 100 notifications per user per minute (app-level check)

2. **Notifications Table** (PostgreSQL)
   - Columns: id, user_id, event_type, message_text (sanitized), created_at, read_at
   - Index on (user_id, read_at) for fast unread queries
   - No sensitive data - just event summaries ("You have a new message")

3. **REST API Endpoint** (Flask)
   - `GET /api/notifications/unread` - returns JSON array
   - Uses existing session authentication (no new auth)
   - Flask-Limiter: `@limiter.limit("120/minute")` on endpoint
   - SQL with user_id filter prevents authorization bypass

4. **Frontend Poller** (React)
   - Polls every 30s with `setInterval`
   - Browser Notification API (requires user permission)
   - Text-only display (no HTML rendering) - double defense against XSS

## Security Measures Explicitly Addressed

**XSS Prevention:**
- Server-side: `bleach.clean()` strips all HTML/JS
- Client-side: Display as plain text only in Browser Notification API
- **Double defense**: Even if sanitization fails, no rendering attack surface

**Rate Limiting:**
- Application layer: 100 notifications/user/minute when creating
- API layer: 120 requests/minute per user on endpoint (Flask-Limiter)
- Database constraint: prevent spam at source

**Authorization:**
- SQL query: `WHERE user_id = session.user_id`
- Cannot query other users' notifications (DB-level enforcement)

**Data Protection:**
- No sensitive data in notifications (answer to Security's question)
- No persistence beyond 30 days (auto-delete old notifications)
- HTTPS enforced (existing infrastructure)

**DoS Prevention:**
- Polling interval (30s) self-regulates load
- Rate limits prevent notification spam
- No persistent connections to exhaust

DIAGRAM: diagrams/notification-system-brainstorm_mvp_6.puml

## Implementation Timeline

**Week 1:**
- Day 1: DB migration + sanitization helper
- Day 2-3: Event publisher integration (3 event types)
- Day 4: REST endpoint + Flask-Limiter
- Day 5: Frontend poller + Browser API

**Week 2:**
- Day 1-2: User preferences UI
- Day 3: Security testing (XSS attempts, rate limit verification)
- Day 4: Performance testing (1000 concurrent users)
- Day 5: Ship to 50 beta users

**Total:** 10 developer-days (2 devs × 1 week) + 1 week polish = 2 weeks

## What We're Deferring

- WebSocket infrastructure (not needed for 30s latency)
- Notification history UI (just keep in DB for now)
- Mobile push notifications (scoped out already)
- Email/SMS fallback (not in requirements)
- Advanced batching/grouping (not needed at 1000 users)

## Response to Security's Questions

1. **Sensitive data?** No - summaries only ("You have 1 new message")
2. **History/persistence?** DB storage yes, UI access no (defer to v2)
3. **Rate limiting?** 100/min creation + 120/min API calls
4. **Sanitization?** bleach.clean() server-side + text-only display

VOTE: READY - This addresses all mandatory security requirements while staying shippable in 2 weeks

DIAGRAM: diagrams/notification-system-brainstorm_mvp_6.puml
VOTE: READY

---

Name: AI-Moderator
## Sketch Phase Status

We've made significant progress with two concrete architectural proposals and diagrams:

### What's Been Discussed

**AI-Pragmatist's Polling Architecture:**
- Simple 30-second polling to `/api/notifications/unread` endpoint
- Two DB tables: `notifications` and `user_preferences`
- Browser Notification API for display
- Implementation timeline: 1-2 weeks with clear breakdown
- **Diagrams:** `diagrams/notification-system-brainstorm_mvp_1.puml`, `diagrams/notification-system-brainstorm_mvp_3.puml`

**AI-Security's Security Analysis:**
- Comprehensive threat model with 5 attack surfaces identified
- Critical requirements: XSS prevention, rate limiting, authentication, encryption
- **Diagram:** `diagrams/notification-system-brainstorm_security_1.puml`
- Recommends Socket.IO with security hardening

**AI-Architect:**
- Two parse errors - **responses not recorded**

### Architectural Tension

There's a productive tension between approaches:
- **Pragmatist:** Minimal polling-based approach, can ship in 1 week
- **Security:** More robust requirements that assume WebSocket infrastructure
- **Pragmatist's counter:** Simplified security MVP (sanitization + rate limiting) for polling approach

### Open Questions

1. **Data sensitivity:** Will notifications contain sensitive data or just summaries? (affects encryption/storage)
2. **Persistence:** Will there be notification history? (expands attack surface)
3. **Security MVP agreement:** Can we proceed with sanitization + rate limiting for polling, defer WebSocket security? (@security)

### Concerns Raised

1. Browser notification phishing potential (even with permissions)
2. WebSocket connection state management complexity
3. **COMPLEXITY:** Risk of over-engineering for 1000-user MVP

## Phase Advancement Assessment

**Per sketch phase instructions:**
- ✓ Created simple diagrams (3 diagrams submitted)
- ✓ Identified 3-5 key pieces (polling endpoint, DB tables, sanitization, rate limiting, browser API)
- ✓ Marked unknowns (data sensitivity, persistence needs)
- ✗ AI-Architect's diagrams missing due to parse errors

**However:** The core architectural approach is now clear:
- Polling-based delivery (30s intervals)
- DB persistence for unread notifications
- Sanitization via `bleach` library
- Rate limiting via Flask-Limiter (100/min)
- Browser Notification API for display

The discussion has converged on a pragmatic polling approach with security hardening. @security's final question about the security MVP is the last blocker.

## Recommendation

**ADVANCE TO REALITY_CHECK PHASE:** Yes, with caveat

We have sufficient architectural clarity to move forward:
- Technical approach is defined (polling-based)
- Security requirements are scoped to the approach (not WebSockets)
- Implementation timeline is concrete (1-2 weeks)
- Key unknowns are identified and can be addressed in reality_check

@architect can contribute in the reality_check phase if the parse issues are resolved.

@security - Please confirm you can accept the simplified security MVP (sanitization + rate limiting) for the polling approach, or state specific blockers.