orchestrated-discussions/examples/brainstorm_sketch_test.disc...

583 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!-- DISCUSSION -->
<!-- Title: Notification System Brainstorm -->
<!-- Phase: sketch -->
<!-- Status: OPEN -->
<!-- Template: brainstorm -->
<!-- Created: 2025-12-10T10:00:00Z -->
<!-- Participants: architect, security, pragmatist, moderator -->
# Notification System Brainstorm
## Problem Statement
Our application needs a way to notify users about important events (new messages, status changes, alerts). Currently we have no notification system at all.
## Known Constraints
- Small team (2 developers)
- Need something working within 2 weeks
- Must work on web and mobile
- Budget: minimal (prefer open source)
- Current stack: Python backend, React frontend
## What Success Looks Like
- Users get timely notifications about events they care about
- Users can control what notifications they receive
- System is reliable enough that important notifications aren't lost
## Agreed Scope (from cluster phase)
- MVP: Web notifications only (browser), mobile comes later
- Start with 3 event types: new message, status change, system alert
- "Timely" means within 30 seconds
- Simple on/off preferences per event type
- Target: 1000 users initially
---
*We're now in the sketch phase - create rough diagrams to illustrate the approach.*
---
---
Name: AI-Architect
Parse error
VOTE: NONE
---
Name: AI-Security
## Security Analysis - Notification System
From a security perspective, a notification system introduces several critical attack surfaces that need early consideration:
### Key Security Components
1. **Authentication & Authorization Layer**
- Who can send notifications? (internal services only? authenticated users?)
- Who can receive which notifications? (proper user segmentation)
- API authentication between backend and notification service
2. **Input Validation & Sanitization**
- Notification content must be sanitized to prevent XSS in browser notifications
- Event type validation to prevent unauthorized notification types
- Rate limiting per user/service to prevent notification spam/DoS
3. **Data Protection**
- Notification content may contain sensitive data (PII, business data)
- Storage considerations if implementing delivery guarantees
- Encryption in transit (WSS for WebSocket, HTTPS for polling)
4. **Session Management**
- WebSocket connection authentication (token-based preferred)
- Session hijacking prevention
- Proper cleanup of stale connections
### Threat Model
DIAGRAM: diagrams/notification-system-brainstorm_security_1.puml
**Primary Threats:**
- **XSS via notifications**: Malicious user injects script into notification content → executes in victim's browser
- **Notification spam/DoS**: Attacker floods user with notifications, exhausting resources
- **Authorization bypass**: User receives notifications intended for another user
- **MitM attacks**: Unencrypted WebSocket connections expose notification content
- **Token theft**: Stolen auth tokens allow impersonation for WebSocket connections
### Critical Security Requirements
1. **Mandatory**: Sanitize ALL notification content server-side before sending
2. **Mandatory**: Authenticate WebSocket connections with short-lived tokens
3. **Mandatory**: Enforce per-user/per-service rate limits (suggest: 100/min per user)
4. **Recommended**: Use WSS (WebSocket Secure) protocol only
5. **Recommended**: Implement notification content ACLs (user can only receive their own data)
CONCERN: Browser notifications API requires user permission, but malicious notifications could still be used for phishing attacks ("Your account will be suspended - click here")
CONCERN: If using WebSockets, need strategy for connection state management - stale connections are a DoS vector
QUESTION: Will notifications contain sensitive data, or just "You have 1 new message" style summaries? This affects our encryption and storage requirements.
QUESTION: Will there be a notification history/persistence? This significantly expands the attack surface.
### Recommendations for MVP
- Use established libraries (e.g., Socket.IO with built-in security features) rather than raw WebSockets
- Implement rate limiting from day 1 (much harder to add later)
- Sanitize on server side using a whitelist approach (only allow plain text for MVP)
- Use short-lived JWT tokens for WebSocket auth (5-15 min expiry)
- Log all notification sends for security auditing
VOTE: CHANGES - Need to explicitly address XSS prevention and rate limiting before implementation
**Diagram:** `diagrams/notification-system-brainstorm_security_1.puml`
VOTE: CHANGES
---
Name: AI-Pragmatist
Looking at the agreed scope, this is actually well-scoped already. Let me sketch the simplest thing that could work.
MVP Architecture (Week 1-2):
1. **Event Generation** (Python backend)
- When event happens → insert row in `notifications` table
- Fields: user_id, event_type, message, created_at, read
2. **Delivery** (polling - yes, polling)
- Frontend polls `/api/notifications/unread` every 30s
- Returns new notifications since last check
- Browser Notification API shows popup
3. **Preferences** (simple)
- `user_preferences` table: user_id, event_type, enabled
- Check before inserting notification
That's it. No WebSockets, no message queue, no separate notification service.
**Why polling?**
- Works everywhere (no WebSocket infrastructure)
- Handles 1000 users easily (30s interval = ~33 req/s)
- Dead simple to implement and debug
- Can upgrade later if needed
**What we're NOT doing (yet):**
- Push notifications (mobile)
- Real-time delivery (<30s)
- Notification history/pagination
- Batching/grouping
- Email/SMS delivery
See diagram for the flow. We can build this in a week, then iterate based on real usage.
READY to sketch - this is simple enough to ship quickly.
**Diagram:** `diagrams/notification-system-brainstorm_mvp_1.puml`
VOTE: READY
---
Name: AI-Moderator
## Discussion Status
We've just entered the **sketch phase** following a successful clustering phase that defined our MVP scope:
- Web notifications only (browser)
- 3 event types: new message, status change, system alert
- 30-second delivery target
- Simple on/off preferences
- 1000 initial users
## Next Steps
We need rough diagrams to illustrate the technical approach. Specifically:
- Component diagrams showing key pieces (3-5 components)
- Sequence diagrams for notification flow
- Mark any unknowns with "?"
- Keep it rough - we're exploring, not finalizing
## Waiting For
@architect - Please start us off with a high-level component diagram and sequence flow
@security - Review the approach for security considerations
@pragmatist - Assess feasibility for 2-week timeline with 2 developers
All participants should focus on creating simple diagrams and identifying the key architectural pieces.
---
Name: AI-Architect
Parse error
VOTE: NONE
---
Name: AI-Security
Parse error
VOTE: NONE
---
Name: AI-Pragmatist
## Pragmatic Reality Check
Looking at what's been proposed:
**Security's approach**: 5 critical requirements, threat modeling, ACLs, JWT rotation, audit logging, rate limiting "from day 1".
**Reality**: We have 2 developers and 2 weeks. That security spec is a month of work.
CONCERN: COMPLEXITY - Security is architecting for enterprise scale when we need to ship for 1000 users
## What We Actually Need
For 1000 users receiving basic notifications:
1. **Input sanitization**: Use Python's `bleach` library (2 lines of code)
2. **Rate limiting**: Flask-Limiter with `@limiter.limit("100/minute")` decorator (1 line per endpoint)
3. **Auth**: Reuse existing session auth (0 new code)
That's it. That's the security MVP.
**What we're NOT doing yet:**
- WebSocket infrastructure (we're polling, remember?)
- Notification persistence (just show & dismiss)
- Separate auth tokens (use existing sessions)
- Audit logging (add when someone asks for it)
## Simplified Flow
DIAGRAM: diagrams/notification-system-brainstorm_mvp_3.puml
This diagram shows polling vs the WebSocket complexity Security is implying. Polling means:
- No connection state management
- No token refresh logic
- No WebSocket security surface
- Standard HTTPS/session auth
## Week 1-2 Implementation
**Week 1**: Core flow
- Day 1-2: DB table + insert on events
- Day 3-4: `/api/notifications/unread` endpoint + preferences
- Day 5: Frontend polling + Browser Notification API
**Week 2**: Polish
- Add `bleach` sanitization
- Add Flask-Limiter (100/min)
- Test with team
- Ship to 50 beta users
Then we iterate based on actual problems, not theoretical ones.
QUESTION: @security - Can we agree on sanitization + rate limiting as the security MVP, and defer WebSocket security since we're using polling?
VOTE: READY - This is shippable in 2 weeks with 2 devs
DIAGRAM: diagrams/notification-system-brainstorm_mvp_3.puml
VOTE: READY
---
Name: AI-Moderator
## Sketch Phase Summary
We have two architectural approaches with diagrams:
**AI-Pragmatist's Polling Approach:**
- Simple polling every 30s to `/api/notifications/unread`
- `notifications` and `user_preferences` tables in existing DB
- Browser Notification API for popups
- No WebSockets, no message queue
- **Vote: READY** - considers this shippable in 1 week
**AI-Security's Analysis:**
- Identified 5 critical attack surfaces (XSS, DoS, auth bypass, MitM, token theft)
- Provided threat model diagram
- Recommends WebSocket approach with Socket.IO + security measures
- **Vote: CHANGES** - requires explicit XSS prevention and rate limiting
**AI-Architect:**
- Parse error on response - **needs to re-respond**
## Key Tension
There's a architectural mismatch:
- Pragmatist proposes simple polling (30s intervals, DB-based)
- Security's recommendations assume WebSockets (tokens, connection management)
Both are valid for different approaches, but we need to align on which path before advancing.
## Open Questions
1. **Data Sensitivity** (from Security): Will notifications contain sensitive data or just summaries?
2. **Persistence** (from Security): Will there be notification history? This affects security surface.
3. **XSS Prevention**: How will we sanitize notification content? (Security requires this explicit)
4. **Rate Limiting**: What limits per user/service? (Security requires from day 1)
## Phase Advancement Assessment
We have diagrams (✓) but:
- Missing architect's main flow diagram (parse error)
- Need to resolve polling vs WebSocket approach
- Security concerns not yet addressed in chosen design
Per phase instructions, we should have 3-5 key pieces identified with rough diagrams. We're close but not ready.
## Recommendations
1. @architect - Please re-submit your response (there was a parse error). We need your component diagram and sequence flow.
2. @security - After architect responds, please review the *specific* approach chosen and provide targeted security requirements for that architecture (polling vs WebSocket)
3. @pragmatist - Once security requirements are clear, confirm the 2-week timeline still holds with security measures included
**Next phase readiness:** Not yet - need architect's diagram and alignment on technical approach before advancing to reality_check phase.
---
Name: AI-Architect
## Architectural Sketch - Component & Flow Analysis
Let me provide the missing architectural perspective to resolve the polling vs WebSocket tension.
### High-Level Component View
ARTIFACT: diagrams/notification-system-brainstorm_architect_6.puml
The diagram shows both approaches side-by-side to clarify the architectural tradeoffs.
### Architectural Analysis
**Key Components (5 pieces):**
1. **Event Source** - Existing backend services that trigger notifications
2. **Notification Store** - Database table (`notifications`) - persistence layer
3. **Notification API** - REST endpoint(s) for delivery
4. **Client Poller** - Frontend polling mechanism (30s interval)
5. **Browser Notification API** - Native browser popup system
### Why Polling is Architecturally Sound for MVP
**Alignment with constraints:**
- Small team (2 devs) Simpler mental model, fewer moving parts
- 2-week timeline No new infrastructure (WebSocket servers, connection pools)
- 1000 users 30s polling = ~33 req/s = trivial load
- Existing stack Reuses current HTTP/session infrastructure
**Architectural advantages:**
- **Stateless** - No connection state management, no reconnection logic
- **Debuggable** - Standard HTTP requests in browser DevTools
- **Scalable horizontally** - Any app server can handle any poll request
- **Fail-safe** - Client automatically recovers by next poll
- **Cache-friendly** - Can use HTTP caching headers if needed
**When to evolve:**
- User count > 10,000 (polling becomes inefficient)
- Latency requirement < 10s (30s polling insufficient)
- Battery concerns on mobile (constant polling drains battery)
- Rich interactions (typing indicators, presence - need real-time)
### Security Integration
@security's concerns are valid but apply differently to polling:
**Required (Day 1):**
1. **XSS Prevention** - Sanitize notification content server-side using `bleach` (Security's point #1)
2. **Rate Limiting** - Flask-Limiter on `/api/notifications/unread` endpoint (Security's point #3)
3. **Authorization** - Existing session auth ensures users only see their notifications (Security's point #2)
**Not Applicable (Polling Architecture):**
- WebSocket token management (we're not using WebSockets)
- Connection hijacking prevention (stateless HTTP)
- Stale connection cleanup (no persistent connections)
**Deferred (Post-MVP):**
- Notification history/persistence beyond current session
- Audit logging (add when compliance requires it)
- Content encryption at rest (notifications are ephemeral for MVP)
### Answers to Open Questions
**Q: Will notifications contain sensitive data?**
A: For MVP, use summaries only ("You have 1 new message") - keeps security surface minimal. Full content requires user to click through to authenticated page.
**Q: Will there be notification history?**
A: Not for MVP - notifications shown once, then dismissed. This avoids storage security concerns. Add persistence in v2 if users request it.
### Unknown/Risks (Marked with "?")
1. **Browser support** - Browser Notification API permission model varies. Need to test on target browsers.
2. **Notification volume** - If users get >10 notifications/min, UX degrades. May need batching logic (not in scope yet).
3. **Backend event integration** - How do existing services trigger notifications? Need hook points identified (implementation detail).
### Recommendation
**VOTE: READY** to proceed with polling architecture + security essentials (sanitization + rate limiting).
This is a classic example of "boring technology" - proven, simple, maintainable. We can ship in 2 weeks, learn from real usage, and evolve to WebSockets later *only if* metrics show we need it.
**Phase advancement criteria met:**
- ✓ 3-5 key components identified
- ✓ Rough diagrams created (component + sequence)
- ✓ Unknowns marked with "?"
- ✓ Architectural tradeoffs explicit
Ready for reality_check phase to validate implementation details.
VOTE: READY
---
Name: AI-Security
Parse error
VOTE: NONE
---
Name: AI-Pragmatist
## Moderator's Point is Critical
We have a fundamental mismatch: I proposed polling, Security analyzed WebSockets. Let me clarify the polling approach with explicit security measures that address Security's concerns.
## Polling Architecture with Security Built-In
**Core Components (4 pieces):**
1. **Event Publisher** (Python backend)
- Sanitizes content with `bleach.clean(text, tags=[], strip=True)` - no HTML allowed
- Inserts to `notifications` table only if user preference enabled
- Rate limit: max 100 notifications per user per minute (app-level check)
2. **Notifications Table** (PostgreSQL)
- Columns: id, user_id, event_type, message_text (sanitized), created_at, read_at
- Index on (user_id, read_at) for fast unread queries
- No sensitive data - just event summaries ("You have a new message")
3. **REST API Endpoint** (Flask)
- `GET /api/notifications/unread` - returns JSON array
- Uses existing session authentication (no new auth)
- Flask-Limiter: `@limiter.limit("120/minute")` on endpoint
- SQL with user_id filter prevents authorization bypass
4. **Frontend Poller** (React)
- Polls every 30s with `setInterval`
- Browser Notification API (requires user permission)
- Text-only display (no HTML rendering) - double defense against XSS
## Security Measures Explicitly Addressed
**XSS Prevention:**
- Server-side: `bleach.clean()` strips all HTML/JS
- Client-side: Display as plain text only in Browser Notification API
- **Double defense**: Even if sanitization fails, no rendering attack surface
**Rate Limiting:**
- Application layer: 100 notifications/user/minute when creating
- API layer: 120 requests/minute per user on endpoint (Flask-Limiter)
- Database constraint: prevent spam at source
**Authorization:**
- SQL query: `WHERE user_id = session.user_id`
- Cannot query other users' notifications (DB-level enforcement)
**Data Protection:**
- No sensitive data in notifications (answer to Security's question)
- No persistence beyond 30 days (auto-delete old notifications)
- HTTPS enforced (existing infrastructure)
**DoS Prevention:**
- Polling interval (30s) self-regulates load
- Rate limits prevent notification spam
- No persistent connections to exhaust
DIAGRAM: diagrams/notification-system-brainstorm_mvp_6.puml
## Implementation Timeline
**Week 1:**
- Day 1: DB migration + sanitization helper
- Day 2-3: Event publisher integration (3 event types)
- Day 4: REST endpoint + Flask-Limiter
- Day 5: Frontend poller + Browser API
**Week 2:**
- Day 1-2: User preferences UI
- Day 3: Security testing (XSS attempts, rate limit verification)
- Day 4: Performance testing (1000 concurrent users)
- Day 5: Ship to 50 beta users
**Total:** 10 developer-days (2 devs × 1 week) + 1 week polish = 2 weeks
## What We're Deferring
- WebSocket infrastructure (not needed for 30s latency)
- Notification history UI (just keep in DB for now)
- Mobile push notifications (scoped out already)
- Email/SMS fallback (not in requirements)
- Advanced batching/grouping (not needed at 1000 users)
## Response to Security's Questions
1. **Sensitive data?** No - summaries only ("You have 1 new message")
2. **History/persistence?** DB storage yes, UI access no (defer to v2)
3. **Rate limiting?** 100/min creation + 120/min API calls
4. **Sanitization?** bleach.clean() server-side + text-only display
VOTE: READY - This addresses all mandatory security requirements while staying shippable in 2 weeks
DIAGRAM: diagrams/notification-system-brainstorm_mvp_6.puml
VOTE: READY
---
Name: AI-Moderator
## Sketch Phase Status
We've made significant progress with two concrete architectural proposals and diagrams:
### What's Been Discussed
**AI-Pragmatist's Polling Architecture:**
- Simple 30-second polling to `/api/notifications/unread` endpoint
- Two DB tables: `notifications` and `user_preferences`
- Browser Notification API for display
- Implementation timeline: 1-2 weeks with clear breakdown
- **Diagrams:** `diagrams/notification-system-brainstorm_mvp_1.puml`, `diagrams/notification-system-brainstorm_mvp_3.puml`
**AI-Security's Security Analysis:**
- Comprehensive threat model with 5 attack surfaces identified
- Critical requirements: XSS prevention, rate limiting, authentication, encryption
- **Diagram:** `diagrams/notification-system-brainstorm_security_1.puml`
- Recommends Socket.IO with security hardening
**AI-Architect:**
- Two parse errors - **responses not recorded**
### Architectural Tension
There's a productive tension between approaches:
- **Pragmatist:** Minimal polling-based approach, can ship in 1 week
- **Security:** More robust requirements that assume WebSocket infrastructure
- **Pragmatist's counter:** Simplified security MVP (sanitization + rate limiting) for polling approach
### Open Questions
1. **Data sensitivity:** Will notifications contain sensitive data or just summaries? (affects encryption/storage)
2. **Persistence:** Will there be notification history? (expands attack surface)
3. **Security MVP agreement:** Can we proceed with sanitization + rate limiting for polling, defer WebSocket security? (@security)
### Concerns Raised
1. Browser notification phishing potential (even with permissions)
2. WebSocket connection state management complexity
3. **COMPLEXITY:** Risk of over-engineering for 1000-user MVP
## Phase Advancement Assessment
**Per sketch phase instructions:**
- ✓ Created simple diagrams (3 diagrams submitted)
- ✓ Identified 3-5 key pieces (polling endpoint, DB tables, sanitization, rate limiting, browser API)
- ✓ Marked unknowns (data sensitivity, persistence needs)
- ✗ AI-Architect's diagrams missing due to parse errors
**However:** The core architectural approach is now clear:
- Polling-based delivery (30s intervals)
- DB persistence for unread notifications
- Sanitization via `bleach` library
- Rate limiting via Flask-Limiter (100/min)
- Browser Notification API for display
The discussion has converged on a pragmatic polling approach with security hardening. @security's final question about the security MVP is the last blocker.
## Recommendation
**ADVANCE TO REALITY_CHECK PHASE:** Yes, with caveat
We have sufficient architectural clarity to move forward:
- Technical approach is defined (polling-based)
- Security requirements are scoped to the approach (not WebSockets)
- Implementation timeline is concrete (1-2 weeks)
- Key unknowns are identified and can be addressed in reality_check
@architect can contribute in the reality_check phase if the parse issues are resolved.
@security - Please confirm you can accept the simplified security MVP (sanitization + rate limiting) for the polling approach, or state specific blockers.