Gaps, Unknowns & Recommendations
Comprehensive audit of Noozer architecture — what's missing, what's unclear, and what you haven't thought of yet.
Executive Summary
After cross-referencing flowchart.md, architecture.md, and initial-requirements.md, I've identified:
- 12 Gaps — things mentioned but not fully specified
- 8 Unknown Unknowns — things you probably haven't considered
- 6 Consistency Issues — mismatches between docs
- 5 Architecture Risks — potential problems at scale
1. GAPS — Mentioned But Not Fully Specified
1.1 KEYWORD_SETS Table Missing
Problem: The flowchart and requirements mention "keyword sets" that customers create to define what to crawl, but there's no KEYWORD_SETS table in the data model.
Recommendation: Add:
KEYWORD_SETS
├── id: uuid
├── customer_id: fk → customers
├── name: string # "AI Industry News"
├── keywords: string[] # ["artificial intelligence", "machine learning", "LLM"]
├── boolean_query: string? # Advanced: "AI AND (startup OR funding) NOT crypto"
├── language: string # ISO 639-1
├── region: string # Google News region code (US, GB, etc)
├── google_news_ceid: string # e.g., "US:en"
├── crawl_frequency: enum # every_15min | hourly | daily
├── is_active: boolean
├── last_crawled_at: timestamp?
├── article_count: int # denormalized
├── created_at: timestamp
└── updated_at: timestamp
1.2 PIPELINE_RUNS Table Missing
Problem: Admin endpoints reference /v1/admin/pipeline/runs but there's no table to store pipeline run history.
Recommendation: Add:
PIPELINE_RUNS
├── id: uuid
├── workflow_id: string # Cloudflare workflow instance ID
├── workflow_type: enum # scheduled_crawl | story_recluster | retention_cleanup
├── status: enum # running | completed | failed | cancelled
├── started_at: timestamp
├── completed_at: timestamp?
├── trigger: enum # cron | manual | api
├── triggered_by: string? # admin user ID if manual
├── metrics: json # {articles_found, articles_new, errors, etc}
├── error_message: text?
└── created_at: timestamp
PIPELINE_ERRORS
├── id: uuid
├── pipeline_run_id: fk → pipeline_runs
├── stage: string # crawl | extract | enrich | classify | etc
├── article_id: uuid?
├── error_type: string
├── error_message: text
├── stack_trace: text?
├── retryable: boolean
├── retry_count: int
└── created_at: timestamp
1.3 Rate Limiting Strategy Incomplete
Problem: Requirements say "≤1 req/sec per egress pattern" but there's no specification of:
- How rate limits are structured in KV
- Per-source vs per-domain vs global limits
- Backoff strategy specifics
- How limits reset
Recommendation: Document rate limit schema:
KV Keys:
├── rate:global:minute:{minute} → count
├── rate:source:{source_id}:minute:{m} → count
├── rate:domain:{domain}:minute:{m} → count
├── backoff:{source_id} → {until: timestamp, multiplier: int}
Rate Limit Rules:
├── Global: 100 req/min across all sources
├── Per-source: 10 req/min per source
├── Per-domain: 20 req/min per publisher domain
├── Google News: 60 req/min (RSS), 10 req/min (HTML)
Backoff Strategy:
├── On 429: backoff = min(base * 2^attempt, 1 hour)
├── On 403: mark source as blocked, alert admin
├── On 5xx: retry 3x with exponential backoff
1.4 Customer Events/Analytics Missing
Problem: Requirements mention "Customer events (click/save/dismiss)" but there's no table for this. This is critical for:
- Improving relevance scoring based on user behavior
- Understanding what customers actually engage with
- Feedback loop beyond explicit feedback
Recommendation: Add:
CUSTOMER_EVENTS
├── id: uuid
├── customer_id: fk → customers
├── event_type: enum # view | click | save | dismiss | share | dwell
├── article_id: uuid?
├── story_id: uuid?
├── profile_id: uuid?
├── source: enum # web | mobile | api | email
├── metadata: json # {dwell_time_ms, scroll_depth, etc}
├── session_id: string?
├── created_at: timestamp
Index: (customer_id, created_at)
Index: (article_id, event_type)
1.5 API Key Management Missing
Problem: Auth endpoints mentioned but no table for API keys.
Recommendation: Add:
API_KEYS
├── id: uuid
├── customer_id: fk → customers
├── key_hash: string # SHA-256 of the key (never store plaintext)
├── key_prefix: string # First 8 chars for identification "nz_live_abc..."
├── name: string # "Production Key"
├── scopes: string[] # ["read:feed", "write:keywords", etc]
├── rate_limit_tier: enum # standard | elevated | unlimited
├── last_used_at: timestamp?
├── expires_at: timestamp?
├── is_active: boolean
├── created_at: timestamp
└── revoked_at: timestamp?
ADMIN_TOKENS
├── id: uuid
├── user_id: string # Admin user (could be email or internal ID)
├── token_hash: string
├── permissions: string[] # ["admin:read", "admin:write", "admin:delete"]
├── ip_allowlist: string[]?
├── last_used_at: timestamp?
├── expires_at: timestamp
├── created_at: timestamp
└── revoked_at: timestamp?
1.6 Notification Templates/Preferences Missing
Problem: Notifications mentioned but no structure for:
- Email templates
- Slack message formats
- Webhook payload schemas
- Per-channel preferences
Recommendation: Add:
NOTIFICATION_TEMPLATES
├── id: uuid
├── channel: enum # email | slack | webhook
├── event_type: enum # new_article | story_update | daily_digest | alert
├── template_name: string
├── subject_template: string? # For email
├── body_template: text # Mustache/Handlebars template
├── is_default: boolean
├── created_at: timestamp
NOTIFICATION_LOG
├── id: uuid
├── customer_id: fk
├── profile_id: fk?
├── article_id: uuid?
├── story_id: uuid?
├── channel: enum
├── status: enum # pending | sent | failed | bounced
├── sent_at: timestamp?
├── error_message: text?
├── retry_count: int
├── payload_hash: string # For deduplication
└── created_at: timestamp
1.7 Source Fetch Configuration Schema Unclear
Problem: SOURCES.fetch_config is defined as json but no schema specified.
Recommendation: Document expected structure:
{
"rate_limit": {
"requests_per_minute": 10,
"requests_per_hour": 100
},
"fetch_strategy": "direct" | "zenrows" | "data4seo" | "auto",
"selectors": {
"article_body": "article.content, .post-content, main",
"author": ".author-name, [rel=author]",
"date": "time[datetime], .published-date"
},
"anti_bot": {
"requires_js": false,
"requires_zenrows": false,
"custom_headers": {}
},
"extraction": {
"use_readability": true,
"extract_comments": false,
"max_body_length": 50000
}
}
1.8 Memory/Context for Q&A Not Specified
Problem: Requirements mention "customer memory docs" and /v1/customers/{id}/memory but no structure defined.
Recommendation: Add:
CUSTOMER_MEMORY
├── id: uuid
├── customer_id: fk → customers
├── memory_type: enum # conversation | preferences | context
├── content: text # The actual memory content
├── embedding_id: string? # For semantic retrieval
├── r2_key: string? # For large documents
├── metadata: json
├── expires_at: timestamp?
├── created_at: timestamp
└── updated_at: timestamp
1.9 Taxonomy Version Control Not Specified
Problem: Requirements mention "versioned rulesets" and "rollback" but no versioning structure.
Recommendation: Add:
TAXONOMY_VERSIONS
├── id: uuid
├── version: int # Auto-incrementing
├── snapshot: json # Full taxonomy state at this version
├── changes: json # Diff from previous version
├── created_by: string # Admin user
├── change_reason: text?
├── is_active: boolean
├── created_at: timestamp
RULES_VERSIONS
├── id: uuid
├── version: int
├── rules: json # All rules at this version
├── created_by: string
├── change_reason: text?
├── is_active: boolean
├── created_at: timestamp
1.10 Scheduled Job Configuration Missing
Problem: Cron schedules mentioned but no way to configure them dynamically.
Recommendation: Add:
SCHEDULED_JOBS
├── id: uuid
├── job_type: enum # crawl | recluster | retention | enrichment_backfill
├── cron_expression: string # "*/15 * * * *"
├── is_enabled: boolean
├── last_run_at: timestamp?
├── next_run_at: timestamp?
├── config: json # Job-specific configuration
├── created_at: timestamp
└── updated_at: timestamp
1.11 Search Index Configuration Missing
Problem: Search endpoint exists but no specification of:
- What fields are searchable
- Full-text vs semantic search behavior
- Filter options
- Pagination strategy
Recommendation: Document search behavior:
Search Fields (Full-text):
├── headline (weight: 3.0)
├── subheadline (weight: 2.0)
├── body_text (weight: 1.0)
├── author names (weight: 1.5)
├── entity names (weight: 2.0)
Semantic Search:
├── Uses article embedding
├── Top-K = 100, then re-rank by relevance
├── Can combine with filters
Filters:
├── date_range: {from, to}
├── sources: uuid[]
├── topics: string[]
├── sentiment_range: {min, max}
├── language: string
├── location_id: uuid
├── story_id: uuid
├── has_media: boolean
Pagination:
├── Cursor-based for consistency
├── Max limit: 100 per page
1.12 Export/Reporting Functionality Missing
Problem: No way for customers to export their data or generate reports.
Recommendation: Add:
EXPORTS
├── id: uuid
├── customer_id: fk → customers
├── export_type: enum # articles | feed | analytics | full_backup
├── format: enum # json | csv | xlsx
├── filters: json # Same as search filters
├── status: enum # pending | processing | completed | failed
├── r2_key: string? # Location of export file
├── download_url: string? # Pre-signed URL (expires)
├── expires_at: timestamp?
├── file_size_bytes: int?
├── row_count: int?
├── created_at: timestamp
└── completed_at: timestamp?
API Endpoints:
POST /v1/exports Create export request
GET /v1/exports List exports
GET /v1/exports/:id Get export status
GET /v1/exports/:id/download Get download URL
2. UNKNOWN UNKNOWNS — Things You Haven't Considered
2.1 Article Updates & Corrections
Problem: Articles change after publication. Headlines get edited, corrections issued, content updated. You capture updated_at but don't handle:
- Detecting when an article has changed
- Re-processing updated articles
- Tracking the diff/history
- Notifying users of significant changes
Recommendation: Add:
ARTICLE_REVISIONS
├── id: uuid
├── article_id: fk → articles
├── revision_number: int
├── detected_at: timestamp
├── previous_headline: string?
├── previous_body_hash: string
├── changes: json # {headline: true, body: true, ...}
├── significance: enum # minor | correction | major_update
├── raw_snapshot_r2_key: string
└── created_at: timestamp
Pipeline addition: Re-fetch articles after 24h and 7d to check for updates.
2.2 Paywall Detection
Problem: Many articles are behind paywalls. You'll fetch them, get partial content, and process garbage. Need to:
- Detect paywalled content
- Mark articles as paywalled
- Potentially skip enrichment/classification for paywalled articles
- Track which sources have paywalls
Recommendation: Add:
ARTICLES:
├── paywall_status: enum # none | soft | hard | metered | unknown
SOURCES:
├── paywall_type: enum # none | soft | hard | metered
├── paywall_bypass_strategy: enum # none | zenrows | archive | skip
Detection signals:
- Body text < 500 chars with "subscribe" keywords
- Known paywall DOM patterns
- Source metadata
2.3 Duplicate/Near-Duplicate Content
Problem: You handle URL deduplication, but what about:
- Syndicated content (AP/Reuters published by 100 outlets)
- Rewrites (same story, slightly different words)
- Plagiarism detection
This matters because:
- You'll waste money processing the same content multiple times
- Story clustering will be noisy
- Authority scoring should credit the original
Recommendation: Add:
ARTICLE_DUPLICATES
├── article_id: fk → articles # The duplicate
├── original_article_id: fk # The original (or earliest)
├── similarity_score: float # 0-1
├── duplicate_type: enum # exact | near_duplicate | syndicated | rewrite
├── detected_at: timestamp
Detection method:
├── MinHash/LSH for body text
├── Exact match on first 500 chars hash
├── Embedding similarity > 0.95
2.4 Source Discovery & Onboarding
Problem: How do you discover new sources? Currently it seems like sources are manually added. But:
- Google News returns articles from sources you've never seen
- Those sources need configuration
- Some sources will be garbage/spam
Recommendation: Add auto-discovery:
SOURCE_CANDIDATES
├── id: uuid
├── domain: string
├── first_seen_at: timestamp
├── article_count: int # How many times we've seen them
├── sample_urls: string[]
├── auto_detected_type: enum # news | blog | spam | unknown
├── auto_detected_quality: float # 0-1 heuristic
├── status: enum # pending_review | approved | rejected | blocked
├── reviewed_by: string?
├── reviewed_at: timestamp?
└── created_at: timestamp
When a new domain appears:
- Log to SOURCE_CANDIDATES
- If seen 5+ times, flag for review
- Auto-approve if quality score > 0.8
- Auto-reject if spam signals detected
2.5 Entity Resolution & Disambiguation
Problem: "Apple" could be the company, the fruit, or Apple Records. "Michael Jordan" could be the basketball player or the professor. You have NER but no disambiguation.
Recommendation: Add:
ENTITIES:
├── disambiguation_type: enum # unique | ambiguous | merged
├── wikidata_id: string? # Q312 for Apple Inc.
├── wikipedia_url: string?
├── related_entities: uuid[] # For disambiguation context
ENTITY_ALIASES:
├── alias: string # "AAPL", "Apple Inc", "Apple Computer"
├── entity_id: fk → entities
├── alias_type: enum # name | ticker | abbreviation | typo
├── confidence: float
Disambiguation strategy:
- Check context (other entities in article)
- Check source (tech news → Apple Inc.)
- Check topic classification
- Use Wikidata for canonical resolution
2.6 Bias & Reliability Scoring
Problem: You have political_lean but bias is more nuanced:
- Factual accuracy vs opinion
- Sensationalism
- Clickbait
- Bias by omission
This matters for credibility scoring and alerting users.
Recommendation: Add:
SOURCE_RELIABILITY
├── source_id: fk → sources
├── factual_accuracy: float # 0-1, from fact-check orgs
├── editorial_bias: float # -1 to +1
├── sensationalism_score: float # 0-1
├── transparency_score: float # Ownership, funding disclosed
├── mbfc_rating: string? # Media Bias Fact Check rating
├── newsguard_score: int? # NewsGuard 0-100
├── last_evaluated_at: timestamp
└── created_at: timestamp
Consider integrating:
- Media Bias Fact Check API
- NewsGuard (if budget allows)
- Ad Fontes Media Bias Chart
- AllSides bias ratings
2.7 Content Moderation
Problem: What if crawled content contains:
- Hate speech
- Graphic violence descriptions
- Illegal content
- NSFW material
You're processing it, storing it, and potentially surfacing it to customers.
Recommendation: Add:
ARTICLE_MODERATION
├── article_id: fk → articles
├── flagged: boolean
├── flags: string[] # ["hate_speech", "violence", "nsfw"]
├── confidence: float
├── reviewed: boolean
├── reviewer_decision: enum # approved | hidden | deleted
├── reviewed_by: string?
├── reviewed_at: timestamp?
└── created_at: timestamp
Options:
- Use Workers AI content moderation
- Add LLM classification step for safety
- Source-level blocklist for known bad actors
- Customer-configurable content filters
2.8 Internationalization (i18n) of the Platform
Problem: You support multi-language articles, but what about:
- Admin console in multiple languages
- API error messages
- Email notification templates
- Taxonomy labels in multiple languages
Recommendation: For V1, document that UI/API is English-only. For V2, plan:
TRANSLATIONS
├── key: string # "error.rate_limit_exceeded"
├── language: string # ISO 639-1
├── value: text
└── created_at: timestamp
3. CONSISTENCY ISSUES — Mismatches Between Docs
3.1 Vectorize Index Count Mismatch
architecture.md system diagram: Shows 5 vector indexes architecture.md deployment: Shows 7 vector indexes flowchart.md: Shows 7 vector indexes
Fix: Update system diagram to show all 7.
3.2 Queue Count Mismatch
architecture.md: Says "8 topics" in diagram, lists 9 in deployment flowchart.md: Lists 9 queues
Fix: Update diagram to say "9 topics".
3.3 Worker Count
architecture.md: Lists 8 workers flowchart.md: Shows slightly different worker breakdown
Fix: Align on exact worker names and count.
3.4 Missing Dedup Step in architecture.md Flows
flowchart.md: Shows explicit dedup step with URL hash check architecture.md sub-flows: Don't show dedup step
Fix: Update architecture.md pipeline diagrams to include dedup.
3.5 Missing Geo/Author in architecture.md Flows
flowchart.md: Shows Geo Extraction and Author Resolution as separate steps architecture.md sub-flows: Don't show these
Fix: Update architecture.md extraction pipeline to match.
3.6 API Endpoints Incomplete in architecture.md
flowchart.md: Has comprehensive client + admin API sections architecture.md: Has abbreviated endpoint list
Fix: Either sync them or reference flowchart.md for full API spec.
4. ARCHITECTURE RISKS — Potential Problems at Scale
4.1 D1 Row Limits
Risk: Cloudflare D1 has limits. At scale:
- 10,000 articles/day = 300,000/month = 3.6M/year
- Plus all junction tables, classifications, events, etc.
Mitigation:
- Implement aggressive retention (30-90 days for articles)
- Archive to R2 before deletion
- Consider D1 sharding strategy
- Monitor row counts with alerts
4.2 Vectorize Index Size
Risk: 225K locations + millions of articles + authors + entities = large vector indexes
Mitigation:
- Understand Vectorize limits (currently 5M vectors per index)
- Plan for index sharding or tiering
- Consider separate indexes for hot (recent) vs cold (archive) data
4.3 Cold Start Latency
Risk: Workers cold starts can add 50-200ms. For real-time APIs, this matters.
Mitigation:
- Keep critical paths in single worker (avoid chaining)
- Use KV caching aggressively
- Consider always-on workers for critical paths (if available)
4.4 Queue Depth Runaway
Risk: If processing slows down, queues can back up infinitely.
Mitigation:
- Set max queue depths with alerts
- Implement circuit breakers (pause crawling if queue > threshold)
- Dead letter queues for failed messages
- Queue metrics in observability
4.5 Cost Explosion
Risk: A bug or misconfiguration could trigger massive external API usage.
Mitigation:
- Hard budget limits (stop processing if daily cost > $X)
- Per-service rate limits
- Anomaly detection on cost velocity
- Require manual approval for > 10x normal spend
5. RECOMMENDATIONS — Priority Actions
Immediate (Before Coding)
- Add KEYWORD_SETS table — Core to the product
- Add PIPELINE_RUNS/ERRORS tables — Critical for debugging
- Add API_KEYS table — Required for auth
- Document rate limiting strategy — Required for crawling
- Fix consistency issues — Update diagrams
Short-term (During V1)
- Add CUSTOMER_EVENTS — Important for relevance improvement
- Add paywall detection — Or you'll waste money on garbage
- Add near-duplicate detection — Or story clustering will be noisy
- Add SOURCE_CANDIDATES — Or you'll never scale source coverage
- Add NOTIFICATION_LOG — Or you can't debug delivery issues
Medium-term (V1.1)
- Add article update detection — News changes
- Add entity disambiguation — Or NER is half-useful
- Add content moderation — Liability risk
- Add export functionality — Customers will ask
Long-term (V2)
- Source reliability scoring — Differentiation
- Full bias/credibility analysis — Premium feature
- I18n infrastructure — International expansion
6. UPDATED TABLES TO ADD
Here's the complete list of missing tables to add to architecture.md:
KEYWORD_SETS
API_KEYS
ADMIN_TOKENS
PIPELINE_RUNS
PIPELINE_ERRORS
CUSTOMER_EVENTS
CUSTOMER_MEMORY
NOTIFICATION_TEMPLATES
NOTIFICATION_LOG
TAXONOMY_VERSIONS
RULES_VERSIONS
SCHEDULED_JOBS
EXPORTS
ARTICLE_REVISIONS
ARTICLE_DUPLICATES
ARTICLE_MODERATION
SOURCE_CANDIDATES
SOURCE_RELIABILITY
ENTITY_ALIASES
This document should be reviewed before starting implementation and updated as decisions are made.