Gaps, Unknowns & Recommendations

Comprehensive audit of Noozer architecture — what's missing, what's unclear, and what you haven't thought of yet.

Executive Summary

After cross-referencing flowchart.md, architecture.md, and initial-requirements.md, I've identified:

12 Gaps — things mentioned but not fully specified
8 Unknown Unknowns — things you probably haven't considered
6 Consistency Issues — mismatches between docs
5 Architecture Risks — potential problems at scale

1. GAPS — Mentioned But Not Fully Specified

1.1 KEYWORD_SETS Table Missing

Problem: The flowchart and requirements mention "keyword sets" that customers create to define what to crawl, but there's no KEYWORD_SETS table in the data model.

Recommendation: Add:

KEYWORD_SETS
├── id: uuid
├── customer_id: fk → customers
├── name: string                    # "AI Industry News"
├── keywords: string[]              # ["artificial intelligence", "machine learning", "LLM"]
├── boolean_query: string?          # Advanced: "AI AND (startup OR funding) NOT crypto"
├── language: string                # ISO 639-1
├── region: string                  # Google News region code (US, GB, etc)
├── google_news_ceid: string        # e.g., "US:en"
├── crawl_frequency: enum           # every_15min | hourly | daily
├── is_active: boolean
├── last_crawled_at: timestamp?
├── article_count: int              # denormalized
├── created_at: timestamp
└── updated_at: timestamp

1.2 PIPELINE_RUNS Table Missing

Problem: Admin endpoints reference /v1/admin/pipeline/runs but there's no table to store pipeline run history.

Recommendation: Add:

PIPELINE_RUNS
├── id: uuid
├── workflow_id: string             # Cloudflare workflow instance ID
├── workflow_type: enum             # scheduled_crawl | story_recluster | retention_cleanup
├── status: enum                    # running | completed | failed | cancelled
├── started_at: timestamp
├── completed_at: timestamp?
├── trigger: enum                   # cron | manual | api
├── triggered_by: string?           # admin user ID if manual
├── metrics: json                   # {articles_found, articles_new, errors, etc}
├── error_message: text?
└── created_at: timestamp

PIPELINE_ERRORS
├── id: uuid
├── pipeline_run_id: fk → pipeline_runs
├── stage: string                   # crawl | extract | enrich | classify | etc
├── article_id: uuid?
├── error_type: string
├── error_message: text
├── stack_trace: text?
├── retryable: boolean
├── retry_count: int
└── created_at: timestamp

1.3 Rate Limiting Strategy Incomplete

Problem: Requirements say "≤1 req/sec per egress pattern" but there's no specification of:

How rate limits are structured in KV
Per-source vs per-domain vs global limits
Backoff strategy specifics
How limits reset

Recommendation: Document rate limit schema:

KV Keys:
├── rate:global:minute:{minute}         → count
├── rate:source:{source_id}:minute:{m}  → count
├── rate:domain:{domain}:minute:{m}     → count
├── backoff:{source_id}                 → {until: timestamp, multiplier: int}

Rate Limit Rules:
├── Global: 100 req/min across all sources
├── Per-source: 10 req/min per source
├── Per-domain: 20 req/min per publisher domain
├── Google News: 60 req/min (RSS), 10 req/min (HTML)

Backoff Strategy:
├── On 429: backoff = min(base * 2^attempt, 1 hour)
├── On 403: mark source as blocked, alert admin
├── On 5xx: retry 3x with exponential backoff

1.4 Customer Events/Analytics Missing

Problem: Requirements mention "Customer events (click/save/dismiss)" but there's no table for this. This is critical for:

Improving relevance scoring based on user behavior
Understanding what customers actually engage with
Feedback loop beyond explicit feedback

Recommendation: Add:

CUSTOMER_EVENTS
├── id: uuid
├── customer_id: fk → customers
├── event_type: enum                # view | click | save | dismiss | share | dwell
├── article_id: uuid?
├── story_id: uuid?
├── profile_id: uuid?
├── source: enum                    # web | mobile | api | email
├── metadata: json                  # {dwell_time_ms, scroll_depth, etc}
├── session_id: string?
├── created_at: timestamp

Index: (customer_id, created_at)
Index: (article_id, event_type)

1.5 API Key Management Missing

Problem: Auth endpoints mentioned but no table for API keys.

Recommendation: Add:

API_KEYS
├── id: uuid
├── customer_id: fk → customers
├── key_hash: string                # SHA-256 of the key (never store plaintext)
├── key_prefix: string              # First 8 chars for identification "nz_live_abc..."
├── name: string                    # "Production Key"
├── scopes: string[]                # ["read:feed", "write:keywords", etc]
├── rate_limit_tier: enum           # standard | elevated | unlimited
├── last_used_at: timestamp?
├── expires_at: timestamp?
├── is_active: boolean
├── created_at: timestamp
└── revoked_at: timestamp?

ADMIN_TOKENS
├── id: uuid
├── user_id: string                 # Admin user (could be email or internal ID)
├── token_hash: string
├── permissions: string[]           # ["admin:read", "admin:write", "admin:delete"]
├── ip_allowlist: string[]?
├── last_used_at: timestamp?
├── expires_at: timestamp
├── created_at: timestamp
└── revoked_at: timestamp?

1.6 Notification Templates/Preferences Missing

Problem: Notifications mentioned but no structure for:

Email templates
Slack message formats
Webhook payload schemas
Per-channel preferences

Recommendation: Add:

NOTIFICATION_TEMPLATES
├── id: uuid
├── channel: enum                   # email | slack | webhook
├── event_type: enum                # new_article | story_update | daily_digest | alert
├── template_name: string
├── subject_template: string?       # For email
├── body_template: text             # Mustache/Handlebars template
├── is_default: boolean
├── created_at: timestamp

NOTIFICATION_LOG
├── id: uuid
├── customer_id: fk
├── profile_id: fk?
├── article_id: uuid?
├── story_id: uuid?
├── channel: enum
├── status: enum                    # pending | sent | failed | bounced
├── sent_at: timestamp?
├── error_message: text?
├── retry_count: int
├── payload_hash: string            # For deduplication
└── created_at: timestamp

1.7 Source Fetch Configuration Schema Unclear

Problem: SOURCES.fetch_config is defined as json but no schema specified.

Recommendation: Document expected structure:

{
  "rate_limit": {
    "requests_per_minute": 10,
    "requests_per_hour": 100
  },
  "fetch_strategy": "direct" | "zenrows" | "data4seo" | "auto",
  "selectors": {
    "article_body": "article.content, .post-content, main",
    "author": ".author-name, [rel=author]",
    "date": "time[datetime], .published-date"
  },
  "anti_bot": {
    "requires_js": false,
    "requires_zenrows": false,
    "custom_headers": {}
  },
  "extraction": {
    "use_readability": true,
    "extract_comments": false,
    "max_body_length": 50000
  }
}

1.8 Memory/Context for Q&A Not Specified

Problem: Requirements mention "customer memory docs" and /v1/customers/{id}/memory but no structure defined.

Recommendation: Add:

CUSTOMER_MEMORY
├── id: uuid
├── customer_id: fk → customers
├── memory_type: enum               # conversation | preferences | context
├── content: text                   # The actual memory content
├── embedding_id: string?           # For semantic retrieval
├── r2_key: string?                 # For large documents
├── metadata: json
├── expires_at: timestamp?
├── created_at: timestamp
└── updated_at: timestamp

1.9 Taxonomy Version Control Not Specified

Problem: Requirements mention "versioned rulesets" and "rollback" but no versioning structure.

Recommendation: Add:

TAXONOMY_VERSIONS
├── id: uuid
├── version: int                    # Auto-incrementing
├── snapshot: json                  # Full taxonomy state at this version
├── changes: json                   # Diff from previous version
├── created_by: string              # Admin user
├── change_reason: text?
├── is_active: boolean
├── created_at: timestamp

RULES_VERSIONS
├── id: uuid
├── version: int
├── rules: json                     # All rules at this version
├── created_by: string
├── change_reason: text?
├── is_active: boolean
├── created_at: timestamp

1.10 Scheduled Job Configuration Missing

Problem: Cron schedules mentioned but no way to configure them dynamically.

Recommendation: Add:

SCHEDULED_JOBS
├── id: uuid
├── job_type: enum                  # crawl | recluster | retention | enrichment_backfill
├── cron_expression: string         # "*/15 * * * *"
├── is_enabled: boolean
├── last_run_at: timestamp?
├── next_run_at: timestamp?
├── config: json                    # Job-specific configuration
├── created_at: timestamp
└── updated_at: timestamp

1.11 Search Index Configuration Missing

Problem: Search endpoint exists but no specification of:

What fields are searchable
Full-text vs semantic search behavior
Filter options
Pagination strategy

Recommendation: Document search behavior:

Search Fields (Full-text):
├── headline (weight: 3.0)
├── subheadline (weight: 2.0)
├── body_text (weight: 1.0)
├── author names (weight: 1.5)
├── entity names (weight: 2.0)

Semantic Search:
├── Uses article embedding
├── Top-K = 100, then re-rank by relevance
├── Can combine with filters

Filters:
├── date_range: {from, to}
├── sources: uuid[]
├── topics: string[]
├── sentiment_range: {min, max}
├── language: string
├── location_id: uuid
├── story_id: uuid
├── has_media: boolean

Pagination:
├── Cursor-based for consistency
├── Max limit: 100 per page

1.12 Export/Reporting Functionality Missing

Problem: No way for customers to export their data or generate reports.

Recommendation: Add:

EXPORTS
├── id: uuid
├── customer_id: fk → customers
├── export_type: enum               # articles | feed | analytics | full_backup
├── format: enum                    # json | csv | xlsx
├── filters: json                   # Same as search filters
├── status: enum                    # pending | processing | completed | failed
├── r2_key: string?                 # Location of export file
├── download_url: string?           # Pre-signed URL (expires)
├── expires_at: timestamp?
├── file_size_bytes: int?
├── row_count: int?
├── created_at: timestamp
└── completed_at: timestamp?

API Endpoints:

POST /v1/exports                    Create export request
GET  /v1/exports                    List exports
GET  /v1/exports/:id                Get export status
GET  /v1/exports/:id/download       Get download URL

2. UNKNOWN UNKNOWNS — Things You Haven't Considered

2.1 Article Updates & Corrections

Problem: Articles change after publication. Headlines get edited, corrections issued, content updated. You capture updated_at but don't handle:

Detecting when an article has changed
Re-processing updated articles
Tracking the diff/history
Notifying users of significant changes

Recommendation: Add:

ARTICLE_REVISIONS
├── id: uuid
├── article_id: fk → articles
├── revision_number: int
├── detected_at: timestamp
├── previous_headline: string?
├── previous_body_hash: string
├── changes: json                   # {headline: true, body: true, ...}
├── significance: enum              # minor | correction | major_update
├── raw_snapshot_r2_key: string
└── created_at: timestamp

Pipeline addition: Re-fetch articles after 24h and 7d to check for updates.

2.2 Paywall Detection

Problem: Many articles are behind paywalls. You'll fetch them, get partial content, and process garbage. Need to:

Detect paywalled content
Mark articles as paywalled
Potentially skip enrichment/classification for paywalled articles
Track which sources have paywalls

Recommendation: Add:

ARTICLES:
├── paywall_status: enum            # none | soft | hard | metered | unknown

SOURCES:
├── paywall_type: enum              # none | soft | hard | metered
├── paywall_bypass_strategy: enum   # none | zenrows | archive | skip

Detection signals:

Body text < 500 chars with "subscribe" keywords
Known paywall DOM patterns
Source metadata

2.3 Duplicate/Near-Duplicate Content

Problem: You handle URL deduplication, but what about:

Syndicated content (AP/Reuters published by 100 outlets)
Rewrites (same story, slightly different words)
Plagiarism detection

This matters because:

You'll waste money processing the same content multiple times
Story clustering will be noisy
Authority scoring should credit the original

Recommendation: Add:

ARTICLE_DUPLICATES
├── article_id: fk → articles       # The duplicate
├── original_article_id: fk         # The original (or earliest)
├── similarity_score: float         # 0-1
├── duplicate_type: enum            # exact | near_duplicate | syndicated | rewrite
├── detected_at: timestamp

Detection method:
├── MinHash/LSH for body text
├── Exact match on first 500 chars hash
├── Embedding similarity > 0.95

2.4 Source Discovery & Onboarding

Problem: How do you discover new sources? Currently it seems like sources are manually added. But:

Google News returns articles from sources you've never seen
Those sources need configuration
Some sources will be garbage/spam

Recommendation: Add auto-discovery:

SOURCE_CANDIDATES
├── id: uuid
├── domain: string
├── first_seen_at: timestamp
├── article_count: int              # How many times we've seen them
├── sample_urls: string[]
├── auto_detected_type: enum        # news | blog | spam | unknown
├── auto_detected_quality: float    # 0-1 heuristic
├── status: enum                    # pending_review | approved | rejected | blocked
├── reviewed_by: string?
├── reviewed_at: timestamp?
└── created_at: timestamp

When a new domain appears:

Log to SOURCE_CANDIDATES
If seen 5+ times, flag for review
Auto-approve if quality score > 0.8
Auto-reject if spam signals detected

2.5 Entity Resolution & Disambiguation

Problem: "Apple" could be the company, the fruit, or Apple Records. "Michael Jordan" could be the basketball player or the professor. You have NER but no disambiguation.

Recommendation: Add:

ENTITIES:
├── disambiguation_type: enum       # unique | ambiguous | merged
├── wikidata_id: string?            # Q312 for Apple Inc.
├── wikipedia_url: string?
├── related_entities: uuid[]        # For disambiguation context

ENTITY_ALIASES:
├── alias: string                   # "AAPL", "Apple Inc", "Apple Computer"
├── entity_id: fk → entities
├── alias_type: enum                # name | ticker | abbreviation | typo
├── confidence: float

Disambiguation strategy:

Check context (other entities in article)
Check source (tech news → Apple Inc.)
Check topic classification
Use Wikidata for canonical resolution

2.6 Bias & Reliability Scoring

Problem: You have political_lean but bias is more nuanced:

Factual accuracy vs opinion
Sensationalism
Clickbait
Bias by omission

This matters for credibility scoring and alerting users.

Recommendation: Add:

SOURCE_RELIABILITY
├── source_id: fk → sources
├── factual_accuracy: float         # 0-1, from fact-check orgs
├── editorial_bias: float           # -1 to +1
├── sensationalism_score: float     # 0-1
├── transparency_score: float       # Ownership, funding disclosed
├── mbfc_rating: string?            # Media Bias Fact Check rating
├── newsguard_score: int?           # NewsGuard 0-100
├── last_evaluated_at: timestamp
└── created_at: timestamp

Consider integrating:

Media Bias Fact Check API
NewsGuard (if budget allows)
Ad Fontes Media Bias Chart
AllSides bias ratings

2.7 Content Moderation

Problem: What if crawled content contains:

Hate speech
Graphic violence descriptions
Illegal content
NSFW material

You're processing it, storing it, and potentially surfacing it to customers.

Recommendation: Add:

ARTICLE_MODERATION
├── article_id: fk → articles
├── flagged: boolean
├── flags: string[]                 # ["hate_speech", "violence", "nsfw"]
├── confidence: float
├── reviewed: boolean
├── reviewer_decision: enum         # approved | hidden | deleted
├── reviewed_by: string?
├── reviewed_at: timestamp?
└── created_at: timestamp

Options:

Use Workers AI content moderation
Add LLM classification step for safety
Source-level blocklist for known bad actors
Customer-configurable content filters

2.8 Internationalization (i18n) of the Platform

Problem: You support multi-language articles, but what about:

Admin console in multiple languages
API error messages
Email notification templates
Taxonomy labels in multiple languages

Recommendation: For V1, document that UI/API is English-only. For V2, plan:

TRANSLATIONS
├── key: string                     # "error.rate_limit_exceeded"
├── language: string                # ISO 639-1
├── value: text
└── created_at: timestamp

3. CONSISTENCY ISSUES — Mismatches Between Docs

3.1 Vectorize Index Count Mismatch

architecture.md system diagram: Shows 5 vector indexes architecture.md deployment: Shows 7 vector indexes flowchart.md: Shows 7 vector indexes

Fix: Update system diagram to show all 7.

3.2 Queue Count Mismatch

architecture.md: Says "8 topics" in diagram, lists 9 in deployment flowchart.md: Lists 9 queues

Fix: Update diagram to say "9 topics".

3.3 Worker Count

architecture.md: Lists 8 workers flowchart.md: Shows slightly different worker breakdown

Fix: Align on exact worker names and count.

3.4 Missing Dedup Step in architecture.md Flows

flowchart.md: Shows explicit dedup step with URL hash check architecture.md sub-flows: Don't show dedup step

Fix: Update architecture.md pipeline diagrams to include dedup.

3.5 Missing Geo/Author in architecture.md Flows

flowchart.md: Shows Geo Extraction and Author Resolution as separate steps architecture.md sub-flows: Don't show these

Fix: Update architecture.md extraction pipeline to match.

3.6 API Endpoints Incomplete in architecture.md

flowchart.md: Has comprehensive client + admin API sections architecture.md: Has abbreviated endpoint list

Fix: Either sync them or reference flowchart.md for full API spec.

4. ARCHITECTURE RISKS — Potential Problems at Scale

4.1 D1 Row Limits

Risk: Cloudflare D1 has limits. At scale:

10,000 articles/day = 300,000/month = 3.6M/year
Plus all junction tables, classifications, events, etc.

Mitigation:

Implement aggressive retention (30-90 days for articles)
Archive to R2 before deletion
Consider D1 sharding strategy
Monitor row counts with alerts

4.2 Vectorize Index Size

Risk: 225K locations + millions of articles + authors + entities = large vector indexes

Mitigation:

Understand Vectorize limits (currently 5M vectors per index)
Plan for index sharding or tiering
Consider separate indexes for hot (recent) vs cold (archive) data

4.3 Cold Start Latency

Risk: Workers cold starts can add 50-200ms. For real-time APIs, this matters.

Mitigation:

Keep critical paths in single worker (avoid chaining)
Use KV caching aggressively
Consider always-on workers for critical paths (if available)

4.4 Queue Depth Runaway

Risk: If processing slows down, queues can back up infinitely.

Mitigation:

Set max queue depths with alerts
Implement circuit breakers (pause crawling if queue > threshold)
Dead letter queues for failed messages
Queue metrics in observability

4.5 Cost Explosion

Risk: A bug or misconfiguration could trigger massive external API usage.

Mitigation:

Hard budget limits (stop processing if daily cost > $X)
Per-service rate limits
Anomaly detection on cost velocity
Require manual approval for > 10x normal spend

5. RECOMMENDATIONS — Priority Actions

Immediate (Before Coding)

Add KEYWORD_SETS table — Core to the product
Add PIPELINE_RUNS/ERRORS tables — Critical for debugging
Add API_KEYS table — Required for auth
Document rate limiting strategy — Required for crawling
Fix consistency issues — Update diagrams

Short-term (During V1)

Add CUSTOMER_EVENTS — Important for relevance improvement
Add paywall detection — Or you'll waste money on garbage
Add near-duplicate detection — Or story clustering will be noisy
Add SOURCE_CANDIDATES — Or you'll never scale source coverage
Add NOTIFICATION_LOG — Or you can't debug delivery issues

Medium-term (V1.1)

Add article update detection — News changes
Add entity disambiguation — Or NER is half-useful
Add content moderation — Liability risk
Add export functionality — Customers will ask

Long-term (V2)

Source reliability scoring — Differentiation
Full bias/credibility analysis — Premium feature
I18n infrastructure — International expansion

6. UPDATED TABLES TO ADD

Here's the complete list of missing tables to add to architecture.md:

KEYWORD_SETS
API_KEYS
ADMIN_TOKENS
PIPELINE_RUNS
PIPELINE_ERRORS
CUSTOMER_EVENTS
CUSTOMER_MEMORY
NOTIFICATION_TEMPLATES
NOTIFICATION_LOG
TAXONOMY_VERSIONS
RULES_VERSIONS
SCHEDULED_JOBS
EXPORTS
ARTICLE_REVISIONS
ARTICLE_DUPLICATES
ARTICLE_MODERATION
SOURCE_CANDIDATES
SOURCE_RELIABILITY
ENTITY_ALIASES

This document should be reviewed before starting implementation and updated as decisions are made.

Executive Summary​

1. GAPS — Mentioned But Not Fully Specified​

1.1 KEYWORD_SETS Table Missing​

1.2 PIPELINE_RUNS Table Missing​

1.3 Rate Limiting Strategy Incomplete​

1.4 Customer Events/Analytics Missing​

1.5 API Key Management Missing​

1.6 Notification Templates/Preferences Missing​

1.7 Source Fetch Configuration Schema Unclear​

1.8 Memory/Context for Q&A Not Specified​

1.9 Taxonomy Version Control Not Specified​

1.10 Scheduled Job Configuration Missing​

1.11 Search Index Configuration Missing​

1.12 Export/Reporting Functionality Missing​

2. UNKNOWN UNKNOWNS — Things You Haven't Considered​

2.1 Article Updates & Corrections​

2.2 Paywall Detection​

2.3 Duplicate/Near-Duplicate Content​

2.4 Source Discovery & Onboarding​

2.5 Entity Resolution & Disambiguation​

2.6 Bias & Reliability Scoring​

2.7 Content Moderation​

2.8 Internationalization (i18n) of the Platform​

3. CONSISTENCY ISSUES — Mismatches Between Docs​

3.1 Vectorize Index Count Mismatch​

3.2 Queue Count Mismatch​

3.3 Worker Count​

3.4 Missing Dedup Step in architecture.md Flows​

3.5 Missing Geo/Author in architecture.md Flows​

3.6 API Endpoints Incomplete in architecture.md​

4. ARCHITECTURE RISKS — Potential Problems at Scale​

4.1 D1 Row Limits​

4.2 Vectorize Index Size​

4.3 Cold Start Latency​

4.4 Queue Depth Runaway​

4.5 Cost Explosion​

5. RECOMMENDATIONS — Priority Actions​

Immediate (Before Coding)​

Short-term (During V1)​

Medium-term (V1.1)​

Long-term (V2)​

6. UPDATED TABLES TO ADD​

Executive Summary

1. GAPS — Mentioned But Not Fully Specified

1.1 KEYWORD_SETS Table Missing

1.2 PIPELINE_RUNS Table Missing

1.3 Rate Limiting Strategy Incomplete

1.4 Customer Events/Analytics Missing

1.5 API Key Management Missing

1.6 Notification Templates/Preferences Missing

1.7 Source Fetch Configuration Schema Unclear

1.8 Memory/Context for Q&A Not Specified

1.9 Taxonomy Version Control Not Specified

1.10 Scheduled Job Configuration Missing

1.11 Search Index Configuration Missing

1.12 Export/Reporting Functionality Missing

2. UNKNOWN UNKNOWNS — Things You Haven't Considered

2.1 Article Updates & Corrections

2.2 Paywall Detection

2.3 Duplicate/Near-Duplicate Content

2.4 Source Discovery & Onboarding

2.5 Entity Resolution & Disambiguation

2.6 Bias & Reliability Scoring

2.7 Content Moderation

2.8 Internationalization (i18n) of the Platform

3. CONSISTENCY ISSUES — Mismatches Between Docs

3.1 Vectorize Index Count Mismatch

3.2 Queue Count Mismatch

3.3 Worker Count

3.4 Missing Dedup Step in architecture.md Flows

3.5 Missing Geo/Author in architecture.md Flows

3.6 API Endpoints Incomplete in architecture.md

4. ARCHITECTURE RISKS — Potential Problems at Scale

4.1 D1 Row Limits

4.2 Vectorize Index Size

4.3 Cold Start Latency

4.4 Queue Depth Runaway

4.5 Cost Explosion

5. RECOMMENDATIONS — Priority Actions

Immediate (Before Coding)

Short-term (During V1)

Medium-term (V1.1)

Long-term (V2)

6. UPDATED TABLES TO ADD