Skip to main content

Gaps, Unknowns & Recommendations

Comprehensive audit of Noozer architecture — what's missing, what's unclear, and what you haven't thought of yet.


Executive Summary

After cross-referencing flowchart.md, architecture.md, and initial-requirements.md, I've identified:

  • 12 Gaps — things mentioned but not fully specified
  • 8 Unknown Unknowns — things you probably haven't considered
  • 6 Consistency Issues — mismatches between docs
  • 5 Architecture Risks — potential problems at scale

1. GAPS — Mentioned But Not Fully Specified

1.1 KEYWORD_SETS Table Missing

Problem: The flowchart and requirements mention "keyword sets" that customers create to define what to crawl, but there's no KEYWORD_SETS table in the data model.

Recommendation: Add:

KEYWORD_SETS
├── id: uuid
├── customer_id: fk → customers
├── name: string # "AI Industry News"
├── keywords: string[] # ["artificial intelligence", "machine learning", "LLM"]
├── boolean_query: string? # Advanced: "AI AND (startup OR funding) NOT crypto"
├── language: string # ISO 639-1
├── region: string # Google News region code (US, GB, etc)
├── google_news_ceid: string # e.g., "US:en"
├── crawl_frequency: enum # every_15min | hourly | daily
├── is_active: boolean
├── last_crawled_at: timestamp?
├── article_count: int # denormalized
├── created_at: timestamp
└── updated_at: timestamp

1.2 PIPELINE_RUNS Table Missing

Problem: Admin endpoints reference /v1/admin/pipeline/runs but there's no table to store pipeline run history.

Recommendation: Add:

PIPELINE_RUNS
├── id: uuid
├── workflow_id: string # Cloudflare workflow instance ID
├── workflow_type: enum # scheduled_crawl | story_recluster | retention_cleanup
├── status: enum # running | completed | failed | cancelled
├── started_at: timestamp
├── completed_at: timestamp?
├── trigger: enum # cron | manual | api
├── triggered_by: string? # admin user ID if manual
├── metrics: json # {articles_found, articles_new, errors, etc}
├── error_message: text?
└── created_at: timestamp

PIPELINE_ERRORS
├── id: uuid
├── pipeline_run_id: fk → pipeline_runs
├── stage: string # crawl | extract | enrich | classify | etc
├── article_id: uuid?
├── error_type: string
├── error_message: text
├── stack_trace: text?
├── retryable: boolean
├── retry_count: int
└── created_at: timestamp

1.3 Rate Limiting Strategy Incomplete

Problem: Requirements say "≤1 req/sec per egress pattern" but there's no specification of:

  • How rate limits are structured in KV
  • Per-source vs per-domain vs global limits
  • Backoff strategy specifics
  • How limits reset

Recommendation: Document rate limit schema:

KV Keys:
├── rate:global:minute:{minute} → count
├── rate:source:{source_id}:minute:{m} → count
├── rate:domain:{domain}:minute:{m} → count
├── backoff:{source_id} → {until: timestamp, multiplier: int}

Rate Limit Rules:
├── Global: 100 req/min across all sources
├── Per-source: 10 req/min per source
├── Per-domain: 20 req/min per publisher domain
├── Google News: 60 req/min (RSS), 10 req/min (HTML)

Backoff Strategy:
├── On 429: backoff = min(base * 2^attempt, 1 hour)
├── On 403: mark source as blocked, alert admin
├── On 5xx: retry 3x with exponential backoff

1.4 Customer Events/Analytics Missing

Problem: Requirements mention "Customer events (click/save/dismiss)" but there's no table for this. This is critical for:

  • Improving relevance scoring based on user behavior
  • Understanding what customers actually engage with
  • Feedback loop beyond explicit feedback

Recommendation: Add:

CUSTOMER_EVENTS
├── id: uuid
├── customer_id: fk → customers
├── event_type: enum # view | click | save | dismiss | share | dwell
├── article_id: uuid?
├── story_id: uuid?
├── profile_id: uuid?
├── source: enum # web | mobile | api | email
├── metadata: json # {dwell_time_ms, scroll_depth, etc}
├── session_id: string?
├── created_at: timestamp

Index: (customer_id, created_at)
Index: (article_id, event_type)

1.5 API Key Management Missing

Problem: Auth endpoints mentioned but no table for API keys.

Recommendation: Add:

API_KEYS
├── id: uuid
├── customer_id: fk → customers
├── key_hash: string # SHA-256 of the key (never store plaintext)
├── key_prefix: string # First 8 chars for identification "nz_live_abc..."
├── name: string # "Production Key"
├── scopes: string[] # ["read:feed", "write:keywords", etc]
├── rate_limit_tier: enum # standard | elevated | unlimited
├── last_used_at: timestamp?
├── expires_at: timestamp?
├── is_active: boolean
├── created_at: timestamp
└── revoked_at: timestamp?

ADMIN_TOKENS
├── id: uuid
├── user_id: string # Admin user (could be email or internal ID)
├── token_hash: string
├── permissions: string[] # ["admin:read", "admin:write", "admin:delete"]
├── ip_allowlist: string[]?
├── last_used_at: timestamp?
├── expires_at: timestamp
├── created_at: timestamp
└── revoked_at: timestamp?

1.6 Notification Templates/Preferences Missing

Problem: Notifications mentioned but no structure for:

  • Email templates
  • Slack message formats
  • Webhook payload schemas
  • Per-channel preferences

Recommendation: Add:

NOTIFICATION_TEMPLATES
├── id: uuid
├── channel: enum # email | slack | webhook
├── event_type: enum # new_article | story_update | daily_digest | alert
├── template_name: string
├── subject_template: string? # For email
├── body_template: text # Mustache/Handlebars template
├── is_default: boolean
├── created_at: timestamp

NOTIFICATION_LOG
├── id: uuid
├── customer_id: fk
├── profile_id: fk?
├── article_id: uuid?
├── story_id: uuid?
├── channel: enum
├── status: enum # pending | sent | failed | bounced
├── sent_at: timestamp?
├── error_message: text?
├── retry_count: int
├── payload_hash: string # For deduplication
└── created_at: timestamp

1.7 Source Fetch Configuration Schema Unclear

Problem: SOURCES.fetch_config is defined as json but no schema specified.

Recommendation: Document expected structure:

{
"rate_limit": {
"requests_per_minute": 10,
"requests_per_hour": 100
},
"fetch_strategy": "direct" | "zenrows" | "data4seo" | "auto",
"selectors": {
"article_body": "article.content, .post-content, main",
"author": ".author-name, [rel=author]",
"date": "time[datetime], .published-date"
},
"anti_bot": {
"requires_js": false,
"requires_zenrows": false,
"custom_headers": {}
},
"extraction": {
"use_readability": true,
"extract_comments": false,
"max_body_length": 50000
}
}

1.8 Memory/Context for Q&A Not Specified

Problem: Requirements mention "customer memory docs" and /v1/customers/{id}/memory but no structure defined.

Recommendation: Add:

CUSTOMER_MEMORY
├── id: uuid
├── customer_id: fk → customers
├── memory_type: enum # conversation | preferences | context
├── content: text # The actual memory content
├── embedding_id: string? # For semantic retrieval
├── r2_key: string? # For large documents
├── metadata: json
├── expires_at: timestamp?
├── created_at: timestamp
└── updated_at: timestamp

1.9 Taxonomy Version Control Not Specified

Problem: Requirements mention "versioned rulesets" and "rollback" but no versioning structure.

Recommendation: Add:

TAXONOMY_VERSIONS
├── id: uuid
├── version: int # Auto-incrementing
├── snapshot: json # Full taxonomy state at this version
├── changes: json # Diff from previous version
├── created_by: string # Admin user
├── change_reason: text?
├── is_active: boolean
├── created_at: timestamp

RULES_VERSIONS
├── id: uuid
├── version: int
├── rules: json # All rules at this version
├── created_by: string
├── change_reason: text?
├── is_active: boolean
├── created_at: timestamp

1.10 Scheduled Job Configuration Missing

Problem: Cron schedules mentioned but no way to configure them dynamically.

Recommendation: Add:

SCHEDULED_JOBS
├── id: uuid
├── job_type: enum # crawl | recluster | retention | enrichment_backfill
├── cron_expression: string # "*/15 * * * *"
├── is_enabled: boolean
├── last_run_at: timestamp?
├── next_run_at: timestamp?
├── config: json # Job-specific configuration
├── created_at: timestamp
└── updated_at: timestamp

1.11 Search Index Configuration Missing

Problem: Search endpoint exists but no specification of:

  • What fields are searchable
  • Full-text vs semantic search behavior
  • Filter options
  • Pagination strategy

Recommendation: Document search behavior:

Search Fields (Full-text):
├── headline (weight: 3.0)
├── subheadline (weight: 2.0)
├── body_text (weight: 1.0)
├── author names (weight: 1.5)
├── entity names (weight: 2.0)

Semantic Search:
├── Uses article embedding
├── Top-K = 100, then re-rank by relevance
├── Can combine with filters

Filters:
├── date_range: {from, to}
├── sources: uuid[]
├── topics: string[]
├── sentiment_range: {min, max}
├── language: string
├── location_id: uuid
├── story_id: uuid
├── has_media: boolean

Pagination:
├── Cursor-based for consistency
├── Max limit: 100 per page

1.12 Export/Reporting Functionality Missing

Problem: No way for customers to export their data or generate reports.

Recommendation: Add:

EXPORTS
├── id: uuid
├── customer_id: fk → customers
├── export_type: enum # articles | feed | analytics | full_backup
├── format: enum # json | csv | xlsx
├── filters: json # Same as search filters
├── status: enum # pending | processing | completed | failed
├── r2_key: string? # Location of export file
├── download_url: string? # Pre-signed URL (expires)
├── expires_at: timestamp?
├── file_size_bytes: int?
├── row_count: int?
├── created_at: timestamp
└── completed_at: timestamp?

API Endpoints:

POST /v1/exports                    Create export request
GET /v1/exports List exports
GET /v1/exports/:id Get export status
GET /v1/exports/:id/download Get download URL

2. UNKNOWN UNKNOWNS — Things You Haven't Considered

2.1 Article Updates & Corrections

Problem: Articles change after publication. Headlines get edited, corrections issued, content updated. You capture updated_at but don't handle:

  • Detecting when an article has changed
  • Re-processing updated articles
  • Tracking the diff/history
  • Notifying users of significant changes

Recommendation: Add:

ARTICLE_REVISIONS
├── id: uuid
├── article_id: fk → articles
├── revision_number: int
├── detected_at: timestamp
├── previous_headline: string?
├── previous_body_hash: string
├── changes: json # {headline: true, body: true, ...}
├── significance: enum # minor | correction | major_update
├── raw_snapshot_r2_key: string
└── created_at: timestamp

Pipeline addition: Re-fetch articles after 24h and 7d to check for updates.


2.2 Paywall Detection

Problem: Many articles are behind paywalls. You'll fetch them, get partial content, and process garbage. Need to:

  • Detect paywalled content
  • Mark articles as paywalled
  • Potentially skip enrichment/classification for paywalled articles
  • Track which sources have paywalls

Recommendation: Add:

ARTICLES:
├── paywall_status: enum # none | soft | hard | metered | unknown

SOURCES:
├── paywall_type: enum # none | soft | hard | metered
├── paywall_bypass_strategy: enum # none | zenrows | archive | skip

Detection signals:

  • Body text < 500 chars with "subscribe" keywords
  • Known paywall DOM patterns
  • Source metadata

2.3 Duplicate/Near-Duplicate Content

Problem: You handle URL deduplication, but what about:

  • Syndicated content (AP/Reuters published by 100 outlets)
  • Rewrites (same story, slightly different words)
  • Plagiarism detection

This matters because:

  • You'll waste money processing the same content multiple times
  • Story clustering will be noisy
  • Authority scoring should credit the original

Recommendation: Add:

ARTICLE_DUPLICATES
├── article_id: fk → articles # The duplicate
├── original_article_id: fk # The original (or earliest)
├── similarity_score: float # 0-1
├── duplicate_type: enum # exact | near_duplicate | syndicated | rewrite
├── detected_at: timestamp

Detection method:
├── MinHash/LSH for body text
├── Exact match on first 500 chars hash
├── Embedding similarity > 0.95

2.4 Source Discovery & Onboarding

Problem: How do you discover new sources? Currently it seems like sources are manually added. But:

  • Google News returns articles from sources you've never seen
  • Those sources need configuration
  • Some sources will be garbage/spam

Recommendation: Add auto-discovery:

SOURCE_CANDIDATES
├── id: uuid
├── domain: string
├── first_seen_at: timestamp
├── article_count: int # How many times we've seen them
├── sample_urls: string[]
├── auto_detected_type: enum # news | blog | spam | unknown
├── auto_detected_quality: float # 0-1 heuristic
├── status: enum # pending_review | approved | rejected | blocked
├── reviewed_by: string?
├── reviewed_at: timestamp?
└── created_at: timestamp

When a new domain appears:

  1. Log to SOURCE_CANDIDATES
  2. If seen 5+ times, flag for review
  3. Auto-approve if quality score > 0.8
  4. Auto-reject if spam signals detected

2.5 Entity Resolution & Disambiguation

Problem: "Apple" could be the company, the fruit, or Apple Records. "Michael Jordan" could be the basketball player or the professor. You have NER but no disambiguation.

Recommendation: Add:

ENTITIES:
├── disambiguation_type: enum # unique | ambiguous | merged
├── wikidata_id: string? # Q312 for Apple Inc.
├── wikipedia_url: string?
├── related_entities: uuid[] # For disambiguation context

ENTITY_ALIASES:
├── alias: string # "AAPL", "Apple Inc", "Apple Computer"
├── entity_id: fk → entities
├── alias_type: enum # name | ticker | abbreviation | typo
├── confidence: float

Disambiguation strategy:

  1. Check context (other entities in article)
  2. Check source (tech news → Apple Inc.)
  3. Check topic classification
  4. Use Wikidata for canonical resolution

2.6 Bias & Reliability Scoring

Problem: You have political_lean but bias is more nuanced:

  • Factual accuracy vs opinion
  • Sensationalism
  • Clickbait
  • Bias by omission

This matters for credibility scoring and alerting users.

Recommendation: Add:

SOURCE_RELIABILITY
├── source_id: fk → sources
├── factual_accuracy: float # 0-1, from fact-check orgs
├── editorial_bias: float # -1 to +1
├── sensationalism_score: float # 0-1
├── transparency_score: float # Ownership, funding disclosed
├── mbfc_rating: string? # Media Bias Fact Check rating
├── newsguard_score: int? # NewsGuard 0-100
├── last_evaluated_at: timestamp
└── created_at: timestamp

Consider integrating:

  • Media Bias Fact Check API
  • NewsGuard (if budget allows)
  • Ad Fontes Media Bias Chart
  • AllSides bias ratings

2.7 Content Moderation

Problem: What if crawled content contains:

  • Hate speech
  • Graphic violence descriptions
  • Illegal content
  • NSFW material

You're processing it, storing it, and potentially surfacing it to customers.

Recommendation: Add:

ARTICLE_MODERATION
├── article_id: fk → articles
├── flagged: boolean
├── flags: string[] # ["hate_speech", "violence", "nsfw"]
├── confidence: float
├── reviewed: boolean
├── reviewer_decision: enum # approved | hidden | deleted
├── reviewed_by: string?
├── reviewed_at: timestamp?
└── created_at: timestamp

Options:

  1. Use Workers AI content moderation
  2. Add LLM classification step for safety
  3. Source-level blocklist for known bad actors
  4. Customer-configurable content filters

2.8 Internationalization (i18n) of the Platform

Problem: You support multi-language articles, but what about:

  • Admin console in multiple languages
  • API error messages
  • Email notification templates
  • Taxonomy labels in multiple languages

Recommendation: For V1, document that UI/API is English-only. For V2, plan:

TRANSLATIONS
├── key: string # "error.rate_limit_exceeded"
├── language: string # ISO 639-1
├── value: text
└── created_at: timestamp

3. CONSISTENCY ISSUES — Mismatches Between Docs

3.1 Vectorize Index Count Mismatch

architecture.md system diagram: Shows 5 vector indexes architecture.md deployment: Shows 7 vector indexes flowchart.md: Shows 7 vector indexes

Fix: Update system diagram to show all 7.


3.2 Queue Count Mismatch

architecture.md: Says "8 topics" in diagram, lists 9 in deployment flowchart.md: Lists 9 queues

Fix: Update diagram to say "9 topics".


3.3 Worker Count

architecture.md: Lists 8 workers flowchart.md: Shows slightly different worker breakdown

Fix: Align on exact worker names and count.


3.4 Missing Dedup Step in architecture.md Flows

flowchart.md: Shows explicit dedup step with URL hash check architecture.md sub-flows: Don't show dedup step

Fix: Update architecture.md pipeline diagrams to include dedup.


3.5 Missing Geo/Author in architecture.md Flows

flowchart.md: Shows Geo Extraction and Author Resolution as separate steps architecture.md sub-flows: Don't show these

Fix: Update architecture.md extraction pipeline to match.


3.6 API Endpoints Incomplete in architecture.md

flowchart.md: Has comprehensive client + admin API sections architecture.md: Has abbreviated endpoint list

Fix: Either sync them or reference flowchart.md for full API spec.


4. ARCHITECTURE RISKS — Potential Problems at Scale

4.1 D1 Row Limits

Risk: Cloudflare D1 has limits. At scale:

  • 10,000 articles/day = 300,000/month = 3.6M/year
  • Plus all junction tables, classifications, events, etc.

Mitigation:

  • Implement aggressive retention (30-90 days for articles)
  • Archive to R2 before deletion
  • Consider D1 sharding strategy
  • Monitor row counts with alerts

4.2 Vectorize Index Size

Risk: 225K locations + millions of articles + authors + entities = large vector indexes

Mitigation:

  • Understand Vectorize limits (currently 5M vectors per index)
  • Plan for index sharding or tiering
  • Consider separate indexes for hot (recent) vs cold (archive) data

4.3 Cold Start Latency

Risk: Workers cold starts can add 50-200ms. For real-time APIs, this matters.

Mitigation:

  • Keep critical paths in single worker (avoid chaining)
  • Use KV caching aggressively
  • Consider always-on workers for critical paths (if available)

4.4 Queue Depth Runaway

Risk: If processing slows down, queues can back up infinitely.

Mitigation:

  • Set max queue depths with alerts
  • Implement circuit breakers (pause crawling if queue > threshold)
  • Dead letter queues for failed messages
  • Queue metrics in observability

4.5 Cost Explosion

Risk: A bug or misconfiguration could trigger massive external API usage.

Mitigation:

  • Hard budget limits (stop processing if daily cost > $X)
  • Per-service rate limits
  • Anomaly detection on cost velocity
  • Require manual approval for > 10x normal spend

5. RECOMMENDATIONS — Priority Actions

Immediate (Before Coding)

  1. Add KEYWORD_SETS table — Core to the product
  2. Add PIPELINE_RUNS/ERRORS tables — Critical for debugging
  3. Add API_KEYS table — Required for auth
  4. Document rate limiting strategy — Required for crawling
  5. Fix consistency issues — Update diagrams

Short-term (During V1)

  1. Add CUSTOMER_EVENTS — Important for relevance improvement
  2. Add paywall detection — Or you'll waste money on garbage
  3. Add near-duplicate detection — Or story clustering will be noisy
  4. Add SOURCE_CANDIDATES — Or you'll never scale source coverage
  5. Add NOTIFICATION_LOG — Or you can't debug delivery issues

Medium-term (V1.1)

  1. Add article update detection — News changes
  2. Add entity disambiguation — Or NER is half-useful
  3. Add content moderation — Liability risk
  4. Add export functionality — Customers will ask

Long-term (V2)

  1. Source reliability scoring — Differentiation
  2. Full bias/credibility analysis — Premium feature
  3. I18n infrastructure — International expansion

6. UPDATED TABLES TO ADD

Here's the complete list of missing tables to add to architecture.md:

KEYWORD_SETS
API_KEYS
ADMIN_TOKENS
PIPELINE_RUNS
PIPELINE_ERRORS
CUSTOMER_EVENTS
CUSTOMER_MEMORY
NOTIFICATION_TEMPLATES
NOTIFICATION_LOG
TAXONOMY_VERSIONS
RULES_VERSIONS
SCHEDULED_JOBS
EXPORTS
ARTICLE_REVISIONS
ARTICLE_DUPLICATES
ARTICLE_MODERATION
SOURCE_CANDIDATES
SOURCE_RELIABILITY
ENTITY_ALIASES

This document should be reviewed before starting implementation and updated as decisions are made.