Phased Build Plan

Implementation roadmap for Topic Intel - from MVP to Enterprise

Project Summary

What is Topic Intel?

Topic Intel is a source-agnostic news monitoring and intelligence platform built entirely on Cloudflare's edge infrastructure. It enables businesses to track keywords, topics, and entities across news sources, receiving real-time alerts and AI-powered insights.

Core Value Proposition

For PR/Comms teams: Monitor brand mentions, competitor news, industry trends
For Investors: Track portfolio companies, market signals, regulatory filings
For Researchers: Follow topics, aggregate sources, export datasets
For Developers: API-first access to curated news intelligence

Technical Foundation

Component	Technology	Purpose
Compute	Cloudflare Workers	Edge-native, serverless
Database	Cloudflare D1	SQLite at the edge
Object Storage	Cloudflare R2	Raw HTML snapshots
Cache	Cloudflare KV	Hot data, rate limits
Vector Search	Cloudflare Vectorize	Semantic search, clustering
Queues	Cloudflare Queues	Async pipeline processing
AI	Workers AI + OpenAI	Classification, embeddings, NER

Key Architectural Decisions

Source-Agnostic Crawler: Google News is the first adapter, but the architecture supports any content source (Twitter, Reddit, RSS, podcasts, video transcripts, etc.)
Shared Keyword Pool: 1,000 customers tracking "bitcoin" = 1 crawl, not 1,000. Efficiency at scale through deduplication at the keyword level, fan-out at the match level.
Dynamic Crawl Frequency: Keywords are tiered (hot/warm/normal/cold/frozen) based on rate of change. Hot keywords crawl every 15 minutes, frozen keywords once daily.
Three-Tier Classification: Rules (free) → Vector matching (free) → LLM fallback (costly). Minimize AI spend while maximizing accuracy.
Multiple Taxonomy Support: System taxonomies (DataForSEO categories), industry taxonomies, and customer-defined taxonomies coexist.
Admin-Configurable Tiers: Subscription limits are not hardcoded - admin console controls keyword limits, API quotas, retention periods per tier.

Current Status

Area	Status	Notes
Architecture Design	✅ Complete	16,000+ lines of documentation
Data Model	✅ Complete	35+ tables, views, triggers
API Specification	✅ Complete	OpenAPI 3.1, needs minor updates
Security Model	✅ Complete	API keys, admin tokens, HMAC webhooks
External Integrations	✅ Complete	Google News, ZenRows, RapidAPI, DataForSEO, SharedCount, OpenAI
Phase 1 Planning	✅ Complete	Ready to begin implementation
Actual Code	❌ Not Started	Documentation-first approach

What's in Scope (All Phases)

News article monitoring from multiple sources
Keyword and topic subscriptions
Entity extraction and tracking
Classification and taxonomy management
Email and webhook notifications
Search (full-text and semantic)
Story clustering (articles → narratives)
AI briefings (daily/weekly summaries)
RAG-based Q&A against customer's feed
Video and podcast transcript processing (Phase 5+)
Multi-tenant enterprise features

What's NOT in Scope

Consumer mobile apps (API-first, B2B focus)
Social media posting/engagement (read-only monitoring)
Full social listening (Twitter/Reddit are future source adapters, not social management)
Content creation or ghostwriting

Phase Overview

This plan prioritizes de-risking the architecture by deferring complex ML features (story clustering, Q&A) to later phases while delivering core value early.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           PHASE OVERVIEW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Phase 1: Core Pipeline (MVP)                                              │
│   └── Crawl → Extract → Store → Basic Feed                                  │
│       Risk: Low | Value: High | Complexity: Medium                          │
│                                                                              │
│   Phase 2: Intelligence Layer                                               │
│   └── Classification → Entity Extraction → Notifications                    │
│       Risk: Medium | Value: High | Complexity: Medium                       │
│                                                                              │
│   Phase 3: Customer Experience                                              │
│   └── Search → Profiles → API Keys → Exports                               │
│       Risk: Low | Value: High | Complexity: Low                             │
│                                                                              │
│   Phase 4: Advanced Intelligence                                            │
│   └── Story Clustering → Briefings → Q&A (RAG)                             │
│       Risk: HIGH | Value: Medium | Complexity: HIGH                         │
│                                                                              │
│   Phase 5: Scale & Polish                                                   │
│   └── Enterprise Features → Multi-tenant → Advanced Analytics              │
│       Risk: Low | Value: Medium | Complexity: Medium                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 1: Core Pipeline (MVP)

Goal

Build the foundational crawl-to-feed pipeline. Articles flow from Google News to customer feeds without ML complexity.

Deliverables

Component	Description	Priority
Source Adapter Framework	Pluggable adapter interface for any content source	P0
Google News Adapter	First adapter: RSS fetch with 3-tier fallback	P0
Keyword Pool	Global KEYWORDS table with subscription model	P0
Article Extractor	HTML parsing, text extraction, deduplication	P0
Basic Storage	D1 schema, R2 for raw HTML, URL dedup	P0
Keyword Matching	Simple keyword → article matching	P0
Customer Feed API	GET /v1/feed with pagination	P0
Admin Console (Basic)	Crawl health dashboard, keyword management	P1
Direct RSS Adapter	Second adapter: subscribe to publisher RSS directly	P1

Key Principle: The crawler is source-agnostic. Google News is just the first adapter. The architecture supports Twitter, Reddit, HackerNews, PR wires, govt feeds, podcasts, etc.

Data Model (Phase 1)

-- Core tables needed
source_adapters, sources, urls, articles, keywords, customer_keyword_subscriptions,
keyword_articles, customers, api_keys, keyword_crawl_history

API Endpoints (Phase 1)

Customer API:
  GET  /v1/feed                    # Articles matching subscribed keywords
  GET  /v1/articles/:id            # Single article
  GET  /v1/keywords                # Customer's subscribed keywords
  POST /v1/keywords                # Subscribe to keyword
  DEL  /v1/keywords/:id            # Unsubscribe

Admin API:
  GET  /v1/admin/crawl/health      # Fallback rates, success metrics
  GET  /v1/admin/keywords          # Global keyword pool
  POST /v1/admin/crawl/trigger     # Manual crawl

Architecture (Phase 1)

┌──────────────┐     ┌──────────────────────────────────────┐
│   Cron       │────▶│          CRAWLER WORKER              │
│  (15min)     │     │                                      │
└──────────────┘     │  ┌────────────┐  ┌────────────┐     │
                     │  │ Google     │  │ Direct     │     │
                     │  │ News       │  │ RSS        │     │
                     │  │ Adapter    │  │ Adapter    │     │
                     │  └────────────┘  └────────────┘     │
                     │         ▼              ▼            │
                     │  ┌──────────────────────────────┐   │
                     │  │   Unified Content Pipeline   │   │
                     │  └──────────────────────────────┘   │
                     └──────────────────┬──────────────────┘
                                        │
                            ┌───────────┴───────────┐
                            ▼                       ▼
                     ┌──────────────┐        ┌──────────────┐
                     │   R2 Raw     │        │   D1 DB      │
                     │   Storage    │        │   (articles) │
                     └──────────────┘        └──────────────┘
                                                    │
                                                    ▼
                                             ┌──────────────┐
                                             │   API        │
                                             │   Gateway    │
                                             └──────────────┘

Source Adapters are pluggable. Phase 1 ships with Google News + Direct RSS. Future adapters (Twitter, Reddit, etc.) slot in without architecture changes.

Queues (Phase 1)

crawl.batch          # Keyword batches to crawl
article.extract      # URLs to fetch and parse

Success Criteria

Crawl 1000+ articles/day across 50+ keywords
< 5% fetch failure rate with fallback
< 500ms p95 feed API latency
Basic admin visibility into crawl health

Risks & Mitigations

Risk	Mitigation
Google News blocking	3-tier fallback, rate limiting, IP rotation
D1 scale limits	Proper indexing, archive old articles
Keyword explosion	Subscriber-only crawling, tier limits

Phase 2: Intelligence Layer

Goal

Add classification, entity extraction, and customer notifications without the risky ML clustering.

Deliverables

Component	Description	Priority
Embeddings	Article embeddings via OpenAI or Workers AI	P0
Rule-based Classification	Keyword patterns, source mapping	P0
Vector Classification	Compare to taxonomy centroids	P1
Entity Extraction	NER via Workers AI distilbert	P0
Location Extraction	Geo-tagging articles	P1
Social Metrics	SharedCount integration	P2
Backlink Metrics	DataForSEO integration	P2
Email Notifications	Digest emails for matches	P0
Webhook Notifications	Real-time webhooks	P1

Data Model (Phase 2 additions)

-- Add to Phase 1
article_classifications, taxonomy_labels, entities, entity_mentions,
article_social_metrics, article_backlink_metrics, locations,
article_locations, notification_log, webhook_endpoints

Classification Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                    CLASSIFICATION FLOW                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Article ───▶ Rules Engine (free)                              │
│                    │                                             │
│                    ├── Keyword patterns matched? ──▶ Done       │
│                    │                                             │
│                    ▼                                             │
│               Vector Match (free)                               │
│                    │                                             │
│                    ├── Top-K labels > 0.7 confidence? ──▶ Done  │
│                    │                                             │
│                    ▼                                             │
│               LLM Fallback (costly)                             │
│                    │                                             │
│                    └── Low confidence? ──▶ Review Queue         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Queues (Phase 2 additions)

article.enrich       # Fetch social/backlinks
article.classify     # Run classification pipeline
notify.dispatch      # Send notifications

API Endpoints (Phase 2 additions)

Customer API:
  GET  /v1/entities/:id            # Entity profile
  GET  /v1/feed/entities           # Entities in feed
  POST /v1/webhooks                # Register webhook
  GET  /v1/webhooks                # List webhooks

Admin API:
  GET  /v1/admin/taxonomy          # View taxonomy tree
  POST /v1/admin/taxonomy/labels   # Add label
  GET  /v1/admin/review/queue      # Low-confidence items
  POST /v1/admin/review/:id        # Submit review

Success Criteria

80%+ classification confidence (no LLM needed)
Entity extraction on 95%+ articles
< 1 hour latency from publish to notification
< $50/day external API costs

Phase 3: Customer Experience

Goal

Polish the customer-facing features: search, profiles, exports, API keys.

Deliverables

Component	Description	Priority
Full-text Search	Keyword + semantic search	P0
Customer Profiles	Saved search configurations	P0
Topic Subscriptions	Subscribe to keyword bundles	P0
API Key Management	Self-service key creation	P0
Usage Analytics	Track API usage per customer	P1
Export (CSV/JSON)	Bulk article export	P1
Customer Events	Behavioral tracking	P2

Data Model (Phase 3 additions)

-- Add to Phase 2
customer_profiles, customer_article_scores, customer_events,
topics, topic_keywords, customer_topic_subscriptions, exports

Vectorize Indexes

articles         # Full article search
profiles         # Customer preference matching
taxonomy         # Classification centroids
entities         # Entity search

API Endpoints (Phase 3 additions)

Customer API:
  GET  /v1/search                  # Full-text + semantic search
  GET  /v1/profiles                # List profiles
  POST /v1/profiles                # Create profile
  PUT  /v1/profiles/:id            # Update profile
  GET  /v1/topics                  # Available topics
  POST /v1/topics/:id/subscribe    # Subscribe to topic
  GET  /v1/api-keys                # List API keys
  POST /v1/api-keys                # Create API key
  GET  /v1/usage                   # Usage stats
  POST /v1/exports                 # Request export
  GET  /v1/exports/:id             # Download export

Success Criteria

< 200ms p95 search latency
Self-service API key generation
CSV/JSON export for all articles
Profile-based relevance scoring

Phase 4: Advanced Intelligence

Goal

Add the risky ML features: story clustering, AI briefings, RAG Q&A.

WARNING: This phase has the highest technical risk. Story clustering accuracy is difficult to achieve.

Deliverables

Component	Description	Priority	Risk
Story Clustering	Group articles into stories	P0	HIGH
Story Timeline	Event timeline for stories	P1	Medium
Daily Briefings	AI-generated summaries	P1	Medium
Weekly Briefings	Weekly digest generation	P2	Low
RAG Q&A	Ask questions against feed	P2	HIGH
Trend Detection	Keyword velocity alerts	P2	Medium

Story Clustering Approach

┌─────────────────────────────────────────────────────────────────┐
│                    CLUSTERING SIGNALS                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   1. Embedding Similarity (40% weight)                          │
│      └── Cosine similarity > 0.85                               │
│                                                                  │
│   2. Entity Overlap (30% weight)                                │
│      └── Jaccard similarity of entities > 0.5                   │
│                                                                  │
│   3. Temporal Proximity (20% weight)                            │
│      └── Published within 48 hours                              │
│                                                                  │
│   4. Headline Similarity (10% weight)                           │
│      └── TF-IDF or edit distance                                │
│                                                                  │
│   Composite score > 0.75 ──▶ Same story                         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Data Model (Phase 4 additions)

-- Add to Phase 3
stories, article_stories, story_timeline, trend_signals,
keyword_velocity, customer_story_subscriptions

Queues (Phase 4 additions)

story.cluster        # Clustering jobs
briefing.generate    # Briefing generation

API Endpoints (Phase 4 additions)

Customer API:
  GET  /v1/stories                 # Story feed
  GET  /v1/stories/:id             # Story with timeline
  GET  /v1/stories/:id/articles    # Articles in story
  POST /v1/stories/:id/subscribe   # Subscribe to story
  GET  /v1/briefings/daily         # Daily briefing
  GET  /v1/briefings/weekly        # Weekly briefing
  POST /v1/qa                      # Ask a question (RAG)
  GET  /v1/trends                  # Trending topics

Risk Mitigation

Risk	Mitigation
Clustering accuracy	Start with high threshold (0.85), tune down
Over-clustering	Manual review queue, customer feedback
Under-clustering	Periodic re-clustering job
LLM cost explosion	Cache briefings, rate limit Q&A
RAG hallucinations	Strict retrieval, citation required

Success Criteria

70%+ story clustering accuracy (measure via manual review)
< 5% false positive rate (unrelated articles in same story)
Daily briefing generation < 30 seconds
Q&A response < 5 seconds with citations

Phase 5: Scale & Polish

Goal

Enterprise features, multi-tenant isolation, advanced analytics, and operational polish.

Deliverables

Component	Description	Priority
Enterprise SSO	SAML/OIDC integration	P1
Team Management	Multi-user accounts	P1
Usage Quotas	Per-tier limits enforcement	P0
Advanced Analytics	Custom dashboards	P2
Audit Logging	Full audit trail	P1
Data Retention	Configurable retention policies	P1
Multi-region	Geographic distribution	P3
Custom Domains	White-label API endpoints	P3

Data Model (Phase 5 additions)

-- Add to Phase 4
admin_users, admin_audit_log, cost_budgets, cost_rollups
(Many already exist - just enable full functionality)

API Endpoints (Phase 5 additions)

Customer API:
  GET  /v1/account/usage           # Detailed usage stats
  GET  /v1/account/team            # Team members
  POST /v1/account/team            # Invite team member

Admin API:
  GET  /v1/admin/customers         # All customers
  GET  /v1/admin/customers/:id     # Customer details
  PUT  /v1/admin/customers/:id     # Update customer tier
  GET  /v1/admin/analytics         # Platform analytics
  GET  /v1/admin/audit             # Audit log

Success Criteria

Enterprise SSO working with major IdPs
Sub-account management
Per-customer cost tracking
99.9% uptime SLA achievable

Phase 5+: Future Capabilities

These are planned but not scheduled. They extend the platform into new content types and use cases.

Video & Podcast Transcripts

Component	Description	Technology
Podcast Adapter	Discover podcasts via RSS, iTunes API	RSS parsing
Audio Download	Fetch audio files to R2	HTTP + R2
Transcription	Convert speech to text	Whisper API (OpenAI) or Workers AI
Speaker Diarization	Identify who said what	Future Whisper features
Video Adapter	YouTube, Vimeo caption extraction	YouTube Data API
Video Transcription	Process videos without captions	Whisper on audio track

Use Cases:

Monitor industry podcasts for mentions
Track executive interviews on YouTube
Extract quotes from earnings calls
Index conference talks

Additional Source Adapters

Source	Type	Notes
Twitter/X	Social	API v2, streaming, thread unrolling
Reddit	Social	Subreddit monitoring, comment threads
HackerNews	Social	Firebase API, tech-focused
LinkedIn	Social	Limited API, may need scraping
Substack	Newsletter	RSS + paywall handling
Medium	Blog	RSS available
arXiv	Research	Academic papers
SEC EDGAR	Government	Regulatory filings
Patents	Government	USPTO, EPO feeds

Advanced AI Features

Feature	Description	Phase
Claim Verification	Cross-reference claims across sources	6+
Predictive Trends	ML-based trend forecasting	6+
Automated Briefing Scheduling	Smart digest timing	6+
Multi-language Translation	Real-time article translation	6+
Custom Model Training	Fine-tuned classifiers per customer	6+

Implementation Order Summary

Phase 1 (MVP):
├── Week 1-2: D1 schema, Keywords, Subscriptions
├── Week 3-4: Crawler with 3-tier fallback
├── Week 5-6: Extractor, article storage
├── Week 7-8: Feed API, basic admin console
└── MVP Launch

Phase 2 (Intelligence):
├── Week 9-10: Embeddings, Vectorize setup
├── Week 11-12: Classification pipeline (rules → vector)
├── Week 13-14: Entity extraction, locations
├── Week 15-16: Notifications (email, webhook)
└── Intelligence Launch

Phase 3 (Customer Experience):
├── Week 17-18: Search (full-text + semantic)
├── Week 19-20: Profiles, topics
├── Week 21-22: API keys, usage tracking
├── Week 23-24: Exports, polish
└── Full Customer Launch

Phase 4 (Advanced Intelligence):
├── Week 25-28: Story clustering (high risk, extra time)
├── Week 29-30: Story timeline, UI
├── Week 31-32: Briefings (daily/weekly)
├── Week 33-34: RAG Q&A, trends
└── Advanced Launch

Phase 5 (Scale & Polish):
├── Week 35-38: Enterprise features
├── Week 39-40: Multi-tenant polish
├── Week 41-42: Analytics, audit
├── Week 43-44: Performance, scale testing
└── Enterprise Launch

Technical Debt to Track

Item	Phase to Address	Notes
Migrate KEYWORD_SETS to KEYWORDS	Phase 1	Legacy table migration
Vectorize index optimization	Phase 3	After search usage patterns clear
LLM prompt optimization	Phase 4	After seeing real classification data
Cost optimization	Phase 5	After usage patterns established
Test coverage gaps	Each phase	Maintain 80%+ coverage

Go/No-Go Criteria

Phase 1 → Phase 2

1000+ articles/day throughput
< 5% fetch failure rate
Admin can see crawl health
At least 1 paying customer

Phase 2 → Phase 3

Classification working without excessive LLM costs
Notifications delivered < 1 hour
Entity extraction on 90%+ articles

Phase 3 → Phase 4

Search returning relevant results
Customer profiles working
API key self-service functional
Positive customer feedback

Phase 4 → Phase 5

Story clustering accuracy > 70%
Briefings generating successfully
Q&A returning cited answers
No major accuracy complaints

Last updated: 2024-01-15

Project Summary​

What is Topic Intel?​

Core Value Proposition​

Technical Foundation​

Key Architectural Decisions​

Current Status​

What's in Scope (All Phases)​

What's NOT in Scope​

Phase Overview​

Phase 1: Core Pipeline (MVP)​

Goal​

Deliverables​

Data Model (Phase 1)​

API Endpoints (Phase 1)​

Architecture (Phase 1)​

Queues (Phase 1)​

Success Criteria​

Risks & Mitigations​

Phase 2: Intelligence Layer​

Goal​

Deliverables​

Data Model (Phase 2 additions)​

Classification Pipeline​

Queues (Phase 2 additions)​

API Endpoints (Phase 2 additions)​

Success Criteria​

Phase 3: Customer Experience​

Goal​

Deliverables​

Data Model (Phase 3 additions)​

Vectorize Indexes​

API Endpoints (Phase 3 additions)​

Success Criteria​

Phase 4: Advanced Intelligence​

Goal​

Deliverables​

Story Clustering Approach​

Data Model (Phase 4 additions)​

Queues (Phase 4 additions)​

API Endpoints (Phase 4 additions)​

Risk Mitigation​

Success Criteria​

Phase 5: Scale & Polish​

Goal​

Deliverables​

Data Model (Phase 5 additions)​

API Endpoints (Phase 5 additions)​

Success Criteria​

Phase 5+: Future Capabilities​

Video & Podcast Transcripts​

Additional Source Adapters​

Advanced AI Features​

Implementation Order Summary​

Technical Debt to Track​

Go/No-Go Criteria​

Phase 1 → Phase 2​

Phase 2 → Phase 3​

Phase 3 → Phase 4​

Phase 4 → Phase 5​

Project Summary

What is Topic Intel?

Core Value Proposition

Technical Foundation

Key Architectural Decisions

Current Status

What's in Scope (All Phases)

What's NOT in Scope

Phase Overview

Phase 1: Core Pipeline (MVP)

Goal

Deliverables

Data Model (Phase 1)

API Endpoints (Phase 1)

Architecture (Phase 1)

Queues (Phase 1)

Success Criteria

Risks & Mitigations

Phase 2: Intelligence Layer

Goal

Deliverables

Data Model (Phase 2 additions)

Classification Pipeline

Queues (Phase 2 additions)

API Endpoints (Phase 2 additions)

Success Criteria

Phase 3: Customer Experience

Goal

Deliverables

Data Model (Phase 3 additions)

Vectorize Indexes

API Endpoints (Phase 3 additions)

Success Criteria

Phase 4: Advanced Intelligence

Goal

Deliverables

Story Clustering Approach

Data Model (Phase 4 additions)

Queues (Phase 4 additions)

API Endpoints (Phase 4 additions)

Risk Mitigation

Success Criteria

Phase 5: Scale & Polish

Goal

Deliverables

Data Model (Phase 5 additions)

API Endpoints (Phase 5 additions)

Success Criteria

Phase 5+: Future Capabilities

Video & Podcast Transcripts

Additional Source Adapters

Advanced AI Features

Implementation Order Summary

Technical Debt to Track

Go/No-Go Criteria

Phase 1 → Phase 2

Phase 2 → Phase 3

Phase 3 → Phase 4

Phase 4 → Phase 5