Phased Build Plan
Implementation roadmap for Topic Intel - from MVP to Enterprise
Project Summary
What is Topic Intel?
Topic Intel is a source-agnostic news monitoring and intelligence platform built entirely on Cloudflare's edge infrastructure. It enables businesses to track keywords, topics, and entities across news sources, receiving real-time alerts and AI-powered insights.
Core Value Proposition
- For PR/Comms teams: Monitor brand mentions, competitor news, industry trends
- For Investors: Track portfolio companies, market signals, regulatory filings
- For Researchers: Follow topics, aggregate sources, export datasets
- For Developers: API-first access to curated news intelligence
Technical Foundation
| Component | Technology | Purpose |
|---|---|---|
| Compute | Cloudflare Workers | Edge-native, serverless |
| Database | Cloudflare D1 | SQLite at the edge |
| Object Storage | Cloudflare R2 | Raw HTML snapshots |
| Cache | Cloudflare KV | Hot data, rate limits |
| Vector Search | Cloudflare Vectorize | Semantic search, clustering |
| Queues | Cloudflare Queues | Async pipeline processing |
| AI | Workers AI + OpenAI | Classification, embeddings, NER |
Key Architectural Decisions
-
Source-Agnostic Crawler: Google News is the first adapter, but the architecture supports any content source (Twitter, Reddit, RSS, podcasts, video transcripts, etc.)
-
Shared Keyword Pool: 1,000 customers tracking "bitcoin" = 1 crawl, not 1,000. Efficiency at scale through deduplication at the keyword level, fan-out at the match level.
-
Dynamic Crawl Frequency: Keywords are tiered (hot/warm/normal/cold/frozen) based on rate of change. Hot keywords crawl every 15 minutes, frozen keywords once daily.
-
Three-Tier Classification: Rules (free) → Vector matching (free) → LLM fallback (costly). Minimize AI spend while maximizing accuracy.
-
Multiple Taxonomy Support: System taxonomies (DataForSEO categories), industry taxonomies, and customer-defined taxonomies coexist.
-
Admin-Configurable Tiers: Subscription limits are not hardcoded - admin console controls keyword limits, API quotas, retention periods per tier.
Current Status
| Area | Status | Notes |
|---|---|---|
| Architecture Design | ✅ Complete | 16,000+ lines of documentation |
| Data Model | ✅ Complete | 35+ tables, views, triggers |
| API Specification | ✅ Complete | OpenAPI 3.1, needs minor updates |
| Security Model | ✅ Complete | API keys, admin tokens, HMAC webhooks |
| External Integrations | ✅ Complete | Google News, ZenRows, RapidAPI, DataForSEO, SharedCount, OpenAI |
| Phase 1 Planning | ✅ Complete | Ready to begin implementation |
| Actual Code | ❌ Not Started | Documentation-first approach |
What's in Scope (All Phases)
- News article monitoring from multiple sources
- Keyword and topic subscriptions
- Entity extraction and tracking
- Classification and taxonomy management
- Email and webhook notifications
- Search (full-text and semantic)
- Story clustering (articles → narratives)
- AI briefings (daily/weekly summaries)
- RAG-based Q&A against customer's feed
- Video and podcast transcript processing (Phase 5+)
- Multi-tenant enterprise features
What's NOT in Scope
- Consumer mobile apps (API-first, B2B focus)
- Social media posting/engagement (read-only monitoring)
- Full social listening (Twitter/Reddit are future source adapters, not social management)
- Content creation or ghostwriting
Phase Overview
This plan prioritizes de-risking the architecture by deferring complex ML features (story clustering, Q&A) to later phases while delivering core value early.
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE OVERVIEW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: Core Pipeline (MVP) │
│ └── Crawl → Extract → Store → Basic Feed │
│ Risk: Low | Value: High | Complexity: Medium │
│ │
│ Phase 2: Intelligence Layer │
│ └── Classification → Entity Extraction → Notifications │
│ Risk: Medium | Value: High | Complexity: Medium │
│ │
│ Phase 3: Customer Experience │
│ └── Search → Profiles → API Keys → Exports │
│ Risk: Low | Value: High | Complexity: Low │
│ │
│ Phase 4: Advanced Intelligence │
│ └── Story Clustering → Briefings → Q&A (RAG) │
│ Risk: HIGH | Value: Medium | Complexity: HIGH │
│ │
│ Phase 5: Scale & Polish │
│ └── Enterprise Features → Multi-tenant → Advanced Analytics │
│ Risk: Low | Value: Medium | Complexity: Medium │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Phase 1: Core Pipeline (MVP)
Goal
Build the foundational crawl-to-feed pipeline. Articles flow from Google News to customer feeds without ML complexity.
Deliverables
| Component | Description | Priority |
|---|---|---|
| Source Adapter Framework | Pluggable adapter interface for any content source | P0 |
| Google News Adapter | First adapter: RSS fetch with 3-tier fallback | P0 |
| Keyword Pool | Global KEYWORDS table with subscription model | P0 |
| Article Extractor | HTML parsing, text extraction, deduplication | P0 |
| Basic Storage | D1 schema, R2 for raw HTML, URL dedup | P0 |
| Keyword Matching | Simple keyword → article matching | P0 |
| Customer Feed API | GET /v1/feed with pagination | P0 |
| Admin Console (Basic) | Crawl health dashboard, keyword management | P1 |
| Direct RSS Adapter | Second adapter: subscribe to publisher RSS directly | P1 |
Key Principle: The crawler is source-agnostic. Google News is just the first adapter. The architecture supports Twitter, Reddit, HackerNews, PR wires, govt feeds, podcasts, etc.
Data Model (Phase 1)
-- Core tables needed
source_adapters, sources, urls, articles, keywords, customer_keyword_subscriptions,
keyword_articles, customers, api_keys, keyword_crawl_history
API Endpoints (Phase 1)
Customer API:
GET /v1/feed # Articles matching subscribed keywords
GET /v1/articles/:id # Single article
GET /v1/keywords # Customer's subscribed keywords
POST /v1/keywords # Subscribe to keyword
DEL /v1/keywords/:id # Unsubscribe
Admin API:
GET /v1/admin/crawl/health # Fallback rates, success metrics
GET /v1/admin/keywords # Global keyword pool
POST /v1/admin/crawl/trigger # Manual crawl
Architecture (Phase 1)
┌──────────────┐ ┌──────────────────────────────────────┐
│ Cron │────▶│ CRAWLER WORKER │
│ (15min) │ │ │
└──────────────┘ │ ┌────────────┐ ┌────────────┐ │
│ │ Google │ │ Direct │ │
│ │ News │ │ RSS │ │
│ │ Adapter │ │ Adapter │ │
│ └────────────┘ └────────────┘ │
│ ▼ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Unified Content Pipeline │ │
│ └──────────────────────────────┘ │
└──────────────────┬──────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ R2 Raw │ │ D1 DB │
│ Storage │ │ (articles) │
└──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ API │
│ Gateway │
└──────────────┘
Source Adapters are pluggable. Phase 1 ships with Google News + Direct RSS. Future adapters (Twitter, Reddit, etc.) slot in without architecture changes.
Queues (Phase 1)
crawl.batch # Keyword batches to crawl
article.extract # URLs to fetch and parse
Success Criteria
- Crawl 1000+ articles/day across 50+ keywords
- < 5% fetch failure rate with fallback
- < 500ms p95 feed API latency
- Basic admin visibility into crawl health
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Google News blocking | 3-tier fallback, rate limiting, IP rotation |
| D1 scale limits | Proper indexing, archive old articles |
| Keyword explosion | Subscriber-only crawling, tier limits |
Phase 2: Intelligence Layer
Goal
Add classification, entity extraction, and customer notifications without the risky ML clustering.
Deliverables
| Component | Description | Priority |
|---|---|---|
| Embeddings | Article embeddings via OpenAI or Workers AI | P0 |
| Rule-based Classification | Keyword patterns, source mapping | P0 |
| Vector Classification | Compare to taxonomy centroids | P1 |
| Entity Extraction | NER via Workers AI distilbert | P0 |
| Location Extraction | Geo-tagging articles | P1 |
| Social Metrics | SharedCount integration | P2 |
| Backlink Metrics | DataForSEO integration | P2 |
| Email Notifications | Digest emails for matches | P0 |
| Webhook Notifications | Real-time webhooks | P1 |
Data Model (Phase 2 additions)
-- Add to Phase 1
article_classifications, taxonomy_labels, entities, entity_mentions,
article_social_metrics, article_backlink_metrics, locations,
article_locations, notification_log, webhook_endpoints
Classification Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ CLASSIFICATION FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Article ───▶ Rules Engine (free) │
│ │ │
│ ├── Keyword patterns matched? ──▶ Done │
│ │ │
│ ▼ │
│ Vector Match (free) │
│ │ │
│ ├── Top-K labels > 0.7 confidence? ──▶ Done │
│ │ │
│ ▼ │
│ LLM Fallback (costly) │
│ │ │
│ └── Low confidence? ──▶ Review Queue │
│ │
└─────────────────────────────────────────────────────────────────┘
Queues (Phase 2 additions)
article.enrich # Fetch social/backlinks
article.classify # Run classification pipeline
notify.dispatch # Send notifications
API Endpoints (Phase 2 additions)
Customer API:
GET /v1/entities/:id # Entity profile
GET /v1/feed/entities # Entities in feed
POST /v1/webhooks # Register webhook
GET /v1/webhooks # List webhooks
Admin API:
GET /v1/admin/taxonomy # View taxonomy tree
POST /v1/admin/taxonomy/labels # Add label
GET /v1/admin/review/queue # Low-confidence items
POST /v1/admin/review/:id # Submit review
Success Criteria
- 80%+ classification confidence (no LLM needed)
- Entity extraction on 95%+ articles
- < 1 hour latency from publish to notification
- < $50/day external API costs
Phase 3: Customer Experience
Goal
Polish the customer-facing features: search, profiles, exports, API keys.
Deliverables
| Component | Description | Priority |
|---|---|---|
| Full-text Search | Keyword + semantic search | P0 |
| Customer Profiles | Saved search configurations | P0 |
| Topic Subscriptions | Subscribe to keyword bundles | P0 |
| API Key Management | Self-service key creation | P0 |
| Usage Analytics | Track API usage per customer | P1 |
| Export (CSV/JSON) | Bulk article export | P1 |
| Customer Events | Behavioral tracking | P2 |
Data Model (Phase 3 additions)
-- Add to Phase 2
customer_profiles, customer_article_scores, customer_events,
topics, topic_keywords, customer_topic_subscriptions, exports
Vectorize Indexes
articles # Full article search
profiles # Customer preference matching
taxonomy # Classification centroids
entities # Entity search
API Endpoints (Phase 3 additions)
Customer API:
GET /v1/search # Full-text + semantic search
GET /v1/profiles # List profiles
POST /v1/profiles # Create profile
PUT /v1/profiles/:id # Update profile
GET /v1/topics # Available topics
POST /v1/topics/:id/subscribe # Subscribe to topic
GET /v1/api-keys # List API keys
POST /v1/api-keys # Create API key
GET /v1/usage # Usage stats
POST /v1/exports # Request export
GET /v1/exports/:id # Download export
Success Criteria
- < 200ms p95 search latency
- Self-service API key generation
- CSV/JSON export for all articles
- Profile-based relevance scoring
Phase 4: Advanced Intelligence
Goal
Add the risky ML features: story clustering, AI briefings, RAG Q&A.
WARNING: This phase has the highest technical risk. Story clustering accuracy is difficult to achieve.
Deliverables
| Component | Description | Priority | Risk |
|---|---|---|---|
| Story Clustering | Group articles into stories | P0 | HIGH |
| Story Timeline | Event timeline for stories | P1 | Medium |
| Daily Briefings | AI-generated summaries | P1 | Medium |
| Weekly Briefings | Weekly digest generation | P2 | Low |
| RAG Q&A | Ask questions against feed | P2 | HIGH |
| Trend Detection | Keyword velocity alerts | P2 | Medium |
Story Clustering Approach
┌─────────────────────────────────────────────────────────────────┐
│ CLUSTERING SIGNALS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Embedding Similarity (40% weight) │
│ └── Cosine similarity > 0.85 │
│ │
│ 2. Entity Overlap (30% weight) │
│ └── Jaccard similarity of entities > 0.5 │
│ │
│ 3. Temporal Proximity (20% weight) │
│ └── Published within 48 hours │
│ │
│ 4. Headline Similarity (10% weight) │
│ └── TF-IDF or edit distance │
│ │
│ Composite score > 0.75 ──▶ Same story │
│ │
└─────────────────────────────────────────────────────────────────┘
Data Model (Phase 4 additions)
-- Add to Phase 3
stories, article_stories, story_timeline, trend_signals,
keyword_velocity, customer_story_subscriptions
Queues (Phase 4 additions)
story.cluster # Clustering jobs
briefing.generate # Briefing generation
API Endpoints (Phase 4 additions)
Customer API:
GET /v1/stories # Story feed
GET /v1/stories/:id # Story with timeline
GET /v1/stories/:id/articles # Articles in story
POST /v1/stories/:id/subscribe # Subscribe to story
GET /v1/briefings/daily # Daily briefing
GET /v1/briefings/weekly # Weekly briefing
POST /v1/qa # Ask a question (RAG)
GET /v1/trends # Trending topics
Risk Mitigation
| Risk | Mitigation |
|---|---|
| Clustering accuracy | Start with high threshold (0.85), tune down |
| Over-clustering | Manual review queue, customer feedback |
| Under-clustering | Periodic re-clustering job |
| LLM cost explosion | Cache briefings, rate limit Q&A |
| RAG hallucinations | Strict retrieval, citation required |
Success Criteria
- 70%+ story clustering accuracy (measure via manual review)
- < 5% false positive rate (unrelated articles in same story)
- Daily briefing generation < 30 seconds
- Q&A response < 5 seconds with citations
Phase 5: Scale & Polish
Goal
Enterprise features, multi-tenant isolation, advanced analytics, and operational polish.
Deliverables
| Component | Description | Priority |
|---|---|---|
| Enterprise SSO | SAML/OIDC integration | P1 |
| Team Management | Multi-user accounts | P1 |
| Usage Quotas | Per-tier limits enforcement | P0 |
| Advanced Analytics | Custom dashboards | P2 |
| Audit Logging | Full audit trail | P1 |
| Data Retention | Configurable retention policies | P1 |
| Multi-region | Geographic distribution | P3 |
| Custom Domains | White-label API endpoints | P3 |
Data Model (Phase 5 additions)
-- Add to Phase 4
admin_users, admin_audit_log, cost_budgets, cost_rollups
(Many already exist - just enable full functionality)
API Endpoints (Phase 5 additions)
Customer API:
GET /v1/account/usage # Detailed usage stats
GET /v1/account/team # Team members
POST /v1/account/team # Invite team member
Admin API:
GET /v1/admin/customers # All customers
GET /v1/admin/customers/:id # Customer details
PUT /v1/admin/customers/:id # Update customer tier
GET /v1/admin/analytics # Platform analytics
GET /v1/admin/audit # Audit log
Success Criteria
- Enterprise SSO working with major IdPs
- Sub-account management
- Per-customer cost tracking
- 99.9% uptime SLA achievable
Phase 5+: Future Capabilities
These are planned but not scheduled. They extend the platform into new content types and use cases.
Video & Podcast Transcripts
| Component | Description | Technology |
|---|---|---|
| Podcast Adapter | Discover podcasts via RSS, iTunes API | RSS parsing |
| Audio Download | Fetch audio files to R2 | HTTP + R2 |
| Transcription | Convert speech to text | Whisper API (OpenAI) or Workers AI |
| Speaker Diarization | Identify who said what | Future Whisper features |
| Video Adapter | YouTube, Vimeo caption extraction | YouTube Data API |
| Video Transcription | Process videos without captions | Whisper on audio track |
Use Cases:
- Monitor industry podcasts for mentions
- Track executive interviews on YouTube
- Extract quotes from earnings calls
- Index conference talks
Additional Source Adapters
| Source | Type | Notes |
|---|---|---|
| Twitter/X | Social | API v2, streaming, thread unrolling |
| Social | Subreddit monitoring, comment threads | |
| HackerNews | Social | Firebase API, tech-focused |
| Social | Limited API, may need scraping | |
| Substack | Newsletter | RSS + paywall handling |
| Medium | Blog | RSS available |
| arXiv | Research | Academic papers |
| SEC EDGAR | Government | Regulatory filings |
| Patents | Government | USPTO, EPO feeds |
Advanced AI Features
| Feature | Description | Phase |
|---|---|---|
| Claim Verification | Cross-reference claims across sources | 6+ |
| Predictive Trends | ML-based trend forecasting | 6+ |
| Automated Briefing Scheduling | Smart digest timing | 6+ |
| Multi-language Translation | Real-time article translation | 6+ |
| Custom Model Training | Fine-tuned classifiers per customer | 6+ |
Implementation Order Summary
Phase 1 (MVP):
├── Week 1-2: D1 schema, Keywords, Subscriptions
├── Week 3-4: Crawler with 3-tier fallback
├── Week 5-6: Extractor, article storage
├── Week 7-8: Feed API, basic admin console
└── MVP Launch
Phase 2 (Intelligence):
├── Week 9-10: Embeddings, Vectorize setup
├── Week 11-12: Classification pipeline (rules → vector)
├── Week 13-14: Entity extraction, locations
├── Week 15-16: Notifications (email, webhook)
└── Intelligence Launch
Phase 3 (Customer Experience):
├── Week 17-18: Search (full-text + semantic)
├── Week 19-20: Profiles, topics
├── Week 21-22: API keys, usage tracking
├── Week 23-24: Exports, polish
└── Full Customer Launch
Phase 4 (Advanced Intelligence):
├── Week 25-28: Story clustering (high risk, extra time)
├── Week 29-30: Story timeline, UI
├── Week 31-32: Briefings (daily/weekly)
├── Week 33-34: RAG Q&A, trends
└── Advanced Launch
Phase 5 (Scale & Polish):
├── Week 35-38: Enterprise features
├── Week 39-40: Multi-tenant polish
├── Week 41-42: Analytics, audit
├── Week 43-44: Performance, scale testing
└── Enterprise Launch
Technical Debt to Track
| Item | Phase to Address | Notes |
|---|---|---|
| Migrate KEYWORD_SETS to KEYWORDS | Phase 1 | Legacy table migration |
| Vectorize index optimization | Phase 3 | After search usage patterns clear |
| LLM prompt optimization | Phase 4 | After seeing real classification data |
| Cost optimization | Phase 5 | After usage patterns established |
| Test coverage gaps | Each phase | Maintain 80%+ coverage |
Go/No-Go Criteria
Phase 1 → Phase 2
- 1000+ articles/day throughput
- < 5% fetch failure rate
- Admin can see crawl health
- At least 1 paying customer
Phase 2 → Phase 3
- Classification working without excessive LLM costs
- Notifications delivered < 1 hour
- Entity extraction on 90%+ articles
Phase 3 → Phase 4
- Search returning relevant results
- Customer profiles working
- API key self-service functional
- Positive customer feedback
Phase 4 → Phase 5
- Story clustering accuracy > 70%
- Briefings generating successfully
- Q&A returning cited answers
- No major accuracy complaints
Last updated: 2024-01-15