Skip to main content

Phased Build Plan

Implementation roadmap for Topic Intel - from MVP to Enterprise


Project Summary

What is Topic Intel?

Topic Intel is a source-agnostic news monitoring and intelligence platform built entirely on Cloudflare's edge infrastructure. It enables businesses to track keywords, topics, and entities across news sources, receiving real-time alerts and AI-powered insights.

Core Value Proposition

  • For PR/Comms teams: Monitor brand mentions, competitor news, industry trends
  • For Investors: Track portfolio companies, market signals, regulatory filings
  • For Researchers: Follow topics, aggregate sources, export datasets
  • For Developers: API-first access to curated news intelligence

Technical Foundation

ComponentTechnologyPurpose
ComputeCloudflare WorkersEdge-native, serverless
DatabaseCloudflare D1SQLite at the edge
Object StorageCloudflare R2Raw HTML snapshots
CacheCloudflare KVHot data, rate limits
Vector SearchCloudflare VectorizeSemantic search, clustering
QueuesCloudflare QueuesAsync pipeline processing
AIWorkers AI + OpenAIClassification, embeddings, NER

Key Architectural Decisions

  1. Source-Agnostic Crawler: Google News is the first adapter, but the architecture supports any content source (Twitter, Reddit, RSS, podcasts, video transcripts, etc.)

  2. Shared Keyword Pool: 1,000 customers tracking "bitcoin" = 1 crawl, not 1,000. Efficiency at scale through deduplication at the keyword level, fan-out at the match level.

  3. Dynamic Crawl Frequency: Keywords are tiered (hot/warm/normal/cold/frozen) based on rate of change. Hot keywords crawl every 15 minutes, frozen keywords once daily.

  4. Three-Tier Classification: Rules (free) → Vector matching (free) → LLM fallback (costly). Minimize AI spend while maximizing accuracy.

  5. Multiple Taxonomy Support: System taxonomies (DataForSEO categories), industry taxonomies, and customer-defined taxonomies coexist.

  6. Admin-Configurable Tiers: Subscription limits are not hardcoded - admin console controls keyword limits, API quotas, retention periods per tier.

Current Status

AreaStatusNotes
Architecture Design✅ Complete16,000+ lines of documentation
Data Model✅ Complete35+ tables, views, triggers
API Specification✅ CompleteOpenAPI 3.1, needs minor updates
Security Model✅ CompleteAPI keys, admin tokens, HMAC webhooks
External Integrations✅ CompleteGoogle News, ZenRows, RapidAPI, DataForSEO, SharedCount, OpenAI
Phase 1 Planning✅ CompleteReady to begin implementation
Actual Code❌ Not StartedDocumentation-first approach

What's in Scope (All Phases)

  • News article monitoring from multiple sources
  • Keyword and topic subscriptions
  • Entity extraction and tracking
  • Classification and taxonomy management
  • Email and webhook notifications
  • Search (full-text and semantic)
  • Story clustering (articles → narratives)
  • AI briefings (daily/weekly summaries)
  • RAG-based Q&A against customer's feed
  • Video and podcast transcript processing (Phase 5+)
  • Multi-tenant enterprise features

What's NOT in Scope

  • Consumer mobile apps (API-first, B2B focus)
  • Social media posting/engagement (read-only monitoring)
  • Full social listening (Twitter/Reddit are future source adapters, not social management)
  • Content creation or ghostwriting

Phase Overview

This plan prioritizes de-risking the architecture by deferring complex ML features (story clustering, Q&A) to later phases while delivering core value early.

┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE OVERVIEW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: Core Pipeline (MVP) │
│ └── Crawl → Extract → Store → Basic Feed │
│ Risk: Low | Value: High | Complexity: Medium │
│ │
│ Phase 2: Intelligence Layer │
│ └── Classification → Entity Extraction → Notifications │
│ Risk: Medium | Value: High | Complexity: Medium │
│ │
│ Phase 3: Customer Experience │
│ └── Search → Profiles → API Keys → Exports │
│ Risk: Low | Value: High | Complexity: Low │
│ │
│ Phase 4: Advanced Intelligence │
│ └── Story Clustering → Briefings → Q&A (RAG) │
│ Risk: HIGH | Value: Medium | Complexity: HIGH │
│ │
│ Phase 5: Scale & Polish │
│ └── Enterprise Features → Multi-tenant → Advanced Analytics │
│ Risk: Low | Value: Medium | Complexity: Medium │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 1: Core Pipeline (MVP)

Goal

Build the foundational crawl-to-feed pipeline. Articles flow from Google News to customer feeds without ML complexity.

Deliverables

ComponentDescriptionPriority
Source Adapter FrameworkPluggable adapter interface for any content sourceP0
Google News AdapterFirst adapter: RSS fetch with 3-tier fallbackP0
Keyword PoolGlobal KEYWORDS table with subscription modelP0
Article ExtractorHTML parsing, text extraction, deduplicationP0
Basic StorageD1 schema, R2 for raw HTML, URL dedupP0
Keyword MatchingSimple keyword → article matchingP0
Customer Feed APIGET /v1/feed with paginationP0
Admin Console (Basic)Crawl health dashboard, keyword managementP1
Direct RSS AdapterSecond adapter: subscribe to publisher RSS directlyP1

Key Principle: The crawler is source-agnostic. Google News is just the first adapter. The architecture supports Twitter, Reddit, HackerNews, PR wires, govt feeds, podcasts, etc.

Data Model (Phase 1)

-- Core tables needed
source_adapters, sources, urls, articles, keywords, customer_keyword_subscriptions,
keyword_articles, customers, api_keys, keyword_crawl_history

API Endpoints (Phase 1)

Customer API:
GET /v1/feed # Articles matching subscribed keywords
GET /v1/articles/:id # Single article
GET /v1/keywords # Customer's subscribed keywords
POST /v1/keywords # Subscribe to keyword
DEL /v1/keywords/:id # Unsubscribe

Admin API:
GET /v1/admin/crawl/health # Fallback rates, success metrics
GET /v1/admin/keywords # Global keyword pool
POST /v1/admin/crawl/trigger # Manual crawl

Architecture (Phase 1)

┌──────────────┐     ┌──────────────────────────────────────┐
│ Cron │────▶│ CRAWLER WORKER │
│ (15min) │ │ │
└──────────────┘ │ ┌────────────┐ ┌────────────┐ │
│ │ Google │ │ Direct │ │
│ │ News │ │ RSS │ │
│ │ Adapter │ │ Adapter │ │
│ └────────────┘ └────────────┘ │
│ ▼ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Unified Content Pipeline │ │
│ └──────────────────────────────┘ │
└──────────────────┬──────────────────┘

┌───────────┴───────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ R2 Raw │ │ D1 DB │
│ Storage │ │ (articles) │
└──────────────┘ └──────────────┘


┌──────────────┐
│ API │
│ Gateway │
└──────────────┘

Source Adapters are pluggable. Phase 1 ships with Google News + Direct RSS. Future adapters (Twitter, Reddit, etc.) slot in without architecture changes.

Queues (Phase 1)

crawl.batch          # Keyword batches to crawl
article.extract # URLs to fetch and parse

Success Criteria

  • Crawl 1000+ articles/day across 50+ keywords
  • < 5% fetch failure rate with fallback
  • < 500ms p95 feed API latency
  • Basic admin visibility into crawl health

Risks & Mitigations

RiskMitigation
Google News blocking3-tier fallback, rate limiting, IP rotation
D1 scale limitsProper indexing, archive old articles
Keyword explosionSubscriber-only crawling, tier limits

Phase 2: Intelligence Layer

Goal

Add classification, entity extraction, and customer notifications without the risky ML clustering.

Deliverables

ComponentDescriptionPriority
EmbeddingsArticle embeddings via OpenAI or Workers AIP0
Rule-based ClassificationKeyword patterns, source mappingP0
Vector ClassificationCompare to taxonomy centroidsP1
Entity ExtractionNER via Workers AI distilbertP0
Location ExtractionGeo-tagging articlesP1
Social MetricsSharedCount integrationP2
Backlink MetricsDataForSEO integrationP2
Email NotificationsDigest emails for matchesP0
Webhook NotificationsReal-time webhooksP1

Data Model (Phase 2 additions)

-- Add to Phase 1
article_classifications, taxonomy_labels, entities, entity_mentions,
article_social_metrics, article_backlink_metrics, locations,
article_locations, notification_log, webhook_endpoints

Classification Pipeline

┌─────────────────────────────────────────────────────────────────┐
│ CLASSIFICATION FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Article ───▶ Rules Engine (free) │
│ │ │
│ ├── Keyword patterns matched? ──▶ Done │
│ │ │
│ ▼ │
│ Vector Match (free) │
│ │ │
│ ├── Top-K labels > 0.7 confidence? ──▶ Done │
│ │ │
│ ▼ │
│ LLM Fallback (costly) │
│ │ │
│ └── Low confidence? ──▶ Review Queue │
│ │
└─────────────────────────────────────────────────────────────────┘

Queues (Phase 2 additions)

article.enrich       # Fetch social/backlinks
article.classify # Run classification pipeline
notify.dispatch # Send notifications

API Endpoints (Phase 2 additions)

Customer API:
GET /v1/entities/:id # Entity profile
GET /v1/feed/entities # Entities in feed
POST /v1/webhooks # Register webhook
GET /v1/webhooks # List webhooks

Admin API:
GET /v1/admin/taxonomy # View taxonomy tree
POST /v1/admin/taxonomy/labels # Add label
GET /v1/admin/review/queue # Low-confidence items
POST /v1/admin/review/:id # Submit review

Success Criteria

  • 80%+ classification confidence (no LLM needed)
  • Entity extraction on 95%+ articles
  • < 1 hour latency from publish to notification
  • < $50/day external API costs

Phase 3: Customer Experience

Goal

Polish the customer-facing features: search, profiles, exports, API keys.

Deliverables

ComponentDescriptionPriority
Full-text SearchKeyword + semantic searchP0
Customer ProfilesSaved search configurationsP0
Topic SubscriptionsSubscribe to keyword bundlesP0
API Key ManagementSelf-service key creationP0
Usage AnalyticsTrack API usage per customerP1
Export (CSV/JSON)Bulk article exportP1
Customer EventsBehavioral trackingP2

Data Model (Phase 3 additions)

-- Add to Phase 2
customer_profiles, customer_article_scores, customer_events,
topics, topic_keywords, customer_topic_subscriptions, exports

Vectorize Indexes

articles         # Full article search
profiles # Customer preference matching
taxonomy # Classification centroids
entities # Entity search

API Endpoints (Phase 3 additions)

Customer API:
GET /v1/search # Full-text + semantic search
GET /v1/profiles # List profiles
POST /v1/profiles # Create profile
PUT /v1/profiles/:id # Update profile
GET /v1/topics # Available topics
POST /v1/topics/:id/subscribe # Subscribe to topic
GET /v1/api-keys # List API keys
POST /v1/api-keys # Create API key
GET /v1/usage # Usage stats
POST /v1/exports # Request export
GET /v1/exports/:id # Download export

Success Criteria

  • < 200ms p95 search latency
  • Self-service API key generation
  • CSV/JSON export for all articles
  • Profile-based relevance scoring

Phase 4: Advanced Intelligence

Goal

Add the risky ML features: story clustering, AI briefings, RAG Q&A.

WARNING: This phase has the highest technical risk. Story clustering accuracy is difficult to achieve.

Deliverables

ComponentDescriptionPriorityRisk
Story ClusteringGroup articles into storiesP0HIGH
Story TimelineEvent timeline for storiesP1Medium
Daily BriefingsAI-generated summariesP1Medium
Weekly BriefingsWeekly digest generationP2Low
RAG Q&AAsk questions against feedP2HIGH
Trend DetectionKeyword velocity alertsP2Medium

Story Clustering Approach

┌─────────────────────────────────────────────────────────────────┐
│ CLUSTERING SIGNALS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Embedding Similarity (40% weight) │
│ └── Cosine similarity > 0.85 │
│ │
│ 2. Entity Overlap (30% weight) │
│ └── Jaccard similarity of entities > 0.5 │
│ │
│ 3. Temporal Proximity (20% weight) │
│ └── Published within 48 hours │
│ │
│ 4. Headline Similarity (10% weight) │
│ └── TF-IDF or edit distance │
│ │
│ Composite score > 0.75 ──▶ Same story │
│ │
└─────────────────────────────────────────────────────────────────┘

Data Model (Phase 4 additions)

-- Add to Phase 3
stories, article_stories, story_timeline, trend_signals,
keyword_velocity, customer_story_subscriptions

Queues (Phase 4 additions)

story.cluster        # Clustering jobs
briefing.generate # Briefing generation

API Endpoints (Phase 4 additions)

Customer API:
GET /v1/stories # Story feed
GET /v1/stories/:id # Story with timeline
GET /v1/stories/:id/articles # Articles in story
POST /v1/stories/:id/subscribe # Subscribe to story
GET /v1/briefings/daily # Daily briefing
GET /v1/briefings/weekly # Weekly briefing
POST /v1/qa # Ask a question (RAG)
GET /v1/trends # Trending topics

Risk Mitigation

RiskMitigation
Clustering accuracyStart with high threshold (0.85), tune down
Over-clusteringManual review queue, customer feedback
Under-clusteringPeriodic re-clustering job
LLM cost explosionCache briefings, rate limit Q&A
RAG hallucinationsStrict retrieval, citation required

Success Criteria

  • 70%+ story clustering accuracy (measure via manual review)
  • < 5% false positive rate (unrelated articles in same story)
  • Daily briefing generation < 30 seconds
  • Q&A response < 5 seconds with citations

Phase 5: Scale & Polish

Goal

Enterprise features, multi-tenant isolation, advanced analytics, and operational polish.

Deliverables

ComponentDescriptionPriority
Enterprise SSOSAML/OIDC integrationP1
Team ManagementMulti-user accountsP1
Usage QuotasPer-tier limits enforcementP0
Advanced AnalyticsCustom dashboardsP2
Audit LoggingFull audit trailP1
Data RetentionConfigurable retention policiesP1
Multi-regionGeographic distributionP3
Custom DomainsWhite-label API endpointsP3

Data Model (Phase 5 additions)

-- Add to Phase 4
admin_users, admin_audit_log, cost_budgets, cost_rollups
(Many already exist - just enable full functionality)

API Endpoints (Phase 5 additions)

Customer API:
GET /v1/account/usage # Detailed usage stats
GET /v1/account/team # Team members
POST /v1/account/team # Invite team member

Admin API:
GET /v1/admin/customers # All customers
GET /v1/admin/customers/:id # Customer details
PUT /v1/admin/customers/:id # Update customer tier
GET /v1/admin/analytics # Platform analytics
GET /v1/admin/audit # Audit log

Success Criteria

  • Enterprise SSO working with major IdPs
  • Sub-account management
  • Per-customer cost tracking
  • 99.9% uptime SLA achievable

Phase 5+: Future Capabilities

These are planned but not scheduled. They extend the platform into new content types and use cases.

Video & Podcast Transcripts

ComponentDescriptionTechnology
Podcast AdapterDiscover podcasts via RSS, iTunes APIRSS parsing
Audio DownloadFetch audio files to R2HTTP + R2
TranscriptionConvert speech to textWhisper API (OpenAI) or Workers AI
Speaker DiarizationIdentify who said whatFuture Whisper features
Video AdapterYouTube, Vimeo caption extractionYouTube Data API
Video TranscriptionProcess videos without captionsWhisper on audio track

Use Cases:

  • Monitor industry podcasts for mentions
  • Track executive interviews on YouTube
  • Extract quotes from earnings calls
  • Index conference talks

Additional Source Adapters

SourceTypeNotes
Twitter/XSocialAPI v2, streaming, thread unrolling
RedditSocialSubreddit monitoring, comment threads
HackerNewsSocialFirebase API, tech-focused
LinkedInSocialLimited API, may need scraping
SubstackNewsletterRSS + paywall handling
MediumBlogRSS available
arXivResearchAcademic papers
SEC EDGARGovernmentRegulatory filings
PatentsGovernmentUSPTO, EPO feeds

Advanced AI Features

FeatureDescriptionPhase
Claim VerificationCross-reference claims across sources6+
Predictive TrendsML-based trend forecasting6+
Automated Briefing SchedulingSmart digest timing6+
Multi-language TranslationReal-time article translation6+
Custom Model TrainingFine-tuned classifiers per customer6+

Implementation Order Summary

Phase 1 (MVP):
├── Week 1-2: D1 schema, Keywords, Subscriptions
├── Week 3-4: Crawler with 3-tier fallback
├── Week 5-6: Extractor, article storage
├── Week 7-8: Feed API, basic admin console
└── MVP Launch

Phase 2 (Intelligence):
├── Week 9-10: Embeddings, Vectorize setup
├── Week 11-12: Classification pipeline (rules → vector)
├── Week 13-14: Entity extraction, locations
├── Week 15-16: Notifications (email, webhook)
└── Intelligence Launch

Phase 3 (Customer Experience):
├── Week 17-18: Search (full-text + semantic)
├── Week 19-20: Profiles, topics
├── Week 21-22: API keys, usage tracking
├── Week 23-24: Exports, polish
└── Full Customer Launch

Phase 4 (Advanced Intelligence):
├── Week 25-28: Story clustering (high risk, extra time)
├── Week 29-30: Story timeline, UI
├── Week 31-32: Briefings (daily/weekly)
├── Week 33-34: RAG Q&A, trends
└── Advanced Launch

Phase 5 (Scale & Polish):
├── Week 35-38: Enterprise features
├── Week 39-40: Multi-tenant polish
├── Week 41-42: Analytics, audit
├── Week 43-44: Performance, scale testing
└── Enterprise Launch

Technical Debt to Track

ItemPhase to AddressNotes
Migrate KEYWORD_SETS to KEYWORDSPhase 1Legacy table migration
Vectorize index optimizationPhase 3After search usage patterns clear
LLM prompt optimizationPhase 4After seeing real classification data
Cost optimizationPhase 5After usage patterns established
Test coverage gapsEach phaseMaintain 80%+ coverage

Go/No-Go Criteria

Phase 1 → Phase 2

  • 1000+ articles/day throughput
  • < 5% fetch failure rate
  • Admin can see crawl health
  • At least 1 paying customer

Phase 2 → Phase 3

  • Classification working without excessive LLM costs
  • Notifications delivered < 1 hour
  • Entity extraction on 90%+ articles

Phase 3 → Phase 4

  • Search returning relevant results
  • Customer profiles working
  • API key self-service functional
  • Positive customer feedback

Phase 4 → Phase 5

  • Story clustering accuracy > 70%
  • Briefings generating successfully
  • Q&A returning cited answers
  • No major accuracy complaints

Last updated: 2024-01-15