System Flow Diagram

StoryIntel: News Intelligence Platform — 100% Cloudflare-Native

This is the hero diagram — the granular, detailed view of how data flows through the entire system. For a simpler high-level view, see Architecture Overview.

Master System Flow
Pipeline Stages
Cloudflare Workflows
Client API Endpoints
Admin API Endpoints
Cost Model

Master System Flow

Pipeline Stages

Stage 1: Acquisition

Note: Workflows handle fan-out of keyword batches with rate limiting state stored in KV. Each batch respects per-source rate limits (≤1 req/sec sustained).

Stage 2: Extraction & Normalization

Note: Author resolution creates or updates the AUTHORS table, incrementing article counts and recalculating trust scores based on source authority and historical engagement.

Stage 3: Enrichment

Note: Enrichment runs asynchronously after extraction. Social metrics inform relevance scoring; backlinks contribute to source/author authority calculations.

Stage 4: Processing (AI)

Note: Embeddings enable semantic search and story clustering. Summary is optional but improves briefing generation quality.

Stage 5: Classification

Note: The 3-step ladder minimizes cost. ~70% of articles classify via rules+vectors alone. LLM specialists only run when needed, and each module logs its cost independently.

Stage 6: Story Clustering

Note: Story clustering uses the story-recluster Workflow for periodic batch re-clustering, while real-time clustering happens per-article via queues.

Stage 7: Matching & Delivery

Cloudflare Workflows

Workflows provide durable, stateful execution for complex multi-step operations that need retry logic, fan-out, and state persistence.

Workflow: `scheduled-crawl`

┌─────────────────────────────────────────────────────────────────────────────┐
│  WORKFLOW: scheduled-crawl                                                  │
│  Trigger: Cron (*/15 * * * *) or POST /v1/admin/crawl/trigger              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Step 1: Load Active Keyword Sets                                          │
│  └─▶ Query D1 for enabled keyword_sets matching crawl_frequency            │
│                                                                             │
│  Step 2: Check Rate Limits                                                  │
│  └─▶ For each keyword_set:                                                  │
│      ├─▶ Read KV rate counters (per-source, global)                        │
│      ├─▶ If within limits: add to batch                                    │
│      └─▶ If rate-limited: skip, log, schedule retry                        │
│                                                                             │
│  Step 3: Fan-Out Batches                                                    │
│  └─▶ Enqueue crawl.batch messages (max 100 per batch)                      │
│  └─▶ Workflow WAITS for batch completion (durable state)                   │
│                                                                             │
│  Step 4: Collect Results                                                    │
│  └─▶ Aggregate: new_articles[], failed[], rate_limited[]                   │
│                                                                             │
│  Step 5: Update State                                                       │
│  └─▶ Insert pipeline_run record to D1                                      │
│  └─▶ Update KV rate counters                                               │
│  └─▶ If failure_rate > 20%: trigger alert                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Workflow: `story-recluster`

┌─────────────────────────────────────────────────────────────────────────────┐
│  WORKFLOW: story-recluster                                                  │
│  Trigger: Cron (0 */4 * * *) or POST /v1/admin/recluster                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Step 1: Load Unclustered Articles                                          │
│  └─▶ Query articles from last 48h not in any story                         │
│                                                                             │
│  Step 2: Load Active Stories                                                │
│  └─▶ Query stories with status = breaking | developing                     │
│  └─▶ Fetch story centroids from Vectorize                                  │
│                                                                             │
│  Step 3: Compute Similarities                                               │
│  └─▶ For each unclustered article:                                         │
│      ├─▶ Vector similarity to story centroids                              │
│      ├─▶ Entity overlap with story entities                                │
│      └─▶ Temporal proximity scoring                                        │
│                                                                             │
│  Step 4: Assign or Create                                                   │
│  └─▶ If similarity > 0.75: assign to story                                 │
│  └─▶ If 3+ similar unclustered: create new story                           │
│  └─▶ Update story centroids (rolling average)                              │
│                                                                             │
│  Step 5: Merge Detection                                                    │
│  └─▶ Find stories with >50% article overlap                                │
│  └─▶ Flag for admin review or auto-merge if confident                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Workflow: `retention-cleanup`

┌─────────────────────────────────────────────────────────────────────────────┐
│  WORKFLOW: retention-cleanup                                                │
│  Trigger: Cron (0 3 * * *) daily at 3 AM                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Step 1: Archive Mature Stories                                             │
│  └─▶ Stories with no new articles in 30 days → status = archived           │
│                                                                             │
│  Step 2: Purge Raw Snapshots                                                │
│  └─▶ Delete R2 objects older than retention_days (default: 30)             │
│  └─▶ Update D1 raw_snapshot_r2_key = null                                  │
│                                                                             │
│  Step 3: Aggregate Cost Rollups                                             │
│  └─▶ Compute daily COST_ROLLUPS from COST_EVENTS                           │
│  └─▶ Purge COST_EVENTS older than 90 days (keep rollups)                   │
│                                                                             │
│  Step 4: Cleanup Orphans                                                    │
│  └─▶ Delete ARTICLE_STORIES where story deleted                            │
│  └─▶ Delete embeddings for deleted articles                                │
│                                                                             │
│  Step 5: Report                                                             │
│  └─▶ Log storage freed, records deleted                                    │
│  └─▶ Update D1 storage_stats                                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Client API Endpoints

These endpoints power consumer-facing applications — web apps, mobile apps, SDKs.

Authentication

All client endpoints require X-API-Key header.

┌─────────────────────────────────────────────────────────────────────────────┐
│  🔐 AUTHENTICATION                                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│  POST /v1/auth/register               Create customer account               │
│  POST /v1/auth/login                  Get API key                           │
│  POST /v1/auth/refresh                Refresh token                         │
│  GET  /v1/auth/me                     Current user info                     │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  📋 KEYWORD MANAGEMENT — What to crawl                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/keywords                    List customer's keyword sets          │
│  POST /v1/keywords                    Create keyword set                    │
│       Body: { name, keywords[], frequency, language, region }               │
│  GET  /v1/keywords/:id                Get keyword set details               │
│  PUT  /v1/keywords/:id                Update keyword set                    │
│  DEL  /v1/keywords/:id                Delete keyword set                    │
│  POST /v1/keywords/:id/pause          Pause crawling                        │
│  POST /v1/keywords/:id/resume         Resume crawling                       │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  📰 FEED & ARTICLES — Read the news                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/feed                        Personalized article feed             │
│       Params: ?limit, ?offset, ?since, ?until, ?topics[], ?sources[],       │
│               ?sentiment, ?sort (relevance|date|engagement)                 │
│  GET  /v1/articles/:id                Single article with full data         │
│       Returns: headline, body, summary, classification, social, source      │
│  GET  /v1/articles/:id/similar        Semantically similar articles         │
│  POST /v1/articles/:id/feedback       Mark relevant / not relevant          │
│       Body: { feedback: "relevant" | "not_relevant", reason? }              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  📚 STORIES — Clustered narratives                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/stories                     Story feed                            │
│       Params: ?status (breaking|developing|mature), ?limit, ?offset         │
│  GET  /v1/stories/:id                 Story with timeline + top articles    │
│  GET  /v1/stories/:id/articles        Paginated articles in story           │
│  GET  /v1/stories/:id/timeline        Story timeline events                 │
│  POST /v1/stories/:id/subscribe       Subscribe to story updates            │
│  DEL  /v1/stories/:id/subscribe       Unsubscribe from story                │
│  GET  /v1/subscriptions               List story subscriptions              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  🔍 SEARCH — Find articles                                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/search                      Full-text + semantic search           │
│       Params: ?q (query), ?semantic (true|false), ?filters...               │
│  GET  /v1/search/entities             Search by entity                      │
│       Params: ?entity_id, ?entity_type, ?name                               │
│  GET  /v1/search/locations            Search by location                    │
│       Params: ?location_id, ?lat, ?lng, ?radius_km                          │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  🤖 INTELLIGENCE — AI-powered features                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/briefings/daily             AI-generated daily briefing           │
│       Params: ?date (default: today), ?format (summary|detailed)            │
│  GET  /v1/briefings/weekly            Weekly digest                         │
│  POST /v1/qa                          Ask questions about your news         │
│       Body: { query, scope?, story_id?, date_range? }                       │
│  GET  /v1/entities/:id                Entity profile + related articles     │
│  GET  /v1/entities/:id/timeline       Entity's news timeline                │
│  GET  /v1/trends                      Trending topics/entities              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  👤 PROFILE & PREFERENCES — Personalization                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/profiles                    List monitoring profiles              │
│  POST /v1/profiles                    Create profile                        │
│       Body: { name, keywords[], topics[], sources_include[],                │
│               sources_exclude[], regions[], notify_threshold }              │
│  GET  /v1/profiles/:id                Get profile                           │
│  PUT  /v1/profiles/:id                Update profile                        │
│  DEL  /v1/profiles/:id                Delete profile                        │
│  POST /v1/profiles/:id/rebuild        Force re-embed profile                │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  🔔 NOTIFICATIONS — Alert preferences                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/notifications/settings      Get notification settings             │
│  PUT  /v1/notifications/settings      Update settings                       │
│       Body: { email: bool, slack: { webhook_url }, push: bool }             │
│  GET  /v1/notifications/history       Recent notifications sent             │
│  POST /v1/notifications/test          Send test notification                │
└─────────────────────────────────────────────────────────────────────────────┘

Admin API Endpoints

These endpoints power the Admin Console — internal tools for ops, monitoring, and system management.

Authentication

All admin endpoints require Authorization: Bearer <token> header.

┌─────────────────────────────────────────────────────────────────────────────┐
│  📊 PIPELINE MONITORING                                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/admin/pipeline/status       Real-time pipeline health             │
│       Returns: queue depths, worker status, error rates                     │
│  GET  /v1/admin/pipeline/runs         Historical pipeline runs              │
│       Params: ?since, ?until, ?status                                       │
│  GET  /v1/admin/pipeline/runs/:id     Single run details                    │
│  GET  /v1/admin/pipeline/errors       Recent errors across pipeline         │
│  POST /v1/admin/crawl/trigger         Manual crawl trigger                  │
│       Body: { keyword_set_ids[]?, force: bool }                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  🏷️ CLASSIFICATION & TAXONOMY                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/admin/classification/metrics  Accuracy, confidence distributions │
│       Returns: avg_confidence, review_queue_size, accuracy_post_review      │
│  GET  /v1/admin/classification/drift    Label distribution over time       │
│  GET  /v1/admin/taxonomy                Full taxonomy tree                  │
│  POST /v1/admin/taxonomy/labels         Add new label                       │
│  PUT  /v1/admin/taxonomy/labels/:id     Update label                        │
│  DEL  /v1/admin/taxonomy/labels/:id     Delete label                        │
│  POST /v1/admin/taxonomy/exemplars      Add exemplar article to label       │
│  GET  /v1/admin/taxonomy/versions       Taxonomy version history            │
│  POST /v1/admin/taxonomy/rollback       Rollback to previous version        │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  👁️ REVIEW QUEUE — Human-in-the-loop                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/admin/review/queue            Low-confidence items                │
│       Params: ?limit, ?sort (oldest|confidence)                             │
│  GET  /v1/admin/review/:id              Single review item                  │
│  POST /v1/admin/review/:id              Submit review decision              │
│       Body: { corrected_labels, notes?, promote_to_exemplar: bool }         │
│  POST /v1/admin/review/:id/skip         Skip (return to queue later)        │
│  GET  /v1/admin/review/stats            Review metrics                      │
│       Returns: total_pending, reviewed_today, avg_review_time               │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  🏢 SOURCES & AUTHORITY                                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/admin/sources                 All sources with stats              │
│       Returns: domain, authority_score, article_count, last_seen            │
│  POST /v1/admin/sources                 Add new source config               │
│  PUT  /v1/admin/sources/:id             Update source (rate limits, etc)    │
│  POST /v1/admin/sources/:id/recalc      Recalculate authority score         │
│  GET  /v1/admin/sources/:id/history     Authority score over time           │
│  GET  /v1/admin/authors                 Top authors by article count        │
│  GET  /v1/admin/authors/:id             Author details + trust score        │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  💰 COST TRACKING & BUDGETS                                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/admin/costs                   Cost summary                        │
│       Params: ?period (day|week|month), ?since, ?until                      │
│  GET  /v1/admin/costs/by-service        Breakdown by service                │
│       Returns: { zenrows, data4seo, sharedcount, workers_ai }               │
│  GET  /v1/admin/costs/by-operation      Breakdown by operation type         │
│  GET  /v1/admin/costs/by-article/:id    Cost to process specific article    │
│  GET  /v1/admin/costs/by-customer/:id   Customer-attributed costs           │
│  GET  /v1/admin/costs/forecast          Projected costs based on trends     │
│  GET  /v1/admin/budgets                 Current budget configurations       │
│  POST /v1/admin/budgets                 Create/update budget                │
│       Body: { scope, scope_id?, period, budget_usd, alert_pct, hard_limit } │
│  GET  /v1/admin/budgets/alerts          Budget threshold alerts             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  🔧 REPROCESSING & MAINTENANCE                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│  POST /v1/admin/reprocess/article/:id   Reprocess single article            │
│       Body: { stages[]: extract|embed|classify|cluster }                    │
│  POST /v1/admin/reprocess/batch         Reprocess batch of articles         │
│       Body: { article_ids[], stages[] }                                     │
│  POST /v1/admin/recluster               Force story re-clustering           │
│  POST /v1/admin/retention/run           Execute retention policy now        │
│  POST /v1/admin/cache/invalidate        Clear KV caches                     │
│       Body: { patterns[]?: ["rate-limits:*", "hot-cache:sources:*"] }       │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  📈 STORAGE & SYSTEM                                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/admin/storage/stats           D1/R2/KV/Vectorize usage            │
│       Returns: row_counts, storage_bytes, index_sizes                       │
│  GET  /v1/admin/rate-limits/status      Current rate limit state            │
│  GET  /v1/admin/health                  System health check                 │
│  GET  /v1/admin/config                  Current system configuration        │
│  PUT  /v1/admin/config                  Update configuration                │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  👥 CUSTOMER MANAGEMENT                                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│  GET  /v1/admin/customers               All customers                       │
│  GET  /v1/admin/customers/:id           Customer details                    │
│  GET  /v1/admin/customers/:id/usage     Customer usage stats                │
│  PUT  /v1/admin/customers/:id/tier      Update customer tier                │
│  POST /v1/admin/customers/:id/disable   Disable customer                    │
└─────────────────────────────────────────────────────────────────────────────┘

Cost Model

Per-Operation Costs

Service	Operation	Unit	Cost	Notes
Google News	RSS fetch	request	FREE	Rate limited only
Google News	HTML fetch	request	FREE	Rate limited only
Publisher	Direct fetch	request	FREE	May be blocked
ZenRows	Anti-bot fetch	request	~$0.005	Tier 2 fallback
DataForSEO	Content fetch	request	~$0.002	Tier 3 fallback
DataForSEO	Backlinks	request	~$0.004	Per article
SharedCount	Social metrics	request	~$0.0001	Per article
Workers AI	Embedding	1K tokens	~$0.00001	~500-2000 tok/article
Workers AI	LLM inference	1K tokens	~$0.0005	Classification, summaries
Vectorize	Query/Upsert	request	FREE	Included in plan
D1	Read/Write	request	FREE	Included in plan
R2	Storage	GB/month	$0.015	Raw snapshots
KV	Read/Write	request	FREE	Included in plan

Typical Per-Article Cost

┌─────────────────────────────────────────────────────────────────┐
│  TYPICAL ARTICLE PROCESSING COST                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Acquisition (one of):                                          │
│  ├── Direct fetch .......................... FREE               │
│  ├── ZenRows (~20% need it) ............... $0.005              │
│  └── DataForSEO (~5% fallback) ............ $0.002              │
│                                                                 │
│  Enrichment:                                                    │
│  ├── SharedCount .......................... $0.0001             │
│  └── Backlinks ............................ $0.004              │
│                                                                 │
│  Processing:                                                    │
│  ├── Embedding ............................ $0.00002            │
│  └── Summary .............................. $0.0005             │
│                                                                 │
│  Classification:                                                │
│  ├── Rules + Vector ....................... FREE                │
│  └── LLM specialists (~30% need it) ....... $0.001              │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│  TOTAL (80% of articles) .................. $0.005              │
│  TOTAL (with fallbacks + full LLM) ........ $0.015              │
│                                                                 │
│  At 10,000 articles/day:                                        │
│  ├── Typical: ~$50/day (~$1,500/month)                          │
│  └── Worst case: ~$150/day (~$4,500/month)                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Color Legend

Color	Meaning
🔵 Blue (`#dbeafe`)	Entry points, inputs
🟡 Yellow (`#fef3c7`)	Orchestration, queues, workflows
🟢 Green (`#dcfce7`)	Processing, free operations
🩷 Pink (`#fce7f3`)	External services, paid operations
🟣 Purple (`#f3e8ff`)	AI services, storage
🔴 Red (`#fee2e2`)	Alerts, observability

Cloudflare Foundation Summary

Every component in this system runs on Cloudflare's global edge network:

Component	Cloudflare Service	Purpose
API Gateway	Workers	Request handling, auth, routing
Async Processing	Queues	Message passing between pipeline stages
Orchestration	Workflows	Durable, stateful multi-step execution
Transactional DB	D1	SQLite for all entity data (35+ tables)
Object Storage	R2	Raw HTML snapshots, exports
Caching	KV	Rate limits, dedup cache, hot lookups
Vector Search	Vectorize	7 semantic search indexes
AI Inference	Workers AI	Embeddings, classification LLM
Analytics	ClickHouse	Time-series, heavy aggregations (external)

Why 100% Cloudflare? Zero cold starts, global edge deployment, unified billing, and seamless integration between services. No VPCs, no Kubernetes, no container orchestration — just code and configuration.

Architecture Overview — High-level stack diagram
Cloudflare Foundation — Deep dive into each CF component
Plugin Architecture — Extension points (sources, extractors, outputs)
Database Schema — D1 table definitions
ClickHouse Schema — Analytics tables

System Flow Diagram

Table of Contents

Master System Flow

Pipeline Stages

Stage 1: Acquisition

Stage 2: Extraction & Normalization

Stage 3: Enrichment

Stage 4: Processing (AI)

Stage 5: Classification

Stage 6: Story Clustering

Stage 7: Matching & Delivery

Cloudflare Workflows

Workflow: `scheduled-crawl`

Workflow: `story-recluster`

Workflow: `retention-cleanup`

Client API Endpoints

Authentication

Admin API Endpoints

Authentication

Cost Model

Per-Operation Costs

Typical Per-Article Cost

Color Legend

Cloudflare Foundation Summary

Quick Navigation

Table of Contents​

Master System Flow​

Pipeline Stages​

Stage 1: Acquisition​

Stage 2: Extraction & Normalization​

Stage 3: Enrichment​

Stage 4: Processing (AI)​

Stage 5: Classification​

Stage 6: Story Clustering​

Stage 7: Matching & Delivery​

Cloudflare Workflows​

Workflow: scheduled-crawl​

Workflow: story-recluster​

Workflow: retention-cleanup​

Client API Endpoints​

Authentication​

Admin API Endpoints​

Authentication​

Cost Model​

Per-Operation Costs​

Typical Per-Article Cost​

Color Legend​

Cloudflare Foundation Summary​

Quick Navigation​

Table of Contents

Master System Flow

Pipeline Stages

Stage 1: Acquisition

Stage 2: Extraction & Normalization

Stage 3: Enrichment

Stage 4: Processing (AI)

Stage 5: Classification

Stage 6: Story Clustering

Stage 7: Matching & Delivery

Cloudflare Workflows

Workflow: `scheduled-crawl`

Workflow: `story-recluster`

Workflow: `retention-cleanup`

Client API Endpoints

Authentication

Admin API Endpoints

Authentication

Cost Model

Per-Operation Costs

Typical Per-Article Cost

Color Legend

Cloudflare Foundation Summary

Quick Navigation