System Flow Diagram
StoryIntel: News Intelligence Platform — 100% Cloudflare-Native
This is the hero diagram — the granular, detailed view of how data flows through the entire system. For a simpler high-level view, see Architecture Overview.
Table of Contents
- Master System Flow
- Pipeline Stages
- Cloudflare Workflows
- Client API Endpoints
- Admin API Endpoints
- Cost Model
Master System Flow
Pipeline Stages
Stage 1: Acquisition
Note: Workflows handle fan-out of keyword batches with rate limiting state stored in KV. Each batch respects per-source rate limits (≤1 req/sec sustained).
Stage 2: Extraction & Normalization
Note: Author resolution creates or updates the AUTHORS table, incrementing article counts and recalculating trust scores based on source authority and historical engagement.
Stage 3: Enrichment
Note: Enrichment runs asynchronously after extraction. Social metrics inform relevance scoring; backlinks contribute to source/author authority calculations.
Stage 4: Processing (AI)
Note: Embeddings enable semantic search and story clustering. Summary is optional but improves briefing generation quality.
Stage 5: Classification
Note: The 3-step ladder minimizes cost. ~70% of articles classify via rules+vectors alone. LLM specialists only run when needed, and each module logs its cost independently.
Stage 6: Story Clustering
Note: Story clustering uses the
story-reclusterWorkflow for periodic batch re-clustering, while real-time clustering happens per-article via queues.
Stage 7: Matching & Delivery
Cloudflare Workflows
Workflows provide durable, stateful execution for complex multi-step operations that need retry logic, fan-out, and state persistence.
Workflow: scheduled-crawl
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: scheduled-crawl │
│ Trigger: Cron (*/15 * * * *) or POST /v1/admin/crawl/trigger │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Load Active Keyword Sets │
│ └─▶ Query D1 for enabled keyword_sets matching crawl_frequency │
│ │
│ Step 2: Check Rate Limits │
│ └─▶ For each keyword_set: │
│ ├─▶ Read KV rate counters (per-source, global) │
│ ├─▶ If within limits: add to batch │
│ └─▶ If rate-limited: skip, log, schedule retry │
│ │
│ Step 3: Fan-Out Batches │
│ └─▶ Enqueue crawl.batch messages (max 100 per batch) │
│ └─▶ Workflow WAITS for batch completion (durable state) │
│ │
│ Step 4: Collect Results │
│ └─▶ Aggregate: new_articles[], failed[], rate_limited[] │
│ │
│ Step 5: Update State │
│ └─▶ Insert pipeline_run record to D1 │
│ └─▶ Update KV rate counters │
│ └─▶ If failure_rate > 20%: trigger alert │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Workflow: story-recluster
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: story-recluster │
│ Trigger: Cron (0 */4 * * *) or POST /v1/admin/recluster │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Load Unclustered Articles │
│ └─▶ Query articles from last 48h not in any story │
│ │
│ Step 2: Load Active Stories │
│ └─▶ Query stories with status = breaking | developing │
│ └─▶ Fetch story centroids from Vectorize │
│ │
│ Step 3: Compute Similarities │
│ └─▶ For each unclustered article: │
│ ├─▶ Vector similarity to story centroids │
│ ├─▶ Entity overlap with story entities │
│ └─▶ Temporal proximity scoring │
│ │
│ Step 4: Assign or Create │
│ └─▶ If similarity > 0.75: assign to story │
│ └─▶ If 3+ similar unclustered: create new story │
│ └─▶ Update story centroids (rolling average) │
│ │
│ Step 5: Merge Detection │
│ └─▶ Find stories with >50% article overlap │
│ └─▶ Flag for admin review or auto-merge if confident │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Workflow: retention-cleanup
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: retention-cleanup │
│ Trigger: Cron (0 3 * * *) daily at 3 AM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Archive Mature Stories │
│ └─▶ Stories with no new articles in 30 days → status = archived │
│ │
│ Step 2: Purge Raw Snapshots │
│ └─▶ Delete R2 objects older than retention_days (default: 30) │
│ └─▶ Update D1 raw_snapshot_r2_key = null │
│ │
│ Step 3: Aggregate Cost Rollups │
│ └─▶ Compute daily COST_ROLLUPS from COST_EVENTS │
│ └─▶ Purge COST_EVENTS older than 90 days (keep rollups) │
│ │
│ Step 4: Cleanup Orphans │
│ └─▶ Delete ARTICLE_STORIES where story deleted │
│ └─▶ Delete embeddings for deleted articles │
│ │
│ Step 5: Report │
│ └─▶ Log storage freed, records deleted │
│ └─▶ Update D1 storage_stats │
│ │
└─────────────────────────────────────────────────────────────────────────────┘