Skip to main content

System Flow Diagram

StoryIntel: News Intelligence Platform — 100% Cloudflare-Native

This is the hero diagram — the granular, detailed view of how data flows through the entire system. For a simpler high-level view, see Architecture Overview.


Table of Contents

  1. Master System Flow
  2. Pipeline Stages
  3. Cloudflare Workflows
  4. Client API Endpoints
  5. Admin API Endpoints
  6. Cost Model

Master System Flow


Pipeline Stages

Stage 1: Acquisition

Note: Workflows handle fan-out of keyword batches with rate limiting state stored in KV. Each batch respects per-source rate limits (≤1 req/sec sustained).


Stage 2: Extraction & Normalization

Note: Author resolution creates or updates the AUTHORS table, incrementing article counts and recalculating trust scores based on source authority and historical engagement.


Stage 3: Enrichment

Note: Enrichment runs asynchronously after extraction. Social metrics inform relevance scoring; backlinks contribute to source/author authority calculations.


Stage 4: Processing (AI)

Note: Embeddings enable semantic search and story clustering. Summary is optional but improves briefing generation quality.


Stage 5: Classification

Note: The 3-step ladder minimizes cost. ~70% of articles classify via rules+vectors alone. LLM specialists only run when needed, and each module logs its cost independently.


Stage 6: Story Clustering

Note: Story clustering uses the story-recluster Workflow for periodic batch re-clustering, while real-time clustering happens per-article via queues.


Stage 7: Matching & Delivery


Cloudflare Workflows

Workflows provide durable, stateful execution for complex multi-step operations that need retry logic, fan-out, and state persistence.

Workflow: scheduled-crawl

┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: scheduled-crawl │
│ Trigger: Cron (*/15 * * * *) or POST /v1/admin/crawl/trigger │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Load Active Keyword Sets │
│ └─▶ Query D1 for enabled keyword_sets matching crawl_frequency │
│ │
│ Step 2: Check Rate Limits │
│ └─▶ For each keyword_set: │
│ ├─▶ Read KV rate counters (per-source, global) │
│ ├─▶ If within limits: add to batch │
│ └─▶ If rate-limited: skip, log, schedule retry │
│ │
│ Step 3: Fan-Out Batches │
│ └─▶ Enqueue crawl.batch messages (max 100 per batch) │
│ └─▶ Workflow WAITS for batch completion (durable state) │
│ │
│ Step 4: Collect Results │
│ └─▶ Aggregate: new_articles[], failed[], rate_limited[] │
│ │
│ Step 5: Update State │
│ └─▶ Insert pipeline_run record to D1 │
│ └─▶ Update KV rate counters │
│ └─▶ If failure_rate > 20%: trigger alert │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Workflow: story-recluster

┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: story-recluster │
│ Trigger: Cron (0 */4 * * *) or POST /v1/admin/recluster │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Load Unclustered Articles │
│ └─▶ Query articles from last 48h not in any story │
│ │
│ Step 2: Load Active Stories │
│ └─▶ Query stories with status = breaking | developing │
│ └─▶ Fetch story centroids from Vectorize │
│ │
│ Step 3: Compute Similarities │
│ └─▶ For each unclustered article: │
│ ├─▶ Vector similarity to story centroids │
│ ├─▶ Entity overlap with story entities │
│ └─▶ Temporal proximity scoring │
│ │
│ Step 4: Assign or Create │
│ └─▶ If similarity > 0.75: assign to story │
│ └─▶ If 3+ similar unclustered: create new story │
│ └─▶ Update story centroids (rolling average) │
│ │
│ Step 5: Merge Detection │
│ └─▶ Find stories with >50% article overlap │
│ └─▶ Flag for admin review or auto-merge if confident │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Workflow: retention-cleanup

┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: retention-cleanup │
│ Trigger: Cron (0 3 * * *) daily at 3 AM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Archive Mature Stories │
│ └─▶ Stories with no new articles in 30 days → status = archived │
│ │
│ Step 2: Purge Raw Snapshots │
│ └─▶ Delete R2 objects older than retention_days (default: 30) │
│ └─▶ Update D1 raw_snapshot_r2_key = null │
│ │
│ Step 3: Aggregate Cost Rollups │
│ └─▶ Compute daily COST_ROLLUPS from COST_EVENTS │
│ └─▶ Purge COST_EVENTS older than 90 days (keep rollups) │
│ │
│ Step 4: Cleanup Orphans │
│ └─▶ Delete ARTICLE_STORIES where story deleted │
│ └─▶ Delete embeddings for deleted articles │
│ │
│ Step 5: Report │
│ └─▶ Log storage freed, records deleted │
│ └─▶ Update D1 storage_stats │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Client API Endpoints

These endpoints power consumer-facing applications — web apps, mobile apps, SDKs.

Authentication

All client endpoints require X-API-Key header.

┌─────────────────────────────────────────────────────────────────────────────┐
│ 🔐 AUTHENTICATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ POST /v1/auth/register Create customer account │
│ POST /v1/auth/login Get API key │
│ POST /v1/auth/refresh Refresh token │
│ GET /v1/auth/me Current user info │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 📋 KEYWORD MANAGEMENT — What to crawl │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/keywords List customer's keyword sets │
│ POST /v1/keywords Create keyword set │
│ Body: { name, keywords[], frequency, language, region } │
│ GET /v1/keywords/:id Get keyword set details │
│ PUT /v1/keywords/:id Update keyword set │
│ DEL /v1/keywords/:id Delete keyword set │
│ POST /v1/keywords/:id/pause Pause crawling │
│ POST /v1/keywords/:id/resume Resume crawling │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 📰 FEED & ARTICLES — Read the news │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/feed Personalized article feed │
│ Params: ?limit, ?offset, ?since, ?until, ?topics[], ?sources[], │
│ ?sentiment, ?sort (relevance|date|engagement) │
│ GET /v1/articles/:id Single article with full data │
│ Returns: headline, body, summary, classification, social, source │
│ GET /v1/articles/:id/similar Semantically similar articles │
│ POST /v1/articles/:id/feedback Mark relevant / not relevant │
│ Body: { feedback: "relevant" | "not_relevant", reason? } │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 📚 STORIES — Clustered narratives │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/stories Story feed │
│ Params: ?status (breaking|developing|mature), ?limit, ?offset │
│ GET /v1/stories/:id Story with timeline + top articles │
│ GET /v1/stories/:id/articles Paginated articles in story │
│ GET /v1/stories/:id/timeline Story timeline events │
│ POST /v1/stories/:id/subscribe Subscribe to story updates │
│ DEL /v1/stories/:id/subscribe Unsubscribe from story │
│ GET /v1/subscriptions List story subscriptions │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 🔍 SEARCH — Find articles │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/search Full-text + semantic search │
│ Params: ?q (query), ?semantic (true|false), ?filters... │
│ GET /v1/search/entities Search by entity │
│ Params: ?entity_id, ?entity_type, ?name │
│ GET /v1/search/locations Search by location │
│ Params: ?location_id, ?lat, ?lng, ?radius_km │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 🤖 INTELLIGENCE — AI-powered features │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/briefings/daily AI-generated daily briefing │
│ Params: ?date (default: today), ?format (summary|detailed) │
│ GET /v1/briefings/weekly Weekly digest │
│ POST /v1/qa Ask questions about your news │
│ Body: { query, scope?, story_id?, date_range? } │
│ GET /v1/entities/:id Entity profile + related articles │
│ GET /v1/entities/:id/timeline Entity's news timeline │
│ GET /v1/trends Trending topics/entities │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 👤 PROFILE & PREFERENCES — Personalization │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/profiles List monitoring profiles │
│ POST /v1/profiles Create profile │
│ Body: { name, keywords[], topics[], sources_include[], │
│ sources_exclude[], regions[], notify_threshold } │
│ GET /v1/profiles/:id Get profile │
│ PUT /v1/profiles/:id Update profile │
│ DEL /v1/profiles/:id Delete profile │
│ POST /v1/profiles/:id/rebuild Force re-embed profile │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 🔔 NOTIFICATIONS — Alert preferences │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/notifications/settings Get notification settings │
│ PUT /v1/notifications/settings Update settings │
│ Body: { email: bool, slack: { webhook_url }, push: bool } │
│ GET /v1/notifications/history Recent notifications sent │
│ POST /v1/notifications/test Send test notification │
└─────────────────────────────────────────────────────────────────────────────┘

Admin API Endpoints

These endpoints power the Admin Console — internal tools for ops, monitoring, and system management.

Authentication

All admin endpoints require Authorization: Bearer <token> header.

┌─────────────────────────────────────────────────────────────────────────────┐
│ 📊 PIPELINE MONITORING │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/admin/pipeline/status Real-time pipeline health │
│ Returns: queue depths, worker status, error rates │
│ GET /v1/admin/pipeline/runs Historical pipeline runs │
│ Params: ?since, ?until, ?status │
│ GET /v1/admin/pipeline/runs/:id Single run details │
│ GET /v1/admin/pipeline/errors Recent errors across pipeline │
│ POST /v1/admin/crawl/trigger Manual crawl trigger │
│ Body: { keyword_set_ids[]?, force: bool } │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 🏷️ CLASSIFICATION & TAXONOMY │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/admin/classification/metrics Accuracy, confidence distributions │
│ Returns: avg_confidence, review_queue_size, accuracy_post_review │
│ GET /v1/admin/classification/drift Label distribution over time │
│ GET /v1/admin/taxonomy Full taxonomy tree │
│ POST /v1/admin/taxonomy/labels Add new label │
│ PUT /v1/admin/taxonomy/labels/:id Update label │
│ DEL /v1/admin/taxonomy/labels/:id Delete label │
│ POST /v1/admin/taxonomy/exemplars Add exemplar article to label │
│ GET /v1/admin/taxonomy/versions Taxonomy version history │
│ POST /v1/admin/taxonomy/rollback Rollback to previous version │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 👁️ REVIEW QUEUE — Human-in-the-loop │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/admin/review/queue Low-confidence items │
│ Params: ?limit, ?sort (oldest|confidence) │
│ GET /v1/admin/review/:id Single review item │
│ POST /v1/admin/review/:id Submit review decision │
│ Body: { corrected_labels, notes?, promote_to_exemplar: bool } │
│ POST /v1/admin/review/:id/skip Skip (return to queue later) │
│ GET /v1/admin/review/stats Review metrics │
│ Returns: total_pending, reviewed_today, avg_review_time │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 🏢 SOURCES & AUTHORITY │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/admin/sources All sources with stats │
│ Returns: domain, authority_score, article_count, last_seen │
│ POST /v1/admin/sources Add new source config │
│ PUT /v1/admin/sources/:id Update source (rate limits, etc) │
│ POST /v1/admin/sources/:id/recalc Recalculate authority score │
│ GET /v1/admin/sources/:id/history Authority score over time │
│ GET /v1/admin/authors Top authors by article count │
│ GET /v1/admin/authors/:id Author details + trust score │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 💰 COST TRACKING & BUDGETS │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/admin/costs Cost summary │
│ Params: ?period (day|week|month), ?since, ?until │
│ GET /v1/admin/costs/by-service Breakdown by service │
│ Returns: { zenrows, data4seo, sharedcount, workers_ai } │
│ GET /v1/admin/costs/by-operation Breakdown by operation type │
│ GET /v1/admin/costs/by-article/:id Cost to process specific article │
│ GET /v1/admin/costs/by-customer/:id Customer-attributed costs │
│ GET /v1/admin/costs/forecast Projected costs based on trends │
│ GET /v1/admin/budgets Current budget configurations │
│ POST /v1/admin/budgets Create/update budget │
│ Body: { scope, scope_id?, period, budget_usd, alert_pct, hard_limit } │
│ GET /v1/admin/budgets/alerts Budget threshold alerts │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 🔧 REPROCESSING & MAINTENANCE │
├─────────────────────────────────────────────────────────────────────────────┤
│ POST /v1/admin/reprocess/article/:id Reprocess single article │
│ Body: { stages[]: extract|embed|classify|cluster } │
│ POST /v1/admin/reprocess/batch Reprocess batch of articles │
│ Body: { article_ids[], stages[] } │
│ POST /v1/admin/recluster Force story re-clustering │
│ POST /v1/admin/retention/run Execute retention policy now │
│ POST /v1/admin/cache/invalidate Clear KV caches │
│ Body: { patterns[]?: ["rate-limits:*", "hot-cache:sources:*"] } │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 📈 STORAGE & SYSTEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/admin/storage/stats D1/R2/KV/Vectorize usage │
│ Returns: row_counts, storage_bytes, index_sizes │
│ GET /v1/admin/rate-limits/status Current rate limit state │
│ GET /v1/admin/health System health check │
│ GET /v1/admin/config Current system configuration │
│ PUT /v1/admin/config Update configuration │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ 👥 CUSTOMER MANAGEMENT │
├─────────────────────────────────────────────────────────────────────────────┤
│ GET /v1/admin/customers All customers │
│ GET /v1/admin/customers/:id Customer details │
│ GET /v1/admin/customers/:id/usage Customer usage stats │
│ PUT /v1/admin/customers/:id/tier Update customer tier │
│ POST /v1/admin/customers/:id/disable Disable customer │
└─────────────────────────────────────────────────────────────────────────────┘

Cost Model

Per-Operation Costs

ServiceOperationUnitCostNotes
Google NewsRSS fetchrequestFREERate limited only
Google NewsHTML fetchrequestFREERate limited only
PublisherDirect fetchrequestFREEMay be blocked
ZenRowsAnti-bot fetchrequest~$0.005Tier 2 fallback
DataForSEOContent fetchrequest~$0.002Tier 3 fallback
DataForSEOBacklinksrequest~$0.004Per article
SharedCountSocial metricsrequest~$0.0001Per article
Workers AIEmbedding1K tokens~$0.00001~500-2000 tok/article
Workers AILLM inference1K tokens~$0.0005Classification, summaries
VectorizeQuery/UpsertrequestFREEIncluded in plan
D1Read/WriterequestFREEIncluded in plan
R2StorageGB/month$0.015Raw snapshots
KVRead/WriterequestFREEIncluded in plan

Typical Per-Article Cost

┌─────────────────────────────────────────────────────────────────┐
│ TYPICAL ARTICLE PROCESSING COST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Acquisition (one of): │
│ ├── Direct fetch .......................... FREE │
│ ├── ZenRows (~20% need it) ............... $0.005 │
│ └── DataForSEO (~5% fallback) ............ $0.002 │
│ │
│ Enrichment: │
│ ├── SharedCount .......................... $0.0001 │
│ └── Backlinks ............................ $0.004 │
│ │
│ Processing: │
│ ├── Embedding ............................ $0.00002 │
│ └── Summary .............................. $0.0005 │
│ │
│ Classification: │
│ ├── Rules + Vector ....................... FREE │
│ └── LLM specialists (~30% need it) ....... $0.001 │
│ │
│ ───────────────────────────────────────────────────────────── │
│ TOTAL (80% of articles) .................. $0.005 │
│ TOTAL (with fallbacks + full LLM) ........ $0.015 │
│ │
│ At 10,000 articles/day: │
│ ├── Typical: ~$50/day (~$1,500/month) │
│ └── Worst case: ~$150/day (~$4,500/month) │
│ │
└─────────────────────────────────────────────────────────────────┘

Color Legend

ColorMeaning
🔵 Blue (#dbeafe)Entry points, inputs
🟡 Yellow (#fef3c7)Orchestration, queues, workflows
🟢 Green (#dcfce7)Processing, free operations
🩷 Pink (#fce7f3)External services, paid operations
🟣 Purple (#f3e8ff)AI services, storage
🔴 Red (#fee2e2)Alerts, observability

Cloudflare Foundation Summary

Every component in this system runs on Cloudflare's global edge network:

ComponentCloudflare ServicePurpose
API GatewayWorkersRequest handling, auth, routing
Async ProcessingQueuesMessage passing between pipeline stages
OrchestrationWorkflowsDurable, stateful multi-step execution
Transactional DBD1SQLite for all entity data (35+ tables)
Object StorageR2Raw HTML snapshots, exports
CachingKVRate limits, dedup cache, hot lookups
Vector SearchVectorize7 semantic search indexes
AI InferenceWorkers AIEmbeddings, classification LLM
AnalyticsClickHouseTime-series, heavy aggregations (external)

Why 100% Cloudflare? Zero cold starts, global edge deployment, unified billing, and seamless integration between services. No VPCs, no Kubernetes, no container orchestration — just code and configuration.


Quick Navigation