Skip to main content

Architecture Overview

StoryIntel: A News Intelligence Platform Built on Cloudflare

StoryIntel is a serverless news intelligence platform that crawls, enriches, classifies, and delivers relevant news articles to customers. The entire infrastructure runs on Cloudflare's edge platform, providing global low-latency access and zero cold-start compute.


High-Level Stack Diagram


Layer-by-Layer Breakdown

Layer 1: Cloudflare Foundation

The bedrock of StoryIntel. All persistent state, caching, and AI inference runs on Cloudflare's global edge network.

ComponentPurposeKey Metrics
D1 (SQLite)Transactional database35+ tables, source of truth
R2 (Objects)Raw HTML snapshots, exportsUnlimited storage
KV (Cache)Rate limiting, dedup cacheGlobal edge replication
VectorizeSemantic search indexes7 indexes, 1536 dimensions
Workers AIEmbeddings & classification@cf/baai/bge-base-en-v1.5

Why Cloudflare? Zero cold starts, global edge deployment, unified billing, and tight integration between services. No VPCs, no container orchestration, no infrastructure management.

Layer 2: Processing Pipeline

Orchestration happens through two complementary patterns:

Workflows (Control Plane)

  • Long-running, stateful processes
  • Automatic retries and checkpointing
  • Human-in-the-loop capabilities
  • Examples: IngestKeyword, ProcessArticle, StoryCluster

Queues (Data Plane)

  • High-throughput async processing
  • Batching for efficiency
  • Dead-letter queues for failures
  • Examples: crawl.batch, article.extract, notify.dispatch

Layer 3: API Gateway

All external requests hit Cloudflare Workers at the edge:

  • Authentication: API key validation, JWT tokens, admin tokens
  • Rate Limiting: Per-customer limits stored in KV
  • Routing: Path-based routing to appropriate handlers
  • CORS: Proper cross-origin handling for web clients

Layer 4: Applications

Multiple interfaces consume the API:

  • Client Apps: Customer-facing dashboards (React/Next.js)
  • Admin Console: Internal management interface
  • TypeScript SDK: Generated from OpenAPI spec
  • Webhook Consumers: Push notifications to customer systems

Analytics Layer: ClickHouse

D1 handles transactional workloads, but heavy analytics queries go to ClickHouse:

D1 (Transactional)ClickHouse (Analytics)
Article CRUDArticle metrics over time
Customer dataCustomer engagement trends
Real-time lookupsPipeline latency percentiles
Dedup checksKeyword velocity for trends

Key Architectural Decisions

1. Cloudflare-First Infrastructure

Every component runs on Cloudflare's edge. No AWS, no GCP, no self-hosted infrastructure.

2. Workflows + Queues Hybrid

Workflows orchestrate complex multi-step processes. Queues handle high-throughput data movement. They work together.

3. Plugin Architecture

Three extension points for customization:

  • Source Adapters: Where we crawl from (Google News, RSS, future sources)
  • Extraction Plugins: What structured data we extract (entities, events, funding rounds)
  • Output Adapters: Where enriched data goes (email, Slack, webhooks)

4. Schema-Driven Extraction

All extractors define their output as JSON Schema. This enables:

  • Validation at extraction time
  • Type-safe client code generation
  • Consistent storage in ARTICLE_EXTRACTIONS

5. D1 + ClickHouse Split

  • D1: Source of truth for all entities, fast writes, transactional guarantees
  • ClickHouse: Analytics, time-series, heavy aggregations, trend detection