Architecture Overview

StoryIntel: A News Intelligence Platform Built on Cloudflare

StoryIntel is a serverless news intelligence platform that crawls, enriches, classifies, and delivers relevant news articles to customers. The entire infrastructure runs on Cloudflare's edge platform, providing global low-latency access and zero cold-start compute.

High-Level Stack Diagram

Layer-by-Layer Breakdown

Layer 1: Cloudflare Foundation

The bedrock of StoryIntel. All persistent state, caching, and AI inference runs on Cloudflare's global edge network.

Component	Purpose	Key Metrics
D1 (SQLite)	Transactional database	35+ tables, source of truth
R2 (Objects)	Raw HTML snapshots, exports	Unlimited storage
KV (Cache)	Rate limiting, dedup cache	Global edge replication
Vectorize	Semantic search indexes	7 indexes, 1536 dimensions
Workers AI	Embeddings & classification	@cf/baai/bge-base-en-v1.5

Why Cloudflare? Zero cold starts, global edge deployment, unified billing, and tight integration between services. No VPCs, no container orchestration, no infrastructure management.

Layer 2: Processing Pipeline

Orchestration happens through two complementary patterns:

Workflows (Control Plane)

Long-running, stateful processes
Automatic retries and checkpointing
Human-in-the-loop capabilities
Examples: IngestKeyword, ProcessArticle, StoryCluster

Queues (Data Plane)

High-throughput async processing
Batching for efficiency
Dead-letter queues for failures
Examples: crawl.batch, article.extract, notify.dispatch

Layer 3: API Gateway

All external requests hit Cloudflare Workers at the edge:

Authentication: API key validation, JWT tokens, admin tokens
Rate Limiting: Per-customer limits stored in KV
Routing: Path-based routing to appropriate handlers
CORS: Proper cross-origin handling for web clients

Layer 4: Applications

Multiple interfaces consume the API:

Client Apps: Customer-facing dashboards (React/Next.js)
Admin Console: Internal management interface
TypeScript SDK: Generated from OpenAPI spec
Webhook Consumers: Push notifications to customer systems

Analytics Layer: ClickHouse

D1 handles transactional workloads, but heavy analytics queries go to ClickHouse:

D1 (Transactional)	ClickHouse (Analytics)
Article CRUD	Article metrics over time
Customer data	Customer engagement trends
Real-time lookups	Pipeline latency percentiles
Dedup checks	Keyword velocity for trends

Key Architectural Decisions

1. Cloudflare-First Infrastructure

Every component runs on Cloudflare's edge. No AWS, no GCP, no self-hosted infrastructure.

2. Workflows + Queues Hybrid

Workflows orchestrate complex multi-step processes. Queues handle high-throughput data movement. They work together.

3. Plugin Architecture

Three extension points for customization:

Source Adapters: Where we crawl from (Google News, RSS, future sources)
Extraction Plugins: What structured data we extract (entities, events, funding rounds)
Output Adapters: Where enriched data goes (email, Slack, webhooks)

4. Schema-Driven Extraction

All extractors define their output as JSON Schema. This enables:

Validation at extraction time
Type-safe client code generation
Consistent storage in ARTICLE_EXTRACTIONS

5. D1 + ClickHouse Split

D1: Source of truth for all entities, fast writes, transactional guarantees
ClickHouse: Analytics, time-series, heavy aggregations, trend detection

Quick Links

System Flow Diagram - Granular Mermaid diagram of the full pipeline
Cloudflare Foundation - Deep dive into each Cloudflare component
Data Model - Entity relationships and table structure
Plugin Architecture - How to extend StoryIntel

High-Level Stack Diagram​

Layer-by-Layer Breakdown​

Layer 1: Cloudflare Foundation​

Layer 2: Processing Pipeline​

Layer 3: API Gateway​

Layer 4: Applications​

Analytics Layer: ClickHouse​

Key Architectural Decisions​

1. Cloudflare-First Infrastructure​

2. Workflows + Queues Hybrid​

3. Plugin Architecture​

4. Schema-Driven Extraction​

5. D1 + ClickHouse Split​

Quick Links​