Architecture Overview
StoryIntel: A News Intelligence Platform Built on Cloudflare
StoryIntel is a serverless news intelligence platform that crawls, enriches, classifies, and delivers relevant news articles to customers. The entire infrastructure runs on Cloudflare's edge platform, providing global low-latency access and zero cold-start compute.
High-Level Stack Diagram
Layer-by-Layer Breakdown
Layer 1: Cloudflare Foundation
The bedrock of StoryIntel. All persistent state, caching, and AI inference runs on Cloudflare's global edge network.
| Component | Purpose | Key Metrics |
|---|---|---|
| D1 (SQLite) | Transactional database | 35+ tables, source of truth |
| R2 (Objects) | Raw HTML snapshots, exports | Unlimited storage |
| KV (Cache) | Rate limiting, dedup cache | Global edge replication |
| Vectorize | Semantic search indexes | 7 indexes, 1536 dimensions |
| Workers AI | Embeddings & classification | @cf/baai/bge-base-en-v1.5 |
Why Cloudflare? Zero cold starts, global edge deployment, unified billing, and tight integration between services. No VPCs, no container orchestration, no infrastructure management.
Layer 2: Processing Pipeline
Orchestration happens through two complementary patterns:
Workflows (Control Plane)
- Long-running, stateful processes
- Automatic retries and checkpointing
- Human-in-the-loop capabilities
- Examples:
IngestKeyword,ProcessArticle,StoryCluster
Queues (Data Plane)
- High-throughput async processing
- Batching for efficiency
- Dead-letter queues for failures
- Examples:
crawl.batch,article.extract,notify.dispatch
Layer 3: API Gateway
All external requests hit Cloudflare Workers at the edge:
- Authentication: API key validation, JWT tokens, admin tokens
- Rate Limiting: Per-customer limits stored in KV
- Routing: Path-based routing to appropriate handlers
- CORS: Proper cross-origin handling for web clients
Layer 4: Applications
Multiple interfaces consume the API:
- Client Apps: Customer-facing dashboards (React/Next.js)
- Admin Console: Internal management interface
- TypeScript SDK: Generated from OpenAPI spec
- Webhook Consumers: Push notifications to customer systems
Analytics Layer: ClickHouse
D1 handles transactional workloads, but heavy analytics queries go to ClickHouse:
| D1 (Transactional) | ClickHouse (Analytics) |
|---|---|
| Article CRUD | Article metrics over time |
| Customer data | Customer engagement trends |
| Real-time lookups | Pipeline latency percentiles |
| Dedup checks | Keyword velocity for trends |
Key Architectural Decisions
1. Cloudflare-First Infrastructure
Every component runs on Cloudflare's edge. No AWS, no GCP, no self-hosted infrastructure.
2. Workflows + Queues Hybrid
Workflows orchestrate complex multi-step processes. Queues handle high-throughput data movement. They work together.
3. Plugin Architecture
Three extension points for customization:
- Source Adapters: Where we crawl from (Google News, RSS, future sources)
- Extraction Plugins: What structured data we extract (entities, events, funding rounds)
- Output Adapters: Where enriched data goes (email, Slack, webhooks)
4. Schema-Driven Extraction
All extractors define their output as JSON Schema. This enables:
- Validation at extraction time
- Type-safe client code generation
- Consistent storage in
ARTICLE_EXTRACTIONS
5. D1 + ClickHouse Split
- D1: Source of truth for all entities, fast writes, transactional guarantees
- ClickHouse: Analytics, time-series, heavy aggregations, trend detection
Quick Links
- System Flow Diagram - Granular Mermaid diagram of the full pipeline
- Cloudflare Foundation - Deep dive into each Cloudflare component
- Data Model - Entity relationships and table structure
- Plugin Architecture - How to extend StoryIntel