Skip to main content

StoryIntel

100% Cloudflare-Native News Intelligence Platform


Overview

StoryIntel is a serverless news intelligence platform that crawls, enriches, classifies, and delivers relevant news articles to customers. The entire infrastructure runs on Cloudflare's edge platform.

Key Features

  • Real-time News Monitoring - Crawl multiple sources every 15 minutes
  • AI Classification - Multi-tier classification (rules, vectors, LLM)
  • Story Clustering - Group articles into evolving narratives
  • Pluggable Extraction - Extract structured data (entities, events, funding rounds)
  • Personalized Feeds - Match articles to customer profiles
  • Intelligent Alerts - Notify on high-relevance matches
  • Daily Briefings - AI-generated news summaries
  • Q&A Interface - Ask questions about your feed

Documentation

Architecture

DocumentDescription
architecture/overview.mdHigh-level stack diagram, why Cloudflare
architecture/system-flow.mdGranular pipeline flow (hero diagram)
architecture/cloudflare-foundation.mdDeep dive into each CF component

Plugin System

DocumentDescription
plugins/overview.mdExtension points: sources, extractors, outputs
plugins/extraction-plugins.mdBuilt-in extractors and custom creation
plugins/google-news-adapter.mdGoogle News source adapter

API Reference

DocumentDescription
api-reference/openapi.yamlOpenAPI 3.1 specification

Operations

DocumentDescription
operations/deployment.mdWrangler config, CI/CD, environments
operations/runbook.mdOperations procedures, troubleshooting
operations/security.mdAuthentication, authorization, encryption
operations/testing.mdTest strategy, fixtures, mocks

Reference

DocumentDescription
reference/database-schema.sqlD1 schema (35+ tables)
reference/clickhouse-schema.sqlClickHouse analytics schema
reference/taxonomy-seed.mdClassification labels
reference/external-apis.mdGoogle News, ZenRows, DataForSEO
reference/ai-agents.mdAI agents specification
reference/gaps-and-unknowns.mdKnown issues, risks
reference/phased-build-plan.mdImplementation phases

Architecture at a Glance

                        CLOUDFLARE EDGE NETWORK
┌──────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ API GATEWAY │ │
│ │ Cloudflare Workers (Edge) │ │
│ │ Auth • Rate Limiting • Routing │ │
│ └───────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┴──────────────────────────────┐ │
│ │ PROCESSING PIPELINE │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ WORKFLOWS (Control) QUEUES (Data) │ │ │
│ │ │ IngestKeyword crawl.batch │ │ │
│ │ │ ProcessArticle article.extract │ │ │
│ │ │ StoryCluster notify.dispatch │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┴──────────────────────────────┐ │
│ │ CLOUDFLARE FOUNDATION │ │
│ │ D1 (SQLite) • R2 (Objects) • KV (Cache) │ │
│ │ Vectorize (7 indexes) • Workers AI │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘

┌───────────────────────────┴──────────────────────────────────┐
│ EXTERNAL SERVICES │
│ Google News • ZenRows • DataForSEO • SharedCount • OpenAI │
└──────────────────────────────────────────────────────────────┘

See architecture/overview.md for the full stack diagram.


Quick Start

Prerequisites

  • Node.js 18+
  • pnpm 8+
  • Cloudflare account with Workers Paid plan
  • Wrangler CLI

Installation

# Clone repository
git clone https://github.com/lovelady/storyintel.git
cd storyintel/api

# Install dependencies
pnpm install

# Authenticate with Cloudflare
wrangler login

# Create D1 database
wrangler d1 create storyintel-dev

# Run migrations
wrangler d1 migrations apply storyintel-dev --local

# Start development server
pnpm dev

Configuration

Copy the example wrangler config:

cp wrangler.example.toml wrangler.toml

Set required secrets:

wrangler secret put ZENROWS_API_KEY
wrangler secret put DATA4SEO_API_KEY
wrangler secret put SHAREDCOUNT_API_KEY
wrangler secret put OPENAI_API_KEY
wrangler secret put JWT_SECRET

Project Structure

storyintel/
├── api/
│ ├── src/
│ │ ├── workers/ # Worker entry points
│ │ ├── workflows/ # Cloudflare Workflows
│ │ ├── services/ # Business logic
│ │ ├── db/ # Database layer
│ │ └── shared/ # Utilities
│ ├── migrations/ # D1 migrations
│ ├── docs/ # Documentation (you are here)
│ │ ├── architecture/ # System design
│ │ ├── plugins/ # Extension points
│ │ ├── api-reference/ # OpenAPI spec
│ │ ├── operations/ # Deployment, runbook
│ │ └── reference/ # Schema, external APIs
│ └── tests/ # Test suites
├── console/ # Admin console (React)
└── client-web/ # Customer web app

Cloudflare Resources

Workers (8)

WorkerPurpose
api-gatewayPublic API routing
admin-apiAdmin endpoints
crawl-consumerContent fetching
extract-consumerHTML parsing
enrich-consumerSocial metrics, backlinks
classify-consumerClassification
cluster-consumerStory clustering
notify-consumerAlert dispatch

Queues (9)

QueuePurpose
crawl.batchCrawl jobs
article.extractExtraction jobs
article.enrichEnrichment jobs
article.embedEmbedding jobs
article.classifyClassification jobs
story.clusterClustering jobs
profile.matchMatching jobs
notify.dispatchNotification jobs
cost.trackCost tracking

Vectorize Indexes (7)

IndexPurpose
articlesArticle embeddings (1536 dims)
storiesStory centroids
profilesCustomer preferences
taxonomyClassification labels
entitiesNamed entities
locationsGeographic locations (225K+)
authorsAuthor embeddings

Workflows (3)

WorkflowPurpose
IngestKeywordCron-triggered acquisition
StoryClusterPeriodic re-clustering
RetentionCleanupData lifecycle

API Overview

Authentication

# Customer API
curl -H "X-API-Key: si_live_abc123..." https://api.storyintel.com/v1/feed

# Admin API
curl -H "Authorization: Bearer eyJ..." https://api.storyintel.com/v1/admin/...

Key Endpoints

# Feed and Discovery
GET /v1/feed - Personalized article feed
GET /v1/stories - Story feed
GET /v1/articles/:id - Single article
GET /v1/search - Full-text + semantic search

# Intelligence
GET /v1/briefings/daily - AI daily briefing
POST /v1/qa - Ask questions

# Profile Management
GET /v1/profiles - List profiles
POST /v1/profiles - Create profile

# Admin
POST /v1/admin/crawl/trigger - Manual crawl
GET /v1/admin/pipeline/status - System health
GET /v1/admin/costs - Cost tracking

See api-reference/openapi.yaml for complete specification.


Plugin Architecture

StoryIntel is extensible at three points:

Extension PointPurposeExamples
Source AdaptersWhere we crawl fromGoogle News, RSS, Twitter
Extraction PluginsWhat data we extractEntities, Events, Funding Rounds
Output AdaptersWhere data goesEmail, Slack, Webhook, Airtable

See plugins/overview.md for details.


External Services

ServicePurposeCost Model
Google NewsArticle discoveryFree (rate limited)
ZenRowsAnti-bot bypassPer request (~$0.005)
DataForSEOBacklinks, fallbackPer request (~$0.004)
SharedCountSocial metricsPer request (~$0.0001)
Workers AIEmbeddings, LLMIncluded in plan
OpenAIComplex reasoningPer token (fallback)

See reference/external-apis.md for integration details.


Cost Management

All paid operations are tracked:

SELECT service, SUM(cost_micros)/1000000.0 as usd
FROM cost_events
WHERE date(timestamp) = date('now')
GROUP BY service;

Typical cost: $0.005 - $0.015 per article

At 10,000 articles/day: ~$50-150/day


Contributing

  1. Create feature branch from main
  2. Write tests for new functionality
  3. Ensure all tests pass: pnpm test
  4. Submit PR for review

License

Proprietary - All rights reserved


Built on Cloudflare Workers, D1, R2, KV, Vectorize, Queues, Workflows, and Workers AI


Last updated: December 19, 2024 at 10:45 PM PST