Plugin Architecture
Three extension points to customize the entire pipeline: Sources, Extractors, Outputs.
StoryIntel is designed as a pluggable platform. While the core pipeline is fixed, you can extend it at three key points to add new data sources, extract custom structured data, and deliver to new destinations.
Extension Points Overview
1. Source Adapters (Input)
Source adapters define where we crawl from. Each adapter knows how to discover URLs and fetch content from a specific source.
Interface Definition
interface SourceAdapter {
// Unique identifier
id: string;
name: string;
// Discovery: Find URLs matching a keyword
discover(keyword: Keyword): Promise<DiscoveredURL[]>;
// Fetch: Get raw content from a URL
fetch(url: string): Promise<RawContent>;
// Normalize: Convert to standard Article format
normalize(raw: RawContent): Promise<Article>;
// Health: Check if source is accessible
healthCheck(): Promise<HealthStatus>;
// Rate limits for this source
rateLimits: RateLimitConfig;
}
interface DiscoveredURL {
url: string;
title?: string;
publishedAt?: Date;
source?: string;
}
interface RawContent {
html: string;
url: string;
fetchedAt: Date;
headers: Record<string, string>;
}
interface RateLimitConfig {
requestsPerMinute: number;
requestsPerHour: number;
requestsPerDay: number;
burstLimit: number;
}
Built-in Adapters
| Adapter | Status | Discovery | Notes |
|---|---|---|---|
| Google News | ✅ Active | RSS + HTML | Primary source |
| RSS Generic | ✅ Active | RSS feeds | Any RSS/Atom feed |
| Direct Publisher | ✅ Active | Sitemap/robots.txt | For known publishers |
Future Adapters (Planned)
| Adapter | Priority | Notes |
|---|---|---|
| Twitter/X | High | Requires API access |
| High | API or scraping | |
| HackerNews | Medium | Public API |
| PR Newswire | Medium | Requires partnership |
| SEC Filings | Medium | Public EDGAR API |
| Podcasts | Low | Transcription needed |
Adding a New Source
See Source Adapters Guide for implementation details.
2. Extraction Plugins (Processing)
Extraction plugins define what structured data we pull out of articles. Each plugin takes article text and returns structured JSON matching a defined schema.
Interface Definition
interface ExtractionPlugin {
// Unique identifier
id: string;
name: string;
// Type of extraction
type: 'builtin' | 'ai' | 'rules' | 'hybrid';
// JSON Schema for output validation
outputSchema: JSONSchema;
// Extract structured data from article
extract(
article: Article,
context?: ProjectContext
): Promise<ExtractionResult>;
// Validate extracted data against schema
validate(data: unknown): boolean;
// Estimate cost before running (for AI extractors)
estimateCost(article: Article): number;
// Run on ingest or on-demand?
runOnIngest: boolean;
// Priority (lower = runs first)
priority: number;
}
interface ExtractionResult {
extractorId: string;
data: Record<string, any>; // Matches outputSchema
confidence: number; // 0-1
costMicros: number; // Actual cost in microdollars
}
interface ProjectContext {
projectId: string;
focusType: string; // e.g., 'funding_rounds'
focusDescription: string; // LLM-readable context
}
Built-in Extractors
| Extractor | Type | Output | Notes |
|---|---|---|---|
entities | AI | People, orgs, products | Named entity recognition |
locations | Hybrid | Geo-tagged places | Match against 225K locations |
events | AI | Dates, conferences, earnings | Structured event data |
quotes | AI | Speaker + quote text | Attribution extraction |
products | AI | Product mentions | Brand detection |
jobs | AI | Job postings | Title, company, location |
funding_rounds | AI | Series, amounts, investors | New |
Database Schema
-- Extractor definitions
CREATE TABLE extractors (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
type TEXT NOT NULL CHECK (type IN ('builtin', 'ai', 'rules', 'hybrid')),
output_schema TEXT NOT NULL, -- JSON Schema
run_on_ingest INTEGER DEFAULT 1,
priority INTEGER DEFAULT 100,
is_active INTEGER DEFAULT 1,
description TEXT,
created_at TEXT DEFAULT (datetime('now'))
);
-- Extraction results per article
CREATE TABLE article_extractions (
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
extracted_data TEXT NOT NULL, -- JSON matching output_schema
confidence REAL,
cost_micros INTEGER DEFAULT 0,
extracted_at TEXT DEFAULT (datetime('now')),
PRIMARY KEY (article_id, extractor_id)
);
-- Searchable extraction items (denormalized)
CREATE TABLE extraction_items (
id TEXT PRIMARY KEY,
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
item_type TEXT NOT NULL, -- e.g., 'person', 'org', 'event'
item_value TEXT NOT NULL, -- e.g., 'Elon Musk'
item_data TEXT, -- Additional JSON
confidence REAL,
created_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX idx_extraction_items_type_value ON extraction_items(item_type, item_value);
See Also
- Extraction Plugins Guide — Detailed extractor documentation
3. Output Adapters (Delivery)
Output adapters define where enriched data goes. Each adapter knows how to format and send notifications to a specific destination.
Interface Definition
interface OutputAdapter {
// Unique identifier
id: string;
name: string;
// Configuration schema for this adapter
configSchema: JSONSchema;
// Send notification to destination
send(
payload: NotificationPayload,
config: AdapterConfig
): Promise<DeliveryResult>;
// Test connection with provided config
testConnection(config: AdapterConfig): Promise<boolean>;
// Retry configuration
retryPolicy: RetryConfig;
}
interface NotificationPayload {
articles: Article[];
stories?: Story[];
customer: Customer;
profile: Profile;
matchScores: Record<string, number>;
digest?: boolean;
}
interface DeliveryResult {
success: boolean;
messageId?: string;
error?: string;
retryable?: boolean;
}
interface RetryConfig {
maxRetries: number;
initialDelayMs: number;
maxDelayMs: number;
backoffMultiplier: number;
}
Built-in Adapters
| Adapter | Status | Config | Notes |
|---|---|---|---|
| ✅ Active | SMTP settings | HTML templates | |
| Slack | ✅ Active | Webhook URL | Rich formatting |
| Webhook | ✅ Active | URL + headers | Generic POST |
Future Adapters (Planned)
| Adapter | Priority | Notes |
|---|---|---|
| Airtable | High | Customer request |
| Notion | High | Customer request |
| Google Sheets | Medium | Easy integration |
| Microsoft Teams | Medium | Enterprise demand |
| Discord | Low | Community use |
| Zapier | Low | Meta-integration |
Database Schema
-- Customer notification preferences
CREATE TABLE customer_notification_settings (
customer_id TEXT PRIMARY KEY REFERENCES customers(id),
email_enabled INTEGER DEFAULT 1,
email_address TEXT,
slack_enabled INTEGER DEFAULT 0,
slack_webhook_url TEXT,
webhook_enabled INTEGER DEFAULT 0,
webhook_url TEXT,
webhook_headers TEXT, -- JSON
digest_frequency TEXT DEFAULT 'realtime', -- realtime, hourly, daily
quiet_hours_start TEXT, -- HH:MM
quiet_hours_end TEXT,
timezone TEXT DEFAULT 'UTC',
updated_at TEXT DEFAULT (datetime('now'))
);
-- Notification delivery log
CREATE TABLE notification_log (
id TEXT PRIMARY KEY,
customer_id TEXT NOT NULL REFERENCES customers(id),
adapter_id TEXT NOT NULL,
payload_hash TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('pending', 'sent', 'failed', 'retrying')),
attempts INTEGER DEFAULT 0,
last_error TEXT,
sent_at TEXT,
created_at TEXT DEFAULT (datetime('now'))
);
See Also
- Output Adapters Guide — Implementation details
Projects: Focus-Based Extraction
Projects allow customers to create focused campaigns that inform extraction. For example, a "Funding Tracker" project would prioritize the funding_rounds extractor.
Concept
Database Schema
-- Projects: Customer campaigns with specific focus
CREATE TABLE projects (
id TEXT PRIMARY KEY,
customer_id TEXT NOT NULL REFERENCES customers(id),
name TEXT NOT NULL,
description TEXT,
-- Focus informs extraction
focus_type TEXT NOT NULL, -- 'funding_rounds', 'product_launches', etc.
focus_description TEXT, -- LLM-readable description
-- Stats
article_count INTEGER DEFAULT 0,
extraction_count INTEGER DEFAULT 0,
is_active INTEGER DEFAULT 1,
created_at TEXT DEFAULT (datetime('now'))
);
-- Keywords associated with a project
CREATE TABLE project_keywords (
project_id TEXT NOT NULL REFERENCES projects(id),
keyword_id TEXT NOT NULL REFERENCES keywords(id),
PRIMARY KEY (project_id, keyword_id)
);
-- Extractors enabled for a project
CREATE TABLE project_extractors (
project_id TEXT NOT NULL REFERENCES projects(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
is_enabled INTEGER DEFAULT 1,
config TEXT, -- Project-specific config overrides
PRIMARY KEY (project_id, extractor_id)
);
Focus Types
| Focus Type | Description | Primary Extractor |
|---|---|---|
funding_rounds | Track startup funding | funding_rounds |
product_launches | New product announcements | products |
executive_moves | C-suite changes | entities |
acquisitions | M&A activity | entities, custom |
earnings | Quarterly reports | events |
regulatory | Policy changes | entities, events |
custom | User-defined | Configurable |
Plugin Lifecycle
Cost Tracking
All plugin operations are cost-tracked:
-- Every extraction logs its cost
INSERT INTO cost_events (
id,
operation_type,
service,
operation_id,
article_id,
cost_micros,
metadata
) VALUES (
'cost_xxx',
'extraction',
'workers_ai',
'funding_rounds',
'article_123',
500, -- $0.0005
'{"tokens_in": 1500, "tokens_out": 200}'
);
See Cost Tracking for full details.
Quick Links
- Source Adapters — Adding new data sources
- Extraction Plugins — Custom extractors
- Output Adapters — Delivery destinations
- Google News Adapter — Primary source implementation