Plugin Architecture

Three extension points to customize the entire pipeline: Sources, Extractors, Outputs.

StoryIntel is designed as a pluggable platform. While the core pipeline is fixed, you can extend it at three key points to add new data sources, extract custom structured data, and deliver to new destinations.

Extension Points Overview

1. Source Adapters (Input)

Source adapters define where we crawl from. Each adapter knows how to discover URLs and fetch content from a specific source.

Interface Definition

interface SourceAdapter {
  // Unique identifier
  id: string;
  name: string;
  
  // Discovery: Find URLs matching a keyword
  discover(keyword: Keyword): Promise<DiscoveredURL[]>;
  
  // Fetch: Get raw content from a URL
  fetch(url: string): Promise<RawContent>;
  
  // Normalize: Convert to standard Article format
  normalize(raw: RawContent): Promise<Article>;
  
  // Health: Check if source is accessible
  healthCheck(): Promise<HealthStatus>;
  
  // Rate limits for this source
  rateLimits: RateLimitConfig;
}

interface DiscoveredURL {
  url: string;
  title?: string;
  publishedAt?: Date;
  source?: string;
}

interface RawContent {
  html: string;
  url: string;
  fetchedAt: Date;
  headers: Record<string, string>;
}

interface RateLimitConfig {
  requestsPerMinute: number;
  requestsPerHour: number;
  requestsPerDay: number;
  burstLimit: number;
}

Built-in Adapters

Adapter	Status	Discovery	Notes
Google News	✅ Active	RSS + HTML	Primary source
RSS Generic	✅ Active	RSS feeds	Any RSS/Atom feed
Direct Publisher	✅ Active	Sitemap/robots.txt	For known publishers

Future Adapters (Planned)

Adapter	Priority	Notes
Twitter/X	High	Requires API access
Reddit	High	API or scraping
HackerNews	Medium	Public API
PR Newswire	Medium	Requires partnership
SEC Filings	Medium	Public EDGAR API
Podcasts	Low	Transcription needed

Adding a New Source

See Source Adapters Guide for implementation details.

2. Extraction Plugins (Processing)

Extraction plugins define what structured data we pull out of articles. Each plugin takes article text and returns structured JSON matching a defined schema.

Interface Definition

interface ExtractionPlugin {
  // Unique identifier
  id: string;
  name: string;
  
  // Type of extraction
  type: 'builtin' | 'ai' | 'rules' | 'hybrid';
  
  // JSON Schema for output validation
  outputSchema: JSONSchema;
  
  // Extract structured data from article
  extract(
    article: Article, 
    context?: ProjectContext
  ): Promise<ExtractionResult>;
  
  // Validate extracted data against schema
  validate(data: unknown): boolean;
  
  // Estimate cost before running (for AI extractors)
  estimateCost(article: Article): number;
  
  // Run on ingest or on-demand?
  runOnIngest: boolean;
  
  // Priority (lower = runs first)
  priority: number;
}

interface ExtractionResult {
  extractorId: string;
  data: Record<string, any>;  // Matches outputSchema
  confidence: number;         // 0-1
  costMicros: number;         // Actual cost in microdollars
}

interface ProjectContext {
  projectId: string;
  focusType: string;          // e.g., 'funding_rounds'
  focusDescription: string;   // LLM-readable context
}

Built-in Extractors

Extractor	Type	Output	Notes
`entities`	AI	People, orgs, products	Named entity recognition
`locations`	Hybrid	Geo-tagged places	Match against 225K locations
`events`	AI	Dates, conferences, earnings	Structured event data
`quotes`	AI	Speaker + quote text	Attribution extraction
`products`	AI	Product mentions	Brand detection
`jobs`	AI	Job postings	Title, company, location
`funding_rounds`	AI	Series, amounts, investors	New

Database Schema

-- Extractor definitions
CREATE TABLE extractors (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    type TEXT NOT NULL CHECK (type IN ('builtin', 'ai', 'rules', 'hybrid')),
    output_schema TEXT NOT NULL,  -- JSON Schema
    run_on_ingest INTEGER DEFAULT 1,
    priority INTEGER DEFAULT 100,
    is_active INTEGER DEFAULT 1,
    description TEXT,
    created_at TEXT DEFAULT (datetime('now'))
);

-- Extraction results per article
CREATE TABLE article_extractions (
    article_id TEXT NOT NULL REFERENCES articles(id),
    extractor_id TEXT NOT NULL REFERENCES extractors(id),
    extracted_data TEXT NOT NULL,  -- JSON matching output_schema
    confidence REAL,
    cost_micros INTEGER DEFAULT 0,
    extracted_at TEXT DEFAULT (datetime('now')),
    PRIMARY KEY (article_id, extractor_id)
);

-- Searchable extraction items (denormalized)
CREATE TABLE extraction_items (
    id TEXT PRIMARY KEY,
    article_id TEXT NOT NULL REFERENCES articles(id),
    extractor_id TEXT NOT NULL REFERENCES extractors(id),
    item_type TEXT NOT NULL,       -- e.g., 'person', 'org', 'event'
    item_value TEXT NOT NULL,      -- e.g., 'Elon Musk'
    item_data TEXT,                -- Additional JSON
    confidence REAL,
    created_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX idx_extraction_items_type_value ON extraction_items(item_type, item_value);

3. Output Adapters (Delivery)

Output adapters define where enriched data goes. Each adapter knows how to format and send notifications to a specific destination.

Interface Definition

interface OutputAdapter {
  // Unique identifier
  id: string;
  name: string;
  
  // Configuration schema for this adapter
  configSchema: JSONSchema;
  
  // Send notification to destination
  send(
    payload: NotificationPayload, 
    config: AdapterConfig
  ): Promise<DeliveryResult>;
  
  // Test connection with provided config
  testConnection(config: AdapterConfig): Promise<boolean>;
  
  // Retry configuration
  retryPolicy: RetryConfig;
}

interface NotificationPayload {
  articles: Article[];
  stories?: Story[];
  customer: Customer;
  profile: Profile;
  matchScores: Record<string, number>;
  digest?: boolean;
}

interface DeliveryResult {
  success: boolean;
  messageId?: string;
  error?: string;
  retryable?: boolean;
}

interface RetryConfig {
  maxRetries: number;
  initialDelayMs: number;
  maxDelayMs: number;
  backoffMultiplier: number;
}

Built-in Adapters

Adapter	Status	Config	Notes
Email	✅ Active	SMTP settings	HTML templates
Slack	✅ Active	Webhook URL	Rich formatting
Webhook	✅ Active	URL + headers	Generic POST

Future Adapters (Planned)

Adapter	Priority	Notes
Airtable	High	Customer request
Notion	High	Customer request
Google Sheets	Medium	Easy integration
Microsoft Teams	Medium	Enterprise demand
Discord	Low	Community use
Zapier	Low	Meta-integration

Database Schema

-- Customer notification preferences
CREATE TABLE customer_notification_settings (
    customer_id TEXT PRIMARY KEY REFERENCES customers(id),
    email_enabled INTEGER DEFAULT 1,
    email_address TEXT,
    slack_enabled INTEGER DEFAULT 0,
    slack_webhook_url TEXT,
    webhook_enabled INTEGER DEFAULT 0,
    webhook_url TEXT,
    webhook_headers TEXT,  -- JSON
    digest_frequency TEXT DEFAULT 'realtime',  -- realtime, hourly, daily
    quiet_hours_start TEXT,  -- HH:MM
    quiet_hours_end TEXT,
    timezone TEXT DEFAULT 'UTC',
    updated_at TEXT DEFAULT (datetime('now'))
);

-- Notification delivery log
CREATE TABLE notification_log (
    id TEXT PRIMARY KEY,
    customer_id TEXT NOT NULL REFERENCES customers(id),
    adapter_id TEXT NOT NULL,
    payload_hash TEXT NOT NULL,
    status TEXT NOT NULL CHECK (status IN ('pending', 'sent', 'failed', 'retrying')),
    attempts INTEGER DEFAULT 0,
    last_error TEXT,
    sent_at TEXT,
    created_at TEXT DEFAULT (datetime('now'))
);

Projects: Focus-Based Extraction

Projects allow customers to create focused campaigns that inform extraction. For example, a "Funding Tracker" project would prioritize the funding_rounds extractor.

Concept

Database Schema

-- Projects: Customer campaigns with specific focus
CREATE TABLE projects (
    id TEXT PRIMARY KEY,
    customer_id TEXT NOT NULL REFERENCES customers(id),
    name TEXT NOT NULL,
    description TEXT,
    
    -- Focus informs extraction
    focus_type TEXT NOT NULL,           -- 'funding_rounds', 'product_launches', etc.
    focus_description TEXT,             -- LLM-readable description
    
    -- Stats
    article_count INTEGER DEFAULT 0,
    extraction_count INTEGER DEFAULT 0,
    
    is_active INTEGER DEFAULT 1,
    created_at TEXT DEFAULT (datetime('now'))
);

-- Keywords associated with a project
CREATE TABLE project_keywords (
    project_id TEXT NOT NULL REFERENCES projects(id),
    keyword_id TEXT NOT NULL REFERENCES keywords(id),
    PRIMARY KEY (project_id, keyword_id)
);

-- Extractors enabled for a project
CREATE TABLE project_extractors (
    project_id TEXT NOT NULL REFERENCES projects(id),
    extractor_id TEXT NOT NULL REFERENCES extractors(id),
    is_enabled INTEGER DEFAULT 1,
    config TEXT,  -- Project-specific config overrides
    PRIMARY KEY (project_id, extractor_id)
);

Focus Types

Focus Type	Description	Primary Extractor
`funding_rounds`	Track startup funding	`funding_rounds`
`product_launches`	New product announcements	`products`
`executive_moves`	C-suite changes	`entities`
`acquisitions`	M&A activity	`entities`, custom
`earnings`	Quarterly reports	`events`
`regulatory`	Policy changes	`entities`, `events`
`custom`	User-defined	Configurable

Plugin Lifecycle

Cost Tracking

All plugin operations are cost-tracked:

-- Every extraction logs its cost
INSERT INTO cost_events (
    id,
    operation_type,
    service,
    operation_id,
    article_id,
    cost_micros,
    metadata
) VALUES (
    'cost_xxx',
    'extraction',
    'workers_ai',
    'funding_rounds',
    'article_123',
    500,  -- $0.0005
    '{"tokens_in": 1500, "tokens_out": 200}'
);

See Cost Tracking for full details.

Quick Links

Source Adapters — Adding new data sources
Extraction Plugins — Custom extractors
Output Adapters — Delivery destinations
Google News Adapter — Primary source implementation

Plugin Architecture

Extension Points Overview

1. Source Adapters (Input)

Interface Definition

Built-in Adapters

Future Adapters (Planned)

Adding a New Source

2. Extraction Plugins (Processing)

Interface Definition

Built-in Extractors

Database Schema

See Also

3. Output Adapters (Delivery)

Interface Definition

Built-in Adapters

Future Adapters (Planned)

Database Schema

See Also

Projects: Focus-Based Extraction

Concept

Database Schema

Focus Types

Plugin Lifecycle

Cost Tracking

Quick Links

Extension Points Overview​

1. Source Adapters (Input)​

Interface Definition​

Built-in Adapters​

Future Adapters (Planned)​

Adding a New Source​

2. Extraction Plugins (Processing)​

Interface Definition​

Built-in Extractors​

Database Schema​

See Also​

3. Output Adapters (Delivery)​

Interface Definition​

Built-in Adapters​

Future Adapters (Planned)​

Database Schema​

See Also​

Projects: Focus-Based Extraction​

Concept​

Database Schema​

Focus Types​

Plugin Lifecycle​

Cost Tracking​

Quick Links​

Extension Points Overview

1. Source Adapters (Input)

Interface Definition

Built-in Adapters

Future Adapters (Planned)

Adding a New Source

2. Extraction Plugins (Processing)

Interface Definition

Built-in Extractors

Database Schema

See Also

3. Output Adapters (Delivery)

Interface Definition

Built-in Adapters

Future Adapters (Planned)

Database Schema

See Also

Projects: Focus-Based Extraction

Concept

Database Schema

Focus Types

Plugin Lifecycle

Cost Tracking

Quick Links