Skip to main content

Plugin Architecture

Three extension points to customize the entire pipeline: Sources, Extractors, Outputs.

StoryIntel is designed as a pluggable platform. While the core pipeline is fixed, you can extend it at three key points to add new data sources, extract custom structured data, and deliver to new destinations.


Extension Points Overview


1. Source Adapters (Input)

Source adapters define where we crawl from. Each adapter knows how to discover URLs and fetch content from a specific source.

Interface Definition

interface SourceAdapter {
// Unique identifier
id: string;
name: string;

// Discovery: Find URLs matching a keyword
discover(keyword: Keyword): Promise<DiscoveredURL[]>;

// Fetch: Get raw content from a URL
fetch(url: string): Promise<RawContent>;

// Normalize: Convert to standard Article format
normalize(raw: RawContent): Promise<Article>;

// Health: Check if source is accessible
healthCheck(): Promise<HealthStatus>;

// Rate limits for this source
rateLimits: RateLimitConfig;
}

interface DiscoveredURL {
url: string;
title?: string;
publishedAt?: Date;
source?: string;
}

interface RawContent {
html: string;
url: string;
fetchedAt: Date;
headers: Record<string, string>;
}

interface RateLimitConfig {
requestsPerMinute: number;
requestsPerHour: number;
requestsPerDay: number;
burstLimit: number;
}

Built-in Adapters

AdapterStatusDiscoveryNotes
Google News✅ ActiveRSS + HTMLPrimary source
RSS Generic✅ ActiveRSS feedsAny RSS/Atom feed
Direct Publisher✅ ActiveSitemap/robots.txtFor known publishers

Future Adapters (Planned)

AdapterPriorityNotes
Twitter/XHighRequires API access
RedditHighAPI or scraping
HackerNewsMediumPublic API
PR NewswireMediumRequires partnership
SEC FilingsMediumPublic EDGAR API
PodcastsLowTranscription needed

Adding a New Source

See Source Adapters Guide for implementation details.


2. Extraction Plugins (Processing)

Extraction plugins define what structured data we pull out of articles. Each plugin takes article text and returns structured JSON matching a defined schema.

Interface Definition

interface ExtractionPlugin {
// Unique identifier
id: string;
name: string;

// Type of extraction
type: 'builtin' | 'ai' | 'rules' | 'hybrid';

// JSON Schema for output validation
outputSchema: JSONSchema;

// Extract structured data from article
extract(
article: Article,
context?: ProjectContext
): Promise<ExtractionResult>;

// Validate extracted data against schema
validate(data: unknown): boolean;

// Estimate cost before running (for AI extractors)
estimateCost(article: Article): number;

// Run on ingest or on-demand?
runOnIngest: boolean;

// Priority (lower = runs first)
priority: number;
}

interface ExtractionResult {
extractorId: string;
data: Record<string, any>; // Matches outputSchema
confidence: number; // 0-1
costMicros: number; // Actual cost in microdollars
}

interface ProjectContext {
projectId: string;
focusType: string; // e.g., 'funding_rounds'
focusDescription: string; // LLM-readable context
}

Built-in Extractors

ExtractorTypeOutputNotes
entitiesAIPeople, orgs, productsNamed entity recognition
locationsHybridGeo-tagged placesMatch against 225K locations
eventsAIDates, conferences, earningsStructured event data
quotesAISpeaker + quote textAttribution extraction
productsAIProduct mentionsBrand detection
jobsAIJob postingsTitle, company, location
funding_roundsAISeries, amounts, investorsNew

Database Schema

-- Extractor definitions
CREATE TABLE extractors (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
type TEXT NOT NULL CHECK (type IN ('builtin', 'ai', 'rules', 'hybrid')),
output_schema TEXT NOT NULL, -- JSON Schema
run_on_ingest INTEGER DEFAULT 1,
priority INTEGER DEFAULT 100,
is_active INTEGER DEFAULT 1,
description TEXT,
created_at TEXT DEFAULT (datetime('now'))
);

-- Extraction results per article
CREATE TABLE article_extractions (
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
extracted_data TEXT NOT NULL, -- JSON matching output_schema
confidence REAL,
cost_micros INTEGER DEFAULT 0,
extracted_at TEXT DEFAULT (datetime('now')),
PRIMARY KEY (article_id, extractor_id)
);

-- Searchable extraction items (denormalized)
CREATE TABLE extraction_items (
id TEXT PRIMARY KEY,
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
item_type TEXT NOT NULL, -- e.g., 'person', 'org', 'event'
item_value TEXT NOT NULL, -- e.g., 'Elon Musk'
item_data TEXT, -- Additional JSON
confidence REAL,
created_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX idx_extraction_items_type_value ON extraction_items(item_type, item_value);

See Also


3. Output Adapters (Delivery)

Output adapters define where enriched data goes. Each adapter knows how to format and send notifications to a specific destination.

Interface Definition

interface OutputAdapter {
// Unique identifier
id: string;
name: string;

// Configuration schema for this adapter
configSchema: JSONSchema;

// Send notification to destination
send(
payload: NotificationPayload,
config: AdapterConfig
): Promise<DeliveryResult>;

// Test connection with provided config
testConnection(config: AdapterConfig): Promise<boolean>;

// Retry configuration
retryPolicy: RetryConfig;
}

interface NotificationPayload {
articles: Article[];
stories?: Story[];
customer: Customer;
profile: Profile;
matchScores: Record<string, number>;
digest?: boolean;
}

interface DeliveryResult {
success: boolean;
messageId?: string;
error?: string;
retryable?: boolean;
}

interface RetryConfig {
maxRetries: number;
initialDelayMs: number;
maxDelayMs: number;
backoffMultiplier: number;
}

Built-in Adapters

AdapterStatusConfigNotes
Email✅ ActiveSMTP settingsHTML templates
Slack✅ ActiveWebhook URLRich formatting
Webhook✅ ActiveURL + headersGeneric POST

Future Adapters (Planned)

AdapterPriorityNotes
AirtableHighCustomer request
NotionHighCustomer request
Google SheetsMediumEasy integration
Microsoft TeamsMediumEnterprise demand
DiscordLowCommunity use
ZapierLowMeta-integration

Database Schema

-- Customer notification preferences
CREATE TABLE customer_notification_settings (
customer_id TEXT PRIMARY KEY REFERENCES customers(id),
email_enabled INTEGER DEFAULT 1,
email_address TEXT,
slack_enabled INTEGER DEFAULT 0,
slack_webhook_url TEXT,
webhook_enabled INTEGER DEFAULT 0,
webhook_url TEXT,
webhook_headers TEXT, -- JSON
digest_frequency TEXT DEFAULT 'realtime', -- realtime, hourly, daily
quiet_hours_start TEXT, -- HH:MM
quiet_hours_end TEXT,
timezone TEXT DEFAULT 'UTC',
updated_at TEXT DEFAULT (datetime('now'))
);

-- Notification delivery log
CREATE TABLE notification_log (
id TEXT PRIMARY KEY,
customer_id TEXT NOT NULL REFERENCES customers(id),
adapter_id TEXT NOT NULL,
payload_hash TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('pending', 'sent', 'failed', 'retrying')),
attempts INTEGER DEFAULT 0,
last_error TEXT,
sent_at TEXT,
created_at TEXT DEFAULT (datetime('now'))
);

See Also


Projects: Focus-Based Extraction

Projects allow customers to create focused campaigns that inform extraction. For example, a "Funding Tracker" project would prioritize the funding_rounds extractor.

Concept

Database Schema

-- Projects: Customer campaigns with specific focus
CREATE TABLE projects (
id TEXT PRIMARY KEY,
customer_id TEXT NOT NULL REFERENCES customers(id),
name TEXT NOT NULL,
description TEXT,

-- Focus informs extraction
focus_type TEXT NOT NULL, -- 'funding_rounds', 'product_launches', etc.
focus_description TEXT, -- LLM-readable description

-- Stats
article_count INTEGER DEFAULT 0,
extraction_count INTEGER DEFAULT 0,

is_active INTEGER DEFAULT 1,
created_at TEXT DEFAULT (datetime('now'))
);

-- Keywords associated with a project
CREATE TABLE project_keywords (
project_id TEXT NOT NULL REFERENCES projects(id),
keyword_id TEXT NOT NULL REFERENCES keywords(id),
PRIMARY KEY (project_id, keyword_id)
);

-- Extractors enabled for a project
CREATE TABLE project_extractors (
project_id TEXT NOT NULL REFERENCES projects(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
is_enabled INTEGER DEFAULT 1,
config TEXT, -- Project-specific config overrides
PRIMARY KEY (project_id, extractor_id)
);

Focus Types

Focus TypeDescriptionPrimary Extractor
funding_roundsTrack startup fundingfunding_rounds
product_launchesNew product announcementsproducts
executive_movesC-suite changesentities
acquisitionsM&A activityentities, custom
earningsQuarterly reportsevents
regulatoryPolicy changesentities, events
customUser-definedConfigurable

Plugin Lifecycle


Cost Tracking

All plugin operations are cost-tracked:

-- Every extraction logs its cost
INSERT INTO cost_events (
id,
operation_type,
service,
operation_id,
article_id,
cost_micros,
metadata
) VALUES (
'cost_xxx',
'extraction',
'workers_ai',
'funding_rounds',
'article_123',
500, -- $0.0005
'{"tokens_in": 1500, "tokens_out": 200}'
);

See Cost Tracking for full details.