Extraction Plugins
Transform unstructured article text into structured, queryable data.
Extraction plugins are the heart of StoryIntel's intelligence layer. They take raw article content and produce structured JSON that can be searched, filtered, and analyzed.
How Extraction Works
Built-in Extractors
1. Entities Extractor
ID: entities
Type: ai
Purpose: Extract named entities (people, organizations, products)
Output Schema:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"type": {
"type": "string",
"enum": ["person", "organization", "product", "event", "location", "other"]
},
"role": { "type": "string" },
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"]
},
"salience": { "type": "number", "minimum": 0, "maximum": 1 },
"mentions": { "type": "integer" }
},
"required": ["name", "type"]
}
}
Example Output:
[
{
"name": "Elon Musk",
"type": "person",
"role": "CEO of Tesla",
"sentiment": "neutral",
"salience": 0.85,
"mentions": 5
},
{
"name": "Tesla",
"type": "organization",
"role": "Subject company",
"sentiment": "positive",
"salience": 0.95,
"mentions": 12
}
]
2. Locations Extractor
ID: locations
Type: hybrid
Purpose: Extract and geocode mentioned locations
How It Works:
- Pattern matching for location names
- Fuzzy match against 225K location database
- LLM disambiguation for ambiguous cases ("Paris" - France or Texas?)
Output Schema:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"location_id": { "type": "string" },
"type": {
"type": "string",
"enum": ["city", "state", "country", "region", "address"]
},
"latitude": { "type": "number" },
"longitude": { "type": "number" },
"country_code": { "type": "string" },
"confidence": { "type": "number" }
},
"required": ["name", "type"]
}
}
Example Output:
[
{
"name": "San Francisco",
"location_id": "loc_sf_ca_us",
"type": "city",
"latitude": 37.7749,
"longitude": -122.4194,
"country_code": "US",
"confidence": 0.98
}
]
3. Events Extractor
ID: events
Type: ai
Purpose: Extract structured event data (conferences, earnings, launches)
Output Schema:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"type": {
"type": "string",
"enum": ["conference", "earnings", "product_launch", "regulatory", "election", "ipo", "acquisition", "layoff", "other"]
},
"start_date": { "type": "string", "format": "date" },
"start_time": { "type": "string", "format": "time" },
"end_date": { "type": "string", "format": "date" },
"end_time": { "type": "string", "format": "time" },
"is_multi_day": { "type": "boolean" },
"timezone": { "type": "string" },
"location": { "type": "string" },
"organizer": { "type": "string" },
"description": { "type": "string" },
"url": { "type": "string", "format": "uri" },
"confidence": { "type": "number" }
},
"required": ["name", "start_date"]
}
}
Example Output:
[
{
"name": "CES 2025",
"type": "conference",
"start_date": "2025-01-07",
"end_date": "2025-01-10",
"is_multi_day": true,
"timezone": "America/Los_Angeles",
"location": "Las Vegas, NV",
"organizer": "Consumer Technology Association",
"description": "Annual consumer electronics trade show",
"url": "https://www.ces.tech/",
"confidence": 0.95
}
]
4. Quotes Extractor
ID: quotes
Type: ai
Purpose: Extract direct quotes with speaker attribution
Output Schema:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"text": { "type": "string" },
"speaker": { "type": "string" },
"speaker_title": { "type": "string" },
"speaker_org": { "type": "string" },
"context": { "type": "string" },
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"]
},
"is_direct": { "type": "boolean" }
},
"required": ["text", "speaker"]
}
}
5. Funding Rounds Extractor (NEW)
ID: funding_rounds
Type: ai
Purpose: Extract startup funding announcements with structured data
Output Schema:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"company_name": { "type": "string" },
"company_url": { "type": "string", "format": "uri" },
"round_type": {
"type": "string",
"enum": ["pre_seed", "seed", "series_a", "series_b", "series_c", "series_d", "series_e", "growth", "debt", "bridge", "ipo", "spac", "other"]
},
"amount_usd": { "type": "number" },
"amount_raw": { "type": "string" },
"currency": { "type": "string" },
"valuation_usd": { "type": "number" },
"valuation_raw": { "type": "string" },
"lead_investors": {
"type": "array",
"items": { "type": "string" }
},
"other_investors": {
"type": "array",
"items": { "type": "string" }
},
"announced_date": { "type": "string", "format": "date" },
"use_of_funds": { "type": "string" },
"sector": { "type": "string" },
"stage": { "type": "string" },
"confidence": { "type": "number" }
},
"required": ["company_name", "round_type"]
}
}
Example Output:
[
{
"company_name": "Acme AI",
"company_url": "https://acme.ai",
"round_type": "series_b",
"amount_usd": 50000000,
"amount_raw": "$50M",
"currency": "USD",
"valuation_usd": 250000000,
"valuation_raw": "$250M",
"lead_investors": ["Sequoia Capital"],
"other_investors": ["a16z", "Y Combinator"],
"announced_date": "2024-12-15",
"use_of_funds": "Expand engineering team and launch enterprise product",
"sector": "Artificial Intelligence",
"stage": "Growth",
"confidence": 0.92
}
]
LLM Prompt (abridged):
Extract funding round details from this article. Look for:
- Company name and website
- Round type (seed, series A/B/C, etc.)
- Amount raised (convert to USD if possible)
- Valuation if mentioned
- Lead investor(s) and participating investors
- Announced date
- Use of funds / what they'll do with the money
Return a JSON array of funding rounds. If no funding is mentioned, return [].
6. Products Extractor
ID: products
Type: ai
Purpose: Extract product mentions and announcements
Output Schema:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"company": { "type": "string" },
"category": { "type": "string" },
"is_new": { "type": "boolean" },
"price": { "type": "string" },
"availability_date": { "type": "string", "format": "date" },
"description": { "type": "string" },
"sentiment": { "type": "string" },
"confidence": { "type": "number" }
},
"required": ["name", "company"]
}
}
7. Jobs Extractor
ID: jobs
Type: ai
Purpose: Extract job postings from news articles
Output Schema:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"company": { "type": "string" },
"location": { "type": "string" },
"remote": { "type": "boolean" },
"salary_range": { "type": "string" },
"job_type": {
"type": "string",
"enum": ["full_time", "part_time", "contract", "internship"]
},
"seniority": { "type": "string" },
"department": { "type": "string" },
"apply_url": { "type": "string", "format": "uri" },
"confidence": { "type": "number" }
},
"required": ["title", "company"]
}
}
Extractor Configuration
Database Schema
-- Extractor definitions
CREATE TABLE extractors (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
type TEXT NOT NULL CHECK (type IN ('builtin', 'ai', 'rules', 'hybrid')),
output_schema TEXT NOT NULL, -- JSON Schema for validation
run_on_ingest INTEGER DEFAULT 1, -- Auto-run on new articles?
priority INTEGER DEFAULT 100, -- Lower = runs first
is_active INTEGER DEFAULT 1,
description TEXT,
llm_prompt TEXT, -- Prompt for AI extractors
rules_config TEXT, -- Config for rules extractors
cost_estimate_micros INTEGER, -- Estimated cost per article
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
-- Default extractors
INSERT INTO extractors (id, name, type, output_schema, priority, description) VALUES
('entities', 'Entity Extractor', 'ai', '...schema...', 10, 'Extract named entities'),
('locations', 'Location Extractor', 'hybrid', '...schema...', 20, 'Extract and geocode locations'),
('events', 'Event Extractor', 'ai', '...schema...', 30, 'Extract structured event data'),
('quotes', 'Quote Extractor', 'ai', '...schema...', 40, 'Extract quotes with attribution'),
('funding_rounds', 'Funding Round Extractor', 'ai', '...schema...', 50, 'Extract funding announcements'),
('products', 'Product Extractor', 'ai', '...schema...', 60, 'Extract product mentions'),
('jobs', 'Job Extractor', 'ai', '...schema...', 70, 'Extract job postings');
Customer Extractor Preferences
-- Which extractors are enabled per customer
CREATE TABLE customer_extractors (
customer_id TEXT NOT NULL REFERENCES customers(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
is_enabled INTEGER DEFAULT 1,
config TEXT, -- Customer-specific overrides
PRIMARY KEY (customer_id, extractor_id)
);
Project-Specific Extractors
-- Which extractors run for a specific project
CREATE TABLE project_extractors (
project_id TEXT NOT NULL REFERENCES projects(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
is_enabled INTEGER DEFAULT 1,
config TEXT, -- Project-specific config
priority_override INTEGER, -- Override default priority
PRIMARY KEY (project_id, extractor_id)
);
Storage Schema
Extraction Results
-- Full extraction results per article
CREATE TABLE article_extractions (
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
extracted_data TEXT NOT NULL, -- JSON matching output_schema
confidence REAL, -- Overall confidence 0-1
cost_micros INTEGER DEFAULT 0, -- Actual cost in microdollars
tokens_in INTEGER, -- Input tokens (for AI)
tokens_out INTEGER, -- Output tokens (for AI)
latency_ms INTEGER, -- Execution time
extracted_at TEXT DEFAULT (datetime('now')),
PRIMARY KEY (article_id, extractor_id)
);
CREATE INDEX idx_article_extractions_article ON article_extractions(article_id);
CREATE INDEX idx_article_extractions_extractor ON article_extractions(extractor_id);
CREATE INDEX idx_article_extractions_date ON article_extractions(extracted_at);
Denormalized Extraction Items
For fast querying across all articles:
-- Individual extracted items (searchable)
CREATE TABLE extraction_items (
id TEXT PRIMARY KEY,
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
item_type TEXT NOT NULL, -- 'person', 'org', 'event', 'funding_round', etc.
item_value TEXT NOT NULL, -- 'Elon Musk', 'Tesla', 'CES 2025', etc.
item_data TEXT, -- Additional structured data as JSON
confidence REAL,
created_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX idx_extraction_items_type ON extraction_items(item_type);
CREATE INDEX idx_extraction_items_value ON extraction_items(item_value);
CREATE INDEX idx_extraction_items_type_value ON extraction_items(item_type, item_value);
CREATE INDEX idx_extraction_items_article ON extraction_items(article_id);
API Endpoints
List Extractors
GET /v1/extractors
Response:
{
"extractors": [
{
"id": "funding_rounds",
"name": "Funding Round Extractor",
"type": "ai",
"description": "Extract funding announcements",
"is_enabled": true,
"run_on_ingest": true,
"cost_estimate": "$0.0005/article"
}
]
}
Get Extraction Results
GET /v1/articles/:id/extractions
Response:
{
"article_id": "art_123",
"extractions": {
"entities": {
"data": [...],
"confidence": 0.92,
"extracted_at": "2024-12-19T10:00:00Z"
},
"funding_rounds": {
"data": [...],
"confidence": 0.88,
"extracted_at": "2024-12-19T10:00:01Z"
}
}
}
Search Extraction Items
GET /v1/extractions/search?type=funding_round&company=Acme
Response:
{
"items": [
{
"article_id": "art_123",
"item_type": "funding_round",
"item_value": "Acme AI Series B",
"item_data": {
"company_name": "Acme AI",
"round_type": "series_b",
"amount_usd": 50000000
},
"confidence": 0.92
}
]
}
Run Extractor On-Demand
POST /v1/articles/:id/extract
Body: { "extractor_id": "funding_rounds" }
Response:
{
"article_id": "art_123",
"extractor_id": "funding_rounds",
"data": [...],
"confidence": 0.88,
"cost_micros": 500
}
Cost Tracking
Every extraction operation logs its cost:
INSERT INTO cost_events (
id,
operation_type,
service,
operation_id,
article_id,
customer_id,
cost_micros,
metadata
) VALUES (
'cost_xxx',
'extraction',
'workers_ai',
'funding_rounds',
'art_123',
'cust_456',
500,
'{"tokens_in": 1500, "tokens_out": 200, "model": "llama-2-7b"}'
);
Cost Estimates by Extractor
| Extractor | Type | Est. Cost/Article |
|---|---|---|
| entities | AI | $0.0005 |
| locations | Hybrid | $0.0001 |
| events | AI | $0.0004 |
| quotes | AI | $0.0003 |
| funding_rounds | AI | $0.0005 |
| products | AI | $0.0003 |
| jobs | AI | $0.0002 |
Creating Custom Extractors
Step 1: Define the Schema
INSERT INTO extractors (
id,
name,
type,
output_schema,
priority,
description,
llm_prompt
) VALUES (
'earnings_calls',
'Earnings Call Extractor',
'ai',
'{
"type": "array",
"items": {
"type": "object",
"properties": {
"company": { "type": "string" },
"quarter": { "type": "string" },
"year": { "type": "integer" },
"revenue": { "type": "number" },
"eps": { "type": "number" },
"guidance": { "type": "string" },
"call_date": { "type": "string", "format": "date" }
},
"required": ["company", "quarter", "year"]
}
}',
55,
'Extract earnings call details',
'Extract earnings call information from this article...'
);
Step 2: Enable for Customers/Projects
-- Enable for a project
INSERT INTO project_extractors (project_id, extractor_id, is_enabled)
VALUES ('proj_earnings_watch', 'earnings_calls', 1);
Step 3: The Pipeline Runs It Automatically
When articles match the project's keywords, the extractor runs and stores results.
Quick Links
- Plugin Overview — Extension point architecture
- Source Adapters — Adding new data sources
- Output Adapters — Delivery destinations
- Database Schema — Full table definitions