Skip to main content

Extraction Plugins

Transform unstructured article text into structured, queryable data.

Extraction plugins are the heart of StoryIntel's intelligence layer. They take raw article content and produce structured JSON that can be searched, filtered, and analyzed.


How Extraction Works


Built-in Extractors

1. Entities Extractor

ID: entities
Type: ai
Purpose: Extract named entities (people, organizations, products)

Output Schema:

{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"type": {
"type": "string",
"enum": ["person", "organization", "product", "event", "location", "other"]
},
"role": { "type": "string" },
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"]
},
"salience": { "type": "number", "minimum": 0, "maximum": 1 },
"mentions": { "type": "integer" }
},
"required": ["name", "type"]
}
}

Example Output:

[
{
"name": "Elon Musk",
"type": "person",
"role": "CEO of Tesla",
"sentiment": "neutral",
"salience": 0.85,
"mentions": 5
},
{
"name": "Tesla",
"type": "organization",
"role": "Subject company",
"sentiment": "positive",
"salience": 0.95,
"mentions": 12
}
]

2. Locations Extractor

ID: locations
Type: hybrid
Purpose: Extract and geocode mentioned locations

How It Works:

  1. Pattern matching for location names
  2. Fuzzy match against 225K location database
  3. LLM disambiguation for ambiguous cases ("Paris" - France or Texas?)

Output Schema:

{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"location_id": { "type": "string" },
"type": {
"type": "string",
"enum": ["city", "state", "country", "region", "address"]
},
"latitude": { "type": "number" },
"longitude": { "type": "number" },
"country_code": { "type": "string" },
"confidence": { "type": "number" }
},
"required": ["name", "type"]
}
}

Example Output:

[
{
"name": "San Francisco",
"location_id": "loc_sf_ca_us",
"type": "city",
"latitude": 37.7749,
"longitude": -122.4194,
"country_code": "US",
"confidence": 0.98
}
]

3. Events Extractor

ID: events
Type: ai
Purpose: Extract structured event data (conferences, earnings, launches)

Output Schema:

{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"type": {
"type": "string",
"enum": ["conference", "earnings", "product_launch", "regulatory", "election", "ipo", "acquisition", "layoff", "other"]
},
"start_date": { "type": "string", "format": "date" },
"start_time": { "type": "string", "format": "time" },
"end_date": { "type": "string", "format": "date" },
"end_time": { "type": "string", "format": "time" },
"is_multi_day": { "type": "boolean" },
"timezone": { "type": "string" },
"location": { "type": "string" },
"organizer": { "type": "string" },
"description": { "type": "string" },
"url": { "type": "string", "format": "uri" },
"confidence": { "type": "number" }
},
"required": ["name", "start_date"]
}
}

Example Output:

[
{
"name": "CES 2025",
"type": "conference",
"start_date": "2025-01-07",
"end_date": "2025-01-10",
"is_multi_day": true,
"timezone": "America/Los_Angeles",
"location": "Las Vegas, NV",
"organizer": "Consumer Technology Association",
"description": "Annual consumer electronics trade show",
"url": "https://www.ces.tech/",
"confidence": 0.95
}
]

4. Quotes Extractor

ID: quotes
Type: ai
Purpose: Extract direct quotes with speaker attribution

Output Schema:

{
"type": "array",
"items": {
"type": "object",
"properties": {
"text": { "type": "string" },
"speaker": { "type": "string" },
"speaker_title": { "type": "string" },
"speaker_org": { "type": "string" },
"context": { "type": "string" },
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"]
},
"is_direct": { "type": "boolean" }
},
"required": ["text", "speaker"]
}
}

5. Funding Rounds Extractor (NEW)

ID: funding_rounds
Type: ai
Purpose: Extract startup funding announcements with structured data

Output Schema:

{
"type": "array",
"items": {
"type": "object",
"properties": {
"company_name": { "type": "string" },
"company_url": { "type": "string", "format": "uri" },
"round_type": {
"type": "string",
"enum": ["pre_seed", "seed", "series_a", "series_b", "series_c", "series_d", "series_e", "growth", "debt", "bridge", "ipo", "spac", "other"]
},
"amount_usd": { "type": "number" },
"amount_raw": { "type": "string" },
"currency": { "type": "string" },
"valuation_usd": { "type": "number" },
"valuation_raw": { "type": "string" },
"lead_investors": {
"type": "array",
"items": { "type": "string" }
},
"other_investors": {
"type": "array",
"items": { "type": "string" }
},
"announced_date": { "type": "string", "format": "date" },
"use_of_funds": { "type": "string" },
"sector": { "type": "string" },
"stage": { "type": "string" },
"confidence": { "type": "number" }
},
"required": ["company_name", "round_type"]
}
}

Example Output:

[
{
"company_name": "Acme AI",
"company_url": "https://acme.ai",
"round_type": "series_b",
"amount_usd": 50000000,
"amount_raw": "$50M",
"currency": "USD",
"valuation_usd": 250000000,
"valuation_raw": "$250M",
"lead_investors": ["Sequoia Capital"],
"other_investors": ["a16z", "Y Combinator"],
"announced_date": "2024-12-15",
"use_of_funds": "Expand engineering team and launch enterprise product",
"sector": "Artificial Intelligence",
"stage": "Growth",
"confidence": 0.92
}
]

LLM Prompt (abridged):

Extract funding round details from this article. Look for:
- Company name and website
- Round type (seed, series A/B/C, etc.)
- Amount raised (convert to USD if possible)
- Valuation if mentioned
- Lead investor(s) and participating investors
- Announced date
- Use of funds / what they'll do with the money

Return a JSON array of funding rounds. If no funding is mentioned, return [].

6. Products Extractor

ID: products
Type: ai
Purpose: Extract product mentions and announcements

Output Schema:

{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"company": { "type": "string" },
"category": { "type": "string" },
"is_new": { "type": "boolean" },
"price": { "type": "string" },
"availability_date": { "type": "string", "format": "date" },
"description": { "type": "string" },
"sentiment": { "type": "string" },
"confidence": { "type": "number" }
},
"required": ["name", "company"]
}
}

7. Jobs Extractor

ID: jobs
Type: ai
Purpose: Extract job postings from news articles

Output Schema:

{
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"company": { "type": "string" },
"location": { "type": "string" },
"remote": { "type": "boolean" },
"salary_range": { "type": "string" },
"job_type": {
"type": "string",
"enum": ["full_time", "part_time", "contract", "internship"]
},
"seniority": { "type": "string" },
"department": { "type": "string" },
"apply_url": { "type": "string", "format": "uri" },
"confidence": { "type": "number" }
},
"required": ["title", "company"]
}
}

Extractor Configuration

Database Schema

-- Extractor definitions
CREATE TABLE extractors (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
type TEXT NOT NULL CHECK (type IN ('builtin', 'ai', 'rules', 'hybrid')),
output_schema TEXT NOT NULL, -- JSON Schema for validation
run_on_ingest INTEGER DEFAULT 1, -- Auto-run on new articles?
priority INTEGER DEFAULT 100, -- Lower = runs first
is_active INTEGER DEFAULT 1,
description TEXT,
llm_prompt TEXT, -- Prompt for AI extractors
rules_config TEXT, -- Config for rules extractors
cost_estimate_micros INTEGER, -- Estimated cost per article
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);

-- Default extractors
INSERT INTO extractors (id, name, type, output_schema, priority, description) VALUES
('entities', 'Entity Extractor', 'ai', '...schema...', 10, 'Extract named entities'),
('locations', 'Location Extractor', 'hybrid', '...schema...', 20, 'Extract and geocode locations'),
('events', 'Event Extractor', 'ai', '...schema...', 30, 'Extract structured event data'),
('quotes', 'Quote Extractor', 'ai', '...schema...', 40, 'Extract quotes with attribution'),
('funding_rounds', 'Funding Round Extractor', 'ai', '...schema...', 50, 'Extract funding announcements'),
('products', 'Product Extractor', 'ai', '...schema...', 60, 'Extract product mentions'),
('jobs', 'Job Extractor', 'ai', '...schema...', 70, 'Extract job postings');

Customer Extractor Preferences

-- Which extractors are enabled per customer
CREATE TABLE customer_extractors (
customer_id TEXT NOT NULL REFERENCES customers(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
is_enabled INTEGER DEFAULT 1,
config TEXT, -- Customer-specific overrides
PRIMARY KEY (customer_id, extractor_id)
);

Project-Specific Extractors

-- Which extractors run for a specific project
CREATE TABLE project_extractors (
project_id TEXT NOT NULL REFERENCES projects(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
is_enabled INTEGER DEFAULT 1,
config TEXT, -- Project-specific config
priority_override INTEGER, -- Override default priority
PRIMARY KEY (project_id, extractor_id)
);

Storage Schema

Extraction Results

-- Full extraction results per article
CREATE TABLE article_extractions (
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
extracted_data TEXT NOT NULL, -- JSON matching output_schema
confidence REAL, -- Overall confidence 0-1
cost_micros INTEGER DEFAULT 0, -- Actual cost in microdollars
tokens_in INTEGER, -- Input tokens (for AI)
tokens_out INTEGER, -- Output tokens (for AI)
latency_ms INTEGER, -- Execution time
extracted_at TEXT DEFAULT (datetime('now')),
PRIMARY KEY (article_id, extractor_id)
);

CREATE INDEX idx_article_extractions_article ON article_extractions(article_id);
CREATE INDEX idx_article_extractions_extractor ON article_extractions(extractor_id);
CREATE INDEX idx_article_extractions_date ON article_extractions(extracted_at);

Denormalized Extraction Items

For fast querying across all articles:

-- Individual extracted items (searchable)
CREATE TABLE extraction_items (
id TEXT PRIMARY KEY,
article_id TEXT NOT NULL REFERENCES articles(id),
extractor_id TEXT NOT NULL REFERENCES extractors(id),
item_type TEXT NOT NULL, -- 'person', 'org', 'event', 'funding_round', etc.
item_value TEXT NOT NULL, -- 'Elon Musk', 'Tesla', 'CES 2025', etc.
item_data TEXT, -- Additional structured data as JSON
confidence REAL,
created_at TEXT DEFAULT (datetime('now'))
);

CREATE INDEX idx_extraction_items_type ON extraction_items(item_type);
CREATE INDEX idx_extraction_items_value ON extraction_items(item_value);
CREATE INDEX idx_extraction_items_type_value ON extraction_items(item_type, item_value);
CREATE INDEX idx_extraction_items_article ON extraction_items(article_id);

API Endpoints

List Extractors

GET /v1/extractors

Response:
{
"extractors": [
{
"id": "funding_rounds",
"name": "Funding Round Extractor",
"type": "ai",
"description": "Extract funding announcements",
"is_enabled": true,
"run_on_ingest": true,
"cost_estimate": "$0.0005/article"
}
]
}

Get Extraction Results

GET /v1/articles/:id/extractions

Response:
{
"article_id": "art_123",
"extractions": {
"entities": {
"data": [...],
"confidence": 0.92,
"extracted_at": "2024-12-19T10:00:00Z"
},
"funding_rounds": {
"data": [...],
"confidence": 0.88,
"extracted_at": "2024-12-19T10:00:01Z"
}
}
}

Search Extraction Items

GET /v1/extractions/search?type=funding_round&company=Acme

Response:
{
"items": [
{
"article_id": "art_123",
"item_type": "funding_round",
"item_value": "Acme AI Series B",
"item_data": {
"company_name": "Acme AI",
"round_type": "series_b",
"amount_usd": 50000000
},
"confidence": 0.92
}
]
}

Run Extractor On-Demand

POST /v1/articles/:id/extract
Body: { "extractor_id": "funding_rounds" }

Response:
{
"article_id": "art_123",
"extractor_id": "funding_rounds",
"data": [...],
"confidence": 0.88,
"cost_micros": 500
}

Cost Tracking

Every extraction operation logs its cost:

INSERT INTO cost_events (
id,
operation_type,
service,
operation_id,
article_id,
customer_id,
cost_micros,
metadata
) VALUES (
'cost_xxx',
'extraction',
'workers_ai',
'funding_rounds',
'art_123',
'cust_456',
500,
'{"tokens_in": 1500, "tokens_out": 200, "model": "llama-2-7b"}'
);

Cost Estimates by Extractor

ExtractorTypeEst. Cost/Article
entitiesAI$0.0005
locationsHybrid$0.0001
eventsAI$0.0004
quotesAI$0.0003
funding_roundsAI$0.0005
productsAI$0.0003
jobsAI$0.0002

Creating Custom Extractors

Step 1: Define the Schema

INSERT INTO extractors (
id,
name,
type,
output_schema,
priority,
description,
llm_prompt
) VALUES (
'earnings_calls',
'Earnings Call Extractor',
'ai',
'{
"type": "array",
"items": {
"type": "object",
"properties": {
"company": { "type": "string" },
"quarter": { "type": "string" },
"year": { "type": "integer" },
"revenue": { "type": "number" },
"eps": { "type": "number" },
"guidance": { "type": "string" },
"call_date": { "type": "string", "format": "date" }
},
"required": ["company", "quarter", "year"]
}
}',
55,
'Extract earnings call details',
'Extract earnings call information from this article...'
);

Step 2: Enable for Customers/Projects

-- Enable for a project
INSERT INTO project_extractors (project_id, extractor_id, is_enabled)
VALUES ('proj_earnings_watch', 'earnings_calls', 1);

Step 3: The Pipeline Runs It Automatically

When articles match the project's keywords, the extractor runs and stores results.