Extraction Plugins

Transform unstructured article text into structured, queryable data.

Extraction plugins are the heart of StoryIntel's intelligence layer. They take raw article content and produce structured JSON that can be searched, filtered, and analyzed.

How Extraction Works

Built-in Extractors

1. Entities Extractor

ID: entities
Type: ai
Purpose: Extract named entities (people, organizations, products)

Output Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "type": { 
        "type": "string", 
        "enum": ["person", "organization", "product", "event", "location", "other"] 
      },
      "role": { "type": "string" },
      "sentiment": { 
        "type": "string", 
        "enum": ["positive", "negative", "neutral"] 
      },
      "salience": { "type": "number", "minimum": 0, "maximum": 1 },
      "mentions": { "type": "integer" }
    },
    "required": ["name", "type"]
  }
}

Example Output:

[
  {
    "name": "Elon Musk",
    "type": "person",
    "role": "CEO of Tesla",
    "sentiment": "neutral",
    "salience": 0.85,
    "mentions": 5
  },
  {
    "name": "Tesla",
    "type": "organization",
    "role": "Subject company",
    "sentiment": "positive",
    "salience": 0.95,
    "mentions": 12
  }
]

2. Locations Extractor

ID: locations
Type: hybrid
Purpose: Extract and geocode mentioned locations

How It Works:

Pattern matching for location names
Fuzzy match against 225K location database
LLM disambiguation for ambiguous cases ("Paris" - France or Texas?)

Output Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "location_id": { "type": "string" },
      "type": { 
        "type": "string", 
        "enum": ["city", "state", "country", "region", "address"] 
      },
      "latitude": { "type": "number" },
      "longitude": { "type": "number" },
      "country_code": { "type": "string" },
      "confidence": { "type": "number" }
    },
    "required": ["name", "type"]
  }
}

Example Output:

[
  {
    "name": "San Francisco",
    "location_id": "loc_sf_ca_us",
    "type": "city",
    "latitude": 37.7749,
    "longitude": -122.4194,
    "country_code": "US",
    "confidence": 0.98
  }
]

3. Events Extractor

ID: events
Type: ai
Purpose: Extract structured event data (conferences, earnings, launches)

Output Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "type": { 
        "type": "string", 
        "enum": ["conference", "earnings", "product_launch", "regulatory", "election", "ipo", "acquisition", "layoff", "other"] 
      },
      "start_date": { "type": "string", "format": "date" },
      "start_time": { "type": "string", "format": "time" },
      "end_date": { "type": "string", "format": "date" },
      "end_time": { "type": "string", "format": "time" },
      "is_multi_day": { "type": "boolean" },
      "timezone": { "type": "string" },
      "location": { "type": "string" },
      "organizer": { "type": "string" },
      "description": { "type": "string" },
      "url": { "type": "string", "format": "uri" },
      "confidence": { "type": "number" }
    },
    "required": ["name", "start_date"]
  }
}

Example Output:

[
  {
    "name": "CES 2025",
    "type": "conference",
    "start_date": "2025-01-07",
    "end_date": "2025-01-10",
    "is_multi_day": true,
    "timezone": "America/Los_Angeles",
    "location": "Las Vegas, NV",
    "organizer": "Consumer Technology Association",
    "description": "Annual consumer electronics trade show",
    "url": "https://www.ces.tech/",
    "confidence": 0.95
  }
]

4. Quotes Extractor

ID: quotes
Type: ai
Purpose: Extract direct quotes with speaker attribution

Output Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "text": { "type": "string" },
      "speaker": { "type": "string" },
      "speaker_title": { "type": "string" },
      "speaker_org": { "type": "string" },
      "context": { "type": "string" },
      "sentiment": { 
        "type": "string", 
        "enum": ["positive", "negative", "neutral"] 
      },
      "is_direct": { "type": "boolean" }
    },
    "required": ["text", "speaker"]
  }
}

5. Funding Rounds Extractor (NEW)

ID: funding_rounds
Type: ai
Purpose: Extract startup funding announcements with structured data

Output Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "company_name": { "type": "string" },
      "company_url": { "type": "string", "format": "uri" },
      "round_type": { 
        "type": "string", 
        "enum": ["pre_seed", "seed", "series_a", "series_b", "series_c", "series_d", "series_e", "growth", "debt", "bridge", "ipo", "spac", "other"] 
      },
      "amount_usd": { "type": "number" },
      "amount_raw": { "type": "string" },
      "currency": { "type": "string" },
      "valuation_usd": { "type": "number" },
      "valuation_raw": { "type": "string" },
      "lead_investors": { 
        "type": "array", 
        "items": { "type": "string" } 
      },
      "other_investors": { 
        "type": "array", 
        "items": { "type": "string" } 
      },
      "announced_date": { "type": "string", "format": "date" },
      "use_of_funds": { "type": "string" },
      "sector": { "type": "string" },
      "stage": { "type": "string" },
      "confidence": { "type": "number" }
    },
    "required": ["company_name", "round_type"]
  }
}

Example Output:

[
  {
    "company_name": "Acme AI",
    "company_url": "https://acme.ai",
    "round_type": "series_b",
    "amount_usd": 50000000,
    "amount_raw": "$50M",
    "currency": "USD",
    "valuation_usd": 250000000,
    "valuation_raw": "$250M",
    "lead_investors": ["Sequoia Capital"],
    "other_investors": ["a16z", "Y Combinator"],
    "announced_date": "2024-12-15",
    "use_of_funds": "Expand engineering team and launch enterprise product",
    "sector": "Artificial Intelligence",
    "stage": "Growth",
    "confidence": 0.92
  }
]

LLM Prompt (abridged):

Extract funding round details from this article. Look for:
- Company name and website
- Round type (seed, series A/B/C, etc.)
- Amount raised (convert to USD if possible)
- Valuation if mentioned
- Lead investor(s) and participating investors
- Announced date
- Use of funds / what they'll do with the money

Return a JSON array of funding rounds. If no funding is mentioned, return [].

6. Products Extractor

ID: products
Type: ai
Purpose: Extract product mentions and announcements

Output Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "company": { "type": "string" },
      "category": { "type": "string" },
      "is_new": { "type": "boolean" },
      "price": { "type": "string" },
      "availability_date": { "type": "string", "format": "date" },
      "description": { "type": "string" },
      "sentiment": { "type": "string" },
      "confidence": { "type": "number" }
    },
    "required": ["name", "company"]
  }
}

7. Jobs Extractor

ID: jobs
Type: ai
Purpose: Extract job postings from news articles

Output Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "company": { "type": "string" },
      "location": { "type": "string" },
      "remote": { "type": "boolean" },
      "salary_range": { "type": "string" },
      "job_type": { 
        "type": "string", 
        "enum": ["full_time", "part_time", "contract", "internship"] 
      },
      "seniority": { "type": "string" },
      "department": { "type": "string" },
      "apply_url": { "type": "string", "format": "uri" },
      "confidence": { "type": "number" }
    },
    "required": ["title", "company"]
  }
}

Extractor Configuration

Database Schema

-- Extractor definitions
CREATE TABLE extractors (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    type TEXT NOT NULL CHECK (type IN ('builtin', 'ai', 'rules', 'hybrid')),
    output_schema TEXT NOT NULL,     -- JSON Schema for validation
    run_on_ingest INTEGER DEFAULT 1, -- Auto-run on new articles?
    priority INTEGER DEFAULT 100,    -- Lower = runs first
    is_active INTEGER DEFAULT 1,
    description TEXT,
    llm_prompt TEXT,                 -- Prompt for AI extractors
    rules_config TEXT,               -- Config for rules extractors
    cost_estimate_micros INTEGER,    -- Estimated cost per article
    created_at TEXT DEFAULT (datetime('now')),
    updated_at TEXT DEFAULT (datetime('now'))
);

-- Default extractors
INSERT INTO extractors (id, name, type, output_schema, priority, description) VALUES
    ('entities', 'Entity Extractor', 'ai', '...schema...', 10, 'Extract named entities'),
    ('locations', 'Location Extractor', 'hybrid', '...schema...', 20, 'Extract and geocode locations'),
    ('events', 'Event Extractor', 'ai', '...schema...', 30, 'Extract structured event data'),
    ('quotes', 'Quote Extractor', 'ai', '...schema...', 40, 'Extract quotes with attribution'),
    ('funding_rounds', 'Funding Round Extractor', 'ai', '...schema...', 50, 'Extract funding announcements'),
    ('products', 'Product Extractor', 'ai', '...schema...', 60, 'Extract product mentions'),
    ('jobs', 'Job Extractor', 'ai', '...schema...', 70, 'Extract job postings');

Customer Extractor Preferences

-- Which extractors are enabled per customer
CREATE TABLE customer_extractors (
    customer_id TEXT NOT NULL REFERENCES customers(id),
    extractor_id TEXT NOT NULL REFERENCES extractors(id),
    is_enabled INTEGER DEFAULT 1,
    config TEXT,  -- Customer-specific overrides
    PRIMARY KEY (customer_id, extractor_id)
);

Project-Specific Extractors

-- Which extractors run for a specific project
CREATE TABLE project_extractors (
    project_id TEXT NOT NULL REFERENCES projects(id),
    extractor_id TEXT NOT NULL REFERENCES extractors(id),
    is_enabled INTEGER DEFAULT 1,
    config TEXT,  -- Project-specific config
    priority_override INTEGER,  -- Override default priority
    PRIMARY KEY (project_id, extractor_id)
);

Storage Schema

Extraction Results

-- Full extraction results per article
CREATE TABLE article_extractions (
    article_id TEXT NOT NULL REFERENCES articles(id),
    extractor_id TEXT NOT NULL REFERENCES extractors(id),
    extracted_data TEXT NOT NULL,  -- JSON matching output_schema
    confidence REAL,               -- Overall confidence 0-1
    cost_micros INTEGER DEFAULT 0, -- Actual cost in microdollars
    tokens_in INTEGER,             -- Input tokens (for AI)
    tokens_out INTEGER,            -- Output tokens (for AI)
    latency_ms INTEGER,            -- Execution time
    extracted_at TEXT DEFAULT (datetime('now')),
    PRIMARY KEY (article_id, extractor_id)
);

CREATE INDEX idx_article_extractions_article ON article_extractions(article_id);
CREATE INDEX idx_article_extractions_extractor ON article_extractions(extractor_id);
CREATE INDEX idx_article_extractions_date ON article_extractions(extracted_at);

Denormalized Extraction Items

For fast querying across all articles:

-- Individual extracted items (searchable)
CREATE TABLE extraction_items (
    id TEXT PRIMARY KEY,
    article_id TEXT NOT NULL REFERENCES articles(id),
    extractor_id TEXT NOT NULL REFERENCES extractors(id),
    item_type TEXT NOT NULL,       -- 'person', 'org', 'event', 'funding_round', etc.
    item_value TEXT NOT NULL,      -- 'Elon Musk', 'Tesla', 'CES 2025', etc.
    item_data TEXT,                -- Additional structured data as JSON
    confidence REAL,
    created_at TEXT DEFAULT (datetime('now'))
);

CREATE INDEX idx_extraction_items_type ON extraction_items(item_type);
CREATE INDEX idx_extraction_items_value ON extraction_items(item_value);
CREATE INDEX idx_extraction_items_type_value ON extraction_items(item_type, item_value);
CREATE INDEX idx_extraction_items_article ON extraction_items(article_id);

API Endpoints

List Extractors

GET /v1/extractors

Response:
{
  "extractors": [
    {
      "id": "funding_rounds",
      "name": "Funding Round Extractor",
      "type": "ai",
      "description": "Extract funding announcements",
      "is_enabled": true,
      "run_on_ingest": true,
      "cost_estimate": "$0.0005/article"
    }
  ]
}

Get Extraction Results

GET /v1/articles/:id/extractions

Response:
{
  "article_id": "art_123",
  "extractions": {
    "entities": {
      "data": [...],
      "confidence": 0.92,
      "extracted_at": "2024-12-19T10:00:00Z"
    },
    "funding_rounds": {
      "data": [...],
      "confidence": 0.88,
      "extracted_at": "2024-12-19T10:00:01Z"
    }
  }
}

Search Extraction Items

GET /v1/extractions/search?type=funding_round&company=Acme

Response:
{
  "items": [
    {
      "article_id": "art_123",
      "item_type": "funding_round",
      "item_value": "Acme AI Series B",
      "item_data": {
        "company_name": "Acme AI",
        "round_type": "series_b",
        "amount_usd": 50000000
      },
      "confidence": 0.92
    }
  ]
}

Run Extractor On-Demand

POST /v1/articles/:id/extract
Body: { "extractor_id": "funding_rounds" }

Response:
{
  "article_id": "art_123",
  "extractor_id": "funding_rounds",
  "data": [...],
  "confidence": 0.88,
  "cost_micros": 500
}

Cost Tracking

Every extraction operation logs its cost:

INSERT INTO cost_events (
    id,
    operation_type,
    service,
    operation_id,
    article_id,
    customer_id,
    cost_micros,
    metadata
) VALUES (
    'cost_xxx',
    'extraction',
    'workers_ai',
    'funding_rounds',
    'art_123',
    'cust_456',
    500,
    '{"tokens_in": 1500, "tokens_out": 200, "model": "llama-2-7b"}'
);

Cost Estimates by Extractor

Extractor	Type	Est. Cost/Article
entities	AI	$0.0005
locations	Hybrid	$0.0001
events	AI	$0.0004
quotes	AI	$0.0003
funding_rounds	AI	$0.0005
products	AI	$0.0003
jobs	AI	$0.0002

Creating Custom Extractors

Step 1: Define the Schema

INSERT INTO extractors (
    id, 
    name, 
    type, 
    output_schema, 
    priority, 
    description,
    llm_prompt
) VALUES (
    'earnings_calls',
    'Earnings Call Extractor',
    'ai',
    '{
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "company": { "type": "string" },
                "quarter": { "type": "string" },
                "year": { "type": "integer" },
                "revenue": { "type": "number" },
                "eps": { "type": "number" },
                "guidance": { "type": "string" },
                "call_date": { "type": "string", "format": "date" }
            },
            "required": ["company", "quarter", "year"]
        }
    }',
    55,
    'Extract earnings call details',
    'Extract earnings call information from this article...'
);

Step 2: Enable for Customers/Projects

-- Enable for a project
INSERT INTO project_extractors (project_id, extractor_id, is_enabled)
VALUES ('proj_earnings_watch', 'earnings_calls', 1);

Step 3: The Pipeline Runs It Automatically

When articles match the project's keywords, the extractor runs and stores results.

Quick Links

Plugin Overview — Extension point architecture
Source Adapters — Adding new data sources
Output Adapters — Delivery destinations
Database Schema — Full table definitions

How Extraction Works​

Built-in Extractors​

1. Entities Extractor​

2. Locations Extractor​

3. Events Extractor​

4. Quotes Extractor​

5. Funding Rounds Extractor (NEW)​

6. Products Extractor​

7. Jobs Extractor​

Extractor Configuration​

Database Schema​

Customer Extractor Preferences​

Project-Specific Extractors​

Storage Schema​

Extraction Results​

Denormalized Extraction Items​

API Endpoints​

List Extractors​

Get Extraction Results​

Search Extraction Items​

Run Extractor On-Demand​

Cost Tracking​

Cost Estimates by Extractor​

Creating Custom Extractors​

Step 1: Define the Schema​

Step 2: Enable for Customers/Projects​

Step 3: The Pipeline Runs It Automatically​

Quick Links​

How Extraction Works

Built-in Extractors

1. Entities Extractor

2. Locations Extractor

3. Events Extractor

4. Quotes Extractor

5. Funding Rounds Extractor (NEW)

6. Products Extractor

7. Jobs Extractor

Extractor Configuration

Database Schema

Customer Extractor Preferences

Project-Specific Extractors

Storage Schema

Extraction Results

Denormalized Extraction Items

API Endpoints

List Extractors

Get Extraction Results

Search Extraction Items

Run Extractor On-Demand

Cost Tracking

Cost Estimates by Extractor

Creating Custom Extractors

Step 1: Define the Schema

Step 2: Enable for Customers/Projects

Step 3: The Pipeline Runs It Automatically

Quick Links