Skip to main content

Operations Runbook

Operational procedures, troubleshooting guides, and incident response for Noozer


Table of Contents

  1. Daily Operations
  2. Health Checks
  3. Common Issues
  4. Pipeline Operations
  5. Cost Management
  6. Incident Response
  7. Disaster Recovery
  8. Maintenance Windows

Daily Operations

# 1. Check overnight pipeline runs
GET /v1/admin/pipeline/runs?since=yesterday

# 2. Review error counts
GET /v1/admin/pipeline/status

# 3. Check cost spend vs budget
GET /v1/admin/costs?period=day

# 4. Review classification queue
GET /v1/admin/review/queue

Key Metrics to Monitor

MetricHealthyWarningCritical
Crawl success rate> 95%80-95%< 80%
Queue depth (any)< 100100-500> 500
API latency p95< 500ms500-2000ms> 2000ms
Error rate< 1%1-5%> 5%
Daily cost< budget80-100% budget> budget
Review queue< 5050-100> 100

Automated Alerts

Alerts are configured to fire to Slack/PagerDuty:

AlertThresholdSeverityOn-Call Action
High error rate> 5% for 5 minP1Investigate immediately
Pipeline stalledNo articles for 2 hoursP1Check crawler + queues
Cost budget exceeded> 100% dailyP2Review + adjust limits
Queue backlog> 1000 messagesP2Scale consumers
API downHealth check failsP1Check Workers + D1
Vectorize capacity> 90% vectorsP3Plan index expansion

Health Checks

Quick Health Check

# API health
curl https://api.noozer.io/v1/health

# Expected response:
{
"status": "healthy",
"checks": [
{"name": "d1", "status": "ok", "latency_ms": 5},
{"name": "kv", "status": "ok", "latency_ms": 2},
{"name": "vectorize", "status": "ok", "latency_ms": 15},
{"name": "r2", "status": "ok", "latency_ms": 8}
],
"version": "1.2.3",
"timestamp": "2024-01-15T09:00:00Z"
}

Deep Health Check (Admin Only)

# Full system status
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
https://api.noozer.io/v1/admin/pipeline/status

# Response includes:
# - Queue depths
# - Recent pipeline runs
# - Error counts
# - Resource utilization

Component-Specific Checks

D1 Database

# Via Wrangler
wrangler d1 execute noozer-production --command "SELECT COUNT(*) FROM articles"

# Check table sizes
wrangler d1 execute noozer-production --command "
SELECT name,
(SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name=m.name) as indexes
FROM sqlite_master m
WHERE type='table'
"

Queues

# Check queue depths
wrangler queues info crawl-batch-prod
wrangler queues info article-extract-prod
wrangler queues info article-classify-prod
# ... etc

Vectorize

# Check index status
wrangler vectorize info noozer-articles-prod
wrangler vectorize info noozer-stories-prod

Common Issues

See also: Google News Crawling Playbook for detailed failure modes, guardrails, and circuit breaker patterns.

Issue: Crawl Failures Spiking

Symptoms:

  • crawl.failure_rate > 20%
  • Many 403/429 errors in logs

Diagnosis:

# Check recent crawl errors
wrangler d1 execute noozer-production --command "
SELECT error_type, COUNT(*) as count
FROM pipeline_errors
WHERE stage = 'crawl'
AND created_at > datetime('now', '-1 hour')
GROUP BY error_type
ORDER BY count DESC
"

Resolution:

  1. If 429 (Rate Limited):

    # Check rate limit state
    wrangler kv:key get --binding RATE_LIMITS "source:nytimes.com"

    # Reduce crawl frequency for affected sources
    # Update source fetch_config in D1
  2. If 403 (Blocked):

    # Check if ZenRows is being used
    # May need to rotate ZenRows credentials or use DataForSEO fallback

    # Force DataForSEO for specific source
    wrangler d1 execute noozer-production --command "
    UPDATE sources
    SET fetch_config = json_set(fetch_config, '$.force_d4seo', true)
    WHERE domain = 'blocked-site.com'
    "
  3. If Timeout:

    # Check if specific sources are slow
    # Increase timeout in fetch_config or reduce batch size

Issue: Queue Backlog Growing

Symptoms:

  • Queue depth > 500 and growing
  • Processing latency increasing

Diagnosis:

# Check which queue is backed up
wrangler queues info crawl-batch-prod
wrangler queues info article-extract-prod
# ... check each queue

# Check consumer status
wrangler tail noozer-extractor --format json | grep "error"

Resolution:

  1. Consumer Errors:

    # Check consumer logs
    wrangler tail noozer-classifier

    # If consumer is crashing, redeploy
    wrangler deploy --config src/workers/classifier/wrangler.toml
  2. Need More Throughput:

    # Increase consumer concurrency (in wrangler.toml)
    # max_batch_size = 50 # default 10
    # max_concurrency = 10 # default 1

    # Redeploy consumer
    wrangler deploy --config src/workers/classifier/wrangler.toml
  3. Upstream Dependency Slow:

    # If LLM/embedding service is slow, check external service status
    # Consider enabling circuit breaker
    wrangler kv:key put --binding FEATURE_FLAGS "circuit_breaker_openai" "true"

Issue: High API Latency

Symptoms:

  • API p95 latency > 2000ms
  • Customer complaints

Diagnosis:

# Check recent request distribution
wrangler tail noozer-api --format json | jq '.latency_ms' | sort -n | tail -20

# Check D1 query performance
# (requires custom logging in code)

Resolution:

  1. D1 Slow Queries:

    # Check for missing indexes
    wrangler d1 execute noozer-production --command "
    EXPLAIN QUERY PLAN
    SELECT * FROM articles WHERE source_id = 'xxx' ORDER BY published_at DESC LIMIT 20
    "

    # Add missing index if needed
    wrangler d1 execute noozer-production --command "
    CREATE INDEX IF NOT EXISTS idx_articles_source_published
    ON articles(source_id, published_at DESC)
    "
  2. Vectorize Slow:

    # Check index size
    wrangler vectorize info noozer-articles-prod

    # If near capacity, consider:
    # - Pruning old vectors
    # - Creating time-partitioned indexes
  3. Cache Miss Rate High:

    # Check KV hit rates (requires custom logging)
    # Pre-warm hot cache if needed

    # Increase TTL for stable data
    # Decrease TTL for frequently changing data

Issue: Classification Accuracy Dropping

Symptoms:

  • Customer feedback indicating wrong classifications
  • Review queue growing

Diagnosis:

# Check classification confidence distribution
wrangler d1 execute noozer-production --command "
SELECT
CASE
WHEN confidence >= 0.9 THEN '0.9+'
WHEN confidence >= 0.7 THEN '0.7-0.9'
WHEN confidence >= 0.5 THEN '0.5-0.7'
ELSE '<0.5'
END as confidence_bucket,
COUNT(*) as count
FROM article_classifications
WHERE created_at > datetime('now', '-1 day')
GROUP BY confidence_bucket
"

# Check recent feedback
wrangler d1 execute noozer-production --command "
SELECT feedback, COUNT(*) as count
FROM customer_article_scores
WHERE feedback IS NOT NULL
AND created_at > datetime('now', '-7 days')
GROUP BY feedback
"

Resolution:

  1. Taxonomy Drift:

    # Retrain taxonomy embeddings
    POST /v1/admin/taxonomy/retrain

    # This regenerates exemplar embeddings from recent feedback
  2. New Topics Emerging:

    # Add new taxonomy labels
    POST /v1/admin/taxonomy/labels
    {
    "category": "topic",
    "label": "New Topic",
    "description": "...",
    "keyword_patterns": ["pattern1", "pattern2"]
    }
  3. LLM Model Changed:

    # If OpenAI updated models, may need to recalibrate
    # Check LLM response format hasn't changed
    # Consider pinning to specific model version

Issue: Cost Budget Exceeded

Symptoms:

  • Budget alert fired
  • Hard limit blocking operations (if enabled)

Diagnosis:

# Check what's consuming costs
GET /v1/admin/costs?period=day&group_by=operation

# Check for anomalies
wrangler d1 execute noozer-production --command "
SELECT service, operation,
SUM(cost_micros)/1000000.0 as cost_usd,
COUNT(*) as operations
FROM cost_events
WHERE timestamp > datetime('now', '-1 day')
GROUP BY service, operation
ORDER BY cost_usd DESC
"

Resolution:

  1. LLM Overuse:

    # Check why LLM is being called excessively
    # May need to tune classification pipeline to use rules/vector first

    # Temporarily increase vector-match threshold
    wrangler kv:key put --binding FEATURE_FLAGS "llm_threshold" "0.5"
  2. Unusual Crawl Volume:

    # Check if a customer added many keywords
    wrangler d1 execute noozer-production --command "
    SELECT k.customer_id, COUNT(*) as keywords, SUM(k.article_count) as articles
    FROM keyword_sets k
    WHERE k.is_active = 1
    GROUP BY k.customer_id
    ORDER BY articles DESC
    "

    # May need to throttle specific customer
  3. Increase Budget:

    PUT /v1/admin/budgets/{id}
    { "budget_usd": 150 }

Pipeline Operations

Manual Crawl Trigger

# Trigger immediate crawl for all active keyword sets
POST /v1/admin/crawl/trigger
{ "priority": "high" }

# Trigger for specific keyword sets
POST /v1/admin/crawl/trigger
{
"keyword_set_ids": ["uuid1", "uuid2"],
"priority": "high"
}

Reprocess Article

# Reprocess single article through full pipeline
POST /v1/admin/reprocess/{articleId}
{
"stages": ["extract", "enrich", "classify", "cluster"]
}

Force Story Recluster

# Trigger story reclustering for recent articles
POST /v1/admin/recluster
{
"window_hours": 24,
"min_confidence": 0.5
}

Clear Queue

If a queue is poisoned with bad messages:

# Pause consumer
wrangler queues consumer pause article-classify-prod

# Clear queue (careful - this deletes messages!)
# No direct Wrangler command; use dashboard or:
# 1. Create new queue
# 2. Update producer bindings
# 3. Delete old queue

# Resume consumer
wrangler queues consumer resume article-classify-prod

Backfill Operations

# Backfill social metrics for recent articles
wrangler d1 execute noozer-production --command "
SELECT id FROM articles
WHERE processing_status = 'complete'
AND id NOT IN (SELECT article_id FROM article_social_metrics)
AND published_at > datetime('now', '-7 days')
" | xargs -I {} curl -X POST /v1/admin/enrich/{}/social

# Backfill embeddings
POST /v1/admin/backfill/embeddings
{
"since": "2024-01-01",
"batch_size": 100
}

Cost Management

View Current Spend

# Today's spend
GET /v1/admin/costs?period=day

# Week to date
GET /v1/admin/costs?period=week

# By service
GET /v1/admin/costs?period=day&group_by=service

Set Budget Alerts

# Create daily budget with 80% alert
POST /v1/admin/budgets
{
"scope": "global",
"period": "daily",
"budget_usd": 100,
"alert_threshold_pct": 80,
"hard_limit": false
}

# Create hard limit for specific service
POST /v1/admin/budgets
{
"scope": "service",
"scope_id": "openai",
"period": "daily",
"budget_usd": 50,
"hard_limit": true
}

Cost Optimization

  1. Reduce LLM Calls:

    • Tune rule-based classification to catch more cases
    • Increase vector similarity threshold before LLM fallback
    • Cache LLM responses for similar inputs
  2. Optimize Crawling:

    • Deduplicate URLs aggressively
    • Use conditional fetching (If-Modified-Since)
    • Reduce crawl frequency for low-value sources
  3. Batch Operations:

    • Batch embedding requests
    • Batch social metric fetches
    • Use queue batching effectively

Incident Response

Severity Levels

LevelDescriptionResponse TimeExamples
P1System down15 minAPI unresponsive, no articles ingested
P2Degraded1 hourHigh latency, partial failures
P3Minor4 hoursSingle source failing, review queue backup
P4Low24 hoursCost anomaly, minor bugs

Incident Workflow

  1. Acknowledge - Claim the incident
  2. Assess - Determine severity and impact
  3. Communicate - Update status page, notify stakeholders
  4. Mitigate - Stop the bleeding
  5. Resolve - Fix the root cause
  6. Review - Post-incident review within 48 hours

Communication Templates

Status Page Update:

[Investigating] We are investigating issues with [component].
[Identified] The issue has been identified as [brief description].
[Monitoring] A fix has been deployed. We are monitoring.
[Resolved] The incident has been resolved. [Brief summary].

Customer Notification:

Subject: [Noozer] Service Incident - [Brief Title]

We experienced an issue affecting [description of impact].

Timeline:
- [Time] Issue detected
- [Time] Issue resolved

Impact:
- [What customers experienced]

Root Cause:
- [Brief explanation]

Prevention:
- [Steps to prevent recurrence]

Emergency Contacts

RoleContactEscalation
Primary On-Call[Phone]PagerDuty
Secondary On-Call[Phone]PagerDuty
Engineering Lead[Email/Phone]After 30 min
Cloudflare Supportsupport@cloudflare.comFor platform issues

Disaster Recovery

Data Backup

D1 Database

# Export full database
wrangler d1 export noozer-production --output backup-$(date +%Y%m%d).sql

# Scheduled backup (via cron trigger)
# Backup worker runs daily at 3 AM, uploads to R2

R2 Objects

# R2 has built-in redundancy
# For additional safety, replicate to secondary bucket
wrangler r2 object copy \
raw-snapshots/2024/01/15/ \
raw-snapshots-backup/2024/01/15/ \
--recursive

Recovery Procedures

D1 Recovery

# Create new database
wrangler d1 create noozer-recovery

# Import from backup
wrangler d1 execute noozer-recovery --file backup-20240115.sql

# Update wrangler.toml with new database ID
# Deploy workers
wrangler deploy

Vectorize Recovery

# Vectorize doesn't support export/import
# Must rebuild from source data

# 1. Create new index
wrangler vectorize create noozer-articles-recovery --dimensions 1536 --metric cosine

# 2. Run embedding backfill job
POST /v1/admin/backfill/embeddings
{ "full_rebuild": true }

Failover Checklist

  • Verify backup integrity
  • Create recovery resources
  • Update DNS/routes if needed
  • Deploy workers to new resources
  • Verify data consistency
  • Run smoke tests
  • Update monitoring
  • Notify customers

Maintenance Windows

Scheduled Maintenance

Maintenance windows: Sundays 2-4 AM UTC

Pre-maintenance:

  1. Announce 48 hours in advance
  2. Update status page
  3. Notify enterprise customers directly

During maintenance:

  1. Set feature flag to show maintenance message
  2. Pause queue consumers
  3. Perform maintenance
  4. Run verification tests
  5. Resume consumers
  6. Remove maintenance message

Post-maintenance:

  1. Monitor for 30 minutes
  2. Update status page
  3. Send completion notification

Zero-Downtime Deployments

For routine deployments (no maintenance window needed):

# Workers support zero-downtime deployments by default
wrangler deploy

# For database migrations that are backwards compatible:
# 1. Deploy migration
wrangler d1 migrations apply noozer-production

# 2. Deploy code that uses new schema
wrangler deploy

# For breaking changes:
# 1. Deploy code that supports both old and new
# 2. Run migration
# 3. Deploy code that only uses new
# 4. Clean up old code paths

Quick Reference

Essential Commands

# Logs
wrangler tail # Production
wrangler tail --config wrangler.staging.toml # Staging

# Database
wrangler d1 execute noozer-production --command "SELECT ..."
wrangler d1 migrations apply noozer-production

# Queues
wrangler queues info <queue-name>
wrangler queues consumer pause <queue-name>
wrangler queues consumer resume <queue-name>

# Secrets
wrangler secret put <NAME>
wrangler secret list

# Rollback
wrangler rollback

API Quick Reference

# Health
GET /v1/health
GET /v1/admin/pipeline/status

# Trigger operations
POST /v1/admin/crawl/trigger
POST /v1/admin/recluster
POST /v1/admin/reprocess/{id}

# Costs
GET /v1/admin/costs
GET /v1/admin/budgets

# Storage
GET /v1/admin/storage/stats
POST /v1/admin/cache/invalidate

Last updated: 2024-01-15 On-call rotation: [Link to rotation schedule] Incident channel: #noozer-incidents