Operations Runbook

Operational procedures, troubleshooting guides, and incident response for Noozer

Daily Operations
Health Checks
Common Issues
Pipeline Operations
Cost Management
Incident Response
Disaster Recovery
Maintenance Windows

Daily Operations

Morning Health Check (Recommended: 9 AM)

# 1. Check overnight pipeline runs
GET /v1/admin/pipeline/runs?since=yesterday

# 2. Review error counts
GET /v1/admin/pipeline/status

# 3. Check cost spend vs budget
GET /v1/admin/costs?period=day

# 4. Review classification queue
GET /v1/admin/review/queue

Key Metrics to Monitor

Metric	Healthy	Warning	Critical
Crawl success rate	> 95%	80-95%	< 80%
Queue depth (any)	< 100	100-500	> 500
API latency p95	< 500ms	500-2000ms	> 2000ms
Error rate	< 1%	1-5%	> 5%
Daily cost	< budget	80-100% budget	> budget
Review queue	< 50	50-100	> 100

Automated Alerts

Alerts are configured to fire to Slack/PagerDuty:

Alert	Threshold	Severity	On-Call Action
High error rate	> 5% for 5 min	P1	Investigate immediately
Pipeline stalled	No articles for 2 hours	P1	Check crawler + queues
Cost budget exceeded	> 100% daily	P2	Review + adjust limits
Queue backlog	> 1000 messages	P2	Scale consumers
API down	Health check fails	P1	Check Workers + D1
Vectorize capacity	> 90% vectors	P3	Plan index expansion

Health Checks

Quick Health Check

# API health
curl https://api.noozer.io/v1/health

# Expected response:
{
  "status": "healthy",
  "checks": [
    {"name": "d1", "status": "ok", "latency_ms": 5},
    {"name": "kv", "status": "ok", "latency_ms": 2},
    {"name": "vectorize", "status": "ok", "latency_ms": 15},
    {"name": "r2", "status": "ok", "latency_ms": 8}
  ],
  "version": "1.2.3",
  "timestamp": "2024-01-15T09:00:00Z"
}

Deep Health Check (Admin Only)

# Full system status
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://api.noozer.io/v1/admin/pipeline/status

# Response includes:
# - Queue depths
# - Recent pipeline runs
# - Error counts
# - Resource utilization

Component-Specific Checks

D1 Database

# Via Wrangler
wrangler d1 execute noozer-production --command "SELECT COUNT(*) FROM articles"

# Check table sizes
wrangler d1 execute noozer-production --command "
  SELECT name, 
         (SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name=m.name) as indexes
  FROM sqlite_master m 
  WHERE type='table'
"

Queues

# Check queue depths
wrangler queues info crawl-batch-prod
wrangler queues info article-extract-prod
wrangler queues info article-classify-prod
# ... etc

Vectorize

# Check index status
wrangler vectorize info noozer-articles-prod
wrangler vectorize info noozer-stories-prod

Common Issues

See also: Google News Crawling Playbook for detailed failure modes, guardrails, and circuit breaker patterns.

Issue: Crawl Failures Spiking

Symptoms:

crawl.failure_rate > 20%
Many 403/429 errors in logs

Diagnosis:

# Check recent crawl errors
wrangler d1 execute noozer-production --command "
  SELECT error_type, COUNT(*) as count 
  FROM pipeline_errors 
  WHERE stage = 'crawl' 
    AND created_at > datetime('now', '-1 hour')
  GROUP BY error_type
  ORDER BY count DESC
"

Resolution:

If 429 (Rate Limited):

# Check rate limit state
wrangler kv:key get --binding RATE_LIMITS "source:nytimes.com"

# Reduce crawl frequency for affected sources
# Update source fetch_config in D1

If 403 (Blocked):

# Check if ZenRows is being used
# May need to rotate ZenRows credentials or use DataForSEO fallback

# Force DataForSEO for specific source
wrangler d1 execute noozer-production --command "
  UPDATE sources 
  SET fetch_config = json_set(fetch_config, '$.force_d4seo', true)
  WHERE domain = 'blocked-site.com'
"

If Timeout:

# Check if specific sources are slow
# Increase timeout in fetch_config or reduce batch size

Issue: Queue Backlog Growing

Symptoms:

Queue depth > 500 and growing
Processing latency increasing

Diagnosis:

# Check which queue is backed up
wrangler queues info crawl-batch-prod
wrangler queues info article-extract-prod
# ... check each queue

# Check consumer status
wrangler tail noozer-extractor --format json | grep "error"

Resolution:

Consumer Errors:

# Check consumer logs
wrangler tail noozer-classifier

# If consumer is crashing, redeploy
wrangler deploy --config src/workers/classifier/wrangler.toml

Need More Throughput:

# Increase consumer concurrency (in wrangler.toml)
# max_batch_size = 50  # default 10
# max_concurrency = 10  # default 1

# Redeploy consumer
wrangler deploy --config src/workers/classifier/wrangler.toml

Upstream Dependency Slow:

# If LLM/embedding service is slow, check external service status
# Consider enabling circuit breaker
wrangler kv:key put --binding FEATURE_FLAGS "circuit_breaker_openai" "true"

Issue: High API Latency

Symptoms:

API p95 latency > 2000ms
Customer complaints

Diagnosis:

# Check recent request distribution
wrangler tail noozer-api --format json | jq '.latency_ms' | sort -n | tail -20

# Check D1 query performance
# (requires custom logging in code)

Resolution:

D1 Slow Queries:

# Check for missing indexes
wrangler d1 execute noozer-production --command "
  EXPLAIN QUERY PLAN 
  SELECT * FROM articles WHERE source_id = 'xxx' ORDER BY published_at DESC LIMIT 20
"

# Add missing index if needed
wrangler d1 execute noozer-production --command "
  CREATE INDEX IF NOT EXISTS idx_articles_source_published 
  ON articles(source_id, published_at DESC)
"

Vectorize Slow:

# Check index size
wrangler vectorize info noozer-articles-prod

# If near capacity, consider:
# - Pruning old vectors
# - Creating time-partitioned indexes

Cache Miss Rate High:

# Check KV hit rates (requires custom logging)
# Pre-warm hot cache if needed

# Increase TTL for stable data
# Decrease TTL for frequently changing data

Issue: Classification Accuracy Dropping

Symptoms:

Customer feedback indicating wrong classifications
Review queue growing

Diagnosis:

# Check classification confidence distribution
wrangler d1 execute noozer-production --command "
  SELECT 
    CASE 
      WHEN confidence >= 0.9 THEN '0.9+'
      WHEN confidence >= 0.7 THEN '0.7-0.9'
      WHEN confidence >= 0.5 THEN '0.5-0.7'
      ELSE '<0.5'
    END as confidence_bucket,
    COUNT(*) as count
  FROM article_classifications
  WHERE created_at > datetime('now', '-1 day')
  GROUP BY confidence_bucket
"

# Check recent feedback
wrangler d1 execute noozer-production --command "
  SELECT feedback, COUNT(*) as count
  FROM customer_article_scores
  WHERE feedback IS NOT NULL
    AND created_at > datetime('now', '-7 days')
  GROUP BY feedback
"

Resolution:

Taxonomy Drift:

# Retrain taxonomy embeddings
POST /v1/admin/taxonomy/retrain

# This regenerates exemplar embeddings from recent feedback

New Topics Emerging:

# Add new taxonomy labels
POST /v1/admin/taxonomy/labels
{
  "category": "topic",
  "label": "New Topic",
  "description": "...",
  "keyword_patterns": ["pattern1", "pattern2"]
}

LLM Model Changed:

# If OpenAI updated models, may need to recalibrate
# Check LLM response format hasn't changed
# Consider pinning to specific model version

Issue: Cost Budget Exceeded

Symptoms:

Budget alert fired
Hard limit blocking operations (if enabled)

Diagnosis:

# Check what's consuming costs
GET /v1/admin/costs?period=day&group_by=operation

# Check for anomalies
wrangler d1 execute noozer-production --command "
  SELECT service, operation, 
         SUM(cost_micros)/1000000.0 as cost_usd,
         COUNT(*) as operations
  FROM cost_events
  WHERE timestamp > datetime('now', '-1 day')
  GROUP BY service, operation
  ORDER BY cost_usd DESC
"

Resolution:

LLM Overuse:

# Check why LLM is being called excessively
# May need to tune classification pipeline to use rules/vector first

# Temporarily increase vector-match threshold
wrangler kv:key put --binding FEATURE_FLAGS "llm_threshold" "0.5"

Unusual Crawl Volume:

# Check if a customer added many keywords
wrangler d1 execute noozer-production --command "
  SELECT k.customer_id, COUNT(*) as keywords, SUM(k.article_count) as articles
  FROM keyword_sets k
  WHERE k.is_active = 1
  GROUP BY k.customer_id
  ORDER BY articles DESC
"

# May need to throttle specific customer

Increase Budget:

PUT /v1/admin/budgets/{id}
{ "budget_usd": 150 }

Pipeline Operations

Manual Crawl Trigger

# Trigger immediate crawl for all active keyword sets
POST /v1/admin/crawl/trigger
{ "priority": "high" }

# Trigger for specific keyword sets
POST /v1/admin/crawl/trigger
{
  "keyword_set_ids": ["uuid1", "uuid2"],
  "priority": "high"
}

Reprocess Article

# Reprocess single article through full pipeline
POST /v1/admin/reprocess/{articleId}
{
  "stages": ["extract", "enrich", "classify", "cluster"]
}

Force Story Recluster

# Trigger story reclustering for recent articles
POST /v1/admin/recluster
{
  "window_hours": 24,
  "min_confidence": 0.5
}

Clear Queue

If a queue is poisoned with bad messages:

# Pause consumer
wrangler queues consumer pause article-classify-prod

# Clear queue (careful - this deletes messages!)
# No direct Wrangler command; use dashboard or:
# 1. Create new queue
# 2. Update producer bindings
# 3. Delete old queue

# Resume consumer
wrangler queues consumer resume article-classify-prod

Backfill Operations

# Backfill social metrics for recent articles
wrangler d1 execute noozer-production --command "
  SELECT id FROM articles 
  WHERE processing_status = 'complete'
    AND id NOT IN (SELECT article_id FROM article_social_metrics)
    AND published_at > datetime('now', '-7 days')
" | xargs -I {} curl -X POST /v1/admin/enrich/{}/social

# Backfill embeddings
POST /v1/admin/backfill/embeddings
{
  "since": "2024-01-01",
  "batch_size": 100
}

Cost Management

View Current Spend

# Today's spend
GET /v1/admin/costs?period=day

# Week to date
GET /v1/admin/costs?period=week

# By service
GET /v1/admin/costs?period=day&group_by=service

Set Budget Alerts

# Create daily budget with 80% alert
POST /v1/admin/budgets
{
  "scope": "global",
  "period": "daily",
  "budget_usd": 100,
  "alert_threshold_pct": 80,
  "hard_limit": false
}

# Create hard limit for specific service
POST /v1/admin/budgets
{
  "scope": "service",
  "scope_id": "openai",
  "period": "daily",
  "budget_usd": 50,
  "hard_limit": true
}

Cost Optimization

Reduce LLM Calls:
- Tune rule-based classification to catch more cases
- Increase vector similarity threshold before LLM fallback
- Cache LLM responses for similar inputs
Optimize Crawling:
- Deduplicate URLs aggressively
- Use conditional fetching (If-Modified-Since)
- Reduce crawl frequency for low-value sources
Batch Operations:
- Batch embedding requests
- Batch social metric fetches
- Use queue batching effectively

Incident Response

Severity Levels

Level	Description	Response Time	Examples
P1	System down	15 min	API unresponsive, no articles ingested
P2	Degraded	1 hour	High latency, partial failures
P3	Minor	4 hours	Single source failing, review queue backup
P4	Low	24 hours	Cost anomaly, minor bugs

Incident Workflow

Acknowledge - Claim the incident
Assess - Determine severity and impact
Communicate - Update status page, notify stakeholders
Mitigate - Stop the bleeding
Resolve - Fix the root cause
Review - Post-incident review within 48 hours

Communication Templates

Status Page Update:

[Investigating] We are investigating issues with [component].
[Identified] The issue has been identified as [brief description].
[Monitoring] A fix has been deployed. We are monitoring.
[Resolved] The incident has been resolved. [Brief summary].

Customer Notification:

Subject: [Noozer] Service Incident - [Brief Title]

We experienced an issue affecting [description of impact].

Timeline:
- [Time] Issue detected
- [Time] Issue resolved

Impact:
- [What customers experienced]

Root Cause:
- [Brief explanation]

Prevention:
- [Steps to prevent recurrence]

Emergency Contacts

Role	Contact	Escalation
Primary On-Call	[Phone]	PagerDuty
Secondary On-Call	[Phone]	PagerDuty
Engineering Lead	[Email/Phone]	After 30 min
Cloudflare Support	support@cloudflare.com	For platform issues

Disaster Recovery

Data Backup

D1 Database

# Export full database
wrangler d1 export noozer-production --output backup-$(date +%Y%m%d).sql

# Scheduled backup (via cron trigger)
# Backup worker runs daily at 3 AM, uploads to R2

R2 Objects

# R2 has built-in redundancy
# For additional safety, replicate to secondary bucket
wrangler r2 object copy \
  raw-snapshots/2024/01/15/ \
  raw-snapshots-backup/2024/01/15/ \
  --recursive

Recovery Procedures

D1 Recovery

# Create new database
wrangler d1 create noozer-recovery

# Import from backup
wrangler d1 execute noozer-recovery --file backup-20240115.sql

# Update wrangler.toml with new database ID
# Deploy workers
wrangler deploy

Vectorize Recovery

# Vectorize doesn't support export/import
# Must rebuild from source data

# 1. Create new index
wrangler vectorize create noozer-articles-recovery --dimensions 1536 --metric cosine

# 2. Run embedding backfill job
POST /v1/admin/backfill/embeddings
{ "full_rebuild": true }

Failover Checklist

Maintenance Windows

Scheduled Maintenance

Maintenance windows: Sundays 2-4 AM UTC

Pre-maintenance:

Announce 48 hours in advance
Update status page
Notify enterprise customers directly

During maintenance:

Set feature flag to show maintenance message
Pause queue consumers
Perform maintenance
Run verification tests
Resume consumers
Remove maintenance message

Post-maintenance:

Monitor for 30 minutes
Update status page
Send completion notification

Zero-Downtime Deployments

For routine deployments (no maintenance window needed):

# Workers support zero-downtime deployments by default
wrangler deploy

# For database migrations that are backwards compatible:
# 1. Deploy migration
wrangler d1 migrations apply noozer-production

# 2. Deploy code that uses new schema
wrangler deploy

# For breaking changes:
# 1. Deploy code that supports both old and new
# 2. Run migration
# 3. Deploy code that only uses new
# 4. Clean up old code paths

Quick Reference

Essential Commands

# Logs
wrangler tail                     # Production
wrangler tail --config wrangler.staging.toml  # Staging

# Database
wrangler d1 execute noozer-production --command "SELECT ..."
wrangler d1 migrations apply noozer-production

# Queues
wrangler queues info <queue-name>
wrangler queues consumer pause <queue-name>
wrangler queues consumer resume <queue-name>

# Secrets
wrangler secret put <NAME>
wrangler secret list

# Rollback
wrangler rollback

API Quick Reference

# Health
GET /v1/health
GET /v1/admin/pipeline/status

# Trigger operations
POST /v1/admin/crawl/trigger
POST /v1/admin/recluster
POST /v1/admin/reprocess/{id}

# Costs
GET /v1/admin/costs
GET /v1/admin/budgets

# Storage
GET /v1/admin/storage/stats
POST /v1/admin/cache/invalidate

Last updated: 2024-01-15 On-call rotation: [Link to rotation schedule] Incident channel: #noozer-incidents

Table of Contents​

Daily Operations​

Morning Health Check (Recommended: 9 AM)​

Key Metrics to Monitor​

Automated Alerts​

Health Checks​

Quick Health Check​

Deep Health Check (Admin Only)​

Component-Specific Checks​

D1 Database​

Queues​

Vectorize​

Common Issues​

Issue: Crawl Failures Spiking​

Issue: Queue Backlog Growing​

Issue: High API Latency​

Issue: Classification Accuracy Dropping​

Issue: Cost Budget Exceeded​

Pipeline Operations​

Manual Crawl Trigger​

Reprocess Article​

Force Story Recluster​

Clear Queue​

Backfill Operations​

Cost Management​

View Current Spend​

Set Budget Alerts​

Cost Optimization​

Incident Response​

Severity Levels​

Incident Workflow​

Communication Templates​

Emergency Contacts​

Disaster Recovery​

Data Backup​

D1 Database​

R2 Objects​

Recovery Procedures​

D1 Recovery​

Vectorize Recovery​

Failover Checklist​

Maintenance Windows​

Scheduled Maintenance​

Zero-Downtime Deployments​

Quick Reference​

Essential Commands​

API Quick Reference​

Table of Contents

Daily Operations

Morning Health Check (Recommended: 9 AM)

Key Metrics to Monitor

Automated Alerts

Health Checks

Quick Health Check

Deep Health Check (Admin Only)

Component-Specific Checks

D1 Database

Queues

Vectorize

Common Issues

Issue: Crawl Failures Spiking

Issue: Queue Backlog Growing

Issue: High API Latency

Issue: Classification Accuracy Dropping

Issue: Cost Budget Exceeded

Pipeline Operations

Manual Crawl Trigger

Reprocess Article

Force Story Recluster

Clear Queue

Backfill Operations

Cost Management

View Current Spend

Set Budget Alerts

Cost Optimization

Incident Response

Severity Levels

Incident Workflow

Communication Templates

Emergency Contacts

Disaster Recovery

Data Backup

D1 Database

R2 Objects

Recovery Procedures

D1 Recovery

Vectorize Recovery

Failover Checklist

Maintenance Windows

Scheduled Maintenance

Zero-Downtime Deployments

Quick Reference

Essential Commands

API Quick Reference