Operations Runbook
Operational procedures, troubleshooting guides, and incident response for Noozer
Table of Contents
- Daily Operations
- Health Checks
- Common Issues
- Pipeline Operations
- Cost Management
- Incident Response
- Disaster Recovery
- Maintenance Windows
Daily Operations
Morning Health Check (Recommended: 9 AM)
# 1. Check overnight pipeline runs
GET /v1/admin/pipeline/runs?since=yesterday
# 2. Review error counts
GET /v1/admin/pipeline/status
# 3. Check cost spend vs budget
GET /v1/admin/costs?period=day
# 4. Review classification queue
GET /v1/admin/review/queue
Key Metrics to Monitor
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Crawl success rate | > 95% | 80-95% | < 80% |
| Queue depth (any) | < 100 | 100-500 | > 500 |
| API latency p95 | < 500ms | 500-2000ms | > 2000ms |
| Error rate | < 1% | 1-5% | > 5% |
| Daily cost | < budget | 80-100% budget | > budget |
| Review queue | < 50 | 50-100 | > 100 |
Automated Alerts
Alerts are configured to fire to Slack/PagerDuty:
| Alert | Threshold | Severity | On-Call Action |
|---|---|---|---|
| High error rate | > 5% for 5 min | P1 | Investigate immediately |
| Pipeline stalled | No articles for 2 hours | P1 | Check crawler + queues |
| Cost budget exceeded | > 100% daily | P2 | Review + adjust limits |
| Queue backlog | > 1000 messages | P2 | Scale consumers |
| API down | Health check fails | P1 | Check Workers + D1 |
| Vectorize capacity | > 90% vectors | P3 | Plan index expansion |
Health Checks
Quick Health Check
# API health
curl https://api.noozer.io/v1/health
# Expected response:
{
"status": "healthy",
"checks": [
{"name": "d1", "status": "ok", "latency_ms": 5},
{"name": "kv", "status": "ok", "latency_ms": 2},
{"name": "vectorize", "status": "ok", "latency_ms": 15},
{"name": "r2", "status": "ok", "latency_ms": 8}
],
"version": "1.2.3",
"timestamp": "2024-01-15T09:00:00Z"
}
Deep Health Check (Admin Only)
# Full system status
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
https://api.noozer.io/v1/admin/pipeline/status
# Response includes:
# - Queue depths
# - Recent pipeline runs
# - Error counts
# - Resource utilization
Component-Specific Checks
D1 Database
# Via Wrangler
wrangler d1 execute noozer-production --command "SELECT COUNT(*) FROM articles"
# Check table sizes
wrangler d1 execute noozer-production --command "
SELECT name,
(SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name=m.name) as indexes
FROM sqlite_master m
WHERE type='table'
"
Queues
# Check queue depths
wrangler queues info crawl-batch-prod
wrangler queues info article-extract-prod
wrangler queues info article-classify-prod
# ... etc
Vectorize
# Check index status
wrangler vectorize info noozer-articles-prod
wrangler vectorize info noozer-stories-prod
Common Issues
See also: Google News Crawling Playbook for detailed failure modes, guardrails, and circuit breaker patterns.
Issue: Crawl Failures Spiking
Symptoms:
crawl.failure_rate> 20%- Many 403/429 errors in logs
Diagnosis:
# Check recent crawl errors
wrangler d1 execute noozer-production --command "
SELECT error_type, COUNT(*) as count
FROM pipeline_errors
WHERE stage = 'crawl'
AND created_at > datetime('now', '-1 hour')
GROUP BY error_type
ORDER BY count DESC
"
Resolution:
-
If 429 (Rate Limited):
# Check rate limit state
wrangler kv:key get --binding RATE_LIMITS "source:nytimes.com"
# Reduce crawl frequency for affected sources
# Update source fetch_config in D1 -
If 403 (Blocked):
# Check if ZenRows is being used
# May need to rotate ZenRows credentials or use DataForSEO fallback
# Force DataForSEO for specific source
wrangler d1 execute noozer-production --command "
UPDATE sources
SET fetch_config = json_set(fetch_config, '$.force_d4seo', true)
WHERE domain = 'blocked-site.com'
" -
If Timeout:
# Check if specific sources are slow
# Increase timeout in fetch_config or reduce batch size
Issue: Queue Backlog Growing
Symptoms:
- Queue depth > 500 and growing
- Processing latency increasing
Diagnosis:
# Check which queue is backed up
wrangler queues info crawl-batch-prod
wrangler queues info article-extract-prod
# ... check each queue
# Check consumer status
wrangler tail noozer-extractor --format json | grep "error"
Resolution:
-
Consumer Errors:
# Check consumer logs
wrangler tail noozer-classifier
# If consumer is crashing, redeploy
wrangler deploy --config src/workers/classifier/wrangler.toml -
Need More Throughput:
# Increase consumer concurrency (in wrangler.toml)
# max_batch_size = 50 # default 10
# max_concurrency = 10 # default 1
# Redeploy consumer
wrangler deploy --config src/workers/classifier/wrangler.toml -
Upstream Dependency Slow:
# If LLM/embedding service is slow, check external service status
# Consider enabling circuit breaker
wrangler kv:key put --binding FEATURE_FLAGS "circuit_breaker_openai" "true"
Issue: High API Latency
Symptoms:
- API p95 latency > 2000ms
- Customer complaints
Diagnosis:
# Check recent request distribution
wrangler tail noozer-api --format json | jq '.latency_ms' | sort -n | tail -20
# Check D1 query performance
# (requires custom logging in code)
Resolution:
-
D1 Slow Queries:
# Check for missing indexes
wrangler d1 execute noozer-production --command "
EXPLAIN QUERY PLAN
SELECT * FROM articles WHERE source_id = 'xxx' ORDER BY published_at DESC LIMIT 20
"
# Add missing index if needed
wrangler d1 execute noozer-production --command "
CREATE INDEX IF NOT EXISTS idx_articles_source_published
ON articles(source_id, published_at DESC)
" -
Vectorize Slow:
# Check index size
wrangler vectorize info noozer-articles-prod
# If near capacity, consider:
# - Pruning old vectors
# - Creating time-partitioned indexes -
Cache Miss Rate High:
# Check KV hit rates (requires custom logging)
# Pre-warm hot cache if needed
# Increase TTL for stable data
# Decrease TTL for frequently changing data
Issue: Classification Accuracy Dropping
Symptoms:
- Customer feedback indicating wrong classifications
- Review queue growing
Diagnosis:
# Check classification confidence distribution
wrangler d1 execute noozer-production --command "
SELECT
CASE
WHEN confidence >= 0.9 THEN '0.9+'
WHEN confidence >= 0.7 THEN '0.7-0.9'
WHEN confidence >= 0.5 THEN '0.5-0.7'
ELSE '<0.5'
END as confidence_bucket,
COUNT(*) as count
FROM article_classifications
WHERE created_at > datetime('now', '-1 day')
GROUP BY confidence_bucket
"
# Check recent feedback
wrangler d1 execute noozer-production --command "
SELECT feedback, COUNT(*) as count
FROM customer_article_scores
WHERE feedback IS NOT NULL
AND created_at > datetime('now', '-7 days')
GROUP BY feedback
"
Resolution:
-
Taxonomy Drift:
# Retrain taxonomy embeddings
POST /v1/admin/taxonomy/retrain
# This regenerates exemplar embeddings from recent feedback -
New Topics Emerging:
# Add new taxonomy labels
POST /v1/admin/taxonomy/labels
{
"category": "topic",
"label": "New Topic",
"description": "...",
"keyword_patterns": ["pattern1", "pattern2"]
} -
LLM Model Changed:
# If OpenAI updated models, may need to recalibrate
# Check LLM response format hasn't changed
# Consider pinning to specific model version
Issue: Cost Budget Exceeded
Symptoms:
- Budget alert fired
- Hard limit blocking operations (if enabled)
Diagnosis:
# Check what's consuming costs
GET /v1/admin/costs?period=day&group_by=operation
# Check for anomalies
wrangler d1 execute noozer-production --command "
SELECT service, operation,
SUM(cost_micros)/1000000.0 as cost_usd,
COUNT(*) as operations
FROM cost_events
WHERE timestamp > datetime('now', '-1 day')
GROUP BY service, operation
ORDER BY cost_usd DESC
"
Resolution:
-
LLM Overuse:
# Check why LLM is being called excessively
# May need to tune classification pipeline to use rules/vector first
# Temporarily increase vector-match threshold
wrangler kv:key put --binding FEATURE_FLAGS "llm_threshold" "0.5" -
Unusual Crawl Volume:
# Check if a customer added many keywords
wrangler d1 execute noozer-production --command "
SELECT k.customer_id, COUNT(*) as keywords, SUM(k.article_count) as articles
FROM keyword_sets k
WHERE k.is_active = 1
GROUP BY k.customer_id
ORDER BY articles DESC
"
# May need to throttle specific customer -
Increase Budget:
PUT /v1/admin/budgets/{id}
{ "budget_usd": 150 }
Pipeline Operations
Manual Crawl Trigger
# Trigger immediate crawl for all active keyword sets
POST /v1/admin/crawl/trigger
{ "priority": "high" }
# Trigger for specific keyword sets
POST /v1/admin/crawl/trigger
{
"keyword_set_ids": ["uuid1", "uuid2"],
"priority": "high"
}
Reprocess Article
# Reprocess single article through full pipeline
POST /v1/admin/reprocess/{articleId}
{
"stages": ["extract", "enrich", "classify", "cluster"]
}
Force Story Recluster
# Trigger story reclustering for recent articles
POST /v1/admin/recluster
{
"window_hours": 24,
"min_confidence": 0.5
}
Clear Queue
If a queue is poisoned with bad messages:
# Pause consumer
wrangler queues consumer pause article-classify-prod
# Clear queue (careful - this deletes messages!)
# No direct Wrangler command; use dashboard or:
# 1. Create new queue
# 2. Update producer bindings
# 3. Delete old queue
# Resume consumer
wrangler queues consumer resume article-classify-prod
Backfill Operations
# Backfill social metrics for recent articles
wrangler d1 execute noozer-production --command "
SELECT id FROM articles
WHERE processing_status = 'complete'
AND id NOT IN (SELECT article_id FROM article_social_metrics)
AND published_at > datetime('now', '-7 days')
" | xargs -I {} curl -X POST /v1/admin/enrich/{}/social
# Backfill embeddings
POST /v1/admin/backfill/embeddings
{
"since": "2024-01-01",
"batch_size": 100
}
Cost Management
View Current Spend
# Today's spend
GET /v1/admin/costs?period=day
# Week to date
GET /v1/admin/costs?period=week
# By service
GET /v1/admin/costs?period=day&group_by=service
Set Budget Alerts
# Create daily budget with 80% alert
POST /v1/admin/budgets
{
"scope": "global",
"period": "daily",
"budget_usd": 100,
"alert_threshold_pct": 80,
"hard_limit": false
}
# Create hard limit for specific service
POST /v1/admin/budgets
{
"scope": "service",
"scope_id": "openai",
"period": "daily",
"budget_usd": 50,
"hard_limit": true
}
Cost Optimization
-
Reduce LLM Calls:
- Tune rule-based classification to catch more cases
- Increase vector similarity threshold before LLM fallback
- Cache LLM responses for similar inputs
-
Optimize Crawling:
- Deduplicate URLs aggressively
- Use conditional fetching (If-Modified-Since)
- Reduce crawl frequency for low-value sources
-
Batch Operations:
- Batch embedding requests
- Batch social metric fetches
- Use queue batching effectively
Incident Response
Severity Levels
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 | System down | 15 min | API unresponsive, no articles ingested |
| P2 | Degraded | 1 hour | High latency, partial failures |
| P3 | Minor | 4 hours | Single source failing, review queue backup |
| P4 | Low | 24 hours | Cost anomaly, minor bugs |
Incident Workflow
- Acknowledge - Claim the incident
- Assess - Determine severity and impact
- Communicate - Update status page, notify stakeholders
- Mitigate - Stop the bleeding
- Resolve - Fix the root cause
- Review - Post-incident review within 48 hours
Communication Templates
Status Page Update:
[Investigating] We are investigating issues with [component].
[Identified] The issue has been identified as [brief description].
[Monitoring] A fix has been deployed. We are monitoring.
[Resolved] The incident has been resolved. [Brief summary].
Customer Notification:
Subject: [Noozer] Service Incident - [Brief Title]
We experienced an issue affecting [description of impact].
Timeline:
- [Time] Issue detected
- [Time] Issue resolved
Impact:
- [What customers experienced]
Root Cause:
- [Brief explanation]
Prevention:
- [Steps to prevent recurrence]
Emergency Contacts
| Role | Contact | Escalation |
|---|---|---|
| Primary On-Call | [Phone] | PagerDuty |
| Secondary On-Call | [Phone] | PagerDuty |
| Engineering Lead | [Email/Phone] | After 30 min |
| Cloudflare Support | support@cloudflare.com | For platform issues |
Disaster Recovery
Data Backup
D1 Database
# Export full database
wrangler d1 export noozer-production --output backup-$(date +%Y%m%d).sql
# Scheduled backup (via cron trigger)
# Backup worker runs daily at 3 AM, uploads to R2
R2 Objects
# R2 has built-in redundancy
# For additional safety, replicate to secondary bucket
wrangler r2 object copy \
raw-snapshots/2024/01/15/ \
raw-snapshots-backup/2024/01/15/ \
--recursive
Recovery Procedures
D1 Recovery
# Create new database
wrangler d1 create noozer-recovery
# Import from backup
wrangler d1 execute noozer-recovery --file backup-20240115.sql
# Update wrangler.toml with new database ID
# Deploy workers
wrangler deploy
Vectorize Recovery
# Vectorize doesn't support export/import
# Must rebuild from source data
# 1. Create new index
wrangler vectorize create noozer-articles-recovery --dimensions 1536 --metric cosine
# 2. Run embedding backfill job
POST /v1/admin/backfill/embeddings
{ "full_rebuild": true }
Failover Checklist
- Verify backup integrity
- Create recovery resources
- Update DNS/routes if needed
- Deploy workers to new resources
- Verify data consistency
- Run smoke tests
- Update monitoring
- Notify customers
Maintenance Windows
Scheduled Maintenance
Maintenance windows: Sundays 2-4 AM UTC
Pre-maintenance:
- Announce 48 hours in advance
- Update status page
- Notify enterprise customers directly
During maintenance:
- Set feature flag to show maintenance message
- Pause queue consumers
- Perform maintenance
- Run verification tests
- Resume consumers
- Remove maintenance message
Post-maintenance:
- Monitor for 30 minutes
- Update status page
- Send completion notification
Zero-Downtime Deployments
For routine deployments (no maintenance window needed):
# Workers support zero-downtime deployments by default
wrangler deploy
# For database migrations that are backwards compatible:
# 1. Deploy migration
wrangler d1 migrations apply noozer-production
# 2. Deploy code that uses new schema
wrangler deploy
# For breaking changes:
# 1. Deploy code that supports both old and new
# 2. Run migration
# 3. Deploy code that only uses new
# 4. Clean up old code paths
Quick Reference
Essential Commands
# Logs
wrangler tail # Production
wrangler tail --config wrangler.staging.toml # Staging
# Database
wrangler d1 execute noozer-production --command "SELECT ..."
wrangler d1 migrations apply noozer-production
# Queues
wrangler queues info <queue-name>
wrangler queues consumer pause <queue-name>
wrangler queues consumer resume <queue-name>
# Secrets
wrangler secret put <NAME>
wrangler secret list
# Rollback
wrangler rollback
API Quick Reference
# Health
GET /v1/health
GET /v1/admin/pipeline/status
# Trigger operations
POST /v1/admin/crawl/trigger
POST /v1/admin/recluster
POST /v1/admin/reprocess/{id}
# Costs
GET /v1/admin/costs
GET /v1/admin/budgets
# Storage
GET /v1/admin/storage/stats
POST /v1/admin/cache/invalidate
Last updated: 2024-01-15 On-call rotation: [Link to rotation schedule] Incident channel: #noozer-incidents