Google News Crawling Playbook

Reliably ingest Google News at scale (and from multiple locations) without getting rate-limited, blocked, or poisoning your dataset.

What You're Crawling
Location Strategy
Rate Limiting Rules
Headers & Fingerprints
Caching & Dedupe
Keyword Management
Redirects & Canonicalization
HTML Enrichment
3-Tier Content Retrieval
Multi-Location Crawling
Failure Modes & Guardrails
Logging Requirements
Rules of Engagement Checklist

1. What You're Crawling

Google News has three "surfaces." Treat them differently.

A. RSS (Primary, Stable Backbone)

Use RSS for most coverage. It's the least brittle and most crawl-friendly.

Pattern:

https://news.google.com/rss/search?q={QUERY}&hl={HL}&gl={GL}&ceid={CEID}

What you get:

Clean items: title, link, pubDate, source, description
Stable structure
Less likely to trigger bot detection

What you don't get:

Rich cluster context ("more coverage")
Full article snippets
Some metadata available in HTML

B. HTML Search UI (Secondary, Enrichment Only)

Use HTML to enrich with clusters, "more coverage" links, extra metadata, sometimes ld+json.

Pattern:

https://news.google.com/search?q={QUERY}&hl={HL}&gl={GL}&ceid={CEID}

Risks:

Markup changes frequently
More likely to trip bot heuristics if hammered
Higher processing cost (parsing)

C. Undocumented JSON Endpoints (Avoid)

Do not build production dependencies on internal JSON endpoints.

They change without notice, may require authentication, and have stricter rate limits.

2. Location Strategy

Results differ by geography and language. Use these parameters intentionally.

Parameters

Param	Purpose	Example
`hl`	Interface language	`en-US`, `en-GB`, `es-ES`
`gl`	Geographic region	`US`, `GB`, `CA`, `DE`
`ceid`	Edition identifier	`US:en`, `GB:en`, `DE:de`

Best Practices

Start with a canonical set of locales (3-5):

US:en  (United States, English)
GB:en  (United Kingdom, English)
CA:en  (Canada, English)
AU:en  (Australia, English)
IN:en  (India, English)

Crawl each locale as a separate stream:

Enables comparison and cross-locale deduplication
Reveals regional coverage differences
Allows locale-specific rate limit management

Don't assume results are identical:

Headlines can differ
Ranking order varies
Publisher inclusion differs by locale
Some stories only appear in certain editions

Locale-Aware Storage

Store with each article:

locale_hl TEXT,        -- 'en-US'
locale_gl TEXT,        -- 'US'
locale_ceid TEXT,      -- 'US:en'
rank_position INTEGER, -- Position in results for this locale

3. Rate Limiting Rules

Google News is more tolerant than Google Search, but discipline keeps you alive.

Safe Patterns

Pattern	Limit
Sustained rate	~1 request/second per egress pattern
Burst rate	Up to 5 requests/second briefly
Jitter	Random delay between requests (100-500ms)
Cache-first	Never refetch identical URLs repeatedly

Danger Patterns (Avoid)

❌ Sustained 10+ requests/second
❌ Aggressive parallel fetch for same keyword set
❌ Tight loops re-querying same terms
❌ Missing or obviously bot User-Agent
❌ Repeated hits to identical URLs (especially HTML)

Response Handling

Status	Meaning	Action
`200`	Success	Process normally
`200` (empty/garbage)	Soft block	Quality check, may need fallback
`429`	Rate limited	Exponential backoff, retry later
`403`	Bot detected / blocked	Switch strategy, use fallback tier
`503`	Service unavailable	Retry with backoff

Backoff Strategy

const backoff = {
  initial: 1000,      // 1 second
  multiplier: 2,
  maxDelay: 300000,   // 5 minutes
  maxRetries: 5
};

function getBackoffDelay(attempt: number): number {
  const delay = backoff.initial * Math.pow(backoff.multiplier, attempt);
  const jitter = Math.random() * 1000;
  return Math.min(delay + jitter, backoff.maxDelay);
}

4. Headers & Fingerprints

User-Agent

Always send a normal browser UA, consistently:

const USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36';

Rules:

Use a modern Chrome UA string
Don't rotate UA every request
Rotate per batch or worker instance at most
Keep consistent within a session

Accept Headers

For RSS:

Accept: application/rss+xml, text/xml, application/xml;q=0.9

For HTML:

Accept: text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8

Standard Headers

const headers = {
  'User-Agent': USER_AGENT,
  'Accept': 'application/rss+xml, text/xml',
  'Accept-Language': 'en-US,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Connection': 'keep-alive',
  'Cache-Control': 'no-cache',
};

Avoid Weird Behavior

❌ Don't spam HEAD requests
❌ Don't fetch same URL 50 times in a minute "to test"
❌ Don't send obviously programmatic patterns (perfectly timed requests)

5. Caching & Dedupe

Prevents self-DDoS and dirty data.

Response Caching

Cache RSS/HTML responses for 30-300 seconds depending on freshness needs.

Cache key:

{url}:{locale}:{query_hash}

Implementation (KV):

const CACHE_TTL = 120; // seconds

async function fetchWithCache(url: string, locale: string, query: string): Promise<Response> {
  const cacheKey = `gnews:${locale}:${hashQuery(query)}`;
  
  const cached = await KV.get(cacheKey, 'text');
  if (cached) {
    return new Response(cached, { headers: { 'X-Cache': 'HIT' } });
  }
  
  const response = await fetch(url, { headers });
  const body = await response.text();
  
  await KV.put(cacheKey, body, { expirationTtl: CACHE_TTL });
  return new Response(body, { headers: { 'X-Cache': 'MISS' } });
}

Deduplication

The same story appears multiple times:

Across keywords
Across locales
Across RSS and HTML

Dedupe by (in order):

Resolved canonical URL (primary)
- Follow redirects
- Normalize URL
- SHA-256 hash
Publisher + title similarity (fallback)
- Same publisher domain
- Title similarity > 0.85
Article text fingerprint (final fallback)
- Hash of cleaned article text
- Catches republished content

See URL Deduplication for full implementation.

6. Keyword Management

How you scale without dying.

Batch Keywords

Group keywords for processing:

Batch Size	Use Case
10	High-frequency, hot keywords
20	Medium frequency
50	Long-tail, daily keywords

Use Queues/Workflows to enforce pacing between batches.

Scheduling Tiers

Map to our dynamic crawl frequency:

Tier	Interval	Keywords
Hot	5-10 min	Breaking news, trending topics
Warm	15-30 min	Active topics
Normal	60 min	Standard monitoring
Cold	4 hours	Low activity
Frozen	24 hours	Archival, long-tail

Implementation

// In IngestKeyword Workflow
const schedule = {
  hot: '*/5 * * * *',      // Every 5 minutes
  warm: '*/15 * * * *',    // Every 15 minutes
  normal: '0 * * * *',     // Every hour
  cold: '0 */4 * * *',     // Every 4 hours
  frozen: '0 0 * * *',     // Daily
};

Anti-Patterns

❌ Never "loop forever" on a keyword list
❌ Never process all keywords in one batch
❌ Never ignore crawl tier when scheduling

7. Redirects & Canonicalization

Critical. Google News links often redirect or wrap URLs.

The Problem

Google News URL:

https://news.google.com/rss/articles/CBMiXmh0dHBzOi8vd3d3...

Redirects to:

https://www.nytimes.com/2024/01/15/technology/ai-announcement.html

Your Pipeline Must

Follow all redirects

const response = await fetch(url, { redirect: 'follow' });
const finalUrl = response.url;

Extract canonical URL via:
- Response chain (final URL after redirects)
- <link rel="canonical"> in HTML
- Structured data (ld+json)

Store both:

google_news_url TEXT,  -- Original GN link
canonical_url TEXT,    -- Resolved final URL

Canonicalization Flow

Google News RSS Link
        │
        ▼
   Follow Redirects
        │
        ▼
   Get Final URL
        │
        ▼
   Fetch Article HTML
        │
        ▼
   Check <link rel="canonical">
        │
        ├── Found? Use it
        │
        └── Not found? Use final redirect URL
        │
        ▼
   Normalize URL
        │
        ▼
   SHA-256 Hash
        │
        ▼
   Dedupe Check

8. HTML Enrichment

Use HTML sparingly. Only when it's worth it.

When to Use HTML

Cluster detection ("more coverage" links)
ld+json extraction when present
Richer timestamp/publisher context
Missing fields in RSS

Throttling Rules

Only enrich:

Top-N results per keyword (e.g., top 5)
Items that pass a relevance threshold
Items where RSS lacks required fields

// Only enrich top 5 per keyword
const rssItems = await parseRss(response);
const toEnrich = rssItems.slice(0, 5);

for (const item of toEnrich) {
  if (needsEnrichment(item)) {
    await enrichFromHtml(item);
  }
}

function needsEnrichment(item: RssItem): boolean {
  return !item.author || !item.fullDescription;
}

HTML Parsing Tips

// Look for ld+json first
const ldJson = doc.querySelector('script[type="application/ld+json"]');
if (ldJson) {
  const data = JSON.parse(ldJson.textContent);
  // Extract structured data
}

// Fallback to meta tags
const ogTitle = doc.querySelector('meta[property="og:title"]')?.content;
const ogDescription = doc.querySelector('meta[property="og:description"]')?.content;

9. 3-Tier Content Retrieval

This applies to both Google News and publisher sites.

Tier 1: Direct Fetch (Cloudflare Worker)

Cost: Cheapest
Speed: Fastest
Success rate: ~70-80% of publishers

async function fetchTier1(url: string): Promise<FetchResult> {
  const response = await fetch(url, {
    headers: STANDARD_HEADERS,
    cf: { cacheTtl: 300 }
  });
  
  if (response.ok) {
    return { tier: 1, content: await response.text() };
  }
  
  throw new Error(`Tier 1 failed: ${response.status}`);
}

Tier 2: ZenRows (Anti-Bot Bypass)

Use when: 403, bot walls, JS challenges, missing content
Cost: Per-request pricing
Track: Usage rate for cost monitoring

async function fetchTier2(url: string): Promise<FetchResult> {
  const zenrowsUrl = `https://api.zenrows.com/v1/?apikey=${ZENROWS_KEY}&url=${encodeURIComponent(url)}&js_render=true`;
  
  const response = await fetch(zenrowsUrl);
  
  if (response.ok) {
    return { tier: 2, content: await response.text() };
  }
  
  throw new Error(`Tier 2 failed: ${response.status}`);
}

Tier 3: Third-Party API (Last Resort)

Use when: ZenRows fails, critical coverage needed
Options: RapidAPI Google News, DataForSEO
Cost: Highest

async function fetchTier3(url: string): Promise<FetchResult> {
  // RapidAPI or DataForSEO
  const response = await fetch(RAPIDAPI_ENDPOINT, {
    headers: { 'X-RapidAPI-Key': RAPIDAPI_KEY },
    body: JSON.stringify({ url })
  });
  
  if (response.ok) {
    return { tier: 3, content: await response.json() };
  }
  
  throw new Error(`Tier 3 failed: ${response.status}`);
}

Unified Fetch Function

async function fetchWithFallback(url: string): Promise<FetchResult> {
  // Tier 1: Direct
  try {
    return await fetchTier1(url);
  } catch (e) {
    log.warn('Tier 1 failed', { url, error: e.message });
  }
  
  // Tier 2: ZenRows
  try {
    return await fetchTier2(url);
  } catch (e) {
    log.warn('Tier 2 failed', { url, error: e.message });
  }
  
  // Tier 3: API
  try {
    return await fetchTier3(url);
  } catch (e) {
    log.error('All tiers failed', { url, error: e.message });
    throw new Error('Fetch failed on all tiers');
  }
}

Golden Rule

All tiers must produce the same normalized fields.

Downstream AI/classification doesn't care which tier fetched the content.

interface NormalizedArticle {
  canonical_url: string;
  headline: string;
  body_text: string;
  author?: string;
  published_at?: string;
  source_domain: string;
  fetch_tier: 1 | 2 | 3;
}

10. Multi-Location Crawling

When crawling from different locales/regions.

A. Locale-Aware Storage

Store with each discovery:

CREATE TABLE keyword_crawl_results (
  id TEXT PRIMARY KEY,
  keyword_id TEXT,
  locale_hl TEXT,           -- 'en-US'
  locale_gl TEXT,           -- 'US'
  locale_ceid TEXT,         -- 'US:en'
  rank_position INTEGER,    -- Position in this locale's results
  google_news_url TEXT,
  discovered_at TEXT
);

B. Cross-Locale Story Merging

A story might appear only in UK edition but not US.

Keep locale-specific data:

Rank position per locale
Discovery timestamp per locale
Source variations per locale

Merge story identity by:

Canonical URL (primary)
Entity + time similarity (fallback)

async function mergeAcrossLocales(articles: Article[]): Promise<Story[]> {
  const byCanonical = groupBy(articles, 'canonical_url');
  
  return Object.entries(byCanonical).map(([url, variants]) => ({
    canonical_url: url,
    locales: variants.map(v => ({
      locale: v.locale_ceid,
      rank: v.rank_position,
      discovered_at: v.discovered_at
    })),
    // Use highest-ranked variant for display
    primary: variants.sort((a, b) => a.rank_position - b.rank_position)[0]
  }));
}

C. Coverage Gaps

Some publishers appear more in certain regions
Don't treat absence as "not happening"
Track which locales found which stories

11. Failure Modes & Guardrails

Real-world "don't get wrecked" list.

Guardrails

Guardrail	Implementation
Per-keyword min refresh interval	KV with TTL
Per-domain concurrency caps	Semaphore in Durable Object
Centralized backoff state	KV or D1
Circuit breaker	Trip when 429/403 rate > 20%

Circuit Breaker Example:

const CIRCUIT_BREAKER_THRESHOLD = 0.2; // 20% failure rate
const CIRCUIT_BREAKER_WINDOW = 60000;  // 1 minute

async function checkCircuitBreaker(domain: string): Promise<boolean> {
  const stats = await KV.get(`circuit:${domain}`, 'json');
  
  if (!stats) return true; // Circuit closed, proceed
  
  const failureRate = stats.failures / stats.total;
  if (failureRate > CIRCUIT_BREAKER_THRESHOLD) {
    if (Date.now() - stats.lastTrip < CIRCUIT_BREAKER_WINDOW) {
      return false; // Circuit open, don't proceed
    }
  }
  
  return true; // Circuit closed, proceed
}

Failure Modes

Failure	Symptom	Response
Markup changes	HTML parsing breaks	Fall back to RSS
Partial fetch failures	Some URLs fail	Retry later, don't drop silently
Duplicate storms	Same URL via many keywords	Dedupe at ingest
Soft blocks	200 OK but empty/garbage	Content quality checks
Rate limit cascade	Multiple keywords hit limits	Global backoff

Quality Checks

function validateArticle(article: NormalizedArticle): ValidationResult {
  const issues: string[] = [];
  
  // Empty title
  if (!article.headline?.trim()) {
    issues.push('empty_title');
  }
  
  // Very short text (likely blocked or wrong extraction)
  if (article.body_text && article.body_text.length < 200) {
    issues.push('short_text');
  }
  
  // No text at all
  if (!article.body_text) {
    issues.push('no_text');
  }
  
  // Non-news page indicators
  if (isNonNewsPage(article)) {
    issues.push('non_news_page');
  }
  
  return {
    valid: issues.length === 0,
    issues,
    quarantine: issues.includes('no_text') || issues.includes('non_news_page')
  };
}

12. Logging Requirements

So you can debug in 10 minutes, not 10 hours.

Per-Request Logging

interface RequestLog {
  // Identity
  request_id: string;
  timestamp: string;
  
  // Target
  url: string;
  locale: string;
  keyword_id: string;
  source_type: 'rss' | 'html';
  
  // Result
  status_code: number;
  latency_ms: number;
  bytes_downloaded: number;
  
  // Cache
  cache_hit: boolean;
  
  // Retry
  retry_count: number;
  backoff_ms: number;
  
  // Tier
  fetch_tier: 1 | 2 | 3;
}

Per-Article Logging

interface ArticleLog {
  // Identity
  article_id: string;
  request_id: string;
  
  // URL resolution
  google_news_url: string;
  redirect_chain: string[];
  canonical_url: string;
  
  // Extraction
  extraction_success: boolean;
  has_author: boolean;
  has_date: boolean;
  text_length: number;
  
  // Dedupe
  dedupe_key: string;
  was_merged: boolean;
  merged_with?: string;
  
  // Pipeline
  status: 'pending' | 'processing' | 'complete' | 'failed';
  embedded: boolean;
  classified: boolean;
  scored: boolean;
}

Log Aggregation

Send to ClickHouse for time-series analysis:

-- Crawl success rate over time
SELECT 
  toStartOfHour(timestamp) as hour,
  countIf(status_code = 200) / count() as success_rate
FROM crawl_logs
WHERE timestamp > now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

-- Tier usage distribution
SELECT 
  fetch_tier,
  count() as requests,
  avg(latency_ms) as avg_latency
FROM crawl_logs
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY fetch_tier;

13. Rules of Engagement Checklist

Minimal "rules of engagement" for Google News crawling:

Integration with Topic Intel

Workflow Mapping

This playbook maps to our IngestKeyword Workflow:

Playbook Section	Workflow Step
RSS fetching	Step 3: Enqueue discovery jobs
Rate limiting	Step 3: Queue pacing
Redirects & canonicalization	Step 5: Dedupe URLs
3-tier fetch	Step 6: Enqueue fetch jobs
Quality checks	Step 8: Parse results
Logging	Step 14: Emit to ClickHouse

Queue Mapping

Playbook Concern	Queue
RSS discovery	`discovery.google`
HTML enrichment	`discovery.google` (with flag)
Tier 1 fetch	`fetch.direct`
Tier 2 fetch	`fetch.zenrows`
Tier 3 fetch	`fetch.rapidapi`

Storage Mapping

Data	Storage
Response cache	KV (`gnews:*`)
Backoff state	KV (`backoff:*`)
Circuit breaker	KV (`circuit:*`)
Raw HTML	R2
Article metadata	D1
Crawl logs	ClickHouse

Architecture Overview - System design
URL Deduplication - Full dedupe implementation
External APIs - ZenRows, DataForSEO, RapidAPI integration
Runbook - Operational procedures for failures

This playbook is a living document. Update as Google News behavior changes.

Table of Contents​

1. What You're Crawling​

A. RSS (Primary, Stable Backbone)​

B. HTML Search UI (Secondary, Enrichment Only)​

C. Undocumented JSON Endpoints (Avoid)​

2. Location Strategy​

Parameters​

Best Practices​

Locale-Aware Storage​

3. Rate Limiting Rules​

Safe Patterns​

Danger Patterns (Avoid)​

Response Handling​

Backoff Strategy​

4. Headers & Fingerprints​

User-Agent​

Accept Headers​

Standard Headers​

Avoid Weird Behavior​

5. Caching & Dedupe​

Response Caching​

Deduplication​

6. Keyword Management​

Batch Keywords​

Scheduling Tiers​

Implementation​

Anti-Patterns​

7. Redirects & Canonicalization​

The Problem​

Your Pipeline Must​

Canonicalization Flow​

8. HTML Enrichment​

When to Use HTML​

Throttling Rules​

HTML Parsing Tips​

9. 3-Tier Content Retrieval​

Tier 1: Direct Fetch (Cloudflare Worker)​

Tier 2: ZenRows (Anti-Bot Bypass)​

Tier 3: Third-Party API (Last Resort)​

Unified Fetch Function​

Golden Rule​

10. Multi-Location Crawling​

A. Locale-Aware Storage​

B. Cross-Locale Story Merging​

C. Coverage Gaps​

11. Failure Modes & Guardrails​

Guardrails​

Failure Modes​

Quality Checks​

12. Logging Requirements​

Per-Request Logging​

Per-Article Logging​

Log Aggregation​

13. Rules of Engagement Checklist​

Integration with Topic Intel​

Workflow Mapping​

Queue Mapping​

Storage Mapping​

Related Documentation​

Table of Contents