Skip to main content

Google News Crawling Playbook

Reliably ingest Google News at scale (and from multiple locations) without getting rate-limited, blocked, or poisoning your dataset.


Table of Contents

  1. What You're Crawling
  2. Location Strategy
  3. Rate Limiting Rules
  4. Headers & Fingerprints
  5. Caching & Dedupe
  6. Keyword Management
  7. Redirects & Canonicalization
  8. HTML Enrichment
  9. 3-Tier Content Retrieval
  10. Multi-Location Crawling
  11. Failure Modes & Guardrails
  12. Logging Requirements
  13. Rules of Engagement Checklist

1. What You're Crawling

Google News has three "surfaces." Treat them differently.

A. RSS (Primary, Stable Backbone)

Use RSS for most coverage. It's the least brittle and most crawl-friendly.

Pattern:

https://news.google.com/rss/search?q={QUERY}&hl={HL}&gl={GL}&ceid={CEID}

What you get:

  • Clean items: title, link, pubDate, source, description
  • Stable structure
  • Less likely to trigger bot detection

What you don't get:

  • Rich cluster context ("more coverage")
  • Full article snippets
  • Some metadata available in HTML

B. HTML Search UI (Secondary, Enrichment Only)

Use HTML to enrich with clusters, "more coverage" links, extra metadata, sometimes ld+json.

Pattern:

https://news.google.com/search?q={QUERY}&hl={HL}&gl={GL}&ceid={CEID}

Risks:

  • Markup changes frequently
  • More likely to trip bot heuristics if hammered
  • Higher processing cost (parsing)

C. Undocumented JSON Endpoints (Avoid)

Do not build production dependencies on internal JSON endpoints.

They change without notice, may require authentication, and have stricter rate limits.


2. Location Strategy

Results differ by geography and language. Use these parameters intentionally.

Parameters

ParamPurposeExample
hlInterface languageen-US, en-GB, es-ES
glGeographic regionUS, GB, CA, DE
ceidEdition identifierUS:en, GB:en, DE:de

Best Practices

Start with a canonical set of locales (3-5):

US:en  (United States, English)
GB:en (United Kingdom, English)
CA:en (Canada, English)
AU:en (Australia, English)
IN:en (India, English)

Crawl each locale as a separate stream:

  • Enables comparison and cross-locale deduplication
  • Reveals regional coverage differences
  • Allows locale-specific rate limit management

Don't assume results are identical:

  • Headlines can differ
  • Ranking order varies
  • Publisher inclusion differs by locale
  • Some stories only appear in certain editions

Locale-Aware Storage

Store with each article:

locale_hl TEXT,        -- 'en-US'
locale_gl TEXT, -- 'US'
locale_ceid TEXT, -- 'US:en'
rank_position INTEGER, -- Position in results for this locale

3. Rate Limiting Rules

Google News is more tolerant than Google Search, but discipline keeps you alive.

Safe Patterns

PatternLimit
Sustained rate~1 request/second per egress pattern
Burst rateUp to 5 requests/second briefly
JitterRandom delay between requests (100-500ms)
Cache-firstNever refetch identical URLs repeatedly

Danger Patterns (Avoid)

  • ❌ Sustained 10+ requests/second
  • ❌ Aggressive parallel fetch for same keyword set
  • ❌ Tight loops re-querying same terms
  • ❌ Missing or obviously bot User-Agent
  • ❌ Repeated hits to identical URLs (especially HTML)

Response Handling

StatusMeaningAction
200SuccessProcess normally
200 (empty/garbage)Soft blockQuality check, may need fallback
429Rate limitedExponential backoff, retry later
403Bot detected / blockedSwitch strategy, use fallback tier
503Service unavailableRetry with backoff

Backoff Strategy

const backoff = {
initial: 1000, // 1 second
multiplier: 2,
maxDelay: 300000, // 5 minutes
maxRetries: 5
};

function getBackoffDelay(attempt: number): number {
const delay = backoff.initial * Math.pow(backoff.multiplier, attempt);
const jitter = Math.random() * 1000;
return Math.min(delay + jitter, backoff.maxDelay);
}

4. Headers & Fingerprints

User-Agent

Always send a normal browser UA, consistently:

const USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36';

Rules:

  • Use a modern Chrome UA string
  • Don't rotate UA every request
  • Rotate per batch or worker instance at most
  • Keep consistent within a session

Accept Headers

For RSS:

Accept: application/rss+xml, text/xml, application/xml;q=0.9

For HTML:

Accept: text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8

Standard Headers

const headers = {
'User-Agent': USER_AGENT,
'Accept': 'application/rss+xml, text/xml',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
};

Avoid Weird Behavior

  • ❌ Don't spam HEAD requests
  • ❌ Don't fetch same URL 50 times in a minute "to test"
  • ❌ Don't send obviously programmatic patterns (perfectly timed requests)

5. Caching & Dedupe

Prevents self-DDoS and dirty data.

Response Caching

Cache RSS/HTML responses for 30-300 seconds depending on freshness needs.

Cache key:

{url}:{locale}:{query_hash}

Implementation (KV):

const CACHE_TTL = 120; // seconds

async function fetchWithCache(url: string, locale: string, query: string): Promise<Response> {
const cacheKey = `gnews:${locale}:${hashQuery(query)}`;

const cached = await KV.get(cacheKey, 'text');
if (cached) {
return new Response(cached, { headers: { 'X-Cache': 'HIT' } });
}

const response = await fetch(url, { headers });
const body = await response.text();

await KV.put(cacheKey, body, { expirationTtl: CACHE_TTL });
return new Response(body, { headers: { 'X-Cache': 'MISS' } });
}

Deduplication

The same story appears multiple times:

  • Across keywords
  • Across locales
  • Across RSS and HTML

Dedupe by (in order):

  1. Resolved canonical URL (primary)

    • Follow redirects
    • Normalize URL
    • SHA-256 hash
  2. Publisher + title similarity (fallback)

    • Same publisher domain
    • Title similarity > 0.85
  3. Article text fingerprint (final fallback)

    • Hash of cleaned article text
    • Catches republished content

See URL Deduplication for full implementation.


6. Keyword Management

How you scale without dying.

Batch Keywords

Group keywords for processing:

Batch SizeUse Case
10High-frequency, hot keywords
20Medium frequency
50Long-tail, daily keywords

Use Queues/Workflows to enforce pacing between batches.

Scheduling Tiers

Map to our dynamic crawl frequency:

TierIntervalKeywords
Hot5-10 minBreaking news, trending topics
Warm15-30 minActive topics
Normal60 minStandard monitoring
Cold4 hoursLow activity
Frozen24 hoursArchival, long-tail

Implementation

// In IngestKeyword Workflow
const schedule = {
hot: '*/5 * * * *', // Every 5 minutes
warm: '*/15 * * * *', // Every 15 minutes
normal: '0 * * * *', // Every hour
cold: '0 */4 * * *', // Every 4 hours
frozen: '0 0 * * *', // Daily
};

Anti-Patterns

  • ❌ Never "loop forever" on a keyword list
  • ❌ Never process all keywords in one batch
  • ❌ Never ignore crawl tier when scheduling

7. Redirects & Canonicalization

Critical. Google News links often redirect or wrap URLs.

The Problem

Google News URL:

https://news.google.com/rss/articles/CBMiXmh0dHBzOi8vd3d3...

Redirects to:

https://www.nytimes.com/2024/01/15/technology/ai-announcement.html

Your Pipeline Must

  1. Follow all redirects

    const response = await fetch(url, { redirect: 'follow' });
    const finalUrl = response.url;
  2. Extract canonical URL via:

    • Response chain (final URL after redirects)
    • <link rel="canonical"> in HTML
    • Structured data (ld+json)
  3. Store both:

    google_news_url TEXT,  -- Original GN link
    canonical_url TEXT, -- Resolved final URL

Canonicalization Flow

Google News RSS Link


Follow Redirects


Get Final URL


Fetch Article HTML


Check <link rel="canonical">

├── Found? Use it

└── Not found? Use final redirect URL


Normalize URL


SHA-256 Hash


Dedupe Check

8. HTML Enrichment

Use HTML sparingly. Only when it's worth it.

When to Use HTML

  • Cluster detection ("more coverage" links)
  • ld+json extraction when present
  • Richer timestamp/publisher context
  • Missing fields in RSS

Throttling Rules

Only enrich:

  • Top-N results per keyword (e.g., top 5)
  • Items that pass a relevance threshold
  • Items where RSS lacks required fields
// Only enrich top 5 per keyword
const rssItems = await parseRss(response);
const toEnrich = rssItems.slice(0, 5);

for (const item of toEnrich) {
if (needsEnrichment(item)) {
await enrichFromHtml(item);
}
}

function needsEnrichment(item: RssItem): boolean {
return !item.author || !item.fullDescription;
}

HTML Parsing Tips

// Look for ld+json first
const ldJson = doc.querySelector('script[type="application/ld+json"]');
if (ldJson) {
const data = JSON.parse(ldJson.textContent);
// Extract structured data
}

// Fallback to meta tags
const ogTitle = doc.querySelector('meta[property="og:title"]')?.content;
const ogDescription = doc.querySelector('meta[property="og:description"]')?.content;

9. 3-Tier Content Retrieval

This applies to both Google News and publisher sites.

Tier 1: Direct Fetch (Cloudflare Worker)

  • Cost: Cheapest
  • Speed: Fastest
  • Success rate: ~70-80% of publishers
async function fetchTier1(url: string): Promise<FetchResult> {
const response = await fetch(url, {
headers: STANDARD_HEADERS,
cf: { cacheTtl: 300 }
});

if (response.ok) {
return { tier: 1, content: await response.text() };
}

throw new Error(`Tier 1 failed: ${response.status}`);
}

Tier 2: ZenRows (Anti-Bot Bypass)

  • Use when: 403, bot walls, JS challenges, missing content
  • Cost: Per-request pricing
  • Track: Usage rate for cost monitoring
async function fetchTier2(url: string): Promise<FetchResult> {
const zenrowsUrl = `https://api.zenrows.com/v1/?apikey=${ZENROWS_KEY}&url=${encodeURIComponent(url)}&js_render=true`;

const response = await fetch(zenrowsUrl);

if (response.ok) {
return { tier: 2, content: await response.text() };
}

throw new Error(`Tier 2 failed: ${response.status}`);
}

Tier 3: Third-Party API (Last Resort)

  • Use when: ZenRows fails, critical coverage needed
  • Options: RapidAPI Google News, DataForSEO
  • Cost: Highest
async function fetchTier3(url: string): Promise<FetchResult> {
// RapidAPI or DataForSEO
const response = await fetch(RAPIDAPI_ENDPOINT, {
headers: { 'X-RapidAPI-Key': RAPIDAPI_KEY },
body: JSON.stringify({ url })
});

if (response.ok) {
return { tier: 3, content: await response.json() };
}

throw new Error(`Tier 3 failed: ${response.status}`);
}

Unified Fetch Function

async function fetchWithFallback(url: string): Promise<FetchResult> {
// Tier 1: Direct
try {
return await fetchTier1(url);
} catch (e) {
log.warn('Tier 1 failed', { url, error: e.message });
}

// Tier 2: ZenRows
try {
return await fetchTier2(url);
} catch (e) {
log.warn('Tier 2 failed', { url, error: e.message });
}

// Tier 3: API
try {
return await fetchTier3(url);
} catch (e) {
log.error('All tiers failed', { url, error: e.message });
throw new Error('Fetch failed on all tiers');
}
}

Golden Rule

All tiers must produce the same normalized fields.

Downstream AI/classification doesn't care which tier fetched the content.

interface NormalizedArticle {
canonical_url: string;
headline: string;
body_text: string;
author?: string;
published_at?: string;
source_domain: string;
fetch_tier: 1 | 2 | 3;
}

10. Multi-Location Crawling

When crawling from different locales/regions.

A. Locale-Aware Storage

Store with each discovery:

CREATE TABLE keyword_crawl_results (
id TEXT PRIMARY KEY,
keyword_id TEXT,
locale_hl TEXT, -- 'en-US'
locale_gl TEXT, -- 'US'
locale_ceid TEXT, -- 'US:en'
rank_position INTEGER, -- Position in this locale's results
google_news_url TEXT,
discovered_at TEXT
);

B. Cross-Locale Story Merging

A story might appear only in UK edition but not US.

Keep locale-specific data:

  • Rank position per locale
  • Discovery timestamp per locale
  • Source variations per locale

Merge story identity by:

  • Canonical URL (primary)
  • Entity + time similarity (fallback)
async function mergeAcrossLocales(articles: Article[]): Promise<Story[]> {
const byCanonical = groupBy(articles, 'canonical_url');

return Object.entries(byCanonical).map(([url, variants]) => ({
canonical_url: url,
locales: variants.map(v => ({
locale: v.locale_ceid,
rank: v.rank_position,
discovered_at: v.discovered_at
})),
// Use highest-ranked variant for display
primary: variants.sort((a, b) => a.rank_position - b.rank_position)[0]
}));
}

C. Coverage Gaps

  • Some publishers appear more in certain regions
  • Don't treat absence as "not happening"
  • Track which locales found which stories

11. Failure Modes & Guardrails

Real-world "don't get wrecked" list.

Guardrails

GuardrailImplementation
Per-keyword min refresh intervalKV with TTL
Per-domain concurrency capsSemaphore in Durable Object
Centralized backoff stateKV or D1
Circuit breakerTrip when 429/403 rate > 20%

Circuit Breaker Example:

const CIRCUIT_BREAKER_THRESHOLD = 0.2; // 20% failure rate
const CIRCUIT_BREAKER_WINDOW = 60000; // 1 minute

async function checkCircuitBreaker(domain: string): Promise<boolean> {
const stats = await KV.get(`circuit:${domain}`, 'json');

if (!stats) return true; // Circuit closed, proceed

const failureRate = stats.failures / stats.total;
if (failureRate > CIRCUIT_BREAKER_THRESHOLD) {
if (Date.now() - stats.lastTrip < CIRCUIT_BREAKER_WINDOW) {
return false; // Circuit open, don't proceed
}
}

return true; // Circuit closed, proceed
}

Failure Modes

FailureSymptomResponse
Markup changesHTML parsing breaksFall back to RSS
Partial fetch failuresSome URLs failRetry later, don't drop silently
Duplicate stormsSame URL via many keywordsDedupe at ingest
Soft blocks200 OK but empty/garbageContent quality checks
Rate limit cascadeMultiple keywords hit limitsGlobal backoff

Quality Checks

function validateArticle(article: NormalizedArticle): ValidationResult {
const issues: string[] = [];

// Empty title
if (!article.headline?.trim()) {
issues.push('empty_title');
}

// Very short text (likely blocked or wrong extraction)
if (article.body_text && article.body_text.length < 200) {
issues.push('short_text');
}

// No text at all
if (!article.body_text) {
issues.push('no_text');
}

// Non-news page indicators
if (isNonNewsPage(article)) {
issues.push('non_news_page');
}

return {
valid: issues.length === 0,
issues,
quarantine: issues.includes('no_text') || issues.includes('non_news_page')
};
}

12. Logging Requirements

So you can debug in 10 minutes, not 10 hours.

Per-Request Logging

interface RequestLog {
// Identity
request_id: string;
timestamp: string;

// Target
url: string;
locale: string;
keyword_id: string;
source_type: 'rss' | 'html';

// Result
status_code: number;
latency_ms: number;
bytes_downloaded: number;

// Cache
cache_hit: boolean;

// Retry
retry_count: number;
backoff_ms: number;

// Tier
fetch_tier: 1 | 2 | 3;
}

Per-Article Logging

interface ArticleLog {
// Identity
article_id: string;
request_id: string;

// URL resolution
google_news_url: string;
redirect_chain: string[];
canonical_url: string;

// Extraction
extraction_success: boolean;
has_author: boolean;
has_date: boolean;
text_length: number;

// Dedupe
dedupe_key: string;
was_merged: boolean;
merged_with?: string;

// Pipeline
status: 'pending' | 'processing' | 'complete' | 'failed';
embedded: boolean;
classified: boolean;
scored: boolean;
}

Log Aggregation

Send to ClickHouse for time-series analysis:

-- Crawl success rate over time
SELECT
toStartOfHour(timestamp) as hour,
countIf(status_code = 200) / count() as success_rate
FROM crawl_logs
WHERE timestamp > now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

-- Tier usage distribution
SELECT
fetch_tier,
count() as requests,
avg(latency_ms) as avg_latency
FROM crawl_logs
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY fetch_tier;

13. Rules of Engagement Checklist

Minimal "rules of engagement" for Google News crawling:

  • RSS first, HTML second, never depend on internal JSON
  • Slow and steady beats fast and blocked (~1 req/sec sustained)
  • Cache + dedupe or you'll hammer Google accidentally
  • Treat locale as a first-class dimension (hl/gl/ceid)
  • Follow redirects; canonicalize everything
  • Use the 3-tier fetch plan consistently (Worker → ZenRows → API)
  • Store raw snapshots in R2 with retention policy
  • Instrument everything; add circuit breakers
  • Quality check all content (empty/short = quarantine)
  • Log per-request and per-article for debugging

Integration with Topic Intel

Workflow Mapping

This playbook maps to our IngestKeyword Workflow:

Playbook SectionWorkflow Step
RSS fetchingStep 3: Enqueue discovery jobs
Rate limitingStep 3: Queue pacing
Redirects & canonicalizationStep 5: Dedupe URLs
3-tier fetchStep 6: Enqueue fetch jobs
Quality checksStep 8: Parse results
LoggingStep 14: Emit to ClickHouse

Queue Mapping

Playbook ConcernQueue
RSS discoverydiscovery.google
HTML enrichmentdiscovery.google (with flag)
Tier 1 fetchfetch.direct
Tier 2 fetchfetch.zenrows
Tier 3 fetchfetch.rapidapi

Storage Mapping

DataStorage
Response cacheKV (gnews:*)
Backoff stateKV (backoff:*)
Circuit breakerKV (circuit:*)
Raw HTMLR2
Article metadataD1
Crawl logsClickHouse


This playbook is a living document. Update as Google News behavior changes.