Google News Crawling Playbook
Reliably ingest Google News at scale (and from multiple locations) without getting rate-limited, blocked, or poisoning your dataset.
Table of Contents
- What You're Crawling
- Location Strategy
- Rate Limiting Rules
- Headers & Fingerprints
- Caching & Dedupe
- Keyword Management
- Redirects & Canonicalization
- HTML Enrichment
- 3-Tier Content Retrieval
- Multi-Location Crawling
- Failure Modes & Guardrails
- Logging Requirements
- Rules of Engagement Checklist
1. What You're Crawling
Google News has three "surfaces." Treat them differently.
A. RSS (Primary, Stable Backbone)
Use RSS for most coverage. It's the least brittle and most crawl-friendly.
Pattern:
https://news.google.com/rss/search?q={QUERY}&hl={HL}&gl={GL}&ceid={CEID}
What you get:
- Clean items: title, link, pubDate, source, description
- Stable structure
- Less likely to trigger bot detection
What you don't get:
- Rich cluster context ("more coverage")
- Full article snippets
- Some metadata available in HTML
B. HTML Search UI (Secondary, Enrichment Only)
Use HTML to enrich with clusters, "more coverage" links, extra metadata, sometimes ld+json.
Pattern:
https://news.google.com/search?q={QUERY}&hl={HL}&gl={GL}&ceid={CEID}
Risks:
- Markup changes frequently
- More likely to trip bot heuristics if hammered
- Higher processing cost (parsing)
C. Undocumented JSON Endpoints (Avoid)
Do not build production dependencies on internal JSON endpoints.
They change without notice, may require authentication, and have stricter rate limits.
2. Location Strategy
Results differ by geography and language. Use these parameters intentionally.
Parameters
| Param | Purpose | Example |
|---|---|---|
hl | Interface language | en-US, en-GB, es-ES |
gl | Geographic region | US, GB, CA, DE |
ceid | Edition identifier | US:en, GB:en, DE:de |
Best Practices
Start with a canonical set of locales (3-5):
US:en (United States, English)
GB:en (United Kingdom, English)
CA:en (Canada, English)
AU:en (Australia, English)
IN:en (India, English)
Crawl each locale as a separate stream:
- Enables comparison and cross-locale deduplication
- Reveals regional coverage differences
- Allows locale-specific rate limit management
Don't assume results are identical:
- Headlines can differ
- Ranking order varies
- Publisher inclusion differs by locale
- Some stories only appear in certain editions
Locale-Aware Storage
Store with each article:
locale_hl TEXT, -- 'en-US'
locale_gl TEXT, -- 'US'
locale_ceid TEXT, -- 'US:en'
rank_position INTEGER, -- Position in results for this locale
3. Rate Limiting Rules
Google News is more tolerant than Google Search, but discipline keeps you alive.
Safe Patterns
| Pattern | Limit |
|---|---|
| Sustained rate | ~1 request/second per egress pattern |
| Burst rate | Up to 5 requests/second briefly |
| Jitter | Random delay between requests (100-500ms) |
| Cache-first | Never refetch identical URLs repeatedly |
Danger Patterns (Avoid)
- ❌ Sustained 10+ requests/second
- ❌ Aggressive parallel fetch for same keyword set
- ❌ Tight loops re-querying same terms
- ❌ Missing or obviously bot User-Agent
- ❌ Repeated hits to identical URLs (especially HTML)
Response Handling
| Status | Meaning | Action |
|---|---|---|
200 | Success | Process normally |
200 (empty/garbage) | Soft block | Quality check, may need fallback |
429 | Rate limited | Exponential backoff, retry later |
403 | Bot detected / blocked | Switch strategy, use fallback tier |
503 | Service unavailable | Retry with backoff |
Backoff Strategy
const backoff = {
initial: 1000, // 1 second
multiplier: 2,
maxDelay: 300000, // 5 minutes
maxRetries: 5
};
function getBackoffDelay(attempt: number): number {
const delay = backoff.initial * Math.pow(backoff.multiplier, attempt);
const jitter = Math.random() * 1000;
return Math.min(delay + jitter, backoff.maxDelay);
}
4. Headers & Fingerprints
User-Agent
Always send a normal browser UA, consistently:
const USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36';
Rules:
- Use a modern Chrome UA string
- Don't rotate UA every request
- Rotate per batch or worker instance at most
- Keep consistent within a session
Accept Headers
For RSS:
Accept: application/rss+xml, text/xml, application/xml;q=0.9
For HTML:
Accept: text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8
Standard Headers
const headers = {
'User-Agent': USER_AGENT,
'Accept': 'application/rss+xml, text/xml',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
};
Avoid Weird Behavior
- ❌ Don't spam HEAD requests
- ❌ Don't fetch same URL 50 times in a minute "to test"
- ❌ Don't send obviously programmatic patterns (perfectly timed requests)
5. Caching & Dedupe
Prevents self-DDoS and dirty data.
Response Caching
Cache RSS/HTML responses for 30-300 seconds depending on freshness needs.
Cache key:
{url}:{locale}:{query_hash}
Implementation (KV):
const CACHE_TTL = 120; // seconds
async function fetchWithCache(url: string, locale: string, query: string): Promise<Response> {
const cacheKey = `gnews:${locale}:${hashQuery(query)}`;
const cached = await KV.get(cacheKey, 'text');
if (cached) {
return new Response(cached, { headers: { 'X-Cache': 'HIT' } });
}
const response = await fetch(url, { headers });
const body = await response.text();
await KV.put(cacheKey, body, { expirationTtl: CACHE_TTL });
return new Response(body, { headers: { 'X-Cache': 'MISS' } });
}
Deduplication
The same story appears multiple times:
- Across keywords
- Across locales
- Across RSS and HTML
Dedupe by (in order):
-
Resolved canonical URL (primary)
- Follow redirects
- Normalize URL
- SHA-256 hash
-
Publisher + title similarity (fallback)
- Same publisher domain
- Title similarity > 0.85
-
Article text fingerprint (final fallback)
- Hash of cleaned article text
- Catches republished content
See URL Deduplication for full implementation.
6. Keyword Management
How you scale without dying.
Batch Keywords
Group keywords for processing:
| Batch Size | Use Case |
|---|---|
| 10 | High-frequency, hot keywords |
| 20 | Medium frequency |
| 50 | Long-tail, daily keywords |
Use Queues/Workflows to enforce pacing between batches.
Scheduling Tiers
Map to our dynamic crawl frequency:
| Tier | Interval | Keywords |
|---|---|---|
| Hot | 5-10 min | Breaking news, trending topics |
| Warm | 15-30 min | Active topics |
| Normal | 60 min | Standard monitoring |
| Cold | 4 hours | Low activity |
| Frozen | 24 hours | Archival, long-tail |
Implementation
// In IngestKeyword Workflow
const schedule = {
hot: '*/5 * * * *', // Every 5 minutes
warm: '*/15 * * * *', // Every 15 minutes
normal: '0 * * * *', // Every hour
cold: '0 */4 * * *', // Every 4 hours
frozen: '0 0 * * *', // Daily
};
Anti-Patterns
- ❌ Never "loop forever" on a keyword list
- ❌ Never process all keywords in one batch
- ❌ Never ignore crawl tier when scheduling
7. Redirects & Canonicalization
Critical. Google News links often redirect or wrap URLs.
The Problem
Google News URL:
https://news.google.com/rss/articles/CBMiXmh0dHBzOi8vd3d3...
Redirects to:
https://www.nytimes.com/2024/01/15/technology/ai-announcement.html
Your Pipeline Must
-
Follow all redirects
const response = await fetch(url, { redirect: 'follow' });
const finalUrl = response.url; -
Extract canonical URL via:
- Response chain (final URL after redirects)
<link rel="canonical">in HTML- Structured data (ld+json)
-
Store both:
google_news_url TEXT, -- Original GN link
canonical_url TEXT, -- Resolved final URL
Canonicalization Flow
Google News RSS Link
│
▼
Follow Redirects
│
▼
Get Final URL
│
▼
Fetch Article HTML
│
▼
Check <link rel="canonical">
│
├── Found? Use it
│
└── Not found? Use final redirect URL
│
▼
Normalize URL
│
▼
SHA-256 Hash
│
▼
Dedupe Check
8. HTML Enrichment
Use HTML sparingly. Only when it's worth it.
When to Use HTML
- Cluster detection ("more coverage" links)
ld+jsonextraction when present- Richer timestamp/publisher context
- Missing fields in RSS
Throttling Rules
Only enrich:
- Top-N results per keyword (e.g., top 5)
- Items that pass a relevance threshold
- Items where RSS lacks required fields
// Only enrich top 5 per keyword
const rssItems = await parseRss(response);
const toEnrich = rssItems.slice(0, 5);
for (const item of toEnrich) {
if (needsEnrichment(item)) {
await enrichFromHtml(item);
}
}
function needsEnrichment(item: RssItem): boolean {
return !item.author || !item.fullDescription;
}
HTML Parsing Tips
// Look for ld+json first
const ldJson = doc.querySelector('script[type="application/ld+json"]');
if (ldJson) {
const data = JSON.parse(ldJson.textContent);
// Extract structured data
}
// Fallback to meta tags
const ogTitle = doc.querySelector('meta[property="og:title"]')?.content;
const ogDescription = doc.querySelector('meta[property="og:description"]')?.content;
9. 3-Tier Content Retrieval
This applies to both Google News and publisher sites.
Tier 1: Direct Fetch (Cloudflare Worker)
- Cost: Cheapest
- Speed: Fastest
- Success rate: ~70-80% of publishers
async function fetchTier1(url: string): Promise<FetchResult> {
const response = await fetch(url, {
headers: STANDARD_HEADERS,
cf: { cacheTtl: 300 }
});
if (response.ok) {
return { tier: 1, content: await response.text() };
}
throw new Error(`Tier 1 failed: ${response.status}`);
}
Tier 2: ZenRows (Anti-Bot Bypass)
- Use when: 403, bot walls, JS challenges, missing content
- Cost: Per-request pricing
- Track: Usage rate for cost monitoring
async function fetchTier2(url: string): Promise<FetchResult> {
const zenrowsUrl = `https://api.zenrows.com/v1/?apikey=${ZENROWS_KEY}&url=${encodeURIComponent(url)}&js_render=true`;
const response = await fetch(zenrowsUrl);
if (response.ok) {
return { tier: 2, content: await response.text() };
}
throw new Error(`Tier 2 failed: ${response.status}`);
}
Tier 3: Third-Party API (Last Resort)
- Use when: ZenRows fails, critical coverage needed
- Options: RapidAPI Google News, DataForSEO
- Cost: Highest
async function fetchTier3(url: string): Promise<FetchResult> {
// RapidAPI or DataForSEO
const response = await fetch(RAPIDAPI_ENDPOINT, {
headers: { 'X-RapidAPI-Key': RAPIDAPI_KEY },
body: JSON.stringify({ url })
});
if (response.ok) {
return { tier: 3, content: await response.json() };
}
throw new Error(`Tier 3 failed: ${response.status}`);
}
Unified Fetch Function
async function fetchWithFallback(url: string): Promise<FetchResult> {
// Tier 1: Direct
try {
return await fetchTier1(url);
} catch (e) {
log.warn('Tier 1 failed', { url, error: e.message });
}
// Tier 2: ZenRows
try {
return await fetchTier2(url);
} catch (e) {
log.warn('Tier 2 failed', { url, error: e.message });
}
// Tier 3: API
try {
return await fetchTier3(url);
} catch (e) {
log.error('All tiers failed', { url, error: e.message });
throw new Error('Fetch failed on all tiers');
}
}
Golden Rule
All tiers must produce the same normalized fields.
Downstream AI/classification doesn't care which tier fetched the content.
interface NormalizedArticle {
canonical_url: string;
headline: string;
body_text: string;
author?: string;
published_at?: string;
source_domain: string;
fetch_tier: 1 | 2 | 3;
}
10. Multi-Location Crawling
When crawling from different locales/regions.
A. Locale-Aware Storage
Store with each discovery:
CREATE TABLE keyword_crawl_results (
id TEXT PRIMARY KEY,
keyword_id TEXT,
locale_hl TEXT, -- 'en-US'
locale_gl TEXT, -- 'US'
locale_ceid TEXT, -- 'US:en'
rank_position INTEGER, -- Position in this locale's results
google_news_url TEXT,
discovered_at TEXT
);
B. Cross-Locale Story Merging
A story might appear only in UK edition but not US.
Keep locale-specific data:
- Rank position per locale
- Discovery timestamp per locale
- Source variations per locale
Merge story identity by:
- Canonical URL (primary)
- Entity + time similarity (fallback)
async function mergeAcrossLocales(articles: Article[]): Promise<Story[]> {
const byCanonical = groupBy(articles, 'canonical_url');
return Object.entries(byCanonical).map(([url, variants]) => ({
canonical_url: url,
locales: variants.map(v => ({
locale: v.locale_ceid,
rank: v.rank_position,
discovered_at: v.discovered_at
})),
// Use highest-ranked variant for display
primary: variants.sort((a, b) => a.rank_position - b.rank_position)[0]
}));
}
C. Coverage Gaps
- Some publishers appear more in certain regions
- Don't treat absence as "not happening"
- Track which locales found which stories
11. Failure Modes & Guardrails
Real-world "don't get wrecked" list.
Guardrails
| Guardrail | Implementation |
|---|---|
| Per-keyword min refresh interval | KV with TTL |
| Per-domain concurrency caps | Semaphore in Durable Object |
| Centralized backoff state | KV or D1 |
| Circuit breaker | Trip when 429/403 rate > 20% |
Circuit Breaker Example:
const CIRCUIT_BREAKER_THRESHOLD = 0.2; // 20% failure rate
const CIRCUIT_BREAKER_WINDOW = 60000; // 1 minute
async function checkCircuitBreaker(domain: string): Promise<boolean> {
const stats = await KV.get(`circuit:${domain}`, 'json');
if (!stats) return true; // Circuit closed, proceed
const failureRate = stats.failures / stats.total;
if (failureRate > CIRCUIT_BREAKER_THRESHOLD) {
if (Date.now() - stats.lastTrip < CIRCUIT_BREAKER_WINDOW) {
return false; // Circuit open, don't proceed
}
}
return true; // Circuit closed, proceed
}
Failure Modes
| Failure | Symptom | Response |
|---|---|---|
| Markup changes | HTML parsing breaks | Fall back to RSS |
| Partial fetch failures | Some URLs fail | Retry later, don't drop silently |
| Duplicate storms | Same URL via many keywords | Dedupe at ingest |
| Soft blocks | 200 OK but empty/garbage | Content quality checks |
| Rate limit cascade | Multiple keywords hit limits | Global backoff |
Quality Checks
function validateArticle(article: NormalizedArticle): ValidationResult {
const issues: string[] = [];
// Empty title
if (!article.headline?.trim()) {
issues.push('empty_title');
}
// Very short text (likely blocked or wrong extraction)
if (article.body_text && article.body_text.length < 200) {
issues.push('short_text');
}
// No text at all
if (!article.body_text) {
issues.push('no_text');
}
// Non-news page indicators
if (isNonNewsPage(article)) {
issues.push('non_news_page');
}
return {
valid: issues.length === 0,
issues,
quarantine: issues.includes('no_text') || issues.includes('non_news_page')
};
}
12. Logging Requirements
So you can debug in 10 minutes, not 10 hours.
Per-Request Logging
interface RequestLog {
// Identity
request_id: string;
timestamp: string;
// Target
url: string;
locale: string;
keyword_id: string;
source_type: 'rss' | 'html';
// Result
status_code: number;
latency_ms: number;
bytes_downloaded: number;
// Cache
cache_hit: boolean;
// Retry
retry_count: number;
backoff_ms: number;
// Tier
fetch_tier: 1 | 2 | 3;
}
Per-Article Logging
interface ArticleLog {
// Identity
article_id: string;
request_id: string;
// URL resolution
google_news_url: string;
redirect_chain: string[];
canonical_url: string;
// Extraction
extraction_success: boolean;
has_author: boolean;
has_date: boolean;
text_length: number;
// Dedupe
dedupe_key: string;
was_merged: boolean;
merged_with?: string;
// Pipeline
status: 'pending' | 'processing' | 'complete' | 'failed';
embedded: boolean;
classified: boolean;
scored: boolean;
}
Log Aggregation
Send to ClickHouse for time-series analysis:
-- Crawl success rate over time
SELECT
toStartOfHour(timestamp) as hour,
countIf(status_code = 200) / count() as success_rate
FROM crawl_logs
WHERE timestamp > now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;
-- Tier usage distribution
SELECT
fetch_tier,
count() as requests,
avg(latency_ms) as avg_latency
FROM crawl_logs
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY fetch_tier;
13. Rules of Engagement Checklist
Minimal "rules of engagement" for Google News crawling:
- RSS first, HTML second, never depend on internal JSON
- Slow and steady beats fast and blocked (~1 req/sec sustained)
- Cache + dedupe or you'll hammer Google accidentally
- Treat locale as a first-class dimension (hl/gl/ceid)
- Follow redirects; canonicalize everything
- Use the 3-tier fetch plan consistently (Worker → ZenRows → API)
- Store raw snapshots in R2 with retention policy
- Instrument everything; add circuit breakers
- Quality check all content (empty/short = quarantine)
- Log per-request and per-article for debugging
Integration with Topic Intel
Workflow Mapping
This playbook maps to our IngestKeyword Workflow:
| Playbook Section | Workflow Step |
|---|---|
| RSS fetching | Step 3: Enqueue discovery jobs |
| Rate limiting | Step 3: Queue pacing |
| Redirects & canonicalization | Step 5: Dedupe URLs |
| 3-tier fetch | Step 6: Enqueue fetch jobs |
| Quality checks | Step 8: Parse results |
| Logging | Step 14: Emit to ClickHouse |
Queue Mapping
| Playbook Concern | Queue |
|---|---|
| RSS discovery | discovery.google |
| HTML enrichment | discovery.google (with flag) |
| Tier 1 fetch | fetch.direct |
| Tier 2 fetch | fetch.zenrows |
| Tier 3 fetch | fetch.rapidapi |
Storage Mapping
| Data | Storage |
|---|---|
| Response cache | KV (gnews:*) |
| Backoff state | KV (backoff:*) |
| Circuit breaker | KV (circuit:*) |
| Raw HTML | R2 |
| Article metadata | D1 |
| Crawl logs | ClickHouse |
Related Documentation
- Architecture Overview - System design
- URL Deduplication - Full dedupe implementation
- External APIs - ZenRows, DataForSEO, RapidAPI integration
- Runbook - Operational procedures for failures
This playbook is a living document. Update as Google News behavior changes.