Dawn Scraper
Problem
There is no programmatic API for Dawn News — the largest English-language newspaper in Pakistan.
Anyone building NLP pipelines on Pakistani news data — sentiment models, topic classifiers, summarizers — has no clean data source. They either scrape badly or give up. This tool removes that friction entirely.
Architecture
Key Decision
Used Python dataclasses for the output schema, not dicts. This enforces field presence at parse-time, not at analysis-time — bugs appear immediately rather than silently corrupting downstream models.dawn.com/{section}/
└─→ index parse # article link discovery
└─→ fetch(url) # per-article request
└─→ parse # BeautifulSoup4 selectors
└─→ @dataclass Article {
title, body, author,
date, category, url, word_count
}
└─→ json.dump() → NLP-ready file
Impact / Learnings
Demonstrated that clean data engineering upstream makes every downstream task trivial. The word_count field alone saves a re-read of the body for any length-filtering step.
Would add async fetching with aiohttp for the next version — sequential requests are the bottleneck at scale, and a 10x speed gain is straightforward.