Dawn Scraper

LIVE

Problem

There is no programmatic API for Dawn News — the largest English-language newspaper in Pakistan.

Anyone building NLP pipelines on Pakistani news data — sentiment models, topic classifiers, summarizers — has no clean data source. They either scrape badly or give up. This tool removes that friction entirely.

Architecture

Key Decision

Used Python dataclasses for the output schema, not dicts. This enforces field presence at parse-time, not at analysis-time — bugs appear immediately rather than silently corrupting downstream models.

dawn.com/{section}/ └─→ index parse # article link discovery └─→ fetch(url) # per-article request └─→ parse # BeautifulSoup4 selectors └─→ @dataclass Article { title, body, author, date, category, url, word_count } └─→ json.dump() → NLP-ready file

Impact / Learnings

Demonstrated that clean data engineering upstream makes every downstream task trivial. The word_count field alone saves a re-read of the body for any length-filtering step.

Would add async fetching with aiohttp for the next version — sequential requests are the bottleneck at scale, and a 10x speed gain is straightforward.