frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: Scaling a targeted web crawler beyond 500M pages/day

11•honungsburk•10h ago
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler").

Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern.

The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems.

For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance.

Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.

Comments

4lx87•1h ago
I'm curious, how do you deal with Cloudflare and similar anti-bot systems? Just keep shopping the job around to different proxies?

Tell HN: Claude 4.7 is ignoring stop hooks

24•LatencyKills•1h ago•4 comments

Ask HN: Scaling a targeted web crawler beyond 500M pages/day

11•honungsburk•10h ago•1 comments

Ask HN: How are you using AI code assistants on large messy legacy code bases?

3•thinkingtoilet•7h ago•11 comments

Ask HN: How do solo devs protect their work in the age of vibe coding?

16•langs•20h ago•13 comments

Tell HN: Codex macOS app switches to Fast speed after update without asking

4•mfi•14h ago•0 comments

Ask HN: Chrome, Brave, Firefox or Something Else?

7•wasimsk•12h ago•15 comments

GPT-5.5 – No ARC-AGI-3 scores

10•AG25•1d ago•3 comments

Ask HN: Why are companies so distrustful of remote employees?

10•lyfeninja•20h ago•16 comments

Tell HN: YouTube RSS feeds no longer work

32•019•1d ago•13 comments

Ask HN: How to solve the cold start problem for a two-sided marketplace?

148•alegd•4d ago•164 comments

Is possible a language easy as py, fast as C, more secure than Rust?

5•jerryzhang66•18h ago•7 comments

Hey, it's Earth Day today

20•burnt-resistor•2d ago•13 comments

Can non-developer build commercial products with AI

4•rkorlimarla•1d ago•8 comments

Ask HN: Dear astronomers, what are the most interesting things in space lately?

14•simonebrunozzi•23h ago•5 comments

Ask HN: Am I getting old, or is working with AI juniors becoming a nightmare?

33•MichaelRazum•11h ago•45 comments

Anthropic bans orgs without warning

36•alpinisme•3d ago•18 comments

Ask HN: How are you handling data retention across your stack?

4•preston-kwei•2d ago•3 comments

Ask HN: Would you take a job programming VMS?

11•smackeyacky•1d ago•19 comments

Ask HN: How many tabs do you have open in the browser(s) and why?

4•juujian•21h ago•11 comments

Need advice: Back end engineer → infrastructure: how do you make the transition?

7•gokuljs•2d ago•6 comments

Tell HN: My open-source project hit 5k registered users

19•darkhorse13•3d ago•12 comments

Ask HN: What skills are future proof in an AI driven job market?

35•sunny678•3d ago•76 comments

Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

15•agentseal•5d ago•6 comments

Ask HN: Are cloud coding agents useful in real workflows yet?

5•Rperry2174•2d ago•3 comments

My file access workaround for cron in Tahoe

5•noduerme•3d ago•2 comments

OpenClaw stats don't add up

11•iliaov•2d ago•6 comments

You've reached the end!