frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: Is Common Crawl used exhaustively by any search engine?

8•n1xis10t•11h ago
The Common Crawl has about 300 billion pages in it, and if you downloaded all of it in extracted text format it would only take up about 816 TB compressed. If someone were to make a search engine with this I think it would be more comprehensive than Bing, and possibly pretty similar to Google. The only CC based search engines that I know of use a tiny fraction of what they have available. Do you know of any that use the whole thing?

Comments

agentbox•5h ago
To my knowledge, no public search engine indexes the full Common Crawl corpus. Projects like Neeva (before shutting down) and some academic prototypes used parts of it for evaluation, but none have managed to process all 300B pages continuously.

The biggest practical barriers are deduplication, spam filtering, and keeping the index fresh — CC snapshots are monthly but the quality varies a lot.

For experimentation, you can look at projects like CCNet, ElasticSearch’s open-source pipelines, or small-scale engines such as Marginalia Search, which use subsets for niche purposes.

Tell HN: Mechanical Turk is twenty years old today

73•csmoak•16h ago•47 comments

Doo: A Simple, Fast Programming Language Built on Rust and LLVM

5•nynrathod•3h ago•2 comments

Ask HN: Where to begin with "modern" Emacs?

204•weakfish•1d ago•107 comments

Ask HN: Anyone else use FreePascal as their low level language?

59•rlawson•1w ago•43 comments

Tell HN: Azure outage

881•tartieret•4d ago•804 comments

Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

338•threeturn•2d ago•183 comments

Ask HN: Is Common Crawl used exhaustively by any search engine?

8•n1xis10t•11h ago•1 comments

Ask HN: Why are QR codes not clickable links on browsers?

19•obilgic•16h ago•12 comments

Ask HN: Alternatives to Clay for Enrichment?

3•Poomba•19h ago•4 comments

Ask HN: Self Hostable Alternative to Jsonbin.io?

8•jimmydin7•15h ago•1 comments

Ask HN: Not treated respectfully by colleague – advice?

115•golly_ned•1w ago•125 comments

Tell HN: Twilio support replies with hallucinated features

157•haute_cuisine•4d ago•40 comments

Ask HN: Who else thinks they should add GOTO statements to Python?

2•n1xis10t•11h ago•5 comments

Ask HN: Why I rarely see game dev startup here?

19•blindprogrammer•2d ago•9 comments

Ask HN: What notably hasn't changed in the past 10 years?

14•cjbarber•1d ago•12 comments

Ask HN: Is anybody running a successful non-subscription business?

13•fandorin•3d ago•32 comments

What is the best way to use Claude Code from my phone?

4•tripleyeti•1d ago•5 comments

How are you handling identities for AI agents?

8•andylow•1d ago•7 comments

Scientists can't define consciousness, yet we think AI will have it

15•f_of_t_•2d ago•24 comments

Ask HN: Does anyone else with astigmatism not like dark-mode?

10•morkalork•2d ago•16 comments

Tell HN: OpenAI now requires ID verification and won't refund API credits

205•retube•1w ago•121 comments

Ask HN: Is Udacity now geo blocking countries?

5•estebarb•3d ago•0 comments

Ask HN: Is AWS down again?

85•ajdude•6d ago•37 comments

Ask HN: How to deal with long vibe-coded PRs?

7•philippta•5d ago•14 comments

Ask HN: Thoughts on /etc/hosts instead of DNS for production applications?

13•notepad0x90•4d ago•13 comments

You've reached the end!