frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Being "Seen" and Feeling Part Of

https://growingfearless.substack.com/p/on-being-seen
1•josmor•58s ago•0 comments

The Seifert–Van Kampen Theorem in Homotopy Type Theory (2016) [pdf]

https://home.sandiego.edu/~shulman/papers/vankampen.pdf
1•measurablefunc•1m ago•0 comments

ByViewer – Watch Instagram Stories Anonymously, No Login Needed

https://byviewer.com/
1•cui511511•2m ago•1 comments

Powerful 6.3 quake kills at least 20 in Afghanistan, hundreds injured

https://www.reuters.com/business/environment/magnitude-63-earthquake-hits-afghanistans-hindu-kush...
2•teleforce•5m ago•0 comments

FerroElectric RAM

https://en.wikipedia.org/wiki/Ferroelectric_RAM
1•brudgers•5m ago•0 comments

De-escalating Tailscale CGNAT conflict

https://ysun.co/tscgnat/
1•birdculture•6m ago•0 comments

High-Quality Branded Envelopes for Business Mail

1•skyprint•6m ago•0 comments

Beyond Start and End: PostgreSQL Range Types

https://boringsql.com/posts/beyond-start-end-columns/
2•radimm•10m ago•0 comments

Intuitive UIs support user habits

https://blog.julik.nl/2025/10/what-does-intuitive-even-mean
1•julik•13m ago•0 comments

Show HN: Face Fusion is a fun way to blend faces using AI

https://deepfacefusion.com
2•artemisForge77•14m ago•0 comments

Text Fragments Enable Deep Linking on Web Pages

https://tidbits.com/2025/04/23/text-fragments-enable-deep-linking-on-web-pages/
1•hexage1814•15m ago•0 comments

Cave Rescue That Captivated the Nation

https://www.mentalfloss.com/history/1925-cave-rescue-that-captivated-the-united-states-floyd-collins
1•jameslk•16m ago•0 comments

Thinking Clearly

https://lemire.me/blog/2025/10/26/thinking-clearly/
2•kiyanwang•18m ago•0 comments

The Authoritarian Stack: How Tech Billionaires Are Building a Post-Democratic US

https://www.authoritarian-stack.info/
2•negativelambda•20m ago•1 comments

Cartography of Generative AI

https://cartography-of-generative-ai.net/
1•giuliomagnifico•21m ago•0 comments

I Ask AI for Permission Now (and I Hate Myself for It)

https://www.codecabin.dev/post/i-ask-ai-for-permission-now
1•rebelchrisycom•23m ago•0 comments

CEO Andy Jassy says Amazon's 14,000 layoffs weren't about cutting costs or AI

https://fortune.com/2025/11/01/ceo-andy-jassy-amazon-layoffs-about-culture-not-ai/
1•hansmayer•24m ago•0 comments

Show HN: Easy Text Tools – 130 text utilities that run in the browser

https://easytexttool.com/
2•msdg2024•27m ago•0 comments

A New Faster Algorithm for Gregorian Date Conversion

https://www.benjoffe.com/fast-date
4•benjoffe•27m ago•0 comments

FSF40 Hackathon

https://www.fsf.org/events/fsf40-hackathon
2•salutis•27m ago•0 comments

Installing a MW of nuclear on land is 2x as expensive as on a Aircraft Carrier

https://twitter.com/andercot/status/1984847618489651636
1•MrBuddyCasino•31m ago•0 comments

One-prompt ArXiv filter: parse email digest, output three papers

https://quickchat.ai/post/one-prompt-arxiv-filter
1•piotrgrudzien•33m ago•0 comments

Eigen 5.0.0

https://gitlab.com/libeigen/eigen/-/releases
2•anewhnaccount2•33m ago•0 comments

ecoCompute - Green Tech Conference for Sustainability in Software

https://www.eco-compute.io
1•ArneTR•34m ago•0 comments

Sora AI – Turn Your Words into Videos Powered by OpenAI

https://sora-ai.one/en
1•ucollabn•37m ago•1 comments

I Built an AI to Roast My Own About Page. It Was Brutal (and Right)

https://www.startastory.app/blog/about-page-research/
2•blakey_vibes•40m ago•0 comments

Postbase – Self-Hosted Firebase Alternate – Node, Express, BetterAuth, Postgres

https://github.com/umrashrf/postbase
2•umrashrf•41m ago•0 comments

Master your AI work loads

https://github.com/ulixcode-labs/GPU-pro
2•gpupromain•42m ago•0 comments

Steps to turn off AI

https://againstdata.com/blog/stop-ai
25•austinallegro•44m ago•4 comments

Dynamically include files in GitLab-CI

https://www.zufallsheld.de/2025/10/03/dynamic-gitlab-ci-includes/
1•zufallsheld•46m ago•1 comments
Open in hackernews

Ask HN: Is Common Crawl used exhaustively by any search engine?

8•n1xis10t•6h ago
The Common Crawl has about 300 billion pages in it, and if you downloaded all of it in extracted text format it would only take up about 816 TB compressed. If someone were to make a search engine with this I think it would be more comprehensive than Bing, and possibly pretty similar to Google. The only CC based search engines that I know of use a tiny fraction of what they have available. Do you know of any that use the whole thing?

Comments

agentbox•14m ago
To my knowledge, no public search engine indexes the full Common Crawl corpus. Projects like Neeva (before shutting down) and some academic prototypes used parts of it for evaluation, but none have managed to process all 300B pages continuously.

The biggest practical barriers are deduplication, spam filtering, and keeping the index fresh — CC snapshots are monthly but the quality varies a lot.

For experimentation, you can look at projects like CCNet, ElasticSearch’s open-source pipelines, or small-scale engines such as Marginalia Search, which use subsets for niche purposes.