frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•1m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
1•vladeta•6m ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•8m ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•8m ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•11m ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•12m ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
1•birdculture•14m ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•15m ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
1•ramenbytes•18m ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•19m ago•0 comments

Ed Zitron: The Hater's Guide to Microsoft

https://bsky.app/profile/edzitron.com/post/3me7ibeym2c2n
2•vintagedave•22m ago•1 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
1•__natty__•23m ago•0 comments

Show HN: Android-based audio player for seniors – Homer Audio Player

https://homeraudioplayer.app
2•cinusek•23m ago•0 comments

Starter Template for Ory Kratos

https://github.com/Samuelk0nrad/docker-ory
1•samuel_0xK•25m ago•0 comments

LLMs are powerful, but enterprises are deterministic by nature

2•prateekdalal•28m ago•0 comments

Make your iPad 3 a touchscreen for your computer

https://github.com/lemonjesus/ipad-touch-screen
2•0y•33m ago•1 comments

Internationalization and Localization in the Age of Agents

https://myblog.ru/internationalization-and-localization-in-the-age-of-agents
1•xenator•34m ago•0 comments

Building a Custom Clawdbot Workflow to Automate Website Creation

https://seedance2api.org/
1•pekingzcc•36m ago•1 comments

Why the "Taiwan Dome" won't survive a Chinese attack

https://www.lowyinstitute.org/the-interpreter/why-taiwan-dome-won-t-survive-chinese-attack
2•ryan_j_naughton•37m ago•0 comments

Xkcd: Game AIs

https://xkcd.com/1002/
1•ravenical•38m ago•0 comments

Windows 11 is finally killing off legacy printer drivers in 2026

https://www.windowscentral.com/microsoft/windows-11/windows-11-finally-pulls-the-plug-on-legacy-p...
1•ValdikSS•39m ago•0 comments

From Offloading to Engagement (Study on Generative AI)

https://www.mdpi.com/2306-5729/10/11/172
1•boshomi•41m ago•1 comments

AI for People

https://justsitandgrin.im/posts/ai-for-people/
1•dive•42m ago•0 comments

Rome is studded with cannon balls (2022)

https://essenceofrome.com/rome-is-studded-with-cannon-balls
1•thomassmith65•47m ago•0 comments

8-piece tablebase development on Lichess (op1 partial)

https://lichess.org/@/Lichess/blog/op1-partial-8-piece-tablebase-available/1ptPBDpC
2•somethingp•49m ago•0 comments

US to bankroll far-right think tanks in Europe against digital laws

https://www.brusselstimes.com/1957195/us-to-fund-far-right-forces-in-europe-tbtb
4•saubeidl•50m ago•0 comments

Ask HN: Have AI companies replaced their own SaaS usage with agents?

1•tuxpenguine•52m ago•0 comments

pi-nes

https://twitter.com/thomasmustier/status/2018362041506132205
1•tosh•55m ago•0 comments

Show HN: Crew – Multi-agent orchestration tool for AI-assisted development

https://github.com/garnetliu/crew
1•gl2334•55m ago•0 comments

New hire fixed a problem so fast, their boss left to become a yoga instructor

https://www.theregister.com/2026/02/06/on_call/
1•Brajeshwar•56m ago•0 comments
Open in hackernews

Bots are overwhelming websites with their hunger for AI data

https://www.theregister.com/2025/06/17/bot_overwhelming_websites_report/
31•Bender•7mo ago

Comments

tartoran•7mo ago
RIP internet. It will soon make no sense to share something with the world unless you're in for profit. But who's gonna pay for it?
superkuh•7mo ago
While catchy that headline kind of misses the point. It should be "Corporations are overwhelming websites with their hunger for AI data". They're the ones doing it and corporations are by far the most damaging non-human persons (especially since they are formed nowadays to abstract away liability for the damage they cause).

This is not some new enemy "bots". This is the same old non-human legal persons that polluted our physical world repeating things in the digital. Bots run by actual human persons are not the problem.

Analemma_•7mo ago
I'm not sure that's true. As hardware gets cheaper, you're going to see more and more people wanting to build+deploy their own personal LLMs to avoid the guardrails/censorship (or just the cost) of the commercial ones, and that means scraping the internet themselves. I suspect the amount of scraping that's coming from individuals or small projects is going to increase dramatically in the months/years to come.
johnea•7mo ago
This is an ever growing problem.

The model of the web host paying for all bandwidth was somewhat aligned with traditional usage models, but the wave of scrapping for training data is disrupting this logic.

I remember reading, about 10 years ago?, of how backend website communications (ads and demographic data sharing) had surpassed the bandwidth consumed by actual users. But even in this case, the traffic was still primarily linked to the website hosts.

Whereas with the recent scrapping frenzy the traffic is purely client side, and not initiated by actual website users, and not particularly beneficial to the website host.

One has to wonder what percentage of web traffic now is generated by actual users, versus host backend data sharing, and the mammoth new wave of scrapping.

CSMastermind•7mo ago
What's the solution here? Metered usage based on network traffic that gets shared with the website owners?

Otherwise everything moves behind a paywall?

Analemma_•7mo ago
For now the solution is proof-of-work systems like Anubis combined with cookie-based rate limiting: you get throttled if your session cookie indicates you scraped here before, and if you throw the cookie out you get the POW challenge again. I don't know how long this will continue to work, but for my site at least it seems to be holding back the deluge, for the moment.
the_snooze•7mo ago
>Otherwise everything moves behind a paywall?

Basically. Paywalls and private services. Do things that are anti-scale, because things meant for consumption at scale will inevitably draw parasites.

rglover•7mo ago
> Some of the bots identify themselves, but some don't. Either way, the respondents say that robots.txt directives – voluntary behavior guidelines that web publishers post for web crawlers – are not currently effective at controlling bot swarms.

Is anybody tracking the IP ranges of bots or anything similar that's reliable?

It seems like they're taking the "what are you gonna do about it" approach to this.

Edit: Yes [1]

[1] https://github.com/FabrizioCafolla/openai-crawlers-ip-ranges

dbmikus•7mo ago
Many bots use residential IP proxy networks, so they come from the same IPs that humans use
renegat0x0•7mo ago
I think that some of big tech already said that they don't respect robots in what is prohibited. It is highly probable that ordinary bots also does not respect robots.txt.
zimpenfish•7mo ago
GPTBot certainly doesn't - I added a blanket disallow for it several months ago and in the last 5 days, it's done 22k requests (rate-limited to a max of 5 req/minute, all proxied to iocaine[0].)

[0] https://iocaine.madhouse-project.org

josefritzishere•7mo ago
I think the solution is criminal penalties.
esseph•7mo ago
Good luck tracking them down
darekkay•7mo ago
ai.robots.txt contains a big list of AI crawlers to block, either through robots.txt or via server rules:

https://github.com/ai-robots-txt/ai.robots.tx

Bender•7mo ago
Your link is missing the t at the end of .txt. You should be able to edit it though.
JimDabell•7mo ago
This actually blocks a lot more than just AI crawlers. You shouldn’t use this without reviewing it in detail so that you understand what you are actually blocking.

For instance, it includes ChatGPT-User. This is not a crawler. This is used when a ChatGPT user pastes a link in and asks ChatGPT about the contents of the page.

One of the entries is facebookexternalhit. When you share a link on Facebook, Threads, WhatsApp, etc., this is the user-agent Meta uses to fetch the OpenGraph metadata to display things like the title and thumbnail.

Skimming through the list, I see a bunch of things like this. Not every non-browser fetch is an AI crawler!

millipede•7mo ago
Information is valuable; we just weren't charging for it. AI is just bringing the market for knowledge back into equilibrium.
dehrmann•7mo ago
It looks more like information is valuable in aggregate.
rickydroll•7mo ago
That's a point that's often overlooked. I suspect many of the "amazing insights." LLM events only happen because training sets encompass an extensive range of knowledge and can arrive at conclusions previously unseen.

One could reasonably claim that the value of AI systems and very large training sets is not that it is an approach to AGI, but that it makes finding previously unseen connections possible.

gnabgib•7mo ago
Original source: https://www.glamelab.org/products/are-ai-bots-knocking-cultu... (https://news.ycombinator.com/item?id=44298771)
pleeb•7mo ago
I run a fairly large forum, and I've been getting emails from linode That the CPU usage has been going over 90% multiple times a day, Yours have been complaining that the site has been taking up to five or six seconds to load. I checked the log, and I would keep getting hit with hundreds of connections and second from specific addresses, So I set up rate limiting with Cloudflare.

I thought everything was going well after that, until suddenly it started getting even worse. I started realizing that instead of one IP hitting the site a hundred times per second, it was now hundreds of IP's hitting the site Slightly below the Throttling threshold I had set up.

dehrmann•7mo ago
Can you serve cached data to logged-out users?
dehrmann•7mo ago
Who's doing this at such a high volume? Most of the data is static enough that there isn't value in frequent crawls, crawls are (probably) more expensive than caching, and small shops and hobbyists don't have the resources to move the needle.
chneu•7mo ago
I think it was wikipedia recently that was pissed because bots were crawling the site instead of using the already available datasets.

AI bots don't care about caches. That's one of the big issues

renegat0x0•7mo ago
The additional bad outcome is that all content can go behind logins, and paywalls. What then? You will have to provide data, email in every corner of the web to lo in.

There are also good crawlers that search for sites, like Google, or marginalia which gives your page recognizibility. If you lock everything from the web, we'll it disappears from the web.

cryptonector•7mo ago
This is going to drive all blogs to GitHub gists and such.