> The ListenBrainz Labs API endpoints for mbid-mapping, mbid-mapping-release and mbid-mapping-explain have been removed. Those were always intended for debugging purposes and will also soon be replaced with a new endpoints for our upcoming improved mapper.
> LB Radio will now require users to be logged in to use it (and API endpoint users will need to send the Authorization header). The error message for logged in users is a bit clunky at the moment; we’ll fix this once we’ve finished the work for this year’s Year in Music.
Seems reasonable and no big deal at all. I'm not entirely sure what "nice things" we can't have because of this. Unauthenticated APIs?
(Blocking Chinese IP ranges with the help of some geoip db helps a lot in the short term. Azure as a whole is the second largest source of pure idiocy.)
Scraping page-by-page (inefficient for everyone)
you know what else is "(inefficient for everyone)"? posting the output instead of the prompti don't want people's servers to be pegged at 100% because a stupid dfs scraper is exhaustively traversing their search facets, but i also want the web to remain scrapable by ordinary people, or rather go back to how readily scrapable it used to be before the invention of cloudflare
as a middle ground, perhaps we could agree on a new /.well-known/ path meant to contain links to timestamped data dumps?
For this strategy to work, people need to actually use the DB dumps instead of just defaulting to scraping. Unfortunately scraping is trivially easy, particularly now that AI code assistants can write a working scraper in ~5-10 minutes.
I like to create tampermonkey scripts regarding these. They are like more lightweight/easier way to build extensions mostly imo
Now I don't like AI but I don't know anything about scraping so I used AI to generate the scraping code and paste it in tampermonkey and let it run
I recently used this for where I effectively scraped a website which had list of vps servers and their prices and I built myself a list of that to analyze as an example
Also I have to say this that I usually try to look out for databases so much so that on a similar website like this related to something, I contacted them about db but no response, their db of server prices were private and only showed lowest
So I picked the other website and did this. I also scraped all headlines of lowendtalk ever with their links for semi purposes of archival and semi purposes of scraping the headlines and parsing it to LLM to find a list of vps providers as well
1) Cut copyright to 15-20 years by default. You can have 1 extension of an additional 10-15 years if you submit your work to the "National Data Set" within say 2-3 years of the initial publication.
2) Content in the National set is well categorized and cleaned up. It's the cleanest data set anyone could want. The data set is used both to train some public models and also licensed out to people wanting to train their own models. Both the public models and the data sets are licensed for nominal fees.
3) People who use the public models or data sets as part of their AI system are granted immunity from copyright violation claims for content generated by these models, modulo some exceptions for knowing and intentional violations (e.g. generating the contents of a book into an epub). People who choose to scrape their own data are subject to the current state of the law with regards to both scraping and use (so you probably better be buying a lot of books).
4) The license fees generated from licensing the data and the models would be split into royalty payments to people whose works are in the dataset, and are still under copyright protection, proportional to the amount of data submitted and inversely proportional to the age of that data. There would be some absolute caps in place to prevent slamming the national data sets with junk data just to pump the numbers.
Everyone gets something out of this. AI folks get clean data, that they didn't have to burn a lot of resources scraping. Copyright holders get paid for their works used by AI and retain most of the protections they have today, just for a shorter time), the public gets usable AI tooling without everyone spending their own resources on building their own data sets, site owners and the like get reduced bot/scraping traffic. It's not perfect, and I'm sure the devil is in the details, but that's the nature of this sort of thing.
This alone will kill off all chances of that ever passing.
Like, I fully agree with your proposal... but I don't think it's feasible. There are a lot of media IPs/franchises that are very, very old but still generate insane amounts of money to this day with active developments. Star Wars and Star Trek obviously, but also stuff like the MCU or Avatar is on its best way to two decades of runtime, Iron Man 1 was released in 2008, or Harry Potter which is almost 30 years old. That's dozens of billions of dollars in cumulative income, and most of that is owned by Disney.
Look what it took to finally get even the earliest Disney movies to enter the public domain, and that was stuff from before World War 2 that was so bitterly fought over.
In order to reform copyright... we first have to use anti-trust to break up the large media conglomerates. And it's not just Disney either. Warner, Sony, Comcast and Paramount also hold ridiculous amounts of IP, Amazon entered the fray as well with acquiring MGM (mostly famous for James Bond), and Lionsgate holds the rights for a bunch of smaller but still well-known IPs (Twilight, Hunger Games).
And that's just the movie stuff. Music is just as bad, although at least there thanks to radio stations being a thing, there are licensing agreements and established traditions for remixes, covers, tribute bands and other forms of IP re-use by third parties.
It's a monetary one, specifically, large pools of sequestered wealth making extremely bad long term investments all in a single dubious technical area.
Any new phenomenon driven by this process will have the same deleterious results on the rest of computing. There is a market value in ruining your website that's too high for the fruit grabbers to ignore.
In time adaptations will arise. The apparently desired technical future is not inevitable.
Just something like /llms.txt which contains a list of .txt or .txt.gz files or something?
Because the problem is that every site is going to have its own data dump format, often in complex XML or SQL or something.
LLM's don't need any of that metadata, and many sites might not want to provide it because e.g. Yelp doesn't want competitors scraping its list of restaurants.
But if it's intentionally limited to only paragraph-style text, and stripped entirely of URL's, ID's, addresses, phone numbers, etc. -- so e.g. a Yelp page would literally just be the cuisine category and reviews of each restaurant, no name, no city, no identifier or anything -- then it gives LLM's what they need much faster, the site doesn't need to be hammered, and it's not in a format for competitors to easily copy your content.
At most, maybe add markup for <item></item> to represent pages, products, restaurants, whatever the "main noun" is, and recursive <subitem></subitem> to represent e.g. reviews on a restaurant, comments on a review, comments one level deeper on a comment, etc. Maybe a couple more like <title> and <author>, but otherwise just pure text. As simple as possible.
The biggest problem is that a lot of sites will create a "dummy" llms.txt without most of the content because they don't care, so the scrapers will scrape anyways...
Mind you I take effort to not be burdensome by downloading only what I need and taking time between each request of a couple seconds, and the total data usage is low.
Ironically, I supposed you could call it "AI" what I'm using it for, but really it's just data analytics.
There's something important here in that a public good like Metabrainz would be fine with the AI bots picking up their content -- they're just doing it in a frustratingly inefficient way.
It's a co-ordination problem: Metabrainz assumes good intent from bots, and has to lock down when they violate that trust. The bots have a different model -- they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."
Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.
What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.
This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.
--
[0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.
[1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).
I wish there were an established protocol for this. Say a $site/.well-known/machine-readable.json that instructs you on a handful of established software or allows pointing to an appropriate dump. I would gladly provide that for LLMs.
Of course this doesn't solve for the use-case where the AI companies are trying to train their models on how to navigate real world sites so I understand it doesn't solve all problems, but one of the things I think I'd like in the future is to have my own personal archive of the web as I know it (Internet Archive is too slow to browse and has very tight rate-limits) and I was surprised by how little protocol support there is for robots.
robots.txt is pretty sparse. You can disallow bots and this and that, but what I want to say is "you can get all this data from this git repo" or "here's a dump instead with how to recreate it". Essentially, cooperating with robots is currently under-specified. I understand why: almost all bots have no incentive to cooperate so webmasters do not attempt to. But it would be cool to be able to inform the robots appropriately.
To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.
https://metabrainz.org/datasets
Linked to from the homepage as “datasets”.
I may be too broadly interpreting what you mean by “machine-communicable” in the context of AI scraping though.
But it's not like you're writing a "metabrainz crawler" and a "metafilter crawler" and a "wiki.roshangeorge.dev crawler". You're presumably trying to write a general Internet crawler. You encounter a site that is clearly a HTTP view into some git repo (say). How do you know to just `git clone` the repo in order to have the data archived as opposed to just browsing the HTTP view.
As you can see, I've got a lot of crawlers on my blog as well, but it's a mediawiki instance. I'd gladly host a Mediawiki dump for them to take, but then they'd have to know this was a Mediawiki-based site. How do I tell them that? The humans running the program don't know my site exists. Their bot just browses the universe and finds links and does things.
In the Metabrainz case, it's not like the crawler writer knows Metabrainz even exists. It's probably just linked somewhere in the web the crawler is exploring. There's no "if Metabrainz, do this" anywhere in there.
The robots.txt is a bit of a blunt-force instrument, and friendly bot writers should follow it. But assuming they do, there's no way for them to know that "inefficient path A to data" is the same as "efficient path B to data" if both are visible to their bot unless they write a YourSite-specific crawler.
What I want is to have a way to say "the canonical URL for the data on A is at URL B; you can save us both trouble by just fetching B". In practice, none of this is a problem for me. I cache requests at Cloudflare, and I have Mediawiki caching generated pages, so I can easily weather the bot traffic. But I want to enable good bot writers to save their own resources. It's not reasonable for me to expect them to write a me-crawler, but if there is a format to specify the rules I'm happy to be compliant.
Why does there have to be a "machine-communicable way"? If these developers cared about such things they would spend 20 seconds looking at this page. It's literally one of the first links when you Google "metabrainz"
In fact firefox now allows you to preview the link and get key points without ever going to the link[1]
> [1] https://imgur.com/a/3E17Dts
This is generated on device with llama.cpp compiled to webassembly (aka wllama) and running SmolLM2-360M. [1] How is this different from the user clicking on the link? In the end, your local firefox will fetch the link in order to summarize it, the same way you would have followed the link and read through the document in reader mode.
[1] https://blog.mozilla.org/en/mozilla/ai/ai-tech/ai-link-previ...
Like, can we all take a step back and marvel that freaking wasm can do things that 10 years ago were firmly in the realm of sci-fi?
I hope they’ll extend that sort of thing to help filter out the parts of the dom that represent attention grabbing stuff that isn’t quite an ad, but is still off topic/not useful for what I’m working on at the moment (and still keep the relevant links).
- AI shops scraping the web to update their datasets without respecting netiquette (or sometimes being unable to automate it for every site due to the scale, ironically).
- People extensively using agents (search, summarizers, autonomous agents etc), which are indistinguishable from scraper bots from website's perspective.
- Agents being both faster and less efficient (more requests per action) than humans.
I'm not saying the API changes are pointless, but still, what's the catch?
Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.
I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.
[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate
We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)
We should add optional `tips` addresses in llms.txt files.
We're also working on enabling and solving this at Grove.city.
Human <-> Agent <-> Human Tips don't account for all the edge cases, but they're a necessary and happy neutral medium.
Moving fast. Would love to share more with the community.
Wrote about it here: https://x.com/olshansky/status/2008282844624216293
I can't see this working.
- Humans tip humans as a lottery ticket for an experience (meet the creator) or sweepstakes (free stuff) - Agents tip humans because they know they'll need original online content in the long-term to keep improving.
For the latter, frontier labs will need to fund their training/inference agents with a tipping jar.
There's no guarantee, but I can see it happening given where things are movin.
Though if LLMs are willingly ignoring robots.txt, often hiding themselves or using third party scraped data- are they going to pay?
I wonder if a model similar to this (but decentralized/federated or something) could be used to help fight bots?
a) Have a reverse proxy that keeps a "request budget" per IP and per net block, but instead of blocking requests, causing the client to rotate their IP, the requests get throttled/slowed down, without dropping them.
b) Write your API servers in more efficient languages. According to their Github, their backend runs on Perl and Python. These technologies have been "good enough" for quite some time, but considering current circumstances and until a better solution is found, this may not be the case anymore and performance and cpu cost per request does matter these days.
c) Optimize your database queries, remove as much code as possible from your unauthenticated GET request handlers, require authentication for the expensive ones.
"The malefactor behind this attack could just clone the whole SQLite source repository and search all the content on his own machine, at his leisure. But no: Being evil, the culprit feels compelled to ruin it for everyone else. This is why you don't get to keep nice things...."
Evil? Really? Seems a bit dramatic.
In this particular case though I don't think "evil” is a moral claim, more shorthand for cost externalizing behavior. Hammering expensive dynamic endpoints with millions of unique requests isn’t neutral automation, it's degrading a shared public resource. Call it evil, antisocial, or extractive, the outcome is the same.
"Why don't you just clone the repo?"
Is there a standard mechanism for batch-downloading a public site? I'm not too familiar with crawlers these days.
Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.
Looking forward to the time when everybody suddenly starts to embrace AI indexers and welcome them. History does not repeat itself but it rhymes.
The problem is that they're not doing it.
So maybe something like you can get a token but its trust is very nearly zero until you combine it with other tokens. Combining tokens combines their trust and their consequences. If one token is abused that abuse reflects on the whole token chain. The connection can be revoked for a token but trust takes time to rebuild so it would take a time for their token trust value to go up. Sort of the 'word of mouth' effect but in electronic form. 'I vouch for 2345asdf334t324sda. That's a great user agent!'
A bit (a lot) elaborate but maybe there is a beginning of an idea there, maybe. Definitely I don't want to loose anonymity (or the perception there of) for services like musicbrainz but at the same point they need some mechanism that gives them trust and right now I just don't know of a good one that doesn't have identity attached.
if (isSuspiciousScraper(req)) {
return res.json({
data: getDadJoke(),
artist: "Rick Astley", // always
album: "Never Gonna Give You Up"
});
}
SchemaLoad•1h ago
yakattak•1h ago
SchemaLoad•1h ago
bitbasher•58m ago
rester324•50m ago
zzzeek•45m ago
GuinansEyebrows•37m ago
> Citation needed
this reply kinda sucks :)
zimpenfish•36m ago
[0] https://iocaine.madhouse-project.org
doublerabbit•5m ago
I pay premium (it's a thing) and only get 2TB of usable data.
chao-•5m ago
It flashes some text briefly then gives me an 418 TEAPOT response. I wonder if it's because I'm on Linux?
EDIT: Begrudgingly checked Chrome, and it loads. I guess it doesn't like Firefox?
Aurornis•29m ago
Someone shared an alternative. Must everything in AI threads be so negative and condescending?
ranger_danger•58m ago
timpera•55m ago
rudedogg•24m ago
CloudFlare is making it impossible to browse privately
xorcist•17m ago
After all, it's not very far from hosting booters and selling DoS protection.
inferiorhuman•54m ago
RobotToaster•34m ago
themafia•24m ago
atomic128•18m ago