Seems pretty thorough, though this is may end up being a good lesson for GenZ/A not to post things in public spaces on the internet.
The page given by pavel_lishin above includes a sample data set that's only 6.2 GB:
https://zenodo.org/records/15170676/files/dataset_sample.zst...
But discord servers aren't considered "public spaces", hence the concept of an "invite".
This is akin to someone revealing they've been going to private parties and secretly recording everything.
It might not be illegal, but it's definitely not polite.
> In this regard, this paper introduces the most extensive Discord dataset available to date, comprising 2,052,206,308 messages from 4,735,057 unique users across 3,167 servers – approximately 10% of the servers listed in Discord’s Discovery tab, a feature designed to highlight public servers that users can join.
It sounds more like they went to the mall, picked 10% of the stores, and recorded conversations taking place in those stores.
I mean I use it for voice chatting with friends while gaming too and it's fine for that.
But if I have to beg and plead to a discord bot to join a channel to just read some docs, I'm just going to ignore your project. Not sorry about that at all.
I think part of the problem is that they confuse the semantics of nomenclature. "Servers" are not really servers, "forums" are not really forums, and so on and so forth.
Discord is walled and hard to search. If a channel or server closes then all that information is lost.
Tons of data will be lost to discord when it goes down.
Idk if you've ever tried to use discord for mods or other software but it sucks. It's confusing. Information isn't cataloged well. It's search sucks. It just isn't good for this kind of thing.
This is a feature of the platform, not a bug. Because of the lack of discoverability people act more genuine, for better or for worse, than public places like Twitter, Bsky, Facebook, Instagram, etc where you have to maintain your public image and/or act like HR is watching over your shoulder.
That being said, this feature also makes Discord inappropriate for things like release announcements, patch notes, etc. which should be publicly accessible.
That seems to be a counterpoint to your argument. Users on Twitter usually do not hold back.
Contrast this to Discord which is more like old-school IRC, in that even when everyone is using an alias, if you talk to the same people day-in day-out, you know a fair bit about their personal lives, such as name and where they work.
Forums? No not generally unless you were a signed in user and often signups weren’t available to the general public just like here not all Discord rooms are automatically joinable. Digg, Reddit, slashdot were intentionally generally public forums that you could indeed search but they were the exception rather than the rule (in terms of count, not traffic). Indeed even Reddit has invite only forums that I believe aren’t searchable unless you are a member. Oh and searchable if you’re a member? That’s true for Discord.
If true, that seems like a huge oversight. I also wonder what would happen if someone finds their information in the dataset and requests it to be removed per GDPR or other privacy legislation.
In all honesty, it's better to reserve the effectiveness for private, personal data, for the sake of practicality.
E.g. if someone scraped hackernews and made a dataset containing this comment, i don't think i should have any right to complain.
So did Discord cooperate, or give special authorization for this collection? It wouldn’t appear that they could do so, if privacy belongs to their users at all.
I don't know what a "guild" is, if it's some Discord thing, and you don't say whether this is a good-faith human who joins, or a bot operator, intending to scrape. The hypothetical is irrelevant here; what is germane is that the expectation of privacy by the individual participants, and the terms which bind people who use that service.
The TOS clearly didn't prevent the use of API, but it may indeed prohibit such scraping, or threaten repercussions for people who break the terms, especially for someone who republishes the data. Your example of a simple download dump doesn't seem to involve republication, and that seems to be the major issue with scrapers.
How can you have an expectation of privacy in a public forum? Where did this bizarre disorder originate, where people knowingly put their writing out there for literally anyone to read, then turn around and start talking about "expectations of privacy" when they realize what it entails?
Well unfortunately it originated in the human condition, my friend.
I take it back about "expectation of privacy". Perhaps that is an outmoded concept.
Humans used to sort of have a default expectation of privacy. Being that gossip, slander and libel were sins and crimes, we could often safely gather in a room and isolate ourselves in a select group, and share our thoughts openly.
Most humans could go into a living room with their family, a pub or bar, a classroom, or a treehouse, and say/do things that were shared only by the local group of gathered humans. You could go into a public park and speak to a fire hydrant. It was not usual, or possible 100 years ago, for the news media to go around with recorders and cameras and record/preserve/transmit/broadcast everything everyone said in every place they were doing it.
Expectations of privacy were just sort of... humankind's default setting. And so betrayals were sins and crimes. And we sit alone at our keyboard looking at a screen. It feels private, all right. Where are we really? Where are our words being carried? We can't know anymore.
Unfortunately we've built online and virtual worlds around paradigms that imply privacy or confidentiality, but don't actually afford it. You can go into a "chat room" or a "forum" or change your "privacy settings" but they mean nothing. Nothing at all. Because everything we're sending across the net can be perfectly recorded, preserved, retransmitted, and it's no longer gossip, it's just business.
> Where did this bizarre disorder originate
I don't believe that any other living organism has had to deal with the complete and total collapse of "privacy" like humans in the 21st century. Surely, termites in Australia don't know, and couldn't care, about what's going on with honeybees in California.
And here we have people calling it a bizarre disorder. Yes, it's mistaken and misguided, but who can call it unreasonable?
The only acceptable API usage is via bots that server owners choose to invite. And while it might be legally OK (if the bot's own TOS says it), I promise no server owner is expecting an invited bot to slurp up every message for use in a data set, whether that be for academic purposes or a potential stalking/"dirt" database.
I highly doubt this is the most ethical instance of data collection.
> B. API Data Sharing & Retention
> You will not share API Data with any third party, except in the following circumstances, subject to compliance with the Terms and applicable laws and regulations: (i) with a Service Provider; (ii) to the extent required under applicable laws or regulations; and (iii) when a user of your Application expressly directs you to share their API Data with the third party (and you will provide us proof thereof upon request).
https://support-dev.discord.com/hc/en-us/articles/8562894815...
I use a dedicated alt account to archive tons of various servers I'm in, and auto-download all attachments. It's nice having regex search capabilities on my local copy of the data too.
FWIW, I haven't exactly been careful with it (oftentimes scraping 2 servers at once, and downloading all attachments) and have never had an account get banned.
The only time I got 'banned' in any capacity was when I hammered the internal JSON API to get information about server's invite links, and even then it was only an automated IP ban from Cloudflare for a couple days. Although, it was an unauthenticated API.
These are servers that asked to be advertised by Discord ("Discovery"). These are unlikely to be any kind of servers used for private or even semi-private discussions. You likely don't know most of the people on the server.
Most likely, the 'hottest' kind of data you might find is someone accidentally leaking info akin to the World of Tanks forum post 'corrections'.
I expect this would become more widespread as more traditional jobs are subsumed by unregulated ML tech (which, incidentally, the encumbent job-holders are helping train) and more people turn to what used to be generally a hobby as their means of making a living (not that that would last for too long either).
It can be. As I understand it, it's sort of like streaming or other content creation - yes, it's possible, but difficult, as it's a saturated market. Most mod authors don't make much money.
As a slight aside, I think people would be more inclined to support creators like mod authors if it were simply easier. Patreon and the like make it fairly easy, but I don't think many people want to subscribe to 20+ Patreons for $5 apiece, as much as they might like to support those authors. On the other hand, I think more people would be willing to pledge $X per month to be split among all of their subscriptions. Sure, most creators would only get a few cents per user, but they'd likely get many more people subscribing, and I think it would add up quick. I might be wrong, and I don't take credit for this idea by any means; I read it some time ago, and possibly Patreon even offered this system before?
It's like back in the days of IRC. People just logged all of it.
It's 118 gigabytes of JSON.
118.0 GB of ZST compressed JSON (https://zenodo.org/records/15170676). The actual uncompressed JSON would most likely be much, much larger.
System Instruction: Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered — no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.
How is it ethical to break Discord's terms of service? An ethical researcher would respect any contracts that they agreed to and would not violate them to collect more data.
Would you agree abusive ToS's by massive corpos are unethical? What about the Disney+ ToS hiding a binding arbitration agreement preventing you from suing them? [0].
Or are you one of those "my personal ethics are whatever the law says" folk?
[0] https://www.nbcnews.com/news/us-news/disney-says-man-cant-su...
We have regressed from the open email standard and gone back to these opaque islands of data that do not adhere to any standard.
Slack refused to show me my own messages past a certain age unless I paid up, and eventually deleted them.
A year or so ago I exported all messages from a Slack group I ran and used a Discord bot to recreate the entire dataset including channels and user posts. So we now have our entire history of messages without being blocked by a paywall (Until Discord does the same, and we'll be off to find a new home).
I'm interested to know, from anyone here who's an IRC operator or server/network admin, how the IRC community deals with scraping and bots, because in the early 90s, it was never an issue of corporate Terms of Service or legalese, but typically handled by community standards, and probably, people did whatever they could get away with, and this needed to be anticipated and tolerated by the other participants in any given server or channel.
I doubt that IRC users, back in the day or in the present, have any illusions of privacy, when logging or reflecting or bouncing chats is more or less a built-in feature and an integral component of such a networked chat service.
So many users expect their entire decade+ history of DM contents, attachments included, to be available wherever they are and on any device, gated only by having their login/2fa or passkey. Switching to E2EE would be a major overhaul of that expectation, and it would be a huge task to train users to now keep their encryption key safe, backed up, and available across multiple devices.
Although, mostly unrelated, is that they absolutely are going to have to cull old attachments eventually. There are attachments sitting in their GCP buckets that haven't been accessed since 2015. I'm sure their storage bill is in at least a few million a month at this point, even if most is marked coldline.
That’s not the issue. The issue is that Discord believes they deliver value through aggressively censoring their platform. e2ee prevents that.
e2ee also doesn’t prevent a user from storing their long term keys on the server to be retrieved on new devices and decrypted locally so they can access message history. e2ee does not require PFS.
Mostly I think it's weird how many people on here seem to have been under the illusion that Discord is somehow ephemeral and private when I can hop on any public server and scroll back indefinitely to see anything that anyone has ever said on that server. And that's before I get into the API and the (admittedly bad) search feature.
I think what you were looking for is Signal or similar.
leotravis10•6h ago
https://www.404media.co/researchers-scrape-2-billion-discord...
cflewis•4h ago
----
It should be noted, however, that almost no one reads end-user license agreements and many of Discord’s users are children and teenagers. Discord is, first and foremost, a platform for gamers to organize communities and it’s not plausible that a 15 year old looking for a Fortnite meme server ever thought their dumb jokes about Tomato Town would end up in a public database five years later.
----
Same as other commenters here: I think this is shameful action under the guise of research and I cannot fathom why any IRB board would approve this (and perhaps it did not in this case, I do not know if Brazil has such a thing).
Back in the day (15ish years ago), I wrote a paper where I scraped the World of Warcraft API. It wasn't hard to do, I started on a realm, looked for arena teams, then went to guilds and got character sheets from there. I took the opinion that if Blizzard doesn't throttle me it's fair game.
Looking back now, I think that to have been pretty naive. I wouldn't say reckless, but definitely naive. In my mind, I had not made a delineation between "I can access this thing manually one at a time" and "I can access all of it automatically". As far as I was concerned, it was just the computer pressing the buttons. It was the same thing.
I think in the fullness of time we have collectively come to realize it is 100% not the same thing. The _availability_ of a thing and the _collection_ of a thing are two different issues with their own thorny problems. The researchers here have made the same mistake I did, but instead of it just being what gear your character was wearing, they took actual communications instead.
I hope this paper gets retracted, all data deleted and a sincere apology offered.
lolinder•4h ago
There's no way that this hasn't been done dozens of times before by intelligence agencies, hacker groups, and whoever else you care to worry about. Most of us here were well aware that public Discord channels have always been public and durable. It's hardly a secret from the technically savvy, it's just that Discord doesn't make it clear enough to regular users.
All this paper changes is that it draws mainstream attention to what was already happening illicitly for as long as Discord has been around. This can only be a good thing: the children and teenagers 404 is so worried about have always been vulnerable to their data getting leaked just like this, it's just that up until now that's been happening in the dark so as not to kill the golden goose.
NoahZuniga•4h ago
cflewis•4h ago
lolinder•4h ago
These databases exist and always have because this has always been possible. The only difference is that they've typically been held close to the chest by intelligence agencies or hacker groups or whoever else made them for illicit purposes. The only change here is that this database is public and is drawing mainstream attention, which is a strictly good thing.
A lot of the people on here are using the same reasoning that would say that LockPickingLawyer should stop showing how to pick locks because he's making it too easy to learn how garbage most locks are.