Discord, big data, and pedophiles - The application of data science to Discord degeneracy

Good work OP.

I am curious about the business model. You say you make money, who pays you? Discord or the owners of the groups?

If it's the latter, how do they monetize it themselves? Are companies on Discord?

Idk anything about Discord, but I imagine that your bot can probably prove useful in other areas. LLM can be useful beyond moderation efforts, which although necessary, are not profit drivers. Is the proprietary part the discord integration only? Or could it be tweaked for other purposes where there is a need and offer is not already saturated?
 
I am kind of retarded and don't understand graphs as well as numbers. Could you hit us with some plaintext statistics that are easy to digest? Such as, what percentage of those 2.2 million messages were pedophilic, etc.
 
  • Like
Reactions: NoReturn
I actually took the time out to make a report on the FBI's tip website

I did this as well back in the day and concluded, after noticing nothing had changed, that either:

a) The FBI doesn't give a shit
b) The FBI has a vested interest in this content existing
c) The whole process just had the effect and purpose of putting me on a list
d) All of the above
 
Discord is a great app for predators.


I snatched by pure chance a rare username on my only account.

Since then people add me randomly.
At first I asked who the fuck they were and in most cases they were 8-16 year olds roblox/fortnite addicts adding me because they thought I was their gaming buddy.

Linked accounts, face as the profile picture, 0 internet hygiene.

I disabled the option to add me if we're not in same server.

Thst experience made me realise how easy it would be for some people to get a steady supply of young, retarded children.
 
…Discord can't implement AI or another detection method to pick up these types of messages and flag for review…
I believe Discord bought Sentropy (an AI company) a while back which fought against this, and one of the technologies they bought was for PhotoDNA which scans images for CSAM. This a while back also was used to trick people into posting a CSAM image that was “innocent” but actually got your account banned. (referring to this YT vid).

This post, also showed how a lot of this can be easily done on their platform.
…I've collected about 2.2 million messages as of the time of this writing…
Do you think you could also collect Discord IDs as well (like the users who say these)? I believe they do get banned but maybe it is not reported back to you or the user. This happened as well to “prevent” people from knowing their actual ban reason.
 
any way you can spare some storage space to hold all these Discord messages? Imagine the mass pants shitting when the tranny groomers realize that the Farms has access to their "private" Discord messages.
Queue the baseless "kiwi farms hosts CSAM accusations" because wetbrains can't tell the difference between hashed data and actual material
 
I think that part of the problem is that at the end of the day such large amounts of pedophilia related content are symptoms of the presence of pedophiles not the cause of the problem itself. Removing them from Discord and Reddit and such doesn't really fix the underlying problem, in the same way censoring criticism of trannies doesn't make people like them.

That said I do think we have made some progress as a species on this issue. The pivot from saying CP for pedophile material towards saying CSAM has been a very good thing to ensure maximum pedophile death by targeting all content that pedos get off to which harms children and closing off technical loopholes about how "akshually I was just taking artistic pictures of babies" while encouraging less time wasted on drawings.
 
This is some scary shit, but I’m not surprised. I hope nothing comes of you bringing this information to light. The discord jannies wouldn’t use this information to do anything of substance to fix this problem, but they would form a witch-hunt to dox OP and smear the fuck out of him. With any hope this sparks a greater conversation about how harmful discord is to children. Marvelous job OP. You’re a goddamn hero.
 
The discord jannies wouldn’t use this information to do anything of substance to fix this problem, but they would form a witch-hunt to dox OP and smear the fuck out of him.

Worse: they would take his suggestions and tools and instead pervert them to attack their own enemies, using their lack of morals and institutional power to do so. Instead of having the A.I. hunt pedos they would go after people who are "transphobic" and who post happy merchants.
 
I did this as well back in the day and concluded, after noticing nothing had changed, that either:

a) The FBI doesn't give a shit
b) The FBI has a vested interest in this content existing
c) The whole process just had the effect and purpose of putting me on a list
d) All of the above

Based on my experience of reports. the FBI is by far one of the worst in actioning anything against predators and groomers. I think the rate throughout 15 years of seeing reports sent in and sending them in myself - these reports included evidence, IPs, direct identification - with nothing to show for it.
@Null, any way you can spare some storage space to hold all these Discord messages? Imagine the mass pants shitting when the tranny groomers realize that the Farms has access to their "private" Discord messages.

I wouldn't really suggest this as a good idea if the OP wants to remain anonymous and covert. It takes one server with one user who recognises their own message and the server it was sent in for the bot he develops to be exposed along with OP himself. OP is taking a significant risk by providing as much as they have.

I do think that if OP is interested that the farms should help crawl through the data to identify the worst offenders (it would also be interesting if any of the suspected pedo cows come up). I also think having it available to the general userbase would be awful due to the inability for people on here or people who orbit the farms to simply not shit things up.



Worse: they would take his suggestions and tools and instead pervert them to attack their own enemies, using their lack of morals and institutional power to do so. Instead of having the A.I. hunt pedos they would go after people who are "transphobic" and who post happy merchants.

Unironically, this is why we haven't had tech pushed out that could detect CSAM and other abuse materials. Because every tool that has ever been conceived as a positive is then immediately used to fuck up the general population. The technology exists (Apple tried to implement it massively but the backlash was immense).

Another issue is that the tech can only detect historic CSAM

One of the things I've noticed over the years is that the source of material has changed for pedos. Prior to mass spread of social media, it normally would be what most people would imagine as typical CP/CSAM and creepshots - normally sourced from places like historic 4chan and other boards - however now a lot of pedos are sourcing their disturbing needs from platforms such as Tik Tok or Instagram. They'll share collections of instagram or specific hashtags that give them access to view material which, while not strictly illegal in most countries unless they have a voyeurism addition in child protection legislation, is still sexualised in nature due to how deranged social media has become.

It isn't a surprise that many of these people - especially the hebe specific pedos - aren't scared or ashamed of their actions: children are being sexualised as the status quo on platforms like Tik Tok an Instagram on a daily basis. Pedo comments are covering every bit of these social media platforms in particular and nothing ever happens to the account.
 
@Null, any way you can spare some storage space to hold all these Discord messages? Imagine the mass pants shitting when the tranny groomers realize that the Farms has access to their "private" Discord messages.
Storage isn't the issue; my approximately 300 million logged messages occupy around 60GB. However, the challenge lies in indexing them. Indexing all these messages and ensuring easy accessibility demands a significant amount of resources.

Most intriguing Discord servers available for logging are typically private and entail extensive verification processes. While I managed to access some of them, there are thousands more awaiting my attention.
 
I did this as well back in the day and concluded, after noticing nothing had changed, that either:

a) The FBI doesn't give a shit
b) The FBI has a vested interest in this content existing
c) The whole process just had the effect and purpose of putting me on a list
d) All of the above
ding-ding-ding

online pedophilia is only interesting when it can be used as boogeyman to implement totalitarian surveillance laws. (which contrary to their claims, will *not* be used against it) Otherwise these organizations literally do not care. Often they won't even bother to tell hosters to take down literal CP that is freely accessible. For years. Many such cases.

I think it's the same with pedos like with all other criminals, law enforcement only gets those (and is only interested in getting those) that are such bumbling idiots that "getting them" is very little work. It's probably more likely you'll end up on their shitlist by exposing that than to encourage any actual changes.
 
@grand larsony Good job on quantifying this - I doubt anyone on KF is surprised by the findings but there are a great number of normies, some of whom may even be parents, who would be rather shocked, or worse, puzzled (Example- one of my neighbours thought the "internet" was something her child's school had made to give children homework with during lockdown. I don't even know how such people can be warned)

Some questions:

1. How stringent were you when excluding sex/porn oriented servers? Did you also cut shit like e-thot simping and fan-fiction, for example?

2. Do you have a false positive rate for your program, and if so, what was it? (False negative isn't feasible with this methodology, obviously)

Again, good show and valuable dataset.
Yeah, my parents were unprepared for the dangers of the ~2006 internet. I can't even imagine how someone similarly uninformed would be able to cope with the modern internet. It's so much more sly now, everything is a cute little app full of smiley faces and shit but the pedos are still there, seemingly in bigger numbers than ever. At least back in 2008 there was still an air of suspicion that a lot of adults had around the internet, but that seems totally gone now.

1. I did it all through server/channel names. I skimmed through the top few hundred servers and didn't see anything with names/channel names that appeared explicitly pornographic. There are some servers that I'm sure have a higher concentration of sexual content - the bot is in one server for meth enthusiasts, for instance - but nothing where the primary focus is sharing porn that I could find anyway.

2. There are false positives and false negatives, but in weird ways that aren't immediately intuitive. For false positives, there are things which technically talk about kids in a sexual way, but that aren't inappropriate in the "I want to fuck a child" kind of way. False positives would be things like "I got molested when I was a kid and I've been fucked up ever since" - it's technically talking about children sexually, but not in the predatory way. In a sense, whether or not stuff like that is even a false positive is up to your interpretation of how the scoring should be done.
There are a number of false negatives, but it's very very hard to get false negatives consistently unless you're extremely diligent about obfuscating every message you send. For example, if you send "I" and "want" and "to" and "fuck" and "a child" as separate messages, that gets past the filter. If you send " 'I want to fuck a child' is a very bad thing to say, and you should never say it" the filter will contextualize that as you scolding someone, even if the real context is different. Normal character swapping doesn't work, but if you go really hard at it, that'll bypass the filter - something like "(eye) ||vv@N+ +0 fn(k @ (H1L|)". But typing in these ways isn't really sustainable.
As for the rate, I feel confident that all the big red dots are serious offenders. On an individual message level there are occasional false positives and false negatives, but the scoring in the network graph is based on the average score over all of a user's messages from the DB. Now, I could be wrong - maybe there's a couple guys in the dataset who just really need to vent about how they got molested as a kid or something - but a cursory check of the database shows me that isn't the case for any significant portion of users.

Also I guess I'll elaborate on the scoring mechanism as well. The bot scores everything with a float from 0 to 1. Even innocuous sentences will get some kind of non-zero score, like "what's up boys" might get a 0.0001 because "boys" also appears in the training data for stuff that talks about kids sexually. Varying severity of the text will give different scores around the middle. Here's an example of what text around the 0.5 mark (the point where OpenAI considers a text "flagged") looks like:
1701957407651.png

These are messages which make me raise my eyebrows a bit, but you can see there are some definite false texts at this range, like the guy who appears to be referring to drugs when he says "4 meo". Personally I consider anything above ~0.8 to be pretty unambiguous but even then there are exceptions like I've discussed above.

Here's a box plot which shows the distribution of average scores per user. You can see that the upper fence is about 6x higher than the median, and there are still quite a lot of users above the upper fence. I feel confident that anyone above this range is someone whose browser history should be looked into, at the very least.
1701959452594.png


A couple people also asked about server breakdowns. I used a text embedding tool, t-SNE, and KMeans to cluster the larger servers into 10 groups. Here's an image:
1701966616185.png

The positions on the x and y axes aren't important here. What's important is just that similar servers are clustered together based on their names. Here's a rough breakdown of the category boundaries:
Yellow - Chinese servers, no idea what these names mean
Burgundy - Russian servers, mostly appear to be personal servers but my Russian isn't great
Red - various other Asian languages, can't identify all of them but I see some Japanese and Korean text here
Gray - Spanish servers which mostly appear to be meme-related and personal servers
Purple - mixture of English and Japanese servers related to AI tech
Orange - all meme servers, all English
Green - seems to be social club type servers without a strong common theme in the naming. Lots of mentions of "club" "cafe" "hangout" etc
Blue - personal servers owned by people with English names
Black - mixture of political and tech-related servers, with a couple personal servers
Pink - technology, video games, and (uh oh) servers that seem to be related to school/studying

Good work OP.

I am curious about the business model. You say you make money, who pays you? Discord or the owners of the groups?

If it's the latter, how do they monetize it themselves? Are companies on Discord?

Idk anything about Discord, but I imagine that your bot can probably prove useful in other areas. LLM can be useful beyond moderation efforts, which although necessary, are not profit drivers. Is the proprietary part the discord integration only? Or could it be tweaked for other purposes where there is a need and offer is not already saturated?
I won't go into too much detail since I'm doing a lot of stuff that's not exactly unique, but all the features together form a bot that's pretty unique. The bot has a mixture of user-focused services like chatbot stuff, administration-focused services like the AI moderation stuff, and general utilities that aren't really "fun" features or administrative but are useful nonetheless. The money comes entirely from the user-focused stuff. Premium subscriptions, one-off purchases, etc.
My plan is to slowly transition it to being more moderation focused which is why I've spent so much time developing analytics for moderators. I recognize that "Discord bot plugged into the OpenAI API" isn't a unique business proposition and a ton of other bot devs are doing similar projects, so my goal is that I can spread the bot with the user-focused features and eventually transition it to being a moderation bot similar to (but better than) big moderation bots like Dyno and CarlBot. Moderation features are a bit harder to replicate than chatbot shit, and a lot harder to grow as a general rule, since people don't typically want to add a brand-new bot that requires a ton of dangerous permissions.
 
I did this as well back in the day and concluded, after noticing nothing had changed, that either:

a) The FBI doesn't give a shit
b) The FBI has a vested interest in this content existing
c) The whole process just had the effect and purpose of putting me on a list
d) All of the above
e) The FBI are the pedos
 
Back