Discord, big data, and pedophiles - The application of data science to Discord degeneracy

Dude, don't give out this (genuinely fascinating) work for free. Get in touch with your nearest university that has some kind of combined law and tech programme, work this up as a Masters by thesis or Ph.D proposal, and get this published. You actually have a bundle of data that verifies some very common assumptions about internet grooming. This is valuable and important.
Seconding this @grand larsony. You've built something amazing here.

Here's a box plot which shows the distribution of average scores per user. You can see that the upper fence is about 6x higher than the median, and there are still quite a lot of users above the upper fence. I feel confident that anyone above this range is someone whose browser history should be looked into, at the very least.
:stress:
 
This is exceptional work, OP.

@grand larsony I wouldn't feel too bad if the bot brings up false flags - it's normal. The fact you can pull 10 messages and around +40% of them being related to sexualising children is enough to show that it's capable of exposing the people who are a threat. Language is very nuance and even the most money pumped AI still struggles with it.
 
I run a Discord bot that does a lot of moderation-related stuff. I'm not going to name it here both because this isn't some self promotional bullshit, and also because I fear the Discord turbojannies would ban my bot if they found out I was posting on the transphobic neo-Nazi alt-right cyber stalker global headquarters. I do make some money off of it so it's kinda in my best interest not to fuck it up, but still I feel this should be shared.
I don't use Discord, but is it common knowledge that most community-provided bots invited to a server can (and will) harvest data from an otherwise "private" chat?
 
This is nuts. I'm also not surprised that Discord doesn't really care much about these sort of things, given that a lot of upper-tier Discord staff are also sexual deviants of their own. And like you said, it would involve a massive legal circus that nobody really wants to handle. I'm also pretty sure they don't do it on their own so they can blame PhotoDNA on anything that slips under the radar.
 
I don't use Discord, but is it common knowledge that most community-provided bots invited to a server can (and will) harvest data from an otherwise "private" chat?
Generally, you should operate under the assumption that they do, but Discord is actually decent about how they hand out permissions to bots. If your bot is in <100 servers you can do basically whatever you want with it and Discord's interference in your activities is much smaller. Once you pass 100 servers and have to do ID verification stuff, they also make you apply for different permissions. For example, I had to email back and forth with a Discord staff member explaining why I wanted permission for my bot to see messages that didn't explicitly tag it by username. I had to explain what I'd do with that permission, how I'd store the data, who would have access to it, etc. When I told them I was running it off of my home computer (no longer the case, btw) they actually asked me to explain what it'd take for someone to walk into my house and steal my hard drive. Like, I had to tell them how hard it'd be to rob my house lol.
From what I understand, they really cracked down on bots indiscriminately harvesting user info after they had some issues where bots were built for the explicit purpose of making semi-public community data available to the public. Like there were a couple of bots where, you'd add it to your server, and without telling you, the bot would make all of your server's messages + user info publicly searchable. It seems that caused them a bit of embarrassment.
 
I see the utility, but all of that existed in some form before. Perhaps separated, but there is nothing wrong with that.
Discord did nothing new and only added zoomer tier memes into their app which attracts kids and pedos.
There's nothing wrong with making it more convenient by bringing it all into one app, for free, either.

You're being purposely obtuse with these kinds of claims. The full feature list, convenience and accessibility of discord are what brought it into the big leagues of gaming-oriented chat programs, which naturally brings in children and zoomer type features. Making the claim that the application itself is developed with the intent of providing avenues for preying on children is a leap that needs more evidence than literally adding features that normal people want in a messaging application.
 
There's nothing wrong with making it more convenient by bringing it all into one app, for free, either.

You're being purposely obtuse with these kinds of claims. The full feature list, convenience and accessibility of discord are what brought it into the big leagues of gaming-oriented chat programs, which naturally brings in children and zoomer type features. Making the claim that the application itself is developed with the intent of providing avenues for preying on children is a leap that needs more evidence than literally adding features that normal people want in a messaging application.
I think that a lot of times, people want to believe that someone is in control. Even if someone malicious is in control, the idea of someone malicious pulling the strings from behind the scenes is more comforting than the idea of total chaos stemming from the disinterest of the people powerful enough to make a more orderly, sane, positive world. At least if that were the case, there'd be someone who could be held accountable for the things that are wrong with the world. I can see why it's tempting to believe that this is intentionally abusive behavior from Discord higher ups, even though I know the reality is that they just don't care.

@grand larsony How much overlap is there between the pedo users and users talking about other topics likely to be the subject of moderation, like animal abuse or drug use?
This is harder to answer since the moderation API covers more broad categories. As the bot grows I plan to make purpose-built classifiers for more fine categories. For example, I'd like to separate the "hate" category into racism, sexism, transphobia, separate the sexual category into normal sexuality, fetish content, extreme fetish content, etc. But to do that I need more training data. 2 million might sound like a lot of messages, but the vast majority are people saying totally normal things even with as many disgusting freaks as there are in the database, so the positive examples I have of specific forms of inappropriate content are somewhat hard to come by at this stage.
If you're curious, here's the list of categories I currently scan for: hate, hate_threatening, violence, violence_graphic, sexual, sexual_minors, harassment, harassment_threatening, self-harm intention, self-harm instruction, insult, obscenity, threat, identity_attack. There's some overlap between categories but the points where they don't overlap allow moderators to make some nice fine-tuned choices about what they'd like to allow. E.g. setting a strong obscenity filter but a weak insult filter, so that "you're so dumb" wouldn't be blocked but "you're a fucking retard" would.
Good news though - Meta just released a new classification tool that's purpose built for matching against user-defined categories. This is something I plan to add to the bot as soon as I get approved for research access. This will make it much easier to scan and track a much wider variety of content categories. Facebook spyware link, click at your own risk lol - https://ai.meta.com/llama/purple-llama/

(I mentioned the hate category here, but before anyone jumps down my throat about it, I think that yes, free speech is critical to a free society, but there's also plenty of times and places where certain things are obviously unwelcome. Kiwi Farms is a great place to call people niggers and trannies, but the local PTA meeting probably isn't, that's my point.)
 
Holy shit, that's impressive work, man.

Aren't you scared to post sample messages? Call me paranoid, but if anyone recognizes them, or worse, can search for them in their own databases, your bot could be traced.
 
My plan is to slowly transition it to being more moderation focused which is why I've spent so much time developing analytics for moderators. I recognize that "Discord bot plugged into the OpenAI API" isn't a unique business proposition and a ton of other bot devs are doing similar projects, so my goal is that I can spread the bot with the user-focused features and eventually transition it to being a moderation bot similar to (but better than) big moderation bots like Dyno and CarlBot. Moderation features are a bit harder to replicate than chatbot shit, and a lot harder to grow as a general rule, since people don't typically want to add a brand-new bot that requires a ton of dangerous permissions.

I think this touches on my point but does not really answer it in some ways. I understand your strategy when it comes to entering the market and building trust, but I am more curious about the economics of it all on both sides.

I think that you have the right idea, but that there might be more to it than moderation. The way I think of it, it might actually be your foot in the door to greater things.

If I own a discord and I pay you to help with moderation or admin, then I am making money one way or the other. And moderation is only a mean to an end.

Now, if you know how I make money, your bot can become insanely more valuable.

If you are somehow able to profile people, posts or whatever the fuck they are doing on discord and cross this data with conversion metrics, you can deliver very valuable insights. I am talking an easy sell for like $499 a month.
 
Holy shit, that's impressive work, man.

Aren't you scared to post sample messages? Call me paranoid, but if anyone recognizes them, or worse, can search for them in their own databases, your bot could be traced.
They could, but that would require the senders of these messages to out themselves as the ones doing pedophilic mommy breastfeeding roleplays on Discord. It’s possible that nobody who has access to these databases and could do the reverse lookup you’re suggesting will do so, so that the spotlight is never turned on them.
 
What's being done about this?
Honestly? As far as I can see, nothing.
It's an estimated 20billion dollar industry.

Child exploitation is a Black market, that means a lot of it is run by the Mob that means the Glowinthedarkniggers are in on it. The Mob and Glowies blur together. Black budgets don't get directly funded by Taxpayers, they are made off the books.

They are not just disinterested in fighting child exploitation, they are opposed to it, especially at the top.
 
Kiwi Hero 3.jpeg
 
I run a Discord bot that does a lot of moderation-related stuff. I'm not going to name it here both because this isn't some self promotional bullshit, and also because I fear the Discord turbojannies would ban my bot if they found out I was posting on the transphobic neo-Nazi alt-right cyber stalker global headquarters.

I always had a feeling the guy who coded Dyno was based

FBI=
Federal
Booty
Inspectors
More like Five-year-old Body Inspectors
 
FBI=
Federal
Booty
Inspectors
Then what is the SEC?
:thinking:

This is harder to answer since the moderation API covers more broad categories. As the bot grows I plan to make purpose-built classifiers for more fine categories. For example, I'd like to separate the "hate" category into racism, sexism, transphobia, separate the sexual category into normal sexuality, fetish content, extreme fetish content, etc. But to do that I need more training data. 2 million might sound like a lot of messages, but the vast majority are people saying totally normal things even with as many disgusting freaks as there are in the database, so the positive examples I have of specific forms of inappropriate content are somewhat hard to come by at this stage.
If you're curious, here's the list of categories I currently scan for: hate, hate_threatening, violence, violence_graphic, sexual, sexual_minors, harassment, harassment_threatening, self-harm intention, self-harm instruction, insult, obscenity, threat, identity_attack. There's some overlap between categories but the points where they don't overlap allow moderators to make some nice fine-tuned choices about what they'd like to allow. E.g. setting a strong obscenity filter but a weak insult filter, so that "you're so dumb" wouldn't be blocked but "you're a fucking retard" would.
Good news though - Meta just released a new classification tool that's purpose built for matching against user-defined categories. This is something I plan to add to the bot as soon as I get approved for research access. This will make it much easier to scan and track a much wider variety of content categories. Facebook spyware link, click at your own risk lol - https://ai.meta.com/llama/purple-llama/

(I mentioned the hate category here, but before anyone jumps down my throat about it, I think that yes, free speech is critical to a free society, but there's also plenty of times and places where certain things are obviously unwelcome. Kiwi Farms is a great place to call people niggers and trannies, but the local PTA meeting probably isn't, that's my point.)
I have had a lot of success with using classifiers to train themselves. I start with particularly strong examples (find them using keywords and the like), train the classifier off those, and use that trained iteration to identify more subtle examples, which I can fix or filter manually. It doesn't take long to train text-based classifiers, and you only need a few decent iterations before the accuracy gets pretty good. Then you could always adjust sensitivity with a bit of quick maffs.

Also, great work. I started maintaining a few Telegram bots years ago, not necessarily for moderation. Once they began getting some use, I quickly realized—much like yourself—that I can use them to get insights into fringe communities. For better or worse, we all know Telegram is very hands off, so we don't have a lot of people monitoring this sort of thing. I didn't want to collect any images for obvious reasons, but it only took ~30 minutes of logging message text until I had been scarred for life. Curiosity killed the yat, I guess.

I was going to shut it down™ until I realized how many researchers would kill for this data. Now I just sell it in bulk and use some of that money on the server costs. Very lucrative, and I love the intense irony of Telegram users who think their messages in public groups are even remotely private having their chat data sold. If they ain't broke nowadays, I don't touch any of the bots with a 100 foot pole.
 
Back