Official Policy on Screen Scraping

Apteryx deliciosa · Dec 17, 2022

@Null, our benevolent dictator. I want to screen-scrape the farms. The Internet Archive "excluded" us so I can't download from them.

I fear that the trans(atlantic) Serbian ass-ass-ins will one day cum for ~~you~~ the death of the farms. I want to make my own backup w/o killing the frontends.

Please can you issue an edict on this? How many reqs per second is acceptable? Is there an API endpoint other than the RSS feeds? Are there RSS feeds for threads?

Thanks

macrodegenerate · Dec 18, 2022

According to Null during the events of August/September, a torrent will be made available in the event of the imminent demise of the site. Really the most important content are the media archives which can't be screen capped(audio/video).
I understand your enthusiasm in wanting to protect the sites' content, but are you even sure you could do this from a performance/technical level? Some threads have thousands of pages of posts each with their own images to load. I'm not familiar with scrapers aside from some crawling using nutch, so I am interested in your implementation.

Apteryx deliciosa · Dec 18, 2022

macrodegenerate said:
According to Null during the events of August/September, a torrent will be made available in the event of the imminent demise of the site. Really the most important content are the media archives which can't be screen capped(audio/video).
I understand your enthusiasm in wanting to protect the sites' content, but are you even sure you could do this from a performance/technical level? Some threads have thousands of pages of posts each with their own images to load. I'm not familiar with scrapers aside from some crawling using nutch, so I am interested in your implementation.

I'm confident (and competent) that I can archive the whole site continuously and properly. I regularly do this (data hoard) as a serious hobby. Also, the scale of "thousands" is small in computer terms and most of the site's data is text in posts. And even then, archiving attachments is trivial, except for storage, but TBs are cheap.

The system VAGUELY works like this gross oversimplification:

Given a URL (forum or thread) in the "URL Queue":
1. download the HTML (I'm being vague b/c of doxing reasons.) if it's changed since the last "capture"
2. parse the HTML
3. save the HTML
4. find all links matching a pattern (e.g. kiwifarms.net/threads/*) and add these URLs to the "URL Queue"
5. find all attachments add these to the "Attachment Queue"
6. if the URL is a thread, add the "filename" (again being VAGUE) to the "Thread HTML Queue"
7. if the URL is a forum, add the "filename" (again being VAGUE) to the "Forum HTML Queue"
8. mark the URL as "captured at $DATE" in the database

Every specialized queue is used for one or more operations, but every queue is parallel and rate-limited:

The URL Queue is effectively the crawler
The Attachment Queue is used to extract attachments from the HTML and feeding their URLs into the Attachment Download Queue
The Thread Queue is used to extract posts and metadata from the HTML
The Forum Queue is used to discover new threads and to extract metadata about forums and their threads.
The Attachment Download Queue downloads attachments.
Yada yada yada you get the idea

Now, THE SYSTEM DOES NOT LOOK LIKE THIS. THIS IS NOT A PLAN OF WHAT WILL DO, NOR HOW IT'S IMPLEMENTED. BUT INSTEAD A VAGUE OVERVIEW TO SHOW THAT I CAN DO BETTER THAN USING WGET TO CRAWL THE SITE.

I don't want to be doxxed down the line, so I can't go into any more details. I really do want to talk about tools and methods/techniques, but it's risky b/c the troons are also technically inclined, so I can't talk too much.

Cloacan · Dec 19, 2022

If you ever pull that off, I would like a fully indexed printed copy.

TheDarknessGrows · Dec 19, 2022

Cloacan said:
If you ever pull that off, I would like a fully indexed printed copy.

Don't forget to spiral bind that bitch.

PedoSec · Dec 28, 2022

If I ran this site I wouldn't reply to you, at least not in public. If I were you I would just set a reasonable throttle and do it. Bonus points for implementing exponential backoff like Googlebot has. Good way to take the hint that load is high and stop contributing to the problem.

The hard part is hosting it when you're done.

Official Policy on Screen Scraping

Apteryx deliciosa

macrodegenerate

Generative AI was a mistake

Apteryx deliciosa

Cloacan

DROPPED HIS OTTERPOP ONTO THE FLOOR

TheDarknessGrows

PedoSec

YOU"LL NEVER FIND ME