Kiwifarms Nitter/Bibliogram/Invidious Instances - Better privacy on site. Easier to archive.

It doesn't look like the wordfilter is handling instagram user profile links very well. Looking at the bottom of my Beauty Parlor OP here, I have two instagram profile links:
ig.tinf.io is successfully swapped in but it needs "u/" to be inserted in front of the username to work:
You can easily differentiate the two because instagram has two url formats, one for user profiles (see above) and one for direct links to posts that uses "p/":
 
Last edited:
I could not find anything looking through their github/sourcehut pages. I have no idea how difficult it would be to build archiving functionality into the source code, either.
One thing of note is that invidious allows you to download videos directly with a link in your quality of choice. Unfortunately, you cannot use this to download an entire channel, but for individual videos, it would mean more accessibility instead of only one or two people in a thread using youtube-dl in a terminal.

Do agree with @The Real SVP though, archiving images would be many times easier.
Do they do any automated archiving?

You could set a cronjob to parse through the access logs and look for new links. To save a site with archive.org, just do GET archive.org/save/<URL>. It can probably be done with regex.

However, the best choice is probably to use warcproxy. This will make a high quality record of all outgoing HTTP requests sent by nitter/bibliogram/invidious, in much higher detail than just saving the HTML files. If you do that, you could send your data to the Internet Archive and have them integrate it.

However, if nitter just calls into the API, the archive won't be immediately browsable. So you'd ideally do both of these things.

inb4 kiwi farms gets banned from all the archiving sites and has to start up its own
 
That, minus the banns, is sort of the plan. If I understand Null correctly.
Jesus. Godspeed. This stuff isn't easy; there's a very good reason there's like two archive sites in the world that anybody uses.

The only way to archive a site is by running it in an headless browser; anything else won't work with modern websites (Discourse, Twitter) that use JavaScript for everything. Headless browsers need lots of resources, so you can't just rent 1 VPS and be done with it.
Also, you'll need clean, fast proxies to run the archival service. Their IPs get banned quickly.

The archive.is guy spends $1600 a month on it, has to moderate CP and DMCA, and he doesn't even have people trying to fuck with him.

The Farms can look forward to seeing even more abuse, combined with even less filtering (I don't think you'll do the insane CAPTCHA stuff).

You'll notice archive.is has extreme amounts of captchas. This is because displaying archived sites, let alone archiving arbitrary sites upon request, is extremely resource-intensive, and he'd get DoSed into oblivion if he didn't.

The Internet Archive has an annual budget of $10 million and several petabytes' worth of storage.

I'm not saying it's impossible, especially if you limit it to users of the forum, cache things heavily, and accept saving lossy copies of everything. But it will be a much greater engineering problem than running the forum at scale.
 
Jesus. Godspeed. This stuff isn't easy; there's a very good reason there's like two archive sites in the world that anybody uses.

The only way to archive a site is by running it in an headless browser; anything else won't work with modern websites (Discourse, Twitter) that use JavaScript for everything. Headless browsers need lots of resources, so you can't just rent 1 VPS and be done with it.
Also, you'll need clean, fast proxies to run the archival service. Their IPs get banned quickly.

The archive.md guy spends $1600 a month on it, has to moderate CP and DMCA, and he doesn't even have people trying to fuck with him.

The Farms can look forward to seeing even more abuse, combined with even less filtering (I don't think you'll do the insane CAPTCHA stuff).

You'll notice archive.md has extreme amounts of captchas. This is because displaying archived sites, let alone archiving arbitrary sites upon request, is extremely resource-intensive, and he'd get DoSed into oblivion if he didn't.

The Internet Archive has an annual budget of $10 million and several petabytes' worth of storage.

I'm not saying it's impossible, especially if you limit it to users of the forum, cache things heavily, and accept saving lossy copies of everything. But it will be a much greater engineering problem than running the forum at scale.
Why the headless browser? What makes that specifically so resource intensive?
 
Browsers in general are resource intensive, look how much CPU it takes to load a Twitter page. If you're loading many pages every second, you'll need a lot of servers.
I know that. I was just under the impression that a headless browser was somehow more intensive than a typical browser.

The caching is a pretty big problem. I have an idea as to how it could possibly be mitigated, but it isn't an easy one.

You'd take a commonly archived website, like twitter for instance, and look at global elements. Anything static that exists on every page, such as button/logo imagery, CSS, layout, and individual elements. No matter what account or post you're looking at, these global elements will always be present.

Take the global elements and make a "template" out of it, and have this exist on users' computers.

Then, with this on your computer, when you make a request for a webpage, the archive asks your computer whether or not you have the necessary template, your computer tells it yes, and then it sends you the dynamic information that your browser constructs into a full webpage locally.

If a webpage is 70 KBs, and 50 KBs is global elements, such as the twitter logo and layout HTML, then with that on your computer the server would only need to supply 20 KBs of data for you to view the webpage. This is fundamentally what browser caching is, but a version compatible with internet archives.

Of course this would require having a program on your computer to store the templates, communicate with the archive in a unique way, and construct full pages to hand off to the browser for rendering.
 
The caching is a pretty big problem. I have an idea as to how it could possibly be mitigated, but it isn't an easy one.
The caching isn't a problem, it's the only way to run a site like this at scale. You would either have to have your headless browser dump out the DOM after running all the JavaScript (archive.is method) or try to do weird hacks to get the JavaScript to run semi-properly (archive.org method).
You'd take a commonly archived website, like twitter for instance, and look at global elements. Anything static that exists on every page, such as button/logo imagery, CSS, layout, and individual elements. No matter what account or post you're looking at, these global elements will always be present.
This is what Twitter did since the redesign. It is the reason why the site is total shit shit.
Take the global elements and make a "template" out of it, and have this exist on users' computers.

Then, with this on your computer, when you make a request for a webpage, the archive asks your computer whether or not you have the necessary template, your computer tells it yes, and then it sends you the dynamic information that your browser constructs into a full webpage locally.

If a webpage is 70 KBs, and 50 KBs is global elements, such as the twitter logo and layout HTML, then with that on your computer the server would only need to supply 20 KBs of data for you to view the webpage. This is fundamentally what browser caching is, but a version compatible with internet archives.

Of course this would require having a program on your computer to store the templates, communicate with the archive in a unique way, and construct full pages to hand off to the browser for rendering.
No, that's fine. What you're describing already exists. If you load Twitter in your browser, it takes a template and fills it in by calling the Twitter API. If you run Nitter with warcprox, you would save those API calls, and you could (theoretically) have Nitter interact with Twitter and fall back to your Twitter archive if it fails. Hacking Twitter's "real" web interace to call into your archive will be much more difficult, but that's not strictly necessary.

Archiving Twitter is a solved problem, it's the arbitrary web pages that are difficult.
 
This is what Twitter did since the redesign. It is the reason why the site is total shit shit.
I wasn't referring to Twitter. I was referring to the interaction between individual computers and the archive itself. Having globally present elements on a users computer would mean the archive wouldn't need as much bandwidth to achieve the same results. Only give connected users the information they don't have.

The archive could still interact with twitter normally without needing to do some API magic, just discarding elements by comparing to what already exists in the database. and filling them with instructions for a "webpage compiler" that would exist on users' computers.

Storage internally could also make use of this, by having thousands of twitter pages only consist of their unique elements linked to a common "library" of elements for all of them. Its really the same idea as linux packages where a single package can be used by multiple programs, and the individual programs themselves don't need to harbor that data. Don't know if this is how it works already, but if it doesn't, its an idea to use up less space.

An idea to lower bandwidth and storage costs.
The caching isn't a problem
I fucked up there. I was thinking about my little idea when I was writing it. You'd definitely need caching in the server.
(I write my posts in a fragmented, disorderly way. It causes fuck ups like this to happen if I'm not on top of it.)
 
I wasn't referring to Twitter. I was referring to the interaction between individual computers and the archive itself. Having globally present elements on a users computer would mean the archive wouldn't need as much bandwidth to achieve the same results. Only give connected users the information they don't have.
It sounds a lot more difficult than just doing a static archive, though.
Storage internally could also make use of this, by having thousands of twitter pages only consist of their unique elements linked to a common "library" of elements for all of them. Its really the same idea as linux packages where a single package can be used by multiple programs, and the individual programs themselves don't need to harbor that data. Don't know if this is how it works already, but if it doesn't, its an idea to use up less space.
Yeah, this is the more practical approach - only save one copy of the Twitter logo rather than one for each archived Twitter page. It looks like that warcprox tool linked above already handles that to an extent.
 
Nice to see Nitter and these other third-party frontends being adapted by Null. Nitter especially, knowing the bullshit Twitter tries to do to prevent archives.
 
I wasn't referring to Twitter. I was referring to the interaction between individual computers and the archive itself. Having globally present elements on a users computer would mean the archive wouldn't need as much bandwidth to achieve the same results. Only give connected users the information they don't have.

The archive could still interact with twitter normally without needing to do some API magic, just discarding elements by comparing to what already exists in the database. and filling them with instructions for a "webpage compiler" that would exist on users' computers.
There's no point. The bulk of the bandwidth is going to be spent on videos and pictures, not HTML. You can compress that anyway, and reap most of the space saving benefits.
Storage internally could also make use of this, by having thousands of twitter pages only consist of their unique elements linked to a common "library" of elements for all of them. Its really the same idea as linux packages where a single package can be used by multiple programs, and the individual programs themselves don't need to harbor that data. Don't know if this is how it works already, but if it doesn't, its an idea to use up less space.
The JS libraries will do something like that, yes. JQuery will only be stored once, for instance.
 
So far the only thing I've noticed that I don't like is youtubeDL treats tweet URLs displayed by this way as "generic" and downloads every video it can find in whatever it considers to be a playlist as compared to twitter which only targets the video in that specific tweet by default.

There is a tiny little "view in twitter" button at the top right to get around this so I'm not complaining exactly, just something I noticed.
 
Why the headless browser? What makes that specifically so resource intensive?

It's not just that browsers are fat pieces of shit that just gorge themselves on your computer's memory and CPU. It's that you also need to let the browser sit on its fat ass and wait around for the internet to finish getting everything for you. Plus a ton of website are also bloated turds that take forever to load 5000 tracking scripts and unnecessary "features" like scrolljacking etc.
 
It doesn't look like the wordfilter is handling instagram user profile links very well. Looking at the bottom of my Beauty Parlor OP here, I have two instagram profile links:
ig.tinf.io is successfully swapped in but it needs "u/" to be inserted in front of the username to work:
You can easily differentiate the two because instagram has two url formats, one for user profiles (see above) and one for direct links to posts that uses "p/":
Yeah, I just ran into this. An error page comes up but it has a link to the correct URL with the "u/" before the username.
So instagram .com/username should be replaced with ig.tinf.io/u/username
Pic for visualization:

ig.tinf.io.jpg
 
Back