Kiwifarms Nitter/Bibliogram/Invidious Instances - Better privacy on site. Easier to archive.

Boobie Bomb · Oct 8, 2020

Finally something useful.

Sexy Senior Citizen · Oct 8, 2020

This warms the cockles of my cold, dead heart.

Portajohn · Oct 8, 2020

Just a heads up - this change has broken mobile.twitter.com links, like the one in this post. You might need to adjust the wordfilter.

BlancoMailo · Oct 8, 2020

It doesn't look like the wordfilter is handling instagram user profile links very well. Looking at the bottom of my Beauty Parlor OP here, I have two instagram profile links:

Instagram: https://ig.tinf.io/sickasslizzy / (http://archive.md/Iqq62)
Backup Instagram: https://ig.tinf.io/sickasslizzy2 (for livestreams when banned)

ig.tinf.io is successfully swapped in but it needs "u/" to be inserted in front of the username to work:

Instagram: https://ig.tinf.io/u/sickasslizzy / (http://archive.md/Iqq62)
Backup Instagram: https://ig.tinf.io/u/sickasslizzy2 (for livestreams when banned)

You can easily differentiate the two because instagram has two url formats, one for user profiles (see above) and one for direct links to posts that uses "p/":

https://www.instagram.com/p/CFdytuolW0j/

hundredpercent · Oct 8, 2020

Muskrat said:
I could not find anything looking through their github/sourcehut pages. I have no idea how difficult it would be to build archiving functionality into the source code, either.
One thing of note is that invidious allows you to download videos directly with a link in your quality of choice. Unfortunately, you cannot use this to download an entire channel, but for individual videos, it would mean more accessibility instead of only one or two people in a thread using youtube-dl in a terminal.

Do agree with @The Real SVP though, archiving images would be many times easier.

Null said:
Do they do any automated archiving?

You could set a cronjob to parse through the access logs and look for new links. To save a site with archive.org, just do GET archive.org/save/<URL>. It can probably be done with regex.

However, the best choice is probably to use warcproxy. This will make a high quality record of all outgoing HTTP requests sent by nitter/bibliogram/invidious, in much higher detail than just saving the HTML files. If you do that, you could send your data to the Internet Archive and have them integrate it.

However, if nitter just calls into the API, the archive won't be immediately browsable. So you'd ideally do both of these things.

inb4 kiwi farms gets banned from all the archiving sites and has to start up its own

The Real SVP · Oct 8, 2020

hundredpercent said:
inb4 kiwi farms gets banned from all the archiving sites and has to start up its own

That, minus the banns, is sort of the plan. If I understand Null correctly.

hundredpercent · Oct 8, 2020

The Real SVP said:
That, minus the banns, is sort of the plan. If I understand Null correctly.

Jesus. Godspeed. This stuff isn't easy; there's a very good reason there's like two archive sites in the world that anybody uses.

The only way to archive a site is by running it in an headless browser; anything else won't work with modern websites (Discourse, Twitter) that use JavaScript for everything. Headless browsers need lots of resources, so you can't just rent 1 VPS and be done with it.
Also, you'll need clean, fast proxies to run the archival service. Their IPs get banned quickly.

The archive.is guy spends $1600 a month on it, has to moderate CP and DMCA, and he doesn't even have people trying to fuck with him.

The Farms can look forward to seeing even more abuse, combined with even less filtering (I don't think you'll do the insane CAPTCHA stuff).

You'll notice archive.is has extreme amounts of captchas. This is because displaying archived sites, let alone archiving arbitrary sites upon request, is extremely resource-intensive, and he'd get DoSed into oblivion if he didn't.

The Internet Archive has an annual budget of $10 million and several petabytes' worth of storage.

I'm not saying it's impossible, especially if you limit it to users of the forum, cache things heavily, and accept saving lossy copies of everything. But it will be a much greater engineering problem than running the forum at scale.

Μusk · Oct 8, 2020

hundredpercent said:
Jesus. Godspeed. This stuff isn't easy; there's a very good reason there's like two archive sites in the world that anybody uses.

The only way to archive a site is by running it in an headless browser; anything else won't work with modern websites (Discourse, Twitter) that use JavaScript for everything. Headless browsers need lots of resources, so you can't just rent 1 VPS and be done with it.
Also, you'll need clean, fast proxies to run the archival service. Their IPs get banned quickly.

The archive.md guy spends $1600 a month on it, has to moderate CP and DMCA, and he doesn't even have people trying to fuck with him.

The Farms can look forward to seeing even more abuse, combined with even less filtering (I don't think you'll do the insane CAPTCHA stuff).

You'll notice archive.md has extreme amounts of captchas. This is because displaying archived sites, let alone archiving arbitrary sites upon request, is extremely resource-intensive, and he'd get DoSed into oblivion if he didn't.

The Internet Archive has an annual budget of $10 million and several petabytes' worth of storage.

I'm not saying it's impossible, especially if you limit it to users of the forum, cache things heavily, and accept saving lossy copies of everything. But it will be a much greater engineering problem than running the forum at scale.

Why the headless browser? What makes that specifically so resource intensive?

hundredpercent · Oct 8, 2020

Muskrat said:
Why the headless browser? What makes that specifically so resource intensive?

Browsers in general are resource intensive, look how much CPU it takes to load a Twitter page. If you're loading many pages every second, you'll need a lot of servers.

Μusk · Oct 8, 2020

hundredpercent said:
Browsers in general are resource intensive, look how much CPU it takes to load a Twitter page. If you're loading many pages every second, you'll need a lot of servers.

I know that. I was just under the impression that a headless browser was somehow more intensive than a typical browser.

The caching is a pretty big problem. I have an idea as to how it could possibly be mitigated, but it isn't an easy one.

You'd take a commonly archived website, like twitter for instance, and look at global elements. Anything static that exists on every page, such as button/logo imagery, CSS, layout, and individual elements. No matter what account or post you're looking at, these global elements will always be present.

Take the global elements and make a "template" out of it, and have this exist on users' computers.

Then, with this on your computer, when you make a request for a webpage, the archive asks your computer whether or not you have the necessary template, your computer tells it yes, and then it sends you the dynamic information that your browser constructs into a full webpage locally.

If a webpage is 70 KBs, and 50 KBs is global elements, such as the twitter logo and layout HTML, then with that on your computer the server would only need to supply 20 KBs of data for you to view the webpage. This is fundamentally what browser caching is, but a version compatible with internet archives.

Of course this would require having a program on your computer to store the templates, communicate with the archive in a unique way, and construct full pages to hand off to the browser for rendering.

hundredpercent · Oct 8, 2020

Muskrat said:
The caching is a pretty big problem. I have an idea as to how it could possibly be mitigated, but it isn't an easy one.

The caching isn't a problem, it's the only way to run a site like this at scale. You would either have to have your headless browser dump out the DOM after running all the JavaScript (archive.is method) or try to do weird hacks to get the JavaScript to run semi-properly (archive.org method).

Muskrat said:
You'd take a commonly archived website, like twitter for instance, and look at global elements. Anything static that exists on every page, such as button/logo imagery, CSS, layout, and individual elements. No matter what account or post you're looking at, these global elements will always be present.

This is what Twitter did since the redesign. It is the reason why the site is total shit shit.

Muskrat said:
Take the global elements and make a "template" out of it, and have this exist on users' computers.

Then, with this on your computer, when you make a request for a webpage, the archive asks your computer whether or not you have the necessary template, your computer tells it yes, and then it sends you the dynamic information that your browser constructs into a full webpage locally.

If a webpage is 70 KBs, and 50 KBs is global elements, such as the twitter logo and layout HTML, then with that on your computer the server would only need to supply 20 KBs of data for you to view the webpage. This is fundamentally what browser caching is, but a version compatible with internet archives.

Of course this would require having a program on your computer to store the templates, communicate with the archive in a unique way, and construct full pages to hand off to the browser for rendering.

No, that's fine. What you're describing already exists. If you load Twitter in your browser, it takes a template and fills it in by calling the Twitter API. If you run Nitter with warcprox, you would save those API calls, and you could (theoretically) have Nitter interact with Twitter and fall back to your Twitter archive if it fails. Hacking Twitter's "real" web interace to call into your archive will be much more difficult, but that's not strictly necessary.

Archiving Twitter is a solved problem, it's the arbitrary web pages that are difficult.

Μusk · Oct 8, 2020

hundredpercent said:
This is what Twitter did since the redesign. It is the reason why the site is total shit shit.

I wasn't referring to Twitter. I was referring to the interaction between individual computers and the archive itself. Having globally present elements on a users computer would mean the archive wouldn't need as much bandwidth to achieve the same results. Only give connected users the information they don't have.

The archive could still interact with twitter normally without needing to do some API magic, just discarding elements by comparing to what already exists in the database. and filling them with instructions for a "webpage compiler" that would exist on users' computers.

Storage internally could also make use of this, by having thousands of twitter pages only consist of their unique elements linked to a common "library" of elements for all of them. Its really the same idea as linux packages where a single package can be used by multiple programs, and the individual programs themselves don't need to harbor that data. Don't know if this is how it works already, but if it doesn't, its an idea to use up less space.

An idea to lower bandwidth and storage costs.

hundredpercent said:
The caching isn't a problem

I fucked up there. I was thinking about my little idea when I was writing it. You'd definitely need caching in the server.
(I write my posts in a fragmented, disorderly way. It causes fuck ups like this to happen if I'm not on top of it.)

Least Concern · Oct 8, 2020

Muskrat said:
I wasn't referring to Twitter. I was referring to the interaction between individual computers and the archive itself. Having globally present elements on a users computer would mean the archive wouldn't need as much bandwidth to achieve the same results. Only give connected users the information they don't have.

It sounds a lot more difficult than just doing a static archive, though.

Muskrat said:
Storage internally could also make use of this, by having thousands of twitter pages only consist of their unique elements linked to a common "library" of elements for all of them. Its really the same idea as linux packages where a single package can be used by multiple programs, and the individual programs themselves don't need to harbor that data. Don't know if this is how it works already, but if it doesn't, its an idea to use up less space.

Yeah, this is the more practical approach - only save one copy of the Twitter logo rather than one for each archived Twitter page. It looks like that warcprox tool linked above already handles that to an extent.

Mr. Giggles · Oct 9, 2020

Nice to see Nitter and these other third-party frontends being adapted by Null. Nitter especially, knowing the bullshit Twitter tries to do to prevent archives.

Exigent Circumcisions · Oct 9, 2020

I might actually look at Twitter if I don't have to use their garbage front-end. Neat.

hundredpercent · Oct 9, 2020

Muskrat said:
I wasn't referring to Twitter. I was referring to the interaction between individual computers and the archive itself. Having globally present elements on a users computer would mean the archive wouldn't need as much bandwidth to achieve the same results. Only give connected users the information they don't have.

The archive could still interact with twitter normally without needing to do some API magic, just discarding elements by comparing to what already exists in the database. and filling them with instructions for a "webpage compiler" that would exist on users' computers.

There's no point. The bulk of the bandwidth is going to be spent on videos and pictures, not HTML. You can compress that anyway, and reap most of the space saving benefits.

Muskrat said:
Storage internally could also make use of this, by having thousands of twitter pages only consist of their unique elements linked to a common "library" of elements for all of them. Its really the same idea as linux packages where a single package can be used by multiple programs, and the individual programs themselves don't need to harbor that data. Don't know if this is how it works already, but if it doesn't, its an idea to use up less space.

The JS libraries will do something like that, yes. JQuery will only be stored once, for instance.

HarblMcDavid · Oct 9, 2020

So far the only thing I've noticed that I don't like is youtubeDL treats tweet URLs displayed by this way as "generic" and downloads every video it can find in whatever it considers to be a playlist as compared to twitter which only targets the video in that specific tweet by default.

There is a tiny little "view in twitter" button at the top right to get around this so I'm not complaining exactly, just something I noticed.

fat ugly sped · Oct 10, 2020

Muskrat said:
Why the headless browser? What makes that specifically so resource intensive?

It's not just that browsers are fat pieces of shit that just gorge themselves on your computer's memory and CPU. It's that you also need to let the browser sit on its fat ass and wait around for the internet to finish getting everything for you. Plus a ton of website are also bloated turds that take forever to load 5000 tracking scripts and unnecessary "features" like scrolljacking etc.

Free Dick Pills · Oct 10, 2020

Null said:
Unfortunately, the owner of the archive sites is a retard and he has to be replaced.

At first I asked myself "why though?"
...
Then I remembered this:

We don’t block archive.is or any other domain via 1.1.1.1. Doing so, we believ... | Hacker News

news.ycombinator.com

Captain Chorizo · Oct 10, 2020

BlancoMailo said:
It doesn't look like the wordfilter is handling instagram user profile links very well. Looking at the bottom of my Beauty Parlor OP here, I have two instagram profile links:
ig.tinf.io is successfully swapped in but it needs "u/" to be inserted in front of the username to work:
You can easily differentiate the two because instagram has two url formats, one for user profiles (see above) and one for direct links to posts that uses "p/":

Yeah, I just ran into this. An error page comes up but it has a link to the correct URL with the "u/" before the username.
So instagram .com/username should be replaced with ig.tinf.io/u/username
Pic for visualization:

Kiwifarms Nitter/Bibliogram/Invidious Instances - Better privacy on site. Easier to archive.

Boobie Bomb

Sexy Senior Citizen

Resident Silver Fox

Portajohn

Are you still being followed by the teenage FBI?

BlancoMailo

hundredpercent

The Real SVP

hundredpercent

Μusk

hundredpercent

Μusk

hundredpercent

Μusk

Least Concern

Least to meet you

Mr. Giggles

I'm the clown

Exigent Circumcisions

heads before beds

hundredpercent

HarblMcDavid

in ur zone, dekin' my harbl

fat ugly sped

Free Dick Pills

C L I C K H E R E

We don’t block archive.is or any other domain via 1.1.1.1. Doing so, we believ... | Hacker News

Captain Chorizo

Kiwifarms Nitter/Bibliogram/Invidious Instances - Better privacy on site. Easier to archive.

Resident Silver Fox

Are you still being followed by the teenage FBI?

Least to meet you

I'm the clown

heads before beds

in ur zone, dekin' my harbl

** C L I C K H E R E **

C L I C K H E R E