Decentralized Webpage Archival?

Sprate Header · Oct 20, 2022

So the recent DropKiwiFarms kerfuffle has more than illustrated to everyone how archive.org is susceptible to takedown requests of site archives, and I think anyone who's spent any amount of time posting in a lolcow thread can tell you that archive.ph is chronically susceptible to being down or otherwise too slow to use for whatever reason. Additionally, following Wayback bending the knee to the neurotic Twitter troonsquad, the freaks at least attempted to go after archive.ph next, and while we are lucky they were unsuccessful in that regard, a campaign like that being successful has the potential to set a horrible precedent going forward.

All this being said, is there (or could someone theoretically create) an alternative to webpage archival that doesn't rely on a centralized service? Or am I being too

?

Epic Fail Man · Oct 20, 2022

There's always the good old method.

Vexillology · Oct 20, 2022

All this being said, is there (or could someone theoretically create) an alternative to webpage archival that doesn't rely on a centralized service?

I think a torrent is the closest thing to what you're talking about

Lord of the Large Pants · Oct 20, 2022

Torrents don't do very well with large numbers of small files, and they're not suited for data that needs frequent additions.

Maybe something like LBRY? I'm not that familiar with it, but from what I've read it might work.

Japanese Beavis · Oct 20, 2022

I also wonder, you know if uhh like there are anymore sites like https://www.freezepage.com/ or http://ghostarchive.org ?

1-800-GAMBLER · Oct 20, 2022

I think IPFS could do it, but to be completely honest, I don't know much about IPFS, but it seems simple enough to understand.
A lot of crypto projects use IPFS to make sure their front end can't be censored so easily.

Plus, check out these repos that archive webpages to IPFS that are already made.

GitHub - wabarc/rivet: A toolkit makes it easier to archive webpages to IPFS

A toolkit makes it easier to archive webpages to IPFS - GitHub - wabarc/rivet: A toolkit makes it easier to archive webpages to IPFS

github.com

GitHub - oduwsdl/ipwb: InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS - GitHub - oduwsdl/ipwb: InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

github.com

Anybody who knows more about IPFS could tell us more, or I'll go read the docs all weekend.

Fomo Hoire · Oct 20, 2022

What exactly are the goals here? To have archives that can't be taken down, or to have them be searchable? Do they need to be at the level of proof, or just good enough to record your favorite lolcow threads for a rainy day?

It's probably possible to do using a combination of RSS feeds, Archivebox, and some custom search engine (can searx do Archivebox?). Preferably with a few modifications to Archivebox to make it easy to import archives from one instance to another, or to support more external backups than the Wayback machine. And better captcha circumvention.

But, the big problem with self hosted archives for anything beyond personal use is trust. There's a reason using screenshots isn't good evidence anymore; they can be tampered with. It is possible but extremely difficult to tamper with archive.org and archive.today archives if you can convince something to install a bad browser addon that changes what the archive looks like on the affected browser. If you control the server the archive is on however, nothing is stopping you from editing the source code directly, bringing us back to the problem with screenshots.

Sprate Header · Oct 20, 2022

Lord of the Large Pants said:
Torrents don't do very well with large numbers of small files, and they're not suited for data that needs frequent additions.

magickittyz said:
I think IPFS could do it, but to be completely honest, I don't know much about IPFS, but it seems simple enough to understand.
A lot of crypto projects use IPFS to make sure their front end can't be censored so easily.

From the IPFS Wikipedia page:

From IPFS's own site:

There's been some past discussion of IPFS on the forum, and this exchange in the Modern Web Woes thread was particularly interesting:

Kosher Dill said:
What about it sounds good? I'm not really thrilled by a fancier iteration of "What if every link was a torrent instead of a URL".
The resilience sounds nice in theory, but in practice I'm not sure it's better than conventional archiving. What happens if one little block of the archive goes missing? How long would it be before anyone even knows it's gone? What if the last guy who had it didn't even know he had it? What guarantees that any particular content will stay up if no one has any specific commitment to make sure it's still up?
As for resistance against government interference, I'm skeptical of that too. Remember Napster? Governments can and will go after individual users P2P sharing forbidden content if the political will is there. I assume you're not intended to be running this on Tor at all times.
Finally, how does this work if you're running a node? Who decides what's on your node, is it automatically distributed? Could The Glowies easily poison your machine with CP, Tiananmen Square footage, and so on? What's to stop administrators or governments from publishing a list of "bad hashes" and saying anyone mirroring these hashes (or federating with someone who does) will get their social credit score docked?

There may well be answers to all these already, I don't know. Those are just my impressions from reading the website - I didn't dig down into the specs or implementations.

cecograph said:
I think torrents and magnet links are way cooler for distributing static content than what Tim Berners Lee invented. HTTP was never just about distributing static content though, so it's not a good comparison.

IPFS is what happens when you think torrents and magnet links are really cool, and then imagine turning that into a global content-addressed monotonic filesystem, which you can navigate with a browser. The Bittorrent project had floated this idea before IPFS with Maelstrom, and now with BTFS. The idea is pretty obvious and obviously awesome, provided they can make it decently efficient (IPFS has been horrible on this front). I hope it takes off in one form or another.

When running IPFS, you decide what you want to serve by "pinning". This is where it differs from GNUnet, where you automatically serve whatever you've downloaded. There's a big tradeoff there. I'd expect GNUnet to be more resilient if widely adopted, in terms of not losing data, because it's got more seeders by default. GNUnet includes an anonymous protocol so that you're not guilty of distributing some evil that you unknowingly fetched.

As for resilience, something like IPFS doesn't have to replace an archive. You can still have multiple archive services that agree to store the entirety of the filesystem and serve as permanent seeders. Missing pieces can be detected by crawling the DHT, and archives can be automatically mirrored if there's worry that the men with guns are going to come knocking.

That's the theory anyway. I've read suggestions that there are just fundamental inefficiencies with DHTs that might mean you're always better off centralising. I hope that's not the case

Kosher Dill said:
Well, whenever I think of torrents, I immediately think of a 200GB file that's useless because it's missing one 4KB chunk in the middle.
Torrents work fine for things like Linux distros where there are always a billion nerds seeding them, but I don't think they scale down well to the low end where it's only a few people. If I had to chose between a few unreliable seeders and a few unreliable web hosters, I'd obviously choose the web hosters.

Ah, this is the part I was missing. It looks like you do also serve up whatever's in your "most recently requested" cache whether you like it or not, but I imagine that's easy to deal with one way or another.

I guess the question is: why would I invest in the computing resources to pin my data in multiple copies all over the place (or hire a pinning company that does this), over just using a traditional web CDN? It looks like at least a few of these pinning services do in fact use a CDN behind the scenes - why not cut out the middleman?
Or put another way: if the CDN is already going to distribute the 20 HTTP requests generated by loading this page out to its servers in some optimal way, is it really that much better to torrent the data from 200 randoms from IPFS instead?

cecograph said:
Linux distros will always have seeders because the maintainers of those distros will always be seeding. The original content providers want the torrent to stay healthy. The rest of torrent traffic is for pirated material, and the copyright holders of that content want no part in it (they want it shut down).

That sounds like it must be new then. Unless my memory is screwy, when I was trying it out a few years ago, I don't think they ran any policy like that.

The ideal is that you're not investing resources to pin: your downloaders are volunteering to do that for you, and you can trust them to do this because the only authority on the content's validity is the hash that was asked for. It's just like torrents.

If you're a web company, this is all very stupid, and you should pay for traditional infrastructure. You can recoup your costs by mining user data and running the SEO playbook to get paid shoving as much advertising in your users' faces as possible. In other words, Modern Web Woes.

Kosher Dill said:
Right, but I'm just not seeing that as viable for the small-time publisher - to bring it back around, the ones squeezed out by those modern web woes. If you have content that's obscure or unpopular, you're going to have to pay one way or another, whether that's web hosting or pinning.

Perhaps the root of all woes is that nothing scales down well to the individual user anymore.

cecograph said:
That's not my understanding. My understanding is that, even if you're super small fry, you can pin your initial content with the dribble of bandwidth provided from your home ISP to the raspberry pi in your bedroom. If people find your content and like it, some will pin it, and use their bandwidth to distribute it. Ideally, this should be incentivised, or we should pick something other than IPFS where you automatically P2P and we use a protocol which keeps peers anonymous.

So even if your shitty blog is served from a raspberry pi in your bedroom, you don't run the risk of a hug-of-death if, God forbid, you get on the front page of a news aggregator. By the time that happens, most people will be downloading your blog post from peers who've pinned the content, not from you.

This is all in principle, and I'm not sure it's close to proven.

This might not be the model of a small corporate publisher. But, in principle, it's the model for the people who produce almost all the stuff I watch on youtube, on most of the blogs I read, and the model for the unmonetised web of yester-yore.

... and here's a short thread from early 2021 about Brave adding "native IPFS support"

I'm not technologically literate enough to make heads or tails of how to actually use a system like this, but from how it's being described it doesn't sound far off from what I had in mind. Now obviously I'm not suggesting we all immediately dump archive.ph for a crypto project that the forum has by and large barely looked into, but the political landscape of the world doesn't show any sign of cooling down, and I think it would be a good idea to consider less-centralized alternatives before a problem with our main go-to archiver pops up.

Dergint said:
What exactly are the goals here? To have archives that can't be taken down, or to have them be searchable? Do they need to be at the level of proof, or just good enough to record your favorite lolcow threads for a rainy day?

My question is largely about seeing if it's even possible for us to not have to rely on a centralized service while we're archiving stuff, but ideally it would have all the bells and whistles of archive.ph (like relatively easy searching by URL) - or at least the capabilities for someone to add said bells and whistles sometime down the line - plus a bit more insurance against troons with social credit going after the archive service like they did with Wayback.

Japanese Beavis said:
I also wonder, you know if uhh like there are anymore sites like https://www.freezepage.com/ or http://ghostarchive.org ?

jitter · Oct 20, 2022

Aside from IPFS for decentralization the easiest way to create an archive is HTTrack

HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack...

www.httrack.com

Fomo Hoire · Oct 20, 2022

My question is largely about seeing if it's even possible for us to not have to rely on a centralized service while we're archiving stuff, but ideally it would have all the bells and whistles of archive.ph (like relatively easy searching by URL) - or at least the capabilities for someone to add said bells and whistles sometime down the line - plus a bit more insurance against troons with social credit going after the archive service like they did with Wayback.

In that case, since it doesn't need to be super fancy yet, I'd say yes. I'll elaborate on the level of autism I think would be required to make it nice later when I have some more time, but on a basic level it is possible to set up a decentralized archive service.

Some obstacles that will make it difficult:

They will come for our hosting providers, it will be a game of wackamole
The aforementioned trust issues
People being willing to put time and money into the project, forever
Bad actors archiving legitimate child pornography, as usual

It would take a squad of determined autists to pull off.

But as for the core reason I think it is at least possible? Get on PikaPods, play with ArchiveBox. There already are OSS options to self-host private archive services, it isn't technomancy. We just need to enhance it depending on the desired features.

(PikaPods itself isn't important, Archivebox is. I just shill pika because it's a fast way to spin something up without figuring the tech out. If we all used Pika that would actually defeat the purpose of decentralization.)

There's actually a lot of software available to preserve data if you include read it later apps, but those tend to strip formatting and only store text content. ArchiveBox is the only one I remember off the top of my head that doesn't try to extract just the important stuff.

E: Link to a functioning archive from the public demo

Japanese Beavis · Oct 20, 2022

Uh, I can't quote you @Sprate Header I apologize for the poor wording. I didn't mean ghostarchive.org or freezepage.com were decentralized I meant there needs to be alternatives to .is ontop of decentralized archives.

Dread First · Oct 20, 2022

Hey, doesn't anyone here use RSS/Atom anymore? Like you do realise that's a viable format in the current year, right? You don't have to use a fucking freemium feed reader like Feedly or Inoreader either.

Most web browsers in the past (i.e. Firefox, Opera, even Chrome) had an RSS reader where you could just manually subscribe to RSS feeds and then get direct updates to your reader app of choice. Even though Firefox gave up their fantastic RSS reader for Pocket of all things (eugh) and Google discontinued Google Reader in 2013, there's still a host of RSS applications you can use to host a local feed.

I personally use Feedbro as a browser extension. FYI - Nitter is fantastic as it's RSS compatible, and Feedbro can easily import tweets from any twitter feed you subscribe to without having to do some autistic shit like setting up a Discord or Fediverse bot. Other sites that are RSS friendly include Reddit, almost any podcast, and even YouTube (albeit it's partially obfuscated).

seri0us · Oct 21, 2022

Dread First said:
Hey, doesn't anyone here use RSS/Atom anymore? Like you do realise that's a viable format in the current year, right? You don't have to use a fucking freemium feed reader like Feedly or Inoreader either.

Most web browsers in the past (i.e. Firefox, Opera, even Chrome) had an RSS reader where you could just manually subscribe to RSS feeds and then get direct updates to your reader app of choice. Even though Firefox gave up their fantastic RSS reader for Pocket of all things (eugh) and Google discontinued Google Reader in 2013, there's still a host of RSS applications you can use to host a local feed.

I personally use Feedbro as a browser extension. FYI - Nitter is fantastic as it's RSS compatible, and Feedbro can easily import tweets from any twitter feed you subscribe to without having to do some autistic shit like setting up a Discord or Fediverse bot. Other sites that are RSS friendly include Reddit, almost any podcast, and even YouTube (albeit it's partially obfuscated).

I do! I've been using Newsboat (a commandline RSS reader) for nearly three years now. Since it saves the articles/notifications/whatever it also serves as a good archive to point out some people's lies.

Dread First · Oct 21, 2022

serious n00b said:
I do! I've been using Newsboat (a commandline RSS reader) for nearly three years now. Since it saves the articles/notifications/whatever it also serves as a good archive to point out some people's lies.

Newsboat is an excellent program, and it's arguably the best kind of RSS reader: local. For use cases where I'm only concerned about the text and have no regard for other forms of media (i.e. images, audio, video), it's the gold standard. My personal gripe is that it's a CLI program that's just not immediately useful when I come across something interesting. Feedbro isn't ideal either because it has one major design flaw (single view feed with no option for tabbed), but it's honestly the only thing that comes close to my needs.

I viscerally hate the fact that fantastic RSS readers (i.e. Google Reader, Opera Reader, Firefox Reader) were more or less discontinued in the early 2010s. What did we get in their stead? Mozilla Pocket, Brave News, Google News, bullshit freemium cloud-based "reader" apps that have the nerve to charge you for what used to be a functionally limitless feed like Feedly or Inoreader, the list goes on and on. There are some interesting options in the self-hosted sphere for RSS (i.e. TinyTinyRSS, FreshRSS, Wallabag, etc), but I don't wanna go through the hassle of setting up Docker, a LAMP server, or whatever the fuck else is necessary just to bypass the NYT World paywalls or listen to a podcast without having to log into Spotify all the damn time.

Seriously, what the actual fuck happened to all the idea of a local RSS client that literally just interprets the XML file (with some degree of input from the user)? Did that just disappear with common sense?

Oct 21, 2022

Dergint said:
What exactly are the goals here? To have archives that can't be taken down, or to have them be searchable? Do they need to be at the level of proof, or just good enough to record your favorite lolcow threads for a rainy day?

It's probably possible to do using a combination of RSS feeds, Archivebox, and some custom search engine (can searx do Archivebox?). Preferably with a few modifications to Archivebox to make it easy to import archives from one instance to another, or to support more external backups than the Wayback machine. And better captcha circumvention.

But, the big problem with self hosted archives for anything beyond personal use is trust. There's a reason using screenshots isn't good evidence anymore; they can be tampered with. It is possible but extremely difficult to tamper with archive.org and archive.md archives if you can convince something to install a bad browser addon that changes what the archive looks like on the affected browser. If you control the server the archive is on however, nothing is stopping you from editing the source code directly, bringing us back to the problem with screenshots.

I think the biggest problem with making self hosted archives useful to anyone but the person running them is discoverability and the difficulty imposed in doing the archiving.

Here are the things I'd like to see in a democratized archiving product:

Single click install on Windows/MacOS/Linux
Local web interface
Built in crawling logic that can run in the background:
1. monitor or crawl back to the beginning of time on RSS feeds, Twitter, Telegram- or whatever other plugin logic you care to add
2. pull down full page archives to store on filesystem and IPFS
3. also store 'per Tweet/per post/per article' text from Twitter/forum/RSS/whatever text, references to images etc associated with the items, to enable easy searches
Ability to search to discover others crawling, say, Talia Lavin's tweets*
Ability to archive random items
Built in ability to use Tor or randomly chosen proxies from a proxy list for crawling/archiving
Ability to auto-push new items into Archive.today/Internet Archive/etc

Ability to search just a completely random assortment of archives is something that I wouldn't even try to implement in anything close to a MVP.

* how to do this in a decentralized way is one of the most difficult things for me... I wouldn't even try and implement a technological solution to this unless it's very obvious, just rely on word of mouth... let's suppose you go to Talia Lavin's thread, and half the threads there include references to an IPNS address for each individual tweet, with the full tweet url included at the end. Well, if things are set up right, you should be able to go 'up' from that IPNS address and see all the https://twitter.com/chickinkiev/status archives your man's made, and go 'up' again and see everything he's publically archiving.

Dread First · Nov 6, 2022

Dread First said:
I viscerally hate the fact that fantastic RSS readers (i.e. Google Reader, Opera Reader, Firefox Reader) were more or less discontinued in the early 2010s. What did we get in their stead? Mozilla Pocket

Hi guys. Would like to come out here and retract what I said about Mozilla Pocket somewhat. Pocket is not an adequate replacement for an RSS reader, and I still despise the irksome clickbait that Mozilla shoves in your face. However, I was conflating Mozilla's garish defaults with the actual functionality of Pocket itself and that was horrendously unfair of me. Pocket was originally started in 2007 as a third-party XUL extension called "Read it Later" and it eventually got integrated into Firefox itself by around 2012/2013ish, and purchased by Mozilla not too long after the fact (like 2016/2017ish). That was when it got rebranded into "Pocket."

I tried to get a self-hosted Wallabag instance up and running and though I was successful in doing so, I just couldn't get it to play nicely with my Raspberry Pi that already has an active Pi-Hole install up and running on it long-term. I need another Raspberry Pi to run Wallabag off of, and disposable income is unfortunately in short supply at the moment. Considering how Pocket is already available to everyone who has a Firefox account for free, I relented and gave it a try. It's actually surprisingly robust, and it works alongside my RSS readers of choice quite nicely.

Stuff I save on desktop is easily visible on mobile and vice-versa, plus there's text-to-speech narration available on the Pocket App or Firefox's reader mode on desktop browsers. Anything that my readers of choice typically don't play nice with (i.e. SCMP / NYT / Deutsche Welles articles) is rendered almost perfectly on Pocket! Also, Pocket's able to play certain types of embedded videos (i.e. YouTube, Invidious, Nitter though sadly not Rumble or Reddit from my personal testing) which is another bonus for not having to directly interact with those sites while staying entirely within your RSS client.

On the subject of pricing and premiums and all that stuff? This is how I'd say it:

Pocket's free plan is more than enough for the average person. The only substantial benefit you get are more fonts, premium TTS voices, and the ability to highlight more than 3 times in a single article. If you're gonna pay for something, just go over to the blokes at Wallabag and ask them politely if they'll host an instance for you. You got two weeks to decide if you like it or not. Instapaper is another option worth considering, especially for annotating and speedreading, but I personally don't like Instapaper's UI anywhere near as much as Wallabag or Pocket's.

Decentralized Webpage Archival?

Sprate Header

"May you live in interesting times."

Epic Fail Man

Vexillology

I'd like to take a minute to talk about Nord VPN

Lord of the Large Pants

Chicks dig giant robots.

Japanese Beavis

fire is cool.

1-800-GAMBLER

GitHub - wabarc/rivet: A toolkit makes it easier to archive webpages to IPFS

GitHub - oduwsdl/ipwb: InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Fomo Hoire

Sprate Header

"May you live in interesting times."

jitter

HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

Fomo Hoire

Japanese Beavis

fire is cool.

Dread First

Inshallah, we shall liberate al-Aqsa

seri0us

Nothing too serious.

Dread First

Inshallah, we shall liberate al-Aqsa

⠠⠠⠅⠑⠋⠋⠁⠇⠎ ⠠⠠⠊⠎ ⠠⠠⠁ ⠠⠠⠋⠁⠛

WHO DARES BATTLE THE SARACEN

Dread First

Inshallah, we shall liberate al-Aqsa