Archival Tools - How to archive anything.

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
Btw after the recent outage I'd like to encourage everyone to save threads they like to read using Internet Archive.

To archive on-demand, use "Save Page Now" at https://web.archive.org/. IA also has a command line interface which could be used to archive entire threads which I'll look into.

I tried digging thru here but wasn't able to find this. Hopefully I didn't just miss it.

A top level site, with potentially thousands of .txt files that are displayed as webpages. Is there a way to use Archive.ph or wayback to grab the links from the page and archive them as well?
I'm not sure if you resolved this yet but if this is just a one-off thing and the files are really links, you can go through the HTML DOM and capture all links. Probably doable with https://github.com/ericchiang/pup
but the IA script is probably easier
 
  • Agree
Reactions: notafederalagent
Btw after the recent outage I'd like to encourage everyone to save threads they like to read using Internet Archive.
I cobbled this together a while back. It also works with kf tor addresses:
Bash:
#!/bin/bash

hostname="$(echo $1 | awk -F'/' '{print $3}')"
threadname="$(echo $1 | awk -F'/' '{print $5}')"
page="$(echo $1 | awk -F'page-' '{print $2}')"
i=1

if [[ -z $1 ]]; then
  echo URL required
  echo Example: kf.sh https://kiwifarms.net/threads/archival-tools.6561/page-21
  exit
fi

echo https://kiwifarms.net/threads/"$threadname"/ > kf-"$threadname".txt

while [ "$i" -ne "$page" ]; do
  i=$((i + 1))
  echo https://kiwifarms.net/threads/"$threadname"/page-"$i" >> kf-"$threadname".txt
done
Feed the resulting .txt file to something like the the spn script and it's a done deal.
 
  • Like
Reactions: awoo
I cobbled this together a while back. It also works with kf tor addresses:
Bash:
#!/bin/bash

hostname="$(echo $1 | awk -F'/' '{print $3}')"
threadname="$(echo $1 | awk -F'/' '{print $5}')"
page="$(echo $1 | awk -F'page-' '{print $2}')"
i=1

if [[ -z $1 ]]; then
  echo URL required
  echo Example: kf.sh https://kiwifarms.net/threads/archival-tools.6561/page-21
  exit
fi

echo https://kiwifarms.net/threads/"$threadname"/ > kf-"$threadname".txt

while [ "$i" -ne "$page" ]; do
  i=$((i + 1))
  echo https://kiwifarms.net/threads/"$threadname"/page-"$i" >> kf-"$threadname".txt
done
Feed the resulting .txt file to something like the the spn script and it's a done deal.
I appreciate the effort but seems a tad overengineered for my taste. I mean the one-liner even has helper messages!

I'm not expert in bash but...
Bash:
for i in {1..21}; do echo https://kiwifarms.net/threads/archival-tools.6561/page-${i}; done

probably even possible without a for loop using xargs.
 
  • Like
Reactions: notafederalagent
To archive on-demand, use "Save Page Now" at https://web.archive.org/. IA also has a command line interface which could be used to archive entire threads which I'll look into.
archive.org is pure cuckold and deletes anything they're asked to these days. They'd probably just delete all KF content if asked to in a scary threat with legal letterhead, even from that beaner troon lolyer.
 
archive.org is pure cuckold and deletes anything they're asked to these days. They'd probably just delete all KF content if asked to in a scary threat with legal letterhead, even from that beaner troon lolyer.
that's probably true. I used it to read the CWC threads that were archived. I'm not aware of similar CLI functionality from archive.md. probably does exist
 
archive.org is pure cuckold and deletes anything they're asked to these days. They'd probably just delete all KF content if asked to in a scary threat with legal letterhead, even from that beaner troon lolyer.
Have they done it before? I looked at textfiles' Twitter to see if he's hopped on the anti-Cloudflare or anti-KF bandwagon and it seems he hasn't (yet?)
Even the Christchurch shooting thread from 8ch.net /pol/ is still archived.
Also, archive.ph works fine for me now, same VPN server too.
 
  • Like
Reactions: awoo
Have they done it before?
Yes. And like most places they'll comply with any DMCA request or claims that you're the owner of a website. For instance, byuu.org is gone from there despite the fact that he's "dead" and obviously couldn't request it.

Also, archive.ph works fine for me now, same VPN server too.
This is probably better to use.
 
Yes. And like most places they'll comply with any DMCA request or claims that you're the owner of a website. For instance, byuu.org is gone from there despite the fact that he's "dead" and obviously couldn't request it.


This is probably better to use.
was byuu.org there before or was it always excluded due to robots.txt?
 
Yes. And like most places they'll comply with any DMCA request or claims that you're the owner of a website. For instance, byuu.org is gone from there despite the fact that he's "dead" and obviously couldn't request it.


This is probably better to use.
Pretty sure the entire internet (and NZ government) was going after sites hosting any Christchurch shooting-related content, not just the video and manifesto.
was byuu.org there before or was it always excluded due to robots.txt?
Seems like it's just robots.txt. Also seems like they remove old archives if they're disallowed via robots.txt.

Both links are from way before his "death".
 
Last edited:
  • Informative
Reactions: awoo
was byuu.org there before or was it always excluded due to robots.txt?
I think I remember looking at it on archive.org around the time of the byuuicide hoax, but I wouldn't swear to it. It might have been another archive site.
 
Of course I don't agree with their robots.txt policy, but it's been their longstanding one, from an ancient time where the internet was less mainstream and less full of malicious people.
 
Actually, I'm retarded. From that same forum thread I linked above, byuu wrote this:
> Due to the robots.txt file on byuu.org, IA's scrapes of it aren't public.
No, I specifically asked archive.org to exclude my domain and Twitter accounts. I also blackholed archive.org and archive.is from accessing my domain.

Surely enough, he's right:

So it's not really a robots.txt issue
 
  • Lunacy
Reactions: awoo
Actually, I'm retarded. From that same forum thread I linked above, byuu wrote this:
> Due to the robots.txt file on byuu.org, IA's scrapes of it aren't public.
No, I specifically asked archive.org to exclude my domain and Twitter accounts. I also blackholed archive.org and archive.md from accessing my domain.

Surely enough, he's right:

So it's not really a robots.txt issue
Is there a way to cryptographically preserve information (non-repudiation) without a third party? This might be completely retarded but I was actually thinking it is possible smart contracts take that role. As time passes it gets harder and harder to change the content since that requires changing the whole blockchain consensus. In that sense it is even more permanent than IA or archive.md which could delete content or go down on a whim.

Edit: turns out I was hardly the first person to think of this:
 
Seems like it's just robots.txt. Also seems like they remove old archives if they're disallowed via robots.txt.
Well that by itself would explain that, but it also means anyone who obtains a domain after the fact can just edit robots.txt themselves to delete stuff from it, without even using the DMCA or their other takedown procedures.
 
Well that by itself would explain that, but it also means anyone who obtains a domain after the fact can just edit robots.txt themselves to delete stuff from it, without even using the DMCA or their other takedown procedures.
Yup. I've seen this happen seemingly by accident when parking pages/spammers take over domains formerly of interest. Incredibly annoying.
 
  • Feels
Reactions: AnOminous and awoo
What you're asking about it being verifiable/genuine is a bigger task, because all you're doing in the end is creating HTML code that can be altered, nothing prevents you from doing so. And that's true whether it's images or video, unless the source is verified directly (like you inviting others to the server to see it) then all can be manipulated.
The exporter records the unique chat ID of every message, something that is possible to grab manually but is disabled by default - and tedious to do enmass. While it's true the exporter doesn't assist in casual discussions, having this ID protects you from any defamation claims a given Discord poster would make because the associated message can be verified by the Discord company. IDs also incidentally contains the exact UTC timestamp a statement was processed. The algorithm (invented by Twiter) is called Snowflake.
 
Jigsaw me if you wish, but I'm really curious. After archive.org bending the knee for tranny cock amhole, can we really trust alternatives? Or should we look for other archive options?
 
  • Like
Reactions: $quid
Jigsaw me if you wish, but I'm really curious. After archive.org bending the knee for tranny cock amhole, can we really trust alternatives? Or should we look for other archive options?
They blocked access to something they had up for a decade, with nothing illegal on it, at the request of no legal authority, while having literal murdered corpse picture fan sites.

You tell me
 
  • Thunk-Provoking
Reactions: Dig20
Back