Archival Tools - How to archive anything.

awoo · Aug 30, 2022

Btw after the recent outage I'd like to encourage everyone to save threads they like to read using Internet Archive.

To archive on-demand, use "Save Page Now" at https://web.archive.org/. IA also has a command line interface which could be used to archive entire threads which I'll look into.

stalkerchild said:
I tried digging thru here but wasn't able to find this. Hopefully I didn't just miss it.

A top level site, with potentially thousands of .txt files that are displayed as webpages. Is there a way to use Archive.ph or wayback to grab the links from the page and archive them as well?

I'm not sure if you resolved this yet but if this is just a one-off thing and the files are really links, you can go through the HTML DOM and capture all links. Probably doable with https://github.com/ericchiang/pup
but the IA script is probably easier

notafederalagent · Aug 30, 2022

awoo said:
Btw after the recent outage I'd like to encourage everyone to save threads they like to read using Internet Archive.

I cobbled this together a while back. It also works with kf tor addresses:

Bash:

#!/bin/bash

hostname="$(echo $1 | awk -F'/' '{print $3}')"
threadname="$(echo $1 | awk -F'/' '{print $5}')"
page="$(echo $1 | awk -F'page-' '{print $2}')"
i=1

if [[ -z $1 ]]; then
  echo URL required
  echo Example: kf.sh https://kiwifarms.net/threads/archival-tools.6561/page-21
  exit
fi

echo https://kiwifarms.net/threads/"$threadname"/ > kf-"$threadname".txt

while [ "$i" -ne "$page" ]; do
  i=$((i + 1))
  echo https://kiwifarms.net/threads/"$threadname"/page-"$i" >> kf-"$threadname".txt
done

Feed the resulting .txt file to something like the the spn script and it's a done deal.

awoo · Aug 30, 2022

notafederalagent said:

I cobbled this together a while back. It also works with kf tor addresses:

Bash:

#!/bin/bash

hostname="$(echo $1 | awk -F'/' '{print $3}')"
threadname="$(echo $1 | awk -F'/' '{print $5}')"
page="$(echo $1 | awk -F'page-' '{print $2}')"
i=1

if [[ -z $1 ]]; then
  echo URL required
  echo Example: kf.sh https://kiwifarms.net/threads/archival-tools.6561/page-21
  exit
fi

echo https://kiwifarms.net/threads/"$threadname"/ > kf-"$threadname".txt

while [ "$i" -ne "$page" ]; do
  i=$((i + 1))
  echo https://kiwifarms.net/threads/"$threadname"/page-"$i" >> kf-"$threadname".txt
done

Feed the resulting .txt file to something like the the spn script and it's a done deal.

I appreciate the effort but seems a tad overengineered for my taste. I mean the one-liner even has helper messages!

I'm not expert in bash but...

Bash:

for i in {1..21}; do echo https://kiwifarms.net/threads/archival-tools.6561/page-${i}; done

probably even possible without a for loop using xargs.

AnOminous · Aug 30, 2022

awoo said:
To archive on-demand, use "Save Page Now" at https://web.archive.org/. IA also has a command line interface which could be used to archive entire threads which I'll look into.

archive.org is pure cuckold and deletes anything they're asked to these days. They'd probably just delete all KF content if asked to in a scary threat with legal letterhead, even from that beaner troon lolyer.

awoo · Aug 30, 2022

AnOminous said:
archive.org is pure cuckold and deletes anything they're asked to these days. They'd probably just delete all KF content if asked to in a scary threat with legal letterhead, even from that beaner troon lolyer.

that's probably true. I used it to read the CWC threads that were archived. I'm not aware of similar CLI functionality from archive.md. probably does exist

AVR · Aug 31, 2022

AnOminous said:
archive.org is pure cuckold and deletes anything they're asked to these days. They'd probably just delete all KF content if asked to in a scary threat with legal letterhead, even from that beaner troon lolyer.

Have they done it before? I looked at textfiles' Twitter to see if he's hopped on the anti-Cloudflare or anti-KF bandwagon and it seems he hasn't (yet?)
Even the Christchurch shooting thread from 8ch.net /pol/ is still archived.
Also, archive.ph works fine for me now, same VPN server too.

AnOminous · Aug 31, 2022

AVR said:
Have they done it before?

Yes. And like most places they'll comply with any DMCA request or claims that you're the owner of a website. For instance, byuu.org is gone from there despite the fact that he's "dead" and obviously couldn't request it.

AVR said:
Also, archive.ph works fine for me now, same VPN server too.

This is probably better to use.

awoo · Aug 31, 2022

AnOminous said:
Yes. And like most places they'll comply with any DMCA request or claims that you're the owner of a website. For instance, byuu.org is gone from there despite the fact that he's "dead" and obviously couldn't request it.

This is probably better to use.

was byuu.org there before or was it always excluded due to robots.txt?

AVR · Aug 31, 2022

AnOminous said:
Yes. And like most places they'll comply with any DMCA request or claims that you're the owner of a website. For instance, byuu.org is gone from there despite the fact that he's "dead" and obviously couldn't request it.

This is probably better to use.

Pretty sure the entire internet (and NZ government) was going after sites hosting any Christchurch shooting-related content, not just the video and manifesto.

awoo said:
was byuu.org there before or was it always excluded due to robots.txt?

Seems like it's just robots.txt. Also seems like they remove old archives if they're disallowed via robots.txt.

Announcing the bsnes history kit - bboard

Squish that cat.

helmet.kafuka.org

Does anyone still hold a copy of Byuu's old webpage on all the SNES coprocessors?

Because [this link](https://byuu.org/articles/emulation/snes-coprocessors) has been removed from the wayback machine.

old.reddit.com

Both links are from way before his "death".

AnOminous · Aug 31, 2022

awoo said:
was byuu.org there before or was it always excluded due to robots.txt?

I think I remember looking at it on archive.org around the time of the byuuicide hoax, but I wouldn't swear to it. It might have been another archive site.

awoo · Aug 31, 2022

Of course I don't agree with their robots.txt policy, but it's been their longstanding one, from an ancient time where the internet was less mainstream and less full of malicious people.

AVR · Aug 31, 2022

Actually, I'm retarded. From that same forum thread I linked above, byuu wrote this:

> Due to the robots.txt file on byuu.org, IA's scrapes of it aren't public.
No, I specifically asked archive.org to exclude my domain and Twitter accounts. I also blackholed archive.org and archive.is from accessing my domain.

Surely enough, he's right:

Wayback Machine

web.archive.org

So it's not really a robots.txt issue

awoo · Aug 31, 2022

AVR said:
Actually, I'm retarded. From that same forum thread I linked above, byuu wrote this:

> Due to the robots.txt file on byuu.org, IA's scrapes of it aren't public.
No, I specifically asked archive.org to exclude my domain and Twitter accounts. I also blackholed archive.org and archive.md from accessing my domain.

Surely enough, he's right:

Wayback Machine

web.archive.org

So it's not really a robots.txt issue

Is there a way to cryptographically preserve information (non-repudiation) without a third party? This might be completely retarded but I was actually thinking it is possible smart contracts take that role. As time passes it gets harder and harder to change the content since that requires changing the whole blockchain consensus. In that sense it is even more permanent than IA or archive.md which could delete content or go down on a whim.

Edit: turns out I was hardly the first person to think of this:

Digital signature scheme for information non-repudiation in blockchain: a state of the art review - EURASIP Journal on Wireless Communications and Networking

Blockchain, as one of the most promising technology, has attracted tremendous attention. The interesting characteristics of blockchain are decentralized ledger and strong security, while non-repudiation is the important property of information security in blockchain. A digital signature scheme...

jwcn-eurasipjournals.springeropen.com

AnOminous · Aug 31, 2022

AVR said:
Seems like it's just robots.txt. Also seems like they remove old archives if they're disallowed via robots.txt.

Well that by itself would explain that, but it also means anyone who obtains a domain after the fact can just edit robots.txt themselves to delete stuff from it, without even using the DMCA or their other takedown procedures.

Flatline · Aug 31, 2022

awoo said:
probably even possible without a for loop using xargs.

Definitely

Code:

seq 0 10 | xargs -I {} echo 'https://kiwifarms.net/threads/archival-tools.6561/page-{}'

Aug 31, 2022

AnOminous said:
Well that by itself would explain that, but it also means anyone who obtains a domain after the fact can just edit robots.txt themselves to delete stuff from it, without even using the DMCA or their other takedown procedures.

Yup. I've seen this happen seemingly by accident when parking pages/spammers take over domains formerly of interest. Incredibly annoying.

Jason "Qriist" Close · Sep 6, 2022

We Are The Witches said:
What you're asking about it being verifiable/genuine is a bigger task, because all you're doing in the end is creating HTML code that can be altered, nothing prevents you from doing so. And that's true whether it's images or video, unless the source is verified directly (like you inviting others to the server to see it) then all can be manipulated.

The exporter records the unique chat ID of every message, something that is possible to grab manually but is disabled by default - and tedious to do enmass. While it's true the exporter doesn't assist in casual discussions, having this ID protects you from any defamation claims a given Discord poster would make because the associated message can be verified by the Discord company. IDs also incidentally contains the exact UTC timestamp a statement was processed. The algorithm (invented by Twiter) is called Snowflake.

$quid · Sep 6, 2022

AnOminous said:
archive.org is pure cuckold and deletes anything they're asked to these days. They'd probably just delete all KF content if asked to in a scary threat with legal letterhead, even from that beaner troon lolyer.

age like a fine wine

Nuns with guns · Sep 6, 2022

Jigsaw me if you wish, but I'm really curious. After archive.org bending the knee for tranny ~~cock~~ amhole, can we really trust alternatives? Or should we look for other archive options?

McSneaks · Sep 6, 2022

Nuns with guns said:
Jigsaw me if you wish, but I'm really curious. After archive.org bending the knee for tranny ~~cock~~ amhole, can we really trust alternatives? Or should we look for other archive options?

They blocked access to something they had up for a decade, with nothing illegal on it, at the request of no legal authority, while having literal murdered corpse picture fan sites.

You tell me

Archival Tools - How to archive anything.

awoo

Please be patient, I have awootism

notafederalagent

pinky promise

awoo

Please be patient, I have awootism

AnOminous

SOMEBODY SET UP US THE BOMB

awoo

Please be patient, I have awootism

AVR

AnOminous

SOMEBODY SET UP US THE BOMB

awoo

Please be patient, I have awootism

AVR

Announcing the bsnes history kit - bboard

Does anyone still hold a copy of Byuu's old webpage on all the SNES coprocessors?

AnOminous

SOMEBODY SET UP US THE BOMB

awoo

Please be patient, I have awootism

AVR

Wayback Machine

awoo

Please be patient, I have awootism

Wayback Machine

Digital signature scheme for information non-repudiation in blockchain: a state of the art review - EURASIP Journal on Wireless Communications and Networking

AnOminous

SOMEBODY SET UP US THE BOMB

Flatline

⠠⠠⠅⠑⠋⠋⠁⠇⠎ ⠠⠠⠊⠎ ⠠⠠⠁ ⠠⠠⠋⠁⠛

WHO DARES BATTLE THE SARACEN

Jason "Qriist" Close

Convicted Pedophile, Registered Sex Offender

$quid

Nuns with guns

Ontologically holy.

McSneaks