Archival Tools - How to archive anything.

Is there a way to fully download an archive.org snapshot before they delete it?
 
Is there a way to fully download an archive.org snapshot before they delete it?
Of a particular page? I just tend to submit it to archive.today, which makes it available under the original URL.

Now I'm going to feel like a dunce if I ever transcribe something manually again.
I just spotted a couple of bugs (resulting from me tidying up the code before posting) so if you try it as posted it’ll crash. I’ll fix later today.
 
Of a particular page? I just tend to submit it to archive.today, which makes it available under the original URL.
Sorry I should have specified more;
I have a hypothetical scenario where a post is deleted but is only archived on archive.org. Linking there is obviously a bad idea, is there any way to copy their archived version in the eventuality they delete the archived post?
 
Sorry I should have specified more;
I have a hypothetical scenario where a post is deleted but is only archived on archive.org. Linking there is obviously a bad idea, is there any way to copy their archived version in the eventuality they delete the archived post?

I know minimally anything about archiving, but theoretically, assuming it isn't archived on a source like archive.md, you could save the archive.org page to your hard disk and upload the .MHTML or .HTML output to a file hosting service.
 
Sorry I should have specified more;
I have a hypothetical scenario where a post is deleted but is only archived on archive.org. Linking there is obviously a bad idea, is there any way to copy their archived version in the eventuality they delete the archived post?
Archive the archive.org url.
 
I just spotted a couple of bugs (resulting from me tidying up the code before posting) so if you try it as posted it’ll crash. I’ll fix later today.
Fixed now (I've edited the code in the original post).

Here's an example of it in use, with the output it generates. The audio is a clip from Null's appearance on the Heterodorx podcast.


Code:
$ python3 atranscribe.py -f heterodox-null-short.mp3 -n 3 -l en_us
18:52 :: Uploading audio file...
18:52 :: Audio uploaded to https://cdn.assemblyai.com/upload/[redacted]
18:52 :: Transcription expected to take between 0m39s and 1m19s.
18:52 :: Transcript ID: [redacted]
18:52 :: Polling for completion...
18:53 :: Completed transcription.
18:53 :: Writing response JSON to heterodox-null-short.json
18:53 :: Writing Markdown transcript to heterodox-null-short.md
18:53 :: Writing BBCode transcript to heterodox-null-short.bbcode

Un-edited AssemblyAI transcript:
[00:00:00]
A: It's like it's shocking because what is the Kiwi Farms? It's a drama and gossip site. And it's like the amount of things that have broken that people in the industry have always assumed could never break because they have never broken in 20 plus years. It's kind of shocking. It feels like maybe the Kiwi Farms isn't the biggest thing. Biggest violator ever, because obviously, Cloudflare, just in case you don't know, cloudflare in these ISPs cloudflare in particular protects ISIS. The ISIS recruiting website is hosted by Cloudflare. There's a website I know of that we talked about on the forum that is a hoster of animal torture pornography, and that's federally criminal. In the United States, it's called animal crush. You can't host that, but they do. Cloudflare does just fine. There is lots and lots of softcore, like child nude modeling images or photography sites that Cloudflare hosts that he provides transit to. And that's fine. There's that sanctioned suicide site, which is basically how to teach kids how to kill themselves. That's fine. That's on cloudflare? I'm pretty sure.

[00:01:08]
B: But not Kiwi Farms.

[00:01:10]
A: It's a gossip website that makes fun of shrooms the straw that broke the camel's back.

[00:01:16]
B: In the United States, there is a nonprofit that has been around for well over 30 years called the Electronic Frontier Foundation. EFF.

[00:01:27]
A: I'm very familiar with EFF. It's very significant charity.

[00:01:33]
B: Absolutely. And they've been active in the United States. Well, even their website says they defend digital privacy, free speech, and innovation. Have the EFF been in contact with you or vice versa?

[00:01:49]
A: They do not return my emails. And what's interesting about the EFF is actually when these providers started dropping us, the EFF wrote two different articles. It's In Defense of the Kiwi Farms, vaguely in the sense that they are against providers dropping us. Whenever they write about this, they dedicate three paragraphs to why the Kiwi Farms is pure evil and everyone on it is sick and disgusting and depraved. And then they say, but no, really, this is fucked up. It's a big deal. And then what's funny is that the last articles were actually written by two women, and I think they anonymized their authorship of the article because, of course, on Twitter, they're getting blown up the nanoseconds of this drop that is vaguely in any way in Defense of Kiwi Farms.

[00:02:31]
C: These are cultural circles that I used to be very much a part of because I'm a big anti copyright person, free culture person. So 1215 years ago was very in this world. This is before I was canceled. I have been shocked by how quick the free culture, free software cultures have done a 180. I mean, they canceled Richard Stallman, they've canceled me. There's all of the whole Fediverse, which is based on free software. There's a whole culture there of blocking everybody. So it's free culture, but it's not free culture. There's no free culture principles there. And yeah, I'm like my name is a dirty word among these people, or it's a dirty word among some of them, and everybody else is silent. And one of the heartbreaking things for me when Kiwi Farms lost Cloudflare was that the Internet Archive got rid of archives of Kiwi Farms.

[00:03:38]
A: Yeah, that's true. I think there was some deeper story to that, too. I think that there's a video of a guy, of the CEO of Cloudflare sitting on stage, like, in 2014. It was someone called Michael yonka. Michael Yonka is a former CIA agent who is the father of an Isabel Loretta Yanka.
 
Last edited:
Sorry I should have specified more;
I have a hypothetical scenario where a post is deleted but is only archived on archive.org. Linking there is obviously a bad idea, is there any way to copy their archived version in the eventuality they delete the archived post?
Archive the archive.org url.
To elaborate, use the site known as Archive.today to save the archive.org URL. It will detect if sources like Google Cache or Internet Archive are used, and either the original URL or the archive/cache URL should be able to find that copy. It also saves most redirects.

/threads/non-binary-queer-artist-arrested-after-arranging-to-rape-a-9-year-old-boy-distributing-child-abuse-material.140898/post-13945545

If it's really important, I would check all of your available options to see which one looks the best. Sometimes they don't save all the images properly, or you need to navigate to an older copy for better completeness.

You can do a check for Google Cache with this format:

 
I'm happy that anything works at all. Also if archive.today ever dies without backups I will be sneeding hard.
Yeah, same. I have previously spent some time thinking about building a self-hosted archive as a backup, but the main archive form (warc) and its supporting tools are very clearly written with only web tranny jannies in mind. "Awesome web archiving" is decidedly not awesome.
 
Here are some archival helpers that I've been using for a while, formatted so you can use them as bookmarklets (just add "javascript:" to the front).

Search for the current page on Archive.today, with a special case for looking up the Twitter URL if you're using Nitter:
JavaScript:
(function () {
  let nitterUrls = ["nitter.net"];

  let url = window.location.href;
  let host = window.location.host;
  if (nitterUrls.includes(host)) {
    url = url.replace(host, "twitter.com");
  }
  window.open(`https://archive.today/${url}`);
})();

Same as above but for Ghostarchive:
JavaScript:
(function() {
  let nitterUrls = [
    "nitter.net",
  ];

  let url = window.location.href;
  let host = window.location.host;
  if (nitterUrls.includes(host)) {
    url = url.replace(host, "twitter.com")
  }
  window.open(`https://ghostarchive.org/search?term=${url}`)
})()

Copy the original and archive link from Archive.today & Ghostarchive pages, eg: Trump’s Mug Shot Is Released After Booking at Fulton County Jail - The New York Times (archive)
JavaScript:
(function () {
  function originalURL(document) {
    switch (document.location.host) {
      case "ghostarchive.org":
        return document.querySelector("input[name=term]").defaultValue;
      default:
        return document.querySelector("input[name=q]").defaultValue;
    }
  }

  function archiveUrl() {
    /*  Prefer the canonical archive link if one is available. */
    return (
      document.querySelector("link[rel=canonical]")?.href ??
      document.location.href
    );
  }

  function tweetDescription(match) {
    let username = match[1];
    let tweetID = match[2];
    return `@${username}, tweet ${tweetID}`;
  }

  function description(document, sourceURL) {
    /* Handle tweets specially */
    var tweetRegex = /twitter\.com\/(\w+)\/status\/(\d+)\/?$/;
    if (tweetRegex.test(sourceURL)) {
      return tweetDescription(tweetRegex.exec(sourceURL));
    }

    let title = document.title;

    /* Tidy up Ghostarchive titles */
    let ghostarchiveSuffix = " | Ghostarchive";
    if (title.endsWith(ghostarchiveSuffix)) {
      return title.slice(0, -ghostarchiveSuffix.length);
    }

    /* Return the page title unchanged */
    return title;
  }

  let original = originalURL(document);
  let title = description(document, original);
  let archive = archiveUrl();
  let formatted = `[URL="${original}"]${title}[/URL] ([URL="${archive}"]archive[/URL])`;
  navigator.clipboard
    .writeText(formatted)
    .catch((e) => window.alert(`failed: ${e}`));
})();

Copy a Twitter thread (when viewed on Nitter) as a forum quote block, eg (from here):
Leor Sapir (@LeorSapir) · Aug 24, 2023 · 2:53 PM UTC
Timmy Broderick's (https://nitter.net/broderick_timmy) new piece in https://nitter.net/sciam: "evidence undermines ROGD claims."

How about we debate this question, Timmy? Feel free to propose a neutral venue and moderator.


Leor Sapir (@LeorSapir) · Aug 24, 2023 · 5:49 PM UTC
You can't spell https://nitter.net/sciam without scam.
JavaScript:
(function () {
  function nodeToText(node) {
    if (node.nodeName == "A") {
      return node.href;
    } else {
      return node.textContent;
    }
  }

  function tweetContent(tweet) {
    let nodes = tweet.querySelector(".tweet-content").childNodes;
    return Array.from(nodes).map(nodeToText).join("");
  }

  function tweetDate(tweet) {
    return tweet.querySelector(".tweet-date a[title]").title;
  }

  function tweetAuthor(tweet) {
    let fullname = tweet.querySelector(".fullname").textContent;
    let username = tweet.querySelector(".username").textContent;
    return { fullname, username };
  }

  function tweetToText(tweet) {
    let content = tweetContent(tweet);
    let date = tweetDate(tweet);
    let { fullname, username } = tweetAuthor(tweet);
    return `[B]${fullname} (${username}) · ${date}[/B]\n${content}`;
  }

  function threadText(document) {
    let tweets = Array.from(
      document.querySelectorAll(".main-thread .tweet-body"),
    );
    if (tweets.length === 0) {
      throw "Not a Nitter thread.";
    }
    return tweets.map(tweetToText).join("\n\n");
  }

  function bbquoteThread(document) {
    let text = threadText(document);
    return `[QUOTE]\n${text}\n[/QUOTE]`;
  }

  let quoteBlock = bbquoteThread(document);
  navigator.clipboard.writeText(quoteBlock);
})();
Doesn't handle quote tweets properly but I'll look at fixing that at some point.
 
Has yt-dlp been a little fucky for anyone else lately? been trying to download videos and it keeps saying format not available even though -F --list-formats has been returning formats available. I download with -f -download# [link]
 
For some reason, archive.today isn't archiving replies to tweets. Switching over to GhostArchive.
 
Has yt-dlp been a little fucky for anyone else lately? been trying to download videos and it keeps saying format not available even though -F --list-formats has been returning formats available. I download with -f -download# [link]
Any particularly site giving you trouble? Perhaps a quoting problem with -f in the URL?

Currently I just use a custom format sort that prioritises high quality and h265 video (smaller size), but you can also specify things like -f best or widths and heights; perhaps that will be less error-prone than particular format numbers?

For some reason, archive.today isn't archiving replies to tweets. Switching over to GhostArchive.
I'm having real trouble with Archive.today at the moment; most of the domains just show the Nginx starter page, and when I find a domain that works for the archive search, it then directs me to a failing domain for the archive page itself.
 
  • Thunk-Provoking
Reactions: Baraadmirer
Any particularly site giving you trouble? Perhaps a quoting problem with -f in the URL?

Currently I just use a custom format sort that prioritises high quality and h265 video (smaller size), but you can also specify things like -f best or widths and heights; perhaps that will be less error-prone than particular format numbers?


I'm having real trouble with Archive.today at the moment; most of the domains just show the Nginx starter page, and when I find a domain that works for the archive search, it then directs me to a failing domain for the archive page itself.
Sorry for not replying sooner, every time I've popped onto the site this week I've had really shitty loading issues. I've been having an intermittent issue where youtube-dlp will only return "format not found" messages for shit that it says the format exists for. It's not doing it at the moment but it will randomly do it. I wonder what that's about.
edit: I should probably specify that I am talking about using it for youtube downloading purposes.
 
In this thread no mention of how to preserve very large video streams, only how to lossily compress short ones. Example: YouTube stream VOD of 3:06:40.12 and 5.4G. Compression to site size limit renders content unwatchable. Splitting into tens of chunks of acceptable quality too wasteful of site storage. MEGA not viable because of DMCA attack surface. @Null: posting IPFS/BitTorrent archives permissible? Solves storage problem at cost of user IP exposure. VPN or TOR as suggested in forum user safety guide mitigates.
 
In this thread no mention of how to preserve very large video streams, only how to lossily compress short ones. Example: YouTube stream VOD of 3:06:40.12 and 5.4G. Compression to site size limit renders content unwatchable. Splitting into tens of chunks of acceptable quality too wasteful of site storage. MEGA not viable because of DMCA attack surface. @Null: posting IPFS/BitTorrent archives permissible? Solves storage problem at cost of user IP exposure. VPN or TOR as suggested in forum user safety guide mitigates.
Compress to 720p and split. That is the official instruction. I don't have the storage space for 1080p60fps of random bullshit streams with no significance.

Anything important should be saved directly to the site, i.e. clip important moments out. But however you want to archive entire streams idk. Nothing really works. They all take shit down now.
 
If you have a super big video that you think is of supreme importance, do above to 720p and split, and dare to tag Null that you have the OG source if he wants it. Then he can arrange to obtain it if he really needs that 4k60fps video of a low jerking it.
 
  • Like
Reactions: AIKA and mippy
Shame webm doesn't work on all devices, it really good at compressing video. Do mp4 video need to be code with h264 codec or h265 also works?
 
Back