Archival Tools - How to archive anything.

Baraadmirer · Oct 14, 2022

Great Dane said:
I've noticed that there's been a large backlog with archive.ph in recent days and I can't get archive.st to work either. How difficult would it be to set up another archive site, hosting and DMCA-wise?

Some people have been using https://ghostarchive.org as an alternative.

Great Dane · Oct 14, 2022

Baraadmirer said:
Some people have been using https://ghostarchive.org as an alternative.

This site archives stuff faster than I had expected, thank you!

notafederalagent · Oct 14, 2022

Great Dane said:
This site archives stuff faster than I had expected, thank you!

It's the only one that currently works for LinkedIn and Facebook also (afaik). Youtube videos under a certain size and length also archive which is nice.

A lot of times with archive.ph, the queue numbers can be misleading because it really depends on what site you're trying to archive. I've noticed that twitter might take longer due to them rate limiting but something like a news site goes from 2000 or so to instantly archiving.

Flatline said:
You might to go the dirty route: find the sitemap for the wiki and run wget on them. Then you have a local version of the site. There are some arguments to wget you might need to keep the stylesheets (I don't know them off the top of my head).

HTTrack works well, I used it to backup the hrtwiki site locally a while back.

Flatline · Oct 15, 2022

notafederalagent said:
HTTrack works well, I used it to backup the hrtwiki site locally a while back.

I can appreciate an alterative, but is HTTrack's GUI the sole reason why someone it over something as simple as

Code:

wget -r -np <url>

?

notafederalagent · Oct 15, 2022

Flatline said:
I can appreciate an alterative, but is HTTrack's GUI the sole reason why someone it over something as simple as

Code:

wget -r -np <url>

?

Maybe I'm an idiot, but I've never been able to get a fully browseable local version when using wget and I've tried it multiple times with multiple options.

JohnWayne'sImpactedFeces · Oct 15, 2022

notafederalagent said:
Maybe I'm an idiot, but I've never been able to get a fully browseable local version when using wget and I've tried it multiple times with multiple options.

You may find this helpful.
Spider Websites with Wget - 20 Practical Examples

Flatline · Oct 15, 2022

notafederalagent said:
Maybe I'm an idiot, but I've never been able to get a fully browseable local version when using wget and I've tried it multiple times with multiple options.

Did you run the wget command with the -k flag? That'll give you local links to whatever page you're downloading.

notafederalagent · Oct 15, 2022

Flatline said:
Did you run the wget command with the -k flag? That'll give you local links to whatever page you're downloading.

I just gave it a try with this command:

Code:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://www.friendsofcheese.com/

Walked away 20 minutes and came back to endless scrolls of things like this:

The destination name is too long (1996), reducing to 236
--2022-10-16 00:57:26-- https://www.friendsofcheese.com/sho...d=6?category_id=6?category_id=6?category_id=6
Reusing existing connection to www.friendsofcheese.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.friendsofcheese.com/shop.php?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?category_id=6?cat.html’

Ran httrack and it mirrored it fine.

Artificial Stupidity · Oct 16, 2022

Reusing existing connection to www.friendsofcheese.com:443.
HTTP request sent, awaiting response... 200 OK

Website:443 is SSL so I'd imagine it won't handshake without creds

Is there a multi-archiver ? Skizo & delicious

Cytotoxic Positivity · Nov 6, 2022

All the tools for archiving Instagram seem to be gone/no longer supported. It is possible my boomer brain is missing something? Any help would be appreciated. I have a pet incel alphacuck whom I'd like to submit to Proving Grounds, but I'm internet retarded. Please take pity on me and advise.

GenociderSyo · Nov 6, 2022

Cytotoxic Positivity said:
All the tools for archiving Instagram seem to be gone/no longer supported. It is possible my boomer brain is missing something? Any help would be appreciated. I have a pet incel alphacuck whom I'd like to submit to Proving Grounds, but I'm internet retarded. Please take pity on me and advise.

save-insta, storysaver, inflact, etc. Lots of tools exist still. Some even support private instagram.

ducktales4gameboy · Nov 9, 2022

ducktales4gameboy said:
May have missed it in my 15 page skim, but does anyone know if there's a script around to rip Fandom wikis? There are quite a few games where the only comprehensive guide is a legacy wiki migrated to Fandom and I don't trust them to not nuke everything nonprofitable or unmaintained eventually.

Update to this in case anyone else has a need for it: I did some research on how archiveteam rips fandom wikis. If you make an account on Fandom (gross) and edit a couple pages so you lose the 'new account' status the Special:Statistics page for any given wiki will have an additional 'export' heading which will allow you to dump the entire wiki plus media to an archive in Mediawiki format (!). This kinda owns because it means they can trivially be converted to zim files for offline reading without any of the Fandom bullshit.

Geranium · Nov 20, 2022

I keep an eye on the TikTok accounts of a few people, but it's become a bit more involved since yt-dlp stopped working for TikTok user profiles.

But yt-dlp still works for individual TikTok videos, so if you have a list of all of a user's videos you can run yt-dlp with the --batch-file option. Here's some JavaScript that will get you that list.

First, go to the profile you're interested in, open up the JavaScript console in your browser's development tools, and wait for the page to finish loading. (If you're curious, here's the profile of my favourite lolcow.) Paste in the code below and press enter.

JavaScript:

function getVideoElements() {
    const selector = "[data-e2e=user-post-item-list] a";
    const allElements = Array.from(document.querySelectorAll(selector));
    return allElements.filter(a => a.href.match(location.pathname));
}

function loadVideosByScrolling(nPrevious, wait_ms, resolve) {
    const videoElements = getVideoElements();
    const nVideos = videoElements.length;
    if (nVideos == nPrevious) {
        resolve(videoElements.map(a => a.href));
    }
    videoElements[nVideos - 1].scrollIntoView();
    setTimeout(() => loadVideosByScrolling(nVideos, wait_ms, resolve), wait_ms);
}

async function getAllVideoURLs(wait_ms) {
    return new Promise((resolve) => {
        loadVideosByScrolling(0, wait_ms, resolve)
    });
}

(await getAllVideoURLs(5000)).join(" ");

The argument to getAllVideoURLs at the bottom is a timeout (in milliseconds). You may need to make this longer if you have a slower connection.

Wait for a little while and you'll see a string of space-separated video URLs. Paste those into your favourite text editor, replace each space with a newline, save, and then you're ready to hand it off to yt-dlp.

I have a scraper that uses headless Chrome via Puppeteer, so this can be automated, but it's a little too messy to share at the moment. I'll post again when I've got it sorted out. Thanks to @Pisek for prompting me about this.

Geranium · Nov 22, 2022

My TikTok scraper helper is now up on GitHub at GeraniumKF/scraper-helper. Feel free to ping me if it's giving you trouble.

I tried various things to get scroll-to-load to work in headless Chrome via Puppeteer, but no luck. So if you want to download someone's whole TikTok catalog then use the devtools approach from my post above, and then you can keep on top of it using the scraper-helper periodically.

Geranium · Nov 27, 2022

Apologies for triple(!!!)-posting, but I've updated my scraper-helper repo to include a Twitter screenshot helper, to capture both threads and single tweets (via Nitter).

I've also added an Express server and a Dockerfile, so you can spin this up on a VM somewhere and access it over HTTP. Readme's got some basic examples.

sashimi · Dec 7, 2022

Is there any way of archiving Discord posts?

Geranium · Dec 7, 2022

sashimi said:
Is there any way of archiving Discord posts?

There’s DiscordChatExporter for saving whole servers. I’m not sure if it can do more targeted archiving.

Token Weeaboo said:
With the increase of cows having discord servers, I thought it would be useful to bring up the Discord Chat Exporter.

I’ve used it twice and it works OK - it generates large self-contained HTML files. For reasonably large servers this can be a few GB.

RANT KING · Dec 31, 2022

Geranium said:
There’s DiscordChatExporter for saving whole servers. I’m not sure if it can do more targeted archiving.

I’ve used it twice and it works OK - it generates large self-contained HTML files. For reasonably large servers this can be a few GB.

Another way is using selfbots on Discord but you have the risk of getting banned for using it,those selfbots have the option to save 1.000 lines of whatever server you want executing a command.

Dread First · Jan 2, 2023

For anyone who wants to attempt archiving TikTok videos, ProxiTok is an open-source front-end in a vein similar to Invidious, Nitter, LibReddit, Quetre, and such, but for TikTok. ProxiTok is one of the available options you can toggle for redirecting via LibRedirect. It's not a perfect solution because public instances can be rate limited occasionally. But it works when my friends send TikTok memes via group chat, and I don't feel like disabling NoScript to allow TikTok's desktop site to function properly.

***

EDIT 02 January 2023 @ 10:49 GMT-5 – Suggestion to add the below to OP; tagging @Null per his instruction

Local Archiving 101 – Browser Developer Tools

Disclaimer: The below was done on a fresh profile created on Firefox 102.6 ESR. I am not entirely familiar with Chromium developer tools, but the concept should still largely apply to both Mozilla and Chromium browsers.

Obligatory Introduction

Before the advent of HTML5 DRM, BLOB URLs, obfuscation via DIV tags, among other such unpleasantries, the tried and true method of “Right-click, inspect element” was the de-facto standard for archiving media from other websites. It would seem that a decade plus of conveniences being spoon-fed to us via streaming services and social media have left us ignorant to the recent past. Make no mistake: your web browser still has tons of tools to assist with archival without relying on external sources. It's not perfect, but it's robust enough to where more people should be aware of it.

From my personal testing, here's what definitively does not work (to my knowledge) with Inspect Element

a) YouTube and Instagram videos (images can be archived just fine)
b) Blobs, so anything that starts with blob:https://www.xyz.com/mediawhatever_r@nd0m_5tr1ng
c) Any type of “dynamic” URL where a media file (typically video) is broken up into smaller pieces and then loaded up as you're watching (i.e. a4dsgsomethingorother.googlevideo.com)

For most practical purposes, however, any shortcomings you have with your browser's developer toolkit can be supplemented with other tools discussed prior in this thread (i.e. YT-DLP, LibRedirect, Archive.md, etc).

Tutorial Starts Here

Scenario A: You are logged in to Instagram via your desktop, and you wish to archive a cute animal post.

Test site: https://instagram.com/juniperfoxx

Go to your browser's hamburger menu, then work out where the developer toolkit is.

You should now be in your browser's developer console. Firefox is above, while Brave's is below; I believe you can change the orientation of the console in Chromium, but I can't be fucked to do that right now.

Now, Instagram saves videos as blobs, so we can't really make good use of that at this time. I currently lack the knowledge necessary to inform people how to circumvent blob protection; someone else will have to fill that gap for me. Instead, let's turn our attention to the nice photo of the raccoon by the Christmas tree.

Click on the element picker, then click on the image itself.

Per the attached, you can see that it's not under an img tag, but rather a div tag.
This is some pretty surface-level obfuscation to work past. Look at your toolkit, and do the following:

Open the link in a new tab, and you should have your image!

DISCLAIMER: I don't know how true this is for Instagram, but Facebook URLs always have strings of metadata attached to your account. This extends to the files that you download as well. Remember to scrub the metadata before posting it anywhere!

***

This concept works for any type of media, be it image or video, but video has a lot more strings attached; I couldn't be fucked to find a test site that wasn't autistic as hell to use. Allegedly, this also works with Onlyfans content (both photo and video) as well, but I'm not brash enough to tempt fate. Enjoy!

***

I will come back later to talk about Firefox Save Page As; There's a lot of stuff that I need to look up before I authoritatively say anything about it.

Geranium · Jan 3, 2023

Are you curious about the history of a GitHub repository? GitHub has a streaming API (“firehose”) of events as they happen, and Clickhouse has a searchable database of these events (the github_events table) you can query with SQL.

As a practical example, there was a repository that we had some suggestion was public at one time (it appeared on a search engine results page) but gave a 404 on GitHub. Searching for the repo in the Clickhouse database revealed the events (create, push, add member, etc) issued before the repo was made private.

It seems that events associated with private repos, or public repos associated with accounts with private profiles, do not emit events and do not appear in the database.

Archival Tools - How to archive anything.

Baraadmirer

💪🍦💪

Great Dane

Fear nothing, achieve everything

notafederalagent

pinky promise

Flatline

notafederalagent

pinky promise

JohnWayne'sImpactedFeces

Flatline

notafederalagent

pinky promise

Artificial Stupidity

Meow

Cytotoxic Positivity

Mean, Green, Bad!

GenociderSyo

Syo

ducktales4gameboy

destruction brings creation

Geranium

Kincora; Dutroux; Epstein

Geranium

Kincora; Dutroux; Epstein

Geranium

Kincora; Dutroux; Epstein

sashimi

Geranium

Kincora; Dutroux; Epstein

RANT KING

Dread First

Inshallah, we shall liberate al-Aqsa

Geranium

Kincora; Dutroux; Epstein