Archival Sites General Thread - Archive.is, Wayback Machine, ect.

MarvinTheParanoidAndroid · Aug 11, 2021

Recently I've been having trouble archiving pages on Archive.is, it was working just find a couple days ago when I was archiving the Loretta Janke Bella thread but the site went down again by the time I reached page 30 & when it came back online it just refuses to archive at all, but the Wayback Machine archives thread pages just fine. Did Archive.is block KiwiFarms from being archived? Anything else I archive works just fine.

Anyways, I figured I'd make this thread for general archival site discussion rather than just my opening question, if you have suggestions for alternative archival sites or information regarding known archival sites, please feel free to discuss it here.

Celebrate Nite · Aug 11, 2021

Browser plugin for easy archiving with multiple options on where to save it

GitHub - rahiel/archiveror: Archiveror will help you preserve the webpages you love. 💾

Archiveror will help you preserve the webpages you love. 💾 - GitHub - rahiel/archiveror: Archiveror will help you preserve the webpages you love. 💾

github.com

awoo · Aug 11, 2021

archive.md is running very slowly for me in general. Archive.org works great for most sites except for reddit and some sites that have interactive features. In the general archival discussion I think I posted how to use archive.org from the command line just by sending a curl GET request.

MarvinTheParanoidAndroid · Aug 11, 2021

awoo said:
archive.md is running very slowly for me in general. Archive.org works great for most sites except for reddit and some sites that have interactive features. In the general archival discussion I think I posted how to use archive.org from the command line just by sending a curl GET request.

A tutorial here would be apropos.

awoo · Aug 12, 2021

MarvinTheParanoidAndroid said:
A tutorial here would be apropos.

https://kiwifarms.st/threads/archival-tools.6561/post-9332958

All you have to do is access https://web.archive.org/save/$url where $url is what you want to save. This is what using the save now feature of archive.org does (well I'm guessing it does a POST but if the archive doesn't exist simply visiting the link will try to create one)

In my experience archive.org will only remove things that copyright holders are very anal about. Actually a lot of stuff is stored there that people don't realize

MarvinTheParanoidAndroid · Aug 12, 2021

awoo said:
https://kiwifarms.st/threads/archival-tools.6561/post-9332958

All you have to do is access https://web.archive.org/save/$url where $url is what you want to save. This is what using the save now feature of archive.org does (well I'm guessing it does a POST but if the archive doesn't exist simply visiting the link will try to create one)

In my experience archive.org will only remove things that copyright holders are very anal about. Actually a lot of stuff is stored there that people don't realize

I was looking for a thread like this, you'd think this would've been posted in tech, no wonder I never saw it, it's in lolcow general.

Aidan · Aug 12, 2021

awoo said:
https://kiwifarms.st/threads/archival-tools.6561/post-9332958

All you have to do is access https://web.archive.org/save/$url where $url is what you want to save. This is what using the save now feature of archive.org does (well I'm guessing it does a POST but if the archive doesn't exist simply visiting the link will try to create one)

In my experience archive.org will only remove things that copyright holders are very anal about. Actually a lot of stuff is stored there that people don't realize

Archive.org will remove pretty much anything if someone pushes hard enough, in my experience. The push isn't necessarily that hard either, it's just you have to give them full URLs to the things you're demanding be removed which is obviously a pain in the ass for an entire website with multiple snapshots.

As always, anything you really care about should be stored on your own stuff as well as archive websites.

Since this is in tech, archive.org has APIs that can be handy to use for bulk stuff and fine-tuning/automation.

Celebrate Nite · Aug 12, 2021

I forgot I also had this saved on my computer

MarvinTheParanoidAndroid · Aug 12, 2021

SSF2T Old User said:
I forgot I also had this saved on my computer
View attachment 2438181

I've done option 4 plenty, but it stopped working after a while.

What does via.hypothesis & the webcache do? It looks like you're just archiving a Google cache. Is there a reason that would be useful?

Fomo Hoire · Aug 12, 2021

MarvinTheParanoidAndroid said:
What does via.hypothesis & the webcache do? It looks like you're just archiving a Google cache. Is there a reason that would be useful?

~~Webcache I think I once used to catch a pre-stealth-edit version of a page.~~ Correction: I remember now, it was tutorial of some sort that was no longer accessible because the owner allowed their domain to expire. It was linkrot not nefarious.

Via Hypothesis I can't remember exactly what happened, but there was some kind of incompatibility with a specific website. Maybe it had been coded to detect and reject the archive service, or maybe it was funky Javascript. Either way it fixed the problem.

I wouldn't use them normally, but knowing they're there is useful for when you get into those weird situations.

bippu_as_fuck_ls400 · Jan 18, 2022

Internet Archive Wayback Machine API

Here's some good info on the wayback machine API. In the past when I looked for this info, I could only find an out of date info page from 2013. It has always struck me as odd that the other archive.org APIs have decent documentation, but for some reason there is a blank when it comes to the web archiver. I came across this info while looking through some scripts on github a few nights ago. I also attached the pdf files to this post for those who want to avoid google:

Save Page Now 2 (SPN2) Public API Docs Draft:
https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit
Save Page Now 2 (SPN2) Public API Changelog:
https://docs.google.com/document/d/19RJsRncGUw2qHqGGg9lqYZYf7KKXMDL1Mro5o1Qw6QI/edit

https://github.com/overcast07/wayback-machine-spn-scripts/issues/9 (https://archive.ph/hZnGf)

I can confirm that the API is working! I added the email option (&email_result=1) and received a summary email a while later. There may be a couple of things not working, but I haven't tested everything yet. Here's the link to get your archive.org API key and the IA S3 API docs for reference.

I have attempted archiving hundreds of urls many times in the past but never had any luck and always walked away disappointed. This has actually made a drastic difference when teamed up with the script on overcast07's github. The script actually goes through and retries jobs got rate-limited and timed out. T~~he only thing it won't retry is when the archive is returning some error on their end like the "chrome-error://chromewebdata/ (HTTP status=0)~~" (edit: It actually does loop back through those, I hadn't paid enough attention)

I didn't get a chance to really put things through their paces, but here's some preliminary results. The speed of archiving highly depends on the website you're archiving. Twitter has always been slow to archive and due to large numbers of archives being made at any moment and the slowdowns from rate limiting that archive.org imposes. I ran 989 twitter urls through the script and it took about 8 hours and I got the "chrome-error://chromewebdata/ (HTTP status=0)" for a not so insignificant number of them. Sadly I didn't save the overall file count/stats to compare.

I also ran it on a list of 941 apnews.com urls. It archived 878 urls in just over four hours (251m27.143s). 44 of the links at apnews.com were dead, and 21 errored out with the chromewebdata problem. I can live with those stats, compared to the crap speed and results that I came across in the past when trying to use the curl -I workaround to upload in bulk.

Code:

spn.sh -a access_key:secret_key -n -q -p '4' -c '-4 --location-trusted --http2-prior-knowledge --compressed --tr-encoding' -d '&skip_first_archive=1' url_list.txt

(note: make sure to add --location-trusted if your curl is >=7.58)

To get the list of urls from apnews to test out:

Code:

lynx -dump -force_html -listonly -nonumbers -hiddenlinks=ignore -accept_all_cookies
-useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
-width=250 https://apnews.com/hub/coronavirus-pandemic | grep -E 'https:\/\/apnews\.com\/article\/(.*)' | sort -u | tee -a list_ap.txt

you can completely remove the matching regex section, but I wanted to make sure I was only getting articles back and nothing else.

Archiving Twitter

For an up to date guide on how to archive tweets directly to archive.org, archive.today, or save a local copy, see my post in the Archival Tools thread.

I was going to end the post here but I think this will be very useful information to share, and although it's not directly related to the topic in the strictest sense of the word, I'm hoping that can be overlooked. I won't repeat my writeup I made in the tranch thread about archiving their twitter, you can read that there. I was using twarc to get the list of their posts from the twitter API. That worked well enough for my purposes but I preferred a tool like twint that required no account and bypasses some of the API limitations (unfortunately, twint hasn't worked for me in quite a while).

Last night I found a solution. snscrape just fell into my lap when I was looking into Telegram archivers. I knew it was worthy after it gave me the urls to all 5500+ of the official tranch account posts without timing out. I have been doing some light reading on people gathering sets of data for academic work, so I knew it was a good idea to scrape all of that into a jsonl file and then work on extracting the raw data into more useful formats. You can look more into that here, here and here. I didn't want to take any chances of missing any tweets or it timing out on me getting the raw data, so I broke it up into six month periods dating back to the beginning of the account and had it all append to a text file. Yes, I know I can use >> to append, but I like to see stuff scrolling by if I'm actually sitting there and waiting, hence my use of tee -a.

edit (2022-01-20): I still haven't looked into the structure of the jsonl files it produces, so they might not be up to the same gold standard that twarc will output. It should be a fairly easy workaround to get the full list of tweet urls with snscrape and then grabbing the actual tweets using twarc. I've got plenty of testing to do over the next week.

Get url listing for all of user's tweets:

Code:

snscrape twitter-user <username> | tee -a username-url_list.txt

Get all of user's tweets and replies (minus retweets) in jsonl format:

Code:

snscrape --jsonl twitter-user <username> | tee -a username-raw_tweets.jsonl

Get profile info for user in jsonl format:

Code:

snscrape --jsonl --with-entity --max-results 0 twitter-user <username> | tee -a username-user_info.jsonl

If you want to get all of someone's posts with this, you have to use the "twitter-user" function like I did above.

(2022-01-20) changed the user agent string for lynx. No need to stand out.
(2022-01-25) put the old twitter info behind a spoiler, point towards Archival Tools thread

Goyslop Muncher · Feb 1, 2022

What the heck is going on. It has been like this for weeks.

Toolbox · Feb 1, 2022

Haram Exercise said:
What the heck is going on. It has been like this for weeks.View attachment 2942049

It does this for like two seconds then starts downloading. At this point I'd say it is just a bug.

seri0us · Feb 2, 2022

Haram Exercise said:
What the heck is going on. It has been like this for weeks.View attachment 2942049

I think we've been using it too much

Osama Bin Laden · Jan 15, 2025

GhostArchive.org can now capture full Instagram pages.

https://www.ghostarchive.org/archive/QzAEc

Archival Sites General Thread - Archive.is, Wayback Machine, ect.

MarvinTheParanoidAndroid

This will all end in tears, I just know it.

Celebrate Nite

Come On Baby It's Party Time!

GitHub - rahiel/archiveror: Archiveror will help you preserve the webpages you love. 💾

awoo

Please be patient, I have awootism

MarvinTheParanoidAndroid

This will all end in tears, I just know it.

awoo

Please be patient, I have awootism

MarvinTheParanoidAndroid

This will all end in tears, I just know it.

Aidan

Celebrate Nite

Come On Baby It's Party Time!

MarvinTheParanoidAndroid

This will all end in tears, I just know it.

Fomo Hoire

bippu_as_fuck_ls400

Junction Produce Lifestyle

Internet Archive Wayback Machine API

Archiving Twitter

Attachments

Goyslop Muncher

Whatever it takes c'mon

Toolbox

Trusted the PlQn

seri0us

Nothing too serious.

Osama Bin Laden

Osama Bin Homelander

Archival Sites General Thread - Archive.is, Wayback Machine, ect.

This will all end in tears, I just know it.

Come On Baby It's Party Time!

Please be patient, I have awootism

This will all end in tears, I just know it.

Please be patient, I have awootism

This will all end in tears, I just know it.

Come On Baby It's Party Time!

This will all end in tears, I just know it.

Junction Produce Lifestyle

Internet Archive Wayback Machine API​

Archiving Twitter​

Attachments

Whatever it takes c'mon

Trusted the PlQn

Nothing too serious.

Osama Bin Homelander

Internet Archive Wayback Machine API

Archiving Twitter