Recently I've been having trouble archiving pages on Archive.is, it was working just find a couple days ago when I was archiving the Loretta Janke Bella thread but the site went down again by the time I reached page 30 & when it came back online it just refuses to archive at all, but the Wayback Machine archives thread pages just fine. Did Archive.is block KiwiFarms from being archived? Anything else I archive works just fine.
Anyways, I figured I'd make this thread for general archival site discussion rather than just my opening question, if you have suggestions for alternative archival sites or information regarding known archival sites, please feel free to discuss it here.
archive.md is running very slowly for me in general. Archive.org works great for most sites except for reddit and some sites that have interactive features. In the general archival discussion I think I posted how to use archive.org from the command line just by sending a curl GET request.
archive.md is running very slowly for me in general. Archive.org works great for most sites except for reddit and some sites that have interactive features. In the general archival discussion I think I posted how to use archive.org from the command line just by sending a curl GET request.
All you have to do is access https://web.archive.org/save/$url where $url is what you want to save. This is what using the save now feature of archive.org does (well I'm guessing it does a POST but if the archive doesn't exist simply visiting the link will try to create one)
In my experience archive.org will only remove things that copyright holders are very anal about. Actually a lot of stuff is stored there that people don't realize
All you have to do is access https://web.archive.org/save/$url where $url is what you want to save. This is what using the save now feature of archive.org does (well I'm guessing it does a POST but if the archive doesn't exist simply visiting the link will try to create one)
In my experience archive.org will only remove things that copyright holders are very anal about. Actually a lot of stuff is stored there that people don't realize
All you have to do is access https://web.archive.org/save/$url where $url is what you want to save. This is what using the save now feature of archive.org does (well I'm guessing it does a POST but if the archive doesn't exist simply visiting the link will try to create one)
In my experience archive.org will only remove things that copyright holders are very anal about. Actually a lot of stuff is stored there that people don't realize
Archive.org will remove pretty much anything if someone pushes hard enough, in my experience. The push isn't necessarily that hard either, it's just you have to give them full URLs to the things you're demanding be removed which is obviously a pain in the ass for an entire website with multiple snapshots.
As always, anything you really care about should be stored on your own stuff as well as archive websites.
Since this is in tech, archive.org has APIs that can be handy to use for bulk stuff and fine-tuning/automation.
Webcache I think I once used to catch a pre-stealth-edit version of a page. Correction: I remember now, it was tutorial of some sort that was no longer accessible because the owner allowed their domain to expire. It was linkrot not nefarious.
Via Hypothesis I can't remember exactly what happened, but there was some kind of incompatibility with a specific website. Maybe it had been coded to detect and reject the archive service, or maybe it was funky Javascript. Either way it fixed the problem.
I wouldn't use them normally, but knowing they're there is useful for when you get into those weird situations.
Here's some good info on the wayback machine API. In the past when I looked for this info, I could only find an out of date info page from 2013. It has always struck me as odd that the other archive.org APIs have decent documentation, but for some reason there is a blank when it comes to the web archiver. I came across this info while looking through some scripts on github a few nights ago. I also attached the pdf files to this post for those who want to avoid google:
I can confirm that the API is working! I added the email option (&email_result=1) and received a summary email a while later. There may be a couple of things not working, but I haven't tested everything yet. Here's the link to get your archive.org API key and the IA S3 API docs for reference.
I have attempted archiving hundreds of urls many times in the past but never had any luck and always walked away disappointed. This has actually made a drastic difference when teamed up with the script on overcast07's github. The script actually goes through and retries jobs got rate-limited and timed out. The only thing it won't retry is when the archive is returning some error on their end like the "chrome-error://chromewebdata/ (HTTP status=0)" (edit: It actually does loop back through those, I hadn't paid enough attention)
I didn't get a chance to really put things through their paces, but here's some preliminary results. The speed of archiving highly depends on the website you're archiving. Twitter has always been slow to archive and due to large numbers of archives being made at any moment and the slowdowns from rate limiting that archive.org imposes. I ran 989 twitter urls through the script and it took about 8 hours and I got the "chrome-error://chromewebdata/ (HTTP status=0)" for a not so insignificant number of them. Sadly I didn't save the overall file count/stats to compare.
I also ran it on a list of 941 apnews.com urls. It archived 878 urls in just over four hours (251m27.143s). 44 of the links at apnews.com were dead, and 21 errored out with the chromewebdata problem. I can live with those stats, compared to the crap speed and results that I came across in the past when trying to use the curl -I workaround to upload in bulk.
I was going to end the post here but I think this will be very useful information to share, and although it's not directly related to the topic in the strictest sense of the word, I'm hoping that can be overlooked. I won't repeat my writeup I made in the tranch thread about archiving their twitter, you can read that there. I was using twarc to get the list of their posts from the twitter API. That worked well enough for my purposes but I preferred a tool like twint that required no account and bypasses some of the API limitations (unfortunately, twint hasn't worked for me in quite a while).
Last night I found a solution. snscrape just fell into my lap when I was looking into Telegram archivers. I knew it was worthy after it gave me the urls to all 5500+ of the official tranch account posts without timing out. I have been doing some light reading on people gathering sets of data for academic work, so I knew it was a good idea to scrape all of that into a jsonl file and then work on extracting the raw data into more useful formats. You can look more into that here, here and here. I didn't want to take any chances of missing any tweets or it timing out on me getting the raw data, so I broke it up into six month periods dating back to the beginning of the account and had it all append to a text file. Yes, I know I can use >> to append, but I like to see stuff scrolling by if I'm actually sitting there and waiting, hence my use of tee -a.
edit (2022-01-20): I still haven't looked into the structure of the jsonl files it produces, so they might not be up to the same gold standard that twarc will output. It should be a fairly easy workaround to get the full list of tweet urls with snscrape and then grabbing the actual tweets using twarc. I've got plenty of testing to do over the next week.
Get url listing for all of user's tweets:
Code:
snscrape twitter-user <username> | tee -a username-url_list.txt
Get all of user's tweets and replies (minus retweets) in jsonl format:
Code:
snscrape --jsonl twitter-user <username> | tee -a username-raw_tweets.jsonl
Get profile info for user in jsonl format:
Code:
snscrape --jsonl --with-entity --max-results 0 twitter-user <username> | tee -a username-user_info.jsonl
If you want to get all of someone's posts with this, you have to use the "twitter-user" function like I did above.
(2022-01-20) changed the user agent string for lynx. No need to stand out. (2022-01-25) put the old twitter info behind a spoiler, point towards Archival Tools thread