Archival Tools - How to archive anything.

If you really wanted to, you could archive a limitless amount of tweets using sites like Archive.org/is/today, According to Twitter, they're nothing more than users viewing linked tweets. The problem is when you want to archive an account that's amassed more than several thousand. If you really wanted to, you could save every link, but that's far too much effort.
It's not too much effort using a well-suited library (ex. python's requests is excellent). My main concern is being throttled for "suspicious activity", though as I said if you keep a low profile there shouldn't be problems.

This is from Kenneth Reitz, creator of requests https://github.com/kennethreitz/twitter-scraper
This looks more flexible: https://github.com/taspinar/twitterscraper

With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour. By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.

One of the bigger disadvantages of the Search API is that you can only access Tweets written in the past 7 days. This is a major bottleneck for anyone looking for older past data to make a model from. With TwitterScraper there is no such limitation.
 
Last edited:
I made a script that grabs a user's tweet and archives them automatically at archive.is and archive.org, using their APIs.

It is located here:
https://github.com/anomiee/tweet_archiver


It's not too much effort using a well-suited library (ex. python's requests is excellent). My main concern is being throttled for "suspicious activity", though as I said if you keep a low profile there shouldn't be problems.

This is from Kenneth Reitz, creator of requests https://github.com/kennethreitz/twitter-scraper
This looks more flexible: https://github.com/taspinar/twitterscraper

I did research on this yesterday, here are my findings (also in the README.md of the Github repo project):
I have tried using different techniques to retrieve all of a user's tweets, such as:
  • Using Selenium to scroll through a user's timeline (it stops you after a certain amount)
  • Using Selenium to search in day increments from the user's account creation time to the current date (it simply does not show tweets on days that there were tweets
  • Using the Twitter Search API rather than the User API, which claims to have no limit (it cuts you off after ~1k tweets, or gives completely unreliable results
Additionally, I have tried the solutions you posted already, and they have a bunch of issues, like not getting any/all tweets for certain days, Twitter changes breaking Selenium, etc.

If you have a way around it, please open an issue on my repository or fork it. I've made the script very easy to modify, as all the code block that archives the tweets requires is an array of IDs to iterate over.
 
  • Using Selenium to search in day increments from the user's account creation time to the current date (it simply does not show tweets on days that there were tweets
Do you have any examples of this? It's the approach I use and the number of tweets it yields is usually the same as that reported in the user profile page.
 
Do you have any examples of this? It's the approach I use and the number of tweets it yields is usually the same as that reported in the user profile page.

What are the amounts of tweets you have been retrieving? Try it on someone with a lot of tweets. I have been using https://twitter.com/NickJFuentes as my test case.

When I tried this earlier using this method (using the Selenium method), it only retrieved a few thousand tweets.

Here's another example: check out https://twitter.com/DoomPoathster . It says he has 13.9k tweets, but all methods only retrieve 200-300 tweets. Using the Selenium method (with the quality filter off), it was missing a lot of tweets, so I scrolled through his timeline and found this tweet, made on Oct. 10:

https://twitter.com/DoomPoathster/status/1049961474842456064

It has no flagged words or anything, and I checked his @ on shadowban.eu and it isn't search-banned. Let's try finding it with the date-range method:
https://twitter.com/search?f=tweets...e:2018-10-09 until:2018-10-12&src=typd&qf=off

l5HUO4R.png


Let's try just searching for his name:
https://twitter.com/search?f=tweets&q=from:DoomPoathster&src=typd&qf=off

FtE1rKx.png


It stops at Oct. 17.

It's clear that Twitter has put a lot of effort into preventing scraping all tweets using a browser-based method in order to get you to buy their API.
 
Is there a way to archive Deviantart pages these days? It seems like they've done something to make Archive.is useless. I might go with the screenshot extension instead.
 
  • Informative
Reactions: dysentery
Is there a way to archive Deviantart pages these days? It seems like they've done something to make Archive.is useless. I might go with the screenshot extension instead.

A trick that I sometimes use is the "Save As" function, which is basically archive.is, but local. It downloads a local copy of the webpage, and the resulting files is a HTML file and a folder containing the images, style data, and all that junk. It's more of a remnant from the dial up days when downloading pages locally was more of a thing, but it's still useful for quickly archiving webpages. It also seems to work for Deviantart pages (at least for me). CTRL+S (or CMD+S for the Apple Fanboys) should bring it up in both Firefox and Chrome.
 
A trick that I sometimes use is the "Save As" function, which is basically archive.is, but local. It downloads a local copy of the webpage, and the resulting files is a HTML file and a folder containing the images, style data, and all that junk. It's more of a remnant from the dial up days when downloading pages locally was more of a thing, but it's still useful for quickly archiving webpages. It also seems to work for Deviantart pages (at least for me). CTRL+S (or CMD+S for the Apple Fanboys) should bring it up in both Firefox and Chrome.


I just did both for good measure.
 
  • Like
Reactions: Mr. Giggles
Hey fellas, just posting a quick CLI tool I made to archive any given URL.

https://github.com/anomiee/instamirror

Found myself wanting something like this for when I wanted to make sure something was archived as much as possible without going to the hassle of going to the websites. For those savvy enough, you can also just make a list of URLs you want archived and pipe them in to stdin . Might needs xargs though, haven't tested that.

Is there a way to archive Deviantart pages these days? It seems like they've done something to make Archive.is useless. I might go with the screenshot extension instead.

A trick that I sometimes use is the "Save As" function, which is basically archive.is, but local. It downloads a local copy of the webpage, and the resulting files is a HTML file and a folder containing the images, style data, and all that junk. It's more of a remnant from the dial up days when downloading pages locally was more of a thing, but it's still useful for quickly archiving webpages. It also seems to work for Deviantart pages (at least for me). CTRL+S (or CMD+S for the Apple Fanboys) should bring it up in both Firefox and Chrome.

If you are having trouble with archive.is and archive.org, you have a few options:
  • Free options like HTTracker or Node Website Scraper
  • Some paid alternatives that are probably better
  • Do a page by page screenshot
  • Write a script for Selenium and Beautiful Soup (bs3) that grabs the elements of interests and archives them (particularly good for stuff with AJAXy/infinite scroll-y pages like Reddit and Facebook)
The last option is something we really need, especially as it would let you keep verbatim copies of pages that require authentication. However, writing Selenium scripts is a lot of work, as you are basically teaching the browser how to browse the website based on its code, and it very quickly becomes a big task.

A concern with the more 'dumb' website scrapers is that you will never get much of a working site out of it with most of the new-age webapps, especially one-page-apps.

I made a quick Python script to get archived streams from stream.me:
https://pastebin.com/Js7XgvXM
You should keep stuff like that on Github or something.
 
Last edited:
Is there a way to archive Deviantart pages these days? It seems like they've done something to make Archive.is useless. I might go with the screenshot extension instead.

For archive.is, add https://via.hypothes.is/ to the beginning of the dA URL before you archive it: Example

This also currently works on plenty of other websites that get screwed up while being archived (whether it's producing blank page or a "Network Error :(" message). Alternatively, you could mess around with using archive.is on a Wayback Machine capture of the page, though I don't recall if that will still produce a messed-up page for post-2017/2018 dA stuff.

Oh and also using archive.is on a temporary Freezepage archive also seems to work, in case via.hypothes.is doesn't in the future.
 
Last edited:
I found a tool that might be useful for saving ongoing live streams it's called streamlink and it works on the Linux command line and in windows power shell (If you have python installed)

Link to the GitHub project: https://github.com/streamlink/streamlink

To download a live stream from the start while it is in progress you use this command:
Code:
streamlink <URL to twitch stream or YouTube stream>  best --hls-live-restart -o out.mp4


A script for my Linux lads:
Code:
#!/bin/bash
set -e -o pipefail
streamlink --hls-live-edge 99999 --hls-segment-threads 5 --hls-live-restart -o "/home/<user>/Videos/$(date +%Y-%m-%d-%R).mp4" $1 best
exit
You could just use the primary line as an alias, but it's better to use as a script because then you can use it with "Open With" in Firefox super easily.
If you have a VPS, it's best to run on there and then export the file, as internet disruptions will kill the stream recording short.
 
  • Informative
Reactions: Nobody4353
A question for the experts: Sometimes while combing through the wayback machine trying to find records of deleted stuff, I come across a site that has content behind some kind of login. When trying to access the page, it shows a login screen, and I can't get any further. It doesn't appear that the wayback machine is capable of archiving stuff like this. Is there a way around this, or am I out of luck?

To clarify, this isn't me trying to archive things behind a login, it's about trying to find records of deleted things that were behind a login. Also, are the any other big auto archiving site alternatives to archive.org? I know about archive.is, but it doesn't seem to capture large swaths of random internet content automatically, only if people manually do it.
 
  • Like
Reactions: dysentery
A question for the experts: Sometimes while combing through the wayback machine trying to find records of deleted stuff, I come across a site that has content behind some kind of login. When trying to access the page, it shows a login screen, and I can't get any further. It doesn't appear that the wayback machine is capable of archiving stuff like this. Is there a way around this, or am I out of luck?
This particular issue appears to be a pretty significant content hamper in the Something Awful Community Watch thread. Many notable or drama-heavy parts of the site are locked behind that $10 account paywall. The only effective fix content-digging Kiwis with SA accounts can really use is to download the entire webpage as a .PNG screenshot (Using the tool mentioned at the beginning of this thread OP) and crop it down to the content, obscuring their account name.

You could also download the site as HTML and try to upload all the content manually, but that has the major downside of revealing whatever login name you're using, making it virtually useless for safely archiving content for Kiwi Farms.
To clarify, this isn't me trying to archive things behind a login, it's about trying to find records of deleted things that were behind a login.
If you can't archive a site locked behind a login screen, you won't find any content on the archive site you're using unless it didn't require a login before a certain point and someone had archived it.
Also, are the any other big auto archiving site alternatives to archive.org? I know about archive.is, but it doesn't seem to capture large swaths of random internet content automatically, only if people manually do it.
You might find some luck mass-archiving large sites by using some kind of Scraping utility.
 
  • Informative
Reactions: BarberFerdinand
archive.li seems to capture a blank screen for Instagram
Yeah, it’s always been unable to archive Instagram pages. I usually just end up taking screenshots instead.

in url just replace www.in with web.
exmple:
https://www.instagram.com/p/BtG2MIhntOT/
https://web.stagram.com/p/BtG2MIhntOT
and archive that link
http://archive.li/1ZEK6

edit:

Archiving method for YouTube videos without downloading to pc

you'll need to set up accounts on following online services:

Multcloud https://www.multcloud.com/ (free account 50 GB traffic per month)
Dropbox https://www.dropbox.com/
Flickr https://www.flickr.com/ (you can archive 1000 mp4 videos up to 1GB per video)

add your Dropbox and Flickr accounts to Multcloud

go to http://convert2mp3.net/en/index.php, paste youtube video link and convert it to mp4

After conversion is finished use the option "Save to Dropbox"

When file is saved go to Multcloud and move it from Dropbox to Flickr/My Photostream

*you can add multiple Flickr accounts
**you can use this method with other services like Mega alongside Flickr to store videos on multiple locations just in case one of the services shuts down (3 different services should be enough)

edit 2:
Method for archiving YouTube channels without downloading to pc

you'll need to set up accounts on following online services:

Multcloud https://www.multcloud.com/ (free account 50 GB traffic per month)
pCloud https://www.pcloud.com/ (WARNING - do not use for archiving, there is expiration time for free pCloud accounts not connected to desktop/mobile client)
Flickr https://www.flickr.com/ (you can archive 1000 mp4 videos up to 1GB per video)

add your pCloud and Flickr accounts to Multcloud.

Go to https://youtubemultidownloader.net/channel.html, paste youtube channel link in "Channel ID:" field and wait until links are generated, then Ctrl + A copy and paste in notepad/paste bin.

Go to https://my.pcloud.com/ click on Upload button and select remote upload. In the field bellow copy/paste up to 4-5 links (or more if pCloud can handle it) and upload them.

When videos are uploaded go to Multcloud and move them from pCloud to Flickr/My Photostream, Mega, ...
 
Last edited:
As long as your media is not copyrighted and legal, I encourage people to upload their archives to the Internet Archive at archive.org

From what I've seen, they rarely delete things and I trust their longevity way more than Mega or flickr for example.
Even Nintendo couldn't get their Nintendo Power magazine taken down permanently https://archive.org/details/Nintendo_Power_Issue001-Issue127 and Internet Archive has been used as evidence in numerous legal cases.
 
I found an extension that makes archiving shit go a lot faster if you're on Firefox or Chrome
The Archiver

It adds a folder into your context menu with two options.
Save It, which when you click it, it will shoot the page to 3 different archival sites, Wayback, Archive.is, and Webcite (though you need to add a working email to this in order for it to work).
And you can Check It to see if any of the sites have a record of it. You can also change settings if you don't want to use a particular site.

Both Wayback and Archive.li AFAIT auto-archive with it so you don't even need to interact with the tab until it's done loading.

Supposedly the guy made a Chrome version before the Firefox version, but I can't find it. He put it behind a Mega link on his site. Zip of the .crx is attached if anything happens.

Only real shortcoming is that you can't add any extra archival sites such as hypothis.is
 

Attachments

Last edited:
Back