Archival Tools - How to archive anything.

CatchFire · Sep 16, 2018

I don't know if this will be useful to anybody but here is another tool i found on Git Hub, it will pull all of the messages from a channel in a discord server into a single file (It also works with PM's). It is available for Windows, Linux and Mac OS X (The Mac OS X version and the Linux version only work on the command line while the windows version has a GUI), It also has a Docker image if you want to use that.

Git Hub: https://github.com/Tyrrrz/DiscordChatExporter
Build and Run Instructions: https://github.com/Tyrrrz/DiscordChatExporter/wiki/Linux-usage-instructions

Here is another archiving tool that i just found that might be useful in certain use cases, it's called the Waybackpack, it downloads copies of websites from the wayback machine, the thing that makes this tool powerful is that you can specify a time frame to download archive's from for example you can download archive's between 2003 and 2004 for a specific website if you wanted to. Any way it might be useful it might not be.

Git Hub: https://github.com/jsvine/waybackpack
python pip: pip install waybackpack

Wärring Ornac · Sep 22, 2018

Another archive alternative is freezepage

http://www.freezepage.com/1537623828PRNPGDUCXX

What make freezepage unique is that when archiving a page, you have an option to archive a page with or without a certain element:

Here is an example of choosing to archive text only

http://www.freezepage.com/1537623965EXJHBCOFJZ

However, read it's faq

Frequently Asked Questions

What is FreezePage?
FreezePage is a free service for taking online snapshots of web pages.

With FreezePage you can freeze web pages so they can be recalled in their exact form at a later time or date.

Web pages change all the time, but with FreezePage you can be sure they stay the same.

What can I use FreezePage for?
FreezePage can be used for a number of things. If you use FreezePage instead of saving Web pages to your own hard disk, you can access your saved pages from anywhere on the Web. And you can easily show the pages to friends, colleagues, or a greater public exactly as they were when you froze them.

Since you cannot change a frozen page, FreezePage can also be used to prove exactly how a Web page looked at a specific date:

Journalists or researchers can make references to web contents easily and safely.

Content managers can save and recall site sections that change often.

Lawyers and other professionals get a third party authority to back them you in relation to copyright infringement, defamation cases, etc.

Consumers can document special offers, prices, terms, etc. from the web.

What can I compare FreezePage to?
You can compare FreezePage to these other services:

Google Cached Pages saves pages as they are crawled by the search engine's robot. However, Google only makes the latest version available in relation to your searches and it doesn't save images and other page elements like FreezePage does.

The Wayback Machine takes more complete snapshots (including images and some other elements), but you do not control when individual pages should be copied. If you are looking for a historic page, you can be lucky to find it here.

As far as we know (please let us know if you think otherwise), we are the only service that allows the user to freeze any web page at any time and to get full control of the frozen page.

How reliable is FreezePage?
We have been in operation since 2003 and during all this time, we have not experienced data loss of any kind.

We do run full daily backups from a remote location.

Please notice, however, that our Terms of Use do not give any guarantees.

Does FreezePage cost anything to use?
Our basic service is free. From this moment, you have your own, fully-functional personal account ("My Frozen Pages").

If you use FreezePage a lot, you may consider a Premium Account for more storage space, priority access, and advanced features.

Another site had a link to FreezePage. Are they your partners?
FreezePage is privately owned and remains entirely independent of other sites. That is why you can use FreezePage to document the content of a given Web site at a given point of time. Web pages saved on your own hard disk can easily be modified and therefore represent poor evidence in any kind of dispute or demonstration. Frozen pages cannot be changed and FreezePage is an independent third party who guarantees their authenticity.

How does FreezePage work?
In principle, it's very simple. When you enter a Web address, we take a snapshot of the Web page and save it (cache it) on our system. Along with the main Web page, we also download and save all the elements on the page (images, stylesheets, script files, etc.). It all happens within a matter of seconds, depending on how fast the other site is. We then add the page to your list of frozen pages so you can easily recall the page in its original form.

How do I share a frozen page with other people?
When you freeze a Web page, we automatically add it to your list of frozen pages. But we also provide you with a unique shortcut to the page, like this: Can I freeze any kind of page with FreezePage?[/paste:font]
Our system works properly with the vast majority of Web pages, including secure pages (those starting with https. However, two noticeable exceptions exist:

You cannot freeze pages which have been personalized to you. For instance, if you try to freeze your online mailbox, you will not see your personal mailbox, but probably a log-in/sign-up screen. The reason is that the email service provider will not recognize you behind the FreezePage service. Instead, the email service provider will assume that FreezePage is a new user who is logging on for the first time.

Finally, you cannot view contents based on scripts. In most pages, this does not affect the main contents, but only means that banner ads etc. do not appear (for more information, please see the question below). A few pages, however, check if the users have enabled JavaScript and shut them out if they have not. If you try to freeze a page like this, you may get a message like "You must enable JavaScript to access this page".

What page elements are retrieved?
When you freeze a page, you can choose to retrieve and save:

All elements. With this option, everything is saved: the page with all text and formatting (HTML+CSS), images, other embedded elements, and script files.

All elements except script files. As all scripts (embedded or linked) are automatically disabled by FreezePage, script files are usually not important.

Text only. Only the page with all text and formatting (HTML+CSS) are retrieved; images and other embedded elements are skipped and replaced by generic images.
Use this option to speed up the freeze process and use less space, when you are only interested in the text contents of a page.

You can also skip remaining images, script files, etc. at any time during the freeze process by clicking the "Skip Images" button.

Why does this frozen page not look (exactly) as when I view it in my browser?
You can be sure that FreezePage always displays pages exactly as they were retrieved from the Web site. There are, however, three main reasons why a page may appear different when you freeze and view it at FreezePage and when you visit it directly:

The most obvious explanation is of course that the page has changed since you froze it. By nature, the front page of online news providers and similar services can change from one minute to the next. This is also true with advertisement contents, which is typically rotated and replaced constantly.

Another reason is that some Web sites use technology to personalize their pages according to who they are sending it to. As explained in the above answer, you cannot freeze pages, which have been personalized to you via a login. But there are also many Web sites that send different pages to users who have never logged in. For instance, some Web sites look at your IP address and determine where you are located. They can then send you a Web page in your own language, possibly containing local contents. If you are from another country or region than FreezePage, which is located in the U.S., you may get a different page than us from the same address.

A final explanation is that the contents you view is based on scripts. Some HTML pages contain JavaScript and other client-side script languages, which add to or modify the Web page after it has been downloaded to the browser. We automatically disable these scripts to avoid redirects and conflicts with FreezePage's system. For that reason, some frozen pages may be missing dynamic contents or interactive elements such as drop-down menus that appear when viewing the page directly with your Web browser.

Are there limitations to the frozen pages?
In order to save resources, we have set a few rules for the size of Web page you can freeze. The Web page must:

Not be bigger than 3 MB (or 10 MB for premium user accounts),

Include less than 500 embedded elements (images, stylesheets, script files, etc.),

Be retrievable within 120 seconds.

In practice, these limitations rarely enter into effect and, if it happens, we will let you know when you try to freeze the Web page.

How long are frozen pages stored?
From the moment you enter our site, you have your own personal account. When you freeze pages, they are automatically saved to your account as "My Frozen Pages".

To save space on our system, we require that you use your account regularly, i.e. that you log in or visit any page on our site. If you don't, we will delete your account and frozen pages in it.

If you are an unregistered user, you must visit our site every 3 days.

If you are a member (sign up for free), we only require you to log in once a month (every 31 days).

Premium Users are, of course, not subject to this requirement.

FreezePage does not work on my computer. What should I do?
If you have found a bug in FreezePage, we would be very happy to know. Please notice, however, that we only support relatively recent browsers. A Web service always has a trade-off between innovative features and the ability to run on all platforms and browsers.

Our system should work flawlessly with any of the following browsers:

Mozilla Firefox

Google Chrome

Internet Explorer 7.0 or higher on Windows

Safari on Mac OS X

Opera

iPhone + Android browsers (WebKit)

If you are using one of these browsers, please let us know if you find any bugs. If not, please consider upgrading your browser first.

Why is scripting disabled?
JavaScript and other client scripts are always disabled in frozen pages in order to avoid popup windows, automatic redirects, and other unwanted behavior.

As a result, dynamic contents such as rotating ads may not appear and interactive features such as drop-down menus may not function on the page.

A few pages uses javascript to change the formatting of the page, but most of the time disabled scripts will just result in a blank spot where rotating banners or certain flash animations or video applets would have been.

I am running a Web site. Can I prevent you from freezing my pages?
No. You cannot prevent users from freezing a page from your server, just like you cannot prevent your visitors from manually copying pages to their own computer. In this context, the FreezePage engine does not obey robot tags (such as <META name="ROBOTS" content="NOARCHIVE">), since we do not consider FreezePage a robot. FreezePage acts as an agent between the user's browser and your Web server, but there is a real person behind each request.

It is, however, our intention to respect copyrights and other legal issues. FreezePage provides a system for saving and recalling Web pages, but we cannot effectively check all frozen pages on our system. It is the responsibility of the users to respect copyrights and other contents-related issues. If you encounter misuse of any kind, please contact us through your local authorities and we will help you identify offenders through our log files.

Why does FreezePage sometimes get another page than I do?
Our servers are located in the United States. Thus, when you freeze a web page, the web site thinks you are American.

Some web sites are programmed to give their visitors localized contents. In this case, you get contents targeting an American audience.

From the moment you enter our site, you have your own personal account. When you freeze pages, they are automatically saved to your account as "My Frozen Pages".

To save space on our system, we require that you use your account regularly, i.e. that you log in or visit any page on our site. If you don't, we will delete your account and frozen pages in it.

If you are an unregistered user, you must visit our site every 3 days.

If you are a member (sign up for free), we only require you to log in once a month (every 31 days).

Premium Users are, of course, not subject to this requirement.

So if I'm not wrong, it appears that they can delete your page. Luckily, freezepage is fully compatible with archive.is

https://archive.is/iEjn6

Ragged Beef · Sep 22, 2018

Here's a method for archiving Reddit accounts. I know jack shit about coding but I can do it so you can too. I stole this off of Voat, everyone's favorite alt-right Reddit alternative.

Reddit data is available on BigQuery
https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_11

Click on "Compose Query" and paste the following:

SELECT
id
,link_id
,parent_id
,subreddit
,author
,score
,STRFTIME_UTC_USEC(created_utc*1000000,"%Y/%m/%d %H:%M:%S") AS CreatedOnUTC
,"http://www.reddit.com/comments/" + SUBSTR(link_id,4) + "/_/" + id AS URL
FROM
[fh-bigquery:reddit_comments.2007]
,[fh-bigquery:reddit_comments.2008]
,[fh-bigquery:reddit_comments.2009]
,[fh-bigquery:reddit_comments.2010]
,[fh-bigquery:reddit_comments.2011]
,[fh-bigquery:reddit_comments.2012]
,[fh-bigquery:reddit_comments.2013]
,[fh-bigquery:reddit_comments.2014]
,[fh-bigquery:reddit_comments.2015_01]
,[fh-bigquery:reddit_comments.2015_02]
,[fh-bigquery:reddit_comments.2015_03]
,[fh-bigquery:reddit_comments.2015_04]
,[fh-bigquery:reddit_comments.2015_05]
,[fh-bigquery:reddit_comments.2015_06]
,[fh-bigquery:reddit_comments.2015_07]
,[fh-bigquery:reddit_comments.2015_08]
,[fh-bigquery:reddit_comments.2015_09]
,[fh-bigquery:reddit_comments.2015_10]
,[fh-bigquery:reddit_comments.2015_11]
,[fh-bigquery:reddit_comments.2015_12]
,[fh-bigquery:reddit_comments.2016_01]
,[fh-bigquery:reddit_comments.2016_02]
,[fh-bigquery:reddit_comments.2016_03]
,[fh-bigquery:reddit_comments.2016_04]
,[fh-bigquery:reddit_comments.2016_05]
,[fh-bigquery:reddit_comments.2016_06]
,[fh-bigquery:reddit_comments.2016_07]
,[fh-bigquery:reddit_comments.2016_08]
,[fh-bigquery:reddit_comments.2016_09]
,[fh-bigquery:reddit_comments.2016_10]
,[fh-bigquery:reddit_comments.2016_11]
,[fh-bigquery:reddit_comments.2016_12]
,[fh-bigquery:reddit_comments.2017_01]
,[fh-bigquery:reddit_comments.2017_02]
,[fh-bigquery:reddit_comments.2017_03]
,[fh-bigquery:reddit_comments.2017_04]
,[fh-bigquery:reddit_comments.2017_05]
,[fh-bigquery:reddit_comments.2017_06]
,[fh-bigquery:reddit_comments.2017_07]
,[fh-bigquery:reddit_comments.2017_08]
,[fh-bigquery:reddit_comments.2017_09]
,[fh-bigquery:reddit_comments.2017_10]
,[fh-bigquery:reddit_comments.2017_11]
,[fh-bigquery:reddit_comments.2017_12]
,[fh-bigquery:reddit_comments.2018_01]
,[fh-bigquery:reddit_comments.2018_02]
,[fh-bigquery:reddit_comments.2018_03]
,[fh-bigquery:reddit_comments.2018_04]
,[fh-bigquery:reddit_comments.2018_05]
,[fh-bigquery:reddit_comments.2018_06]
,[fh-bigquery:reddit_comments.2018_07]
WHERE author = 'username' ORDER BY CreatedOnUTC

Important:
You will neeed to add more of those ",[fh-bigquery:reddit_comments.20xx_xx]" lines, depending on the date you do this. Check how far the archive goes, and add lines accordingly.
On the last line, change the 'username' to the username of the account you want to archive. The username is case-sensitive and do not delete the apostrophes.

When you're done, run the query and wait until the archival is complete. When it ends it'll present you different methods to download the account's history. Easiest method imo is to download the data as an Excel file.

Pros:

You don't have to archive a shit ton of pages on archive.is, this method archives thousands of comments at once.
Reddit hides posts that are older than 1 year or so on profile pages. This method bypasses that.
You can customize the query as you wish, given you know how to use this crap (I don't).

Cons:

Last 2-3 months of posts are missing. You still need to archive the last couple of pages of an account through archive.is. USE THE OLD. DOMAIN (old.reddit.com) or the account is archived with the redesign and it looks horrible, sometimes even completely broken.
You need to login to your Google account to use BigQuery (as it's a Google service), so you can not access this data anonymously. I don't believe other users can see your activity, but Google certainly can.

awoo · Sep 30, 2018

Something anomalous I noticed from the Kavanaugh megathread:

The "header" of this wayback archive https://web.archive.org/web/2018091...emyprofessors.com/ShowRatings.jsp?tid=1352705 shows the following reviews in plaintext but not among the 5 student reviews present:

09/25/2014

awful
1.0 Overall Quality
5.0 Level of Difficulty
MSW510 For Credit:Yes Attendance: N/A Textbook Used: Yes Would Take Again: N/A Grade Received: N/A

Christine ford is the worst educator I have ever experienced. Avoid taking her class and avoid any interaction with this person. I feel like she has something wrong with her and I am surprised no one has caught this. Also avoid fullerton's MSW program as long as she is there.

20 people found this useful 2 people did not find this useful
report this rating
04/19/2014

awful
1.0 Overall Quality
5.0 Level of Difficulty
MSW510 For Credit:N/A Attendance: Mandatory Textbook Used: Yes Would Take Again: N/A Grade Received: A

Prof. Ford is unprofessional, lacks appropriate filters, and I am honestly scared of her. She’s made comments both in class and in e-mails, if you cross her, you will be on her bad side. I fear to think of the poor clients that had to deal with her while she got her MSW and her LCSW. Absolutely the worst teacher I ever had.

20 people found this useful 1 person did not find this useful
report this rating

They only appear in the previous day's capture which has 8 student reviews: https://web.archive.org/web/2018091...emyprofessors.com/ShowRatings.jsp?tid=1352705

A Robin · Oct 6, 2018

awoo said:
Something anomalous I noticed from the Kavanaugh megathread:

The "header" of this wayback archive https://web.archive.org/web/2018091...emyprofessors.com/ShowRatings.jsp?tid=1352705 shows the following reviews in plaintext but not among the 5 student reviews present:

I attached an image of this since the Internet Archive people probably plan on fixing this kinda thing someday. I've seen the Wayback banner/header have its links messed up by a forum that had those VigLink ad links all over it (or some similar ad thing, don't quite recall), so it's not always a helpful bug. It's pretty funny when you go back and visit Null's 2017 profile though (pic also attached).

Hope the following isn't too far off topic, but it's relevant to searching archives and I don't know where else to dump it:

Some mad archivist crawled (I assume) all or most of Kiwi Farms during July to August 2018, including external links. They are on archive.org, in the form of bunches of WARC files: 1 2 3 Currently, they are browsable in the Wayback Machine because of the collection the user uploaded them to. However, based on what I've observed, they will likely be moved to the WARCZone collection in the future, meaning they possibly will no longer be easily viewed in Wayback even if the WARCs are still on archive.org. Unfortunately I'm currently not smart enough to know anything about dealing with WARCs, but this article may be helpful for people who are.
Sometimes if a URL is saved in the Wayback Machine numerous times in a day, navigating between the captures is very tedious; the calendar view won't show links to all the captures. But I found there's an API thing that even people like me can figure out: go to https://web.archive.org/cdx/search/cdx?url=[someURL] (Example). This will show all the timestamps for every capture for that one URL, along with page sizes and http status codes.
This may not be news to some, but http://timetravel.mementoweb.org/ allows you to search for a URL in multiple archives at once, including Wayback, archive.today, WebCite, perma.cc, the Library of Congress, and many other web archives from various countries. It's clunky, not perfect with its results, and only lets you search for an exact URL, but I thought it was neat and I had no idea it existed until blog.archive.is posted it.

Edit weeks later: 1) To update on those warcs, they were moved to the other collection but it seems they are actually still viewable in Wayback (I had seen other warc stuff in there that wasn't, which is what led me to post this originally) 2) I later found a guide for dealing with Wayback's CDX API. Pretty handy imo. It even lets you search for archived subdomains like archive.today does.
Nov 3.: I'm not seeing the warcs on Wayback now. Could change again, but, there you have it.

Diabeetus · Oct 22, 2018

Not sure if anyone's brought this up yet, but this is a really useful tool for saving hour-long videos. No ads, no extra redirects, no shenanigans. Put the link in the box, click the link it gives you, and it's ready for you to download.

awoo · Oct 22, 2018

Diabeetus said:
Not sure if anyone's brought this up yet, but this is a really useful tool for saving hour-long videos. No ads, no extra redirects, no shenanigans. Put the link in the box, click the link it gives you, and it's ready for you to download.

Please stop posting these websites. It is fairly clear to me (I'd bet a pretty penny) from the supported sites list that all these sites are just using youtube-dl as a backend.

Diabeetus · Oct 22, 2018

awoo said:
Please stop posting these websites. It is fairly clear to me (I'd bet a pretty penny) from the supported sites list that all these sites are just using youtube-dl as a backend.

Ah shit, I didn't know that. I'll look into youtube-dl instead, sorry.

EdgyKid69 · Oct 22, 2018

Cuddly Pirate said:
If you want to archive videos, streamable.com may be useful. I'm not sure if it's a reliable site but I seem to have no problem with it so far. You don't need an account to upload and the site uses a cookie to keep track of what you upload from your specific browser.

the problem there is that auto-removing cookies - one of the most basic techniques to hamper tracking - will break that functionality

Uncle Warren · Oct 22, 2018

EdgyKid69 said:
the problem there is that auto-removing cookies - one of the most basic techniques to hamper tracking - will break that functionality

Well in that case just make sure you have the links.

OG 666 · Oct 23, 2018

Are there any good options for archiving tweets, aside from just capturing them one by one?

I really want to be able to archive a user’s entire feed, but it’s not feasible to scroll all the way to the bottom, and archive.is only saves the first page.

Diabeetus · Oct 23, 2018

Gengar said:
Are there any good options for archiving tweets, aside from just capturing them one by one?

I really want to be able to archive a user’s entire feed, but it’s not feasible to scroll all the way to the bottom, and archive.is only saves the first page.

Just use an archive site, bud. Like this one or this one.

OG 666 · Oct 23, 2018

Diabeetus said:
Just use an archive site, bud. Like this one or this one.

lol yes, I’m aware of these sites. Maybe I wasn’t clear.

I want to be able to archive several tweets at once, or ideally, a user’s entire feed. Like I said, archive.is only saves the first page and that isn’t particularly helpful for people who tweet 50+ times in a single day.

Diabeetus · Oct 23, 2018

Gengar said:
lol yes, I’m aware of these sites. Maybe I wasn’t clear.

I want to be able to archive several tweets at once, or ideally, a user’s entire feed. Like I said, archive.is only saves the first page and that isn’t particularly helpful for people who tweet 50+ times in a single day.

Oop, excuse me. I'm a bit handicapped.

Just use one of those add-ons for Chrome or Firefox that'll screenshot the entire webpage. I know that Firefox has something built-in that makes screenshotting individual posts really efficient.

dysentery · Oct 25, 2018

Gengar said:
Are there any good options for archiving tweets, aside from just capturing them one by one?

I really want to be able to archive a user’s entire feed, but it’s not feasible to scroll all the way to the bottom, and archive.is only saves the first page.

Apparently there's a limit built into Twitter's API that you can only save a certain thousand tweets per account per day, iirc.

It makes completely and comprehensively mass-archiving historical Twitter addict lolcows once they become noticed virtually impossible.

Diabeetus said:
Just use one of those add-ons for Chrome or Firefox that'll screenshot the entire webpage. I know that Firefox has something built-in that makes screenshotting individual posts really efficient.

That works well, although i'm assuming you want entire accounts archived, which isn't really possible with the current tool limitations Twitter gives you, apparently.

awoo · Oct 25, 2018

You may be able to get around the API limit by scraping (like archive.org) but it will be significantly slower. Upside is that no viewing limit. As long as you aren't going crazy (limit to something slow like 5 parallel requests / second) I don't think Twitter will throttle you.

Wärring Ornac · Oct 25, 2018

dysentery said:
Apparently there's a limit built into Twitter's API that you can only save a certain thousand tweets per account per day, iirc.

It makes completely and comprehensively mass-archiving historical Twitter addict lolcows once they become noticed virtually impossible.

That works well, although i'm assuming you want entire accounts archived, which isn't really possible with the current tool limitations Twitter gives you, apparently.

Wait, how does this work? If, theoretically, I archive 1000 tweet of one twitter account with archive.is and I want to archive one more, does twitter block the archive attempt?

If so, then does this mean that if I archive 1000 tweet of one twitter account with archive.is, then switch to freezepage or Wayback Machine to archive another tweet, does it get block too?

awoo · Oct 25, 2018

Wärring Ornac said:
Wait, how does this work? If, theoretically, I archive 1000 tweet of one twitter account with archive.is and I want to archive one more, does twitter block the archive attempt?

If so, does this mean that if I archive 1000 tweet of one twitter account eith archive.is, then switch to freezepage or Wayback Machine, does it get block too?

No the limit is for using the API. Though if Twitter detects suspicious activity it might throttle you or confirm you're not a robot.

You can also use multiple API accounts though I'm pretty sure this is explicitly against API ToS.

Wärring Ornac · Oct 25, 2018

awoo said:
No the limit is for using the API. Though if Twitter detects suspicious activity it might throttle you or confirm you're not a robot.

Oh ok, that might mean that switching is enough

dysentery · Oct 25, 2018

Wärring Ornac said:
If so, then does this mean that if I archive 1000 tweet of one twitter account with archive.is, then switch to freezepage or Wayback Machine to archive another tweet, does it get block too?

If you really wanted to, you could archive a limitless amount of tweets using sites like Archive.org/is/today, According to Twitter, they're nothing more than users viewing linked tweets. The problem is when you want to archive an account that's amassed more than several thousand. If you really wanted to, you could save every link, but that's far too much effort.

Archival Tools - How to archive anything.

CatchFire

Archivist and Cyber Security Enthusiast

Wärring Ornac

But it was never the streets that were evil.

Ragged Beef

Sir Groin

awoo

Please be patient, I have awootism

A Robin

Attachments

Diabeetus

The hyeckin frickyen sweetist

awoo

Please be patient, I have awootism

Diabeetus

The hyeckin frickyen sweetist

EdgyKid69

Uncle Warren

OG 666

Guest

Diabeetus

The hyeckin frickyen sweetist

OG 666

Guest

Diabeetus

The hyeckin frickyen sweetist

dysentery

Home sweet home.

awoo

Please be patient, I have awootism

Wärring Ornac

But it was never the streets that were evil.

awoo

Please be patient, I have awootism

Wärring Ornac

But it was never the streets that were evil.

dysentery

Home sweet home.