Archival Tools - How to archive anything.

Baraadmirer · Apr 4, 2022

Al-Abaascaal said:
That domain is down,use https://archive.ph/

Is it just me, or do some of the archive domains not render as hyperlinks (I think .is had this problem)? They disappear when I post a comment or preview them.

spoof · Apr 4, 2022

Spins Of Our Fathers said:
Looks like archive.md is down. Any viable alternatives?

archive.org

Al-Abaascaal said:
That domain is down,use https://archive.ph/

SSL_ERROR_NO_CYPHER_OVERLAP

Remove kebab · Apr 8, 2022

Is there a way to archive live streams with live chat using yt-dlp? I know I could use a screen recorder to get the chat or use a chat overlay and record full screen, but these mean you can't really do anything else while recording.

chiobu · Apr 10, 2022

Seems like archive.today has been down for a bit for everyone, here's a couple of alternatives I've been using:

Archive.org: https://web.archive.org/save

Ghostarchive: https://ghostarchive.org/

Any other suggestions? Sometimes it doesn't bypass paywalls like how archive.today is able to.

Winter · Apr 21, 2022

Not sure if anyone has recommended it yet but ShareX is a wonderful tool for drawing specific areas to screenshot, then adding blur, circles, text among other things if you choose to after the screenshot is taken. Moreover you can draw an area to record video of too. It's available on steam and on their site
100% would recommend.

Dogeee · Apr 21, 2022

Winter said:
Not sure if anyone has recommended it yet but ShareX is a wonderful tool for drawing specific areas to screenshot, then adding blur, circles, text among other things if you choose to after the screenshot is taken. Moreover you can draw an area to record video of too. It's available on steam and on their site
100% would recommend.

Completely agree there, only a bit annoying is that it's scroll capture feature doesn't work on Firefox still.

REGENDarySumanai · Apr 29, 2022

Here's a python module made to archive Twitter spaces. It's made by weeaboos.

https://github.com/HoloArchivists/twspace-dl

5t3n0g0ph3r · May 4, 2022

Anybody know a better tool to use for archiving Facebook posts?
Archive.md (or its variants) can't seem to do so anymore.

chiobu · May 12, 2022

5t3n0g0ph3r said:
Anybody know a better tool to use for archiving Facebook posts?
Archive.md (or its variants) can't seem to do so anymore.

Ghostarchive works well for it since it uses a logged in account to archive

chiobu · May 20, 2022

The Awesome Archival list has a ton of resources

https://github.com/iipc/awesome-web-archiving/blob/master/README.md (A)

AWESOME WEB ARCHIVING

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ Web crawlers for automated capture due to the massive scale of the Web. Ever-evolving Web standards require continuous evolution of archiving tools to keep up with the changes in Web technologies to ensure reliable and meaningful capture and replay of archived web pages.

CONTENTS

Training/Documentation

Resources for Web Publishers

Tools & Software

Acquisition

Replay

Search & Discovery

Utilities

WARC I/O Libraries

Analysis

Quality Assurance

Curation

Community Resources

Other Awesome Lists

Blogs and Scholarship

Mailing Lists

Slack

Twitter

TRAINING/DOCUMENTATION

Introductions to web archiving concepts:

What is a web archive? - A video from the UK Web Archive YouTube Channel

Wikipedia's List of Web Archiving Initiatives

Glossary of Archive-It and Web Archiving Terms

The Web Archiving Lifecycle Model - The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world.

Training materials: module for beginners (8 sessions)

UNT Web Archiving Course 2022

Continuing Education to Advance Web Archiving (CEDWARC)

The WARC Standard:

The warc-specifications community HTML version of the official specification and hub for new proposals.

The offical ISO 28500 WARC specification homepage.

For researchers using web archives:

GLAM Workbench: Web Archives - See also this related blog post on 'Asking questions with web archives'.

Archives Unleashed Toolkit documentation

RESOURCES FOR WEB PUBLISHERS
These resources can help when working with individuals or organisations who publish on the web, and who want to make sure their site can be archived.

Stanford Libraries' Archivability pages

The Archive Ready tool, for estimating how likely a web page will be archived successfully.

TOOLS & SOFTWARE
This list of tools and software is intended to briefly describe some of the most important and widely-used tools related to web archiving. For more details, we recommend you refer to (and contribute to!) these excellent resources from other groups:

Comparison of web archiving software

Awesome Website Change Monitoring

ACQUISITION

22120 - A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay. (In Development)

ArchiveBox - A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly Bookmark Archiver). (In Development)

archivenow - A Python library to push web resources into on-demand web archives. (Stable)

ArchiveWeb.Page - A plugin for Chrome and other Chromium based browsers that lets you interactively archive web pages, replay them, and export them as WARC data. Also available as an Electron based desktop application.

Browsertrix Crawler - A Chrome based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container.

Brozzler - A distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded urls and to extract links. (Stable)

Cairn - A npm package and CLI tool for saving webpages. (Stable)

Chronicler - Web browser with record and replay functionality. (In Development)

Crawl - A simple web crawler in Golang. (Stable)

crocoite - Crawl websites using headless Google Chrome/Chromium and save resources, static DOM snapshot and page screenshots to WARC files. (In Development)

F(b)arc - A commandline tool and Python library for archiving data from Facebook using the Graph API. (Stable)

freeze-dry - JavaScript library to turn page into static, self-contained HTML document; useful for browser extensions. (In Development)

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns. (Stable)

Heritrix - An open source, extensible, web-scale, archival quality web crawler. (Stable)

Heritrix Walkthrough (In Development)

html2warc - A simple script to convert offline data into a single WARC file. (Stable)

HTTrack - An open source website copying utility. (Stable)

monolith - CLI tool to save a web page as a single HTML file. (Stable)

Obelisk - Go package and CLI tool for saving web page as single HTML file. (Stable)

SingleFile - Browser extension for Firefox/Chrome and CLI tool to save a faithful copy of a complete page as a single HTML file. (Stable)

SiteStory - A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. (Stable)

Social Feed Manager - Open source software that enables users to create social media collections from Twitter, Tumblr, Flickr, and Sina Weibo public APIs. (Stable)

Squidwarc - An open source, high-fidelity, page interacting archival crawler that uses Chrome or Chrome Headless directly. (In Development)

StormCrawler - A collection of resources for building low-latency, scalable web crawlers on Apache Storm. (Stable)

twarc - A command line tool and Python library for archiving Twitter JSON data. (Stable)

WAIL - A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; Python, Electron. (Stable)

Warcprox - WARC-writing MITM HTTP/S proxy. (Stable)

WARCreate - A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Warcworker - An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI. (Stable)

Wayback - A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond. (Stable)

Waybackpy - Wayback Machine Save, CDX and availability API interface in Python and a command-line tool (Stable)

Web2Warc - An easy-to-use and highly customizable crawler that enables anyone to create their own little Web archives (WARC/CDX). (Stable)

Web Curator Tool - Open-source workflow management for selective web archiving. (Stable)

WebMemex - Browser extension for Firefox and Chrome which lets you archive web pages you visit. (In Development)

Webrecorder - Create high-fidelity, interactive recordings of any web site you browse. (Stable)

Wget - An open source file retrieval utility that of version 1.14 supports writing warcs. (Stable)

Wget-lua - Wget with Lua extension. (Stable)

Wpull - A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler. (Stable)

REPLAY

InterPlanetary Wayback (ipwb) - Web Archive (WARC) indexing and replay using IPFS.

OpenWayback - The open source project aimed to develop Wayback Machine, the key software used by web archives worldwide to play back archived websites in the user's browser. (Stable)

PyWb - A Python (2 and 3) implementation of web archival replay tools, sometimes also known as 'Wayback Machine'. (Stable)

Reconstructive - Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript).

ReplayWeb.Page - A browser-based, fully client-side replay engine for both local and remote WARC files.

warc2html - Converts WARC files to static HTML suitable for browsing offline or rehosting.

SEARCH & DISCOVERY

Mink - A Google Chrome extension for querying Memento aggregators while browsing and integrating live-archived web navigation. (Stable)

playback - A toolkit for searching archived webpages from Internet Archive, archive.today, Memento and beyond. (In Development)

SecurityTrails - Web based archive for WHOIS and DNS records. REST API available free of charge.

Tempas v1 - Temporal web archive search based on Delicious) tags. (Stable)

Tempas v2 - Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., Obama@2005-2009 in Tempas). (Stable)

webarchive-discovery - WARC and ARC full-text indexing and discovery tools, with a number of associated tools capable of using the index shown below. (Stable)

Shine - A prototype web archives exploration UI, developed with researchers as part of the Big UK Domain Data for the Arts and Humanities project. (Stable)

SolrWayback - A backend Java and frontend VUE JS project with freetext search and a build in playback engine. Require Warc files has been index with the Warc-Indexer. The web application also has a wide range of data visualization tools and data export tools that can be used on the whole webarchive. SolrWayback 4 Bundle release contains all the software and dependencies in an out-of-the box solution that is easy to install.

Warclight - A Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats. (In Development)

Wasp - A fully functional prototype of a personal web archive and search system. (In Development)

Other possible options for builting a front-end are listed on in the webarchive-discovery wiki, here.

UTILITIES

ArchiveTools - Collection of tools to extract and interact with WARC files (Python).

gowarcserver - BadgerDB-based capture index (CDX) and WARC record server, used to index and serve WARC files (Go).

har2warc - Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python).

httpreserve.info - Service to return the status of a web page or save it to the Internet Archive. Returns JSON via browser or command line via CURL using GET (Golang Package). (Stable)

HTTPreserve Workbench - Tool and API to describe the status of a web page encoded in a simple JSON output describing current status, and earliest and latest links on wayback.org. Save a web page to the Internet Archive. Audit lists of URIs and output a CSV with the data described above (Golang). (In Development)

httrack2warc - Convert HTTrack archives to WARC format (Java).

MementoMap - A Tool to Summarize Web Archive Holdings (Python). (In Development)

MemGator - A Memento Aggregator CLI and Server (Golang). (Stable)

node-cdxj - CDXJ file parser (Node.js). (Stable)

OutbackCDX - RocksDB-based capture index (CDX) server supporting incremental updates and compression. Can be used as backend for OpenWayback, PyWb and Heritrix. (Stable)

py-wasapi-client - Command line application to download crawls from WASAPI (Python). (Stable)

The Archive Browser - The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).

The Unarchiver - Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).

tikalinkextract - Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development)

wasapi-downloader - Java command line application to download crawls from WASAPI. (Stable)

WarcPartitioner - Partition (W)ARC Files by MIME Type and Year. (Stable)

webarchive-indexing - Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

wikiteam - Tools for downloading and preserving wikis. (Stable)

WARC I/O LIBRARIES

FastWARC - A high-performance WARC parsing library (Python).

HadoopConcatGz - A Splitable Hadoop InputFormat for Concatenated GZIP Files (and *.warc.gz). (Stable)

jwarc - Reading and write WARC files with a typesafe API (Java).

Jwat - Libraries and tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

node-warc - Parse WARC files or create WARC files using either Electron or chrome-remote-interface (Node.js). (Stable)

Unwarcit - Command line interface to unzip WARC and WACZ files (Python).

Warcat - Tool and library for handling Web ARChive (WARC) files (Python). (Stable)

warcio - Streaming WARC/ARC library for fast web archive IO (Python).

warctools - Library to work with ARC and WARC files (Python).

webarchive - Golang readers for ARC and WARC webarchive formats (Golang).

ANALYSIS

ArchiveSpark - An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation. (Stable)

Archives Unleashed Notebooks - Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit. (Stable)

Archives Unleashed Toolkit - Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark. (Stable)

Tweet Archvies Unleashed Toolkit - An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark. (In Development)

QUALITY ASSURANCE

Chrome Check My Links - Browser extension: a link checker with more options.

Chrome link checker - Browser extension: basic link checker.

Chrome link gopher - Browser extension: link harvester on a page.

Chrome Open Multiple URLs - Browser extension: opens multiple URLs and also extracts URLs from text.

Chrome Revolver - Browser extension: switches between browser tabs.

FlameShot - Screen capture and annotation on Ubuntu.

PlayOnLinux - For running Xenu and Notepad++ on Ubuntu.

PlayOnMac - For running Xenu and Notepad++ on macOS.

Windows Snipping Tool - Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).

WineBottler - For running Xenu and Notepad++ on macOS.

xDoTool - Click automation on Ubuntu.

Xenu - Desktop link checker for Windows.

CURATION

Zotero Robust Links Extension - A Zotero extension that submits to and reads from web archives. Source on GitHub. Supercedes leonkt/zotero-memento.

COMMUNITY RESOURCES
OTHER AWESOME LISTS

Web Archiving Community

Awesome Memento

The WARC Ecosystem

The Web Crawl section of COPTR

BLOGS AND SCHOLARSHIP

IIPC Blog

Web Archiving Roundtable - Unofficial blog of the Web Archiving Roundtable of the Society of American Archivists maintained by the members of the Web Archiving Roundtable.

The Web as History - An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.

WS-DL Blog - Web Science and Digital Libraries Research Group blogs about various Web archining related topics, scholarly work, and academic trip reports.

DSHR's Blog - David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field.

UK Web Archive Blog

ducktales4gameboy · May 22, 2022

Does anyone have any leads on tools to dump forums with proper author info? The two I'm interested in specifically are phpbb and xenforo based and every one I've looked at on github has either been horrifically out of date or custom-tailored for a specific instance with paths and such hard-coded in files.

duckbutter&toejamsandwich · Jun 8, 2022

Archive.ph and archive.org no longer archive Instagram. Does anyone know a workaround?

Edit: @BBJ_4_Ever found a way to archive by using https://dumpor.com/ and then archive.ph. Still looking for a more efficient way.

Frail Snail · Jun 12, 2022

Some quirks to watch for when archiving things from Steam.

Use a service like SteamID.io to transform vanity IDs into permanent steam IDs and use said link when archiving, i.e. steamcommunity.com/profiles/$numeric_id. Those always stay the same and the redirect is visible on all archiving websites.
Archiving posts in profile comments and Steam group discussions that go back pages can be done by adding ?ctp=, for example steamcommunity.com/id/name/allcomments/?ctp=3 or steamcommunity.com/groups/fags/discussions/1/$id/?ctp=2.

notafederalagent · Jun 18, 2022

garc is a fork of twarc that works with the Gab API (requires login). It's not as fully featured, but it is possible to grab account info, user posts, and user comments in json. You can run jq -r .url on the userposts and usercomments results and then pass that onto archiving scripts.

JSON:

{
  "id": "31",
  "username": "a",
  "acct": "a",
  "display_name": "Andrew Torba ✝️",
  "locked": false,
  "bot": false,
  "created_at": "2016-08-10T06:02:25.000Z",
  "note": "<p>Saved servant soldier of Jesus Christ the King of Kings. Husband, Father, and the CEO of <a data-focusable=\"true\" role=\"link\" href=\"https://gab.com/gab\" class=\"mention\">@gab</a>. </p><p>Now faith is the substance of things hoped for, the evidence of things not seen. Hebrews 11:1</p>",
  "url": "https://gab.com/a",
  "avatar": "https://media.gab.com/system/accounts/avatars/000/000/031/original/86e4974436280f01.png",
  "avatar_static": "https://media.gab.com/system/accounts/avatars/000/000/031/original/86e4974436280f01.png",
  "avatar_small": "https://media.gab.com/cdn-cgi/image/width=92,fit=scale-down/system/accounts/avatars/000/000/031/original/86e4974436280f01.png",
  "avatar_static_small": "https://media.gab.com/cdn-cgi/image/width=92,fit=scale-down/system/accounts/avatars/000/000/031/original/86e4974436280f01.png",
  "header": "https://media.gab.com/system/accounts/headers/000/000/031/original/D2DFEEB6-ECA8-44A4-BE57-419AC9BEBD6C.jpeg",
  "header_static": "https://media.gab.com/system/accounts/headers/000/000/031/original/D2DFEEB6-ECA8-44A4-BE57-419AC9BEBD6C.jpeg",
  "is_spam": false,
  "followers_count": 3623431,
  "following_count": 2505,
  "statuses_count": 64807,
  "is_pro": true,
  "is_verified": true,
  "is_donor": true,
  "is_investor": false,
  "show_pro_life": true,
  "emojis": [],
  "fields": [
    {
      "name": "Who Is Andrew Torba?",
      "value": "<a href=\"http://andrewtorba.com\" rel=\"me nofollow noopener\" target=\"_blank\"><span aria-hidden=\"true\" class=\"invisible\">http://</span>andrewtorba.com<span aria-hidden=\"true\" class=\"invisible\"></span></a>",
      "verified_at": null
    },
    {
      "name": "Upgrade to GabPRO",
      "value": "<a href=\"https://pro.gab.com\" rel=\"me nofollow noopener\" target=\"_blank\"><span aria-hidden=\"true\" class=\"invisible\">https://</span>pro.gab.com<span aria-hidden=\"true\" class=\"invisible\"></span></a>",
      "verified_at": null
    },
    {
      "name": "Shop Gab Merch",
      "value": "<a href=\"https://shop.gab.com\" rel=\"me nofollow noopener\" target=\"_blank\"><span aria-hidden=\"true\" class=\"invisible\">https://</span>shop.gab.com<span aria-hidden=\"true\" class=\"invisible\"></span></a>",
      "verified_at": null
    },
    {
      "name": "Read Gab News",
      "value": "<a href=\"https://news.gab.com\" rel=\"me nofollow noopener\" target=\"_blank\"><span aria-hidden=\"true\" class=\"invisible\">https://</span>news.gab.com<span aria-hidden=\"true\" class=\"invisible\"></span></a>",
      "verified_at": null
    },
    {
      "name": "Watch Gab TV",
      "value": "<a href=\"https://tv.gab.com\" rel=\"me nofollow noopener\" target=\"_blank\"><span aria-hidden=\"true\" class=\"invisible\">https://</span>tv.gab.com<span aria-hidden=\"true\" class=\"invisible\"></span></a>",
      "verified_at": null
    }
  ]
}

the . · Jul 5, 2022

duckbutter&toejamsandwich said:
Archive.ph and archive.org no longer archive Instagram. Does anyone know a workaround?

Edit: @BBJ_4_Ever found a way to archive by using https://dumpor.com/ and then archive.ph. Still looking for a more efficient way.

Similar to your workaround, you can use a Bibliogram instance and archive that. Unfortunately archive.is does not support searching by an alternative instance so there is no easy way to find out if someone has already archived a post you wanted to look up.

duckbutter&toejamsandwich · Jul 5, 2022

the . said:
Similar to your workaround, you can use a Bibliogram instance and archive that. Unfortunately archive.md does not support searching by an alternative instance so there is no easy way to find out if someone has already archived a post you wanted to look up.

This is actually much easier than using dumpor, as I can just put the url in instead of looking through dumpor. Thanks!

Dork Of Ages · Jul 5, 2022

This might have been mentioned before, but for archiving via archive.md of Twitter posts of a cow without having to go through every individual tweet of the moment, make sure you grab the URL for the "with replies" section (not just the Twitter URL for tweets only, so that is "https://twitter.com/twitterhandle/with_replies" instead of just "https://twitter.com/twitterhandle"). That gets all posts, replies and all media of the most recent tweets.

While you are it, and to get a faster archival, using the Nitter instances that are listed here (archive.md) first and then later going to twitter.com helps a lot, because the twitter.com archive crawls through more of the page and is, as a result. slower.

Account · Jul 11, 2022

For those that ~~glow~~ are paranoid, archive.is has a TOR address. It loads much faster than the clearnet address.

JamusActimus · Jul 17, 2022

Not sure if it was touched on but yt-dlp can bypass the age restriction when youtube dl can't.

GitHub - yt-dlp/yt-dlp: A youtube-dl fork with additional features and fixes

A youtube-dl fork with additional features and fixes - GitHub - yt-dlp/yt-dlp: A youtube-dl fork with additional features and fixes

github.com

I was happy to find an easy to use alternative for age restricted videos.

Baraadmirer · Jul 25, 2022

Is YouTube-DLG still working for people? I tried saving a video to my computer but the program encountered an error.

Archival Tools - How to archive anything.

💪🍦💪

爪闩尺丂㠪ㄚ

♫ Zaza crackarooski, zaza crackarooski ♫

So happy!

Resident Archivist

爪闩尺丂㠪ㄚ

爪闩尺丂㠪ㄚ

AWESOME WEB ARCHIVING​

CONTENTS​

TRAINING/DOCUMENTATION​

RESOURCES FOR WEB PUBLISHERS​

TOOLS & SOFTWARE​

ACQUISITION​

REPLAY​

SEARCH & DISCOVERY​

UTILITIES​

WARC I/O LIBRARIES​

ANALYSIS​

QUALITY ASSURANCE​

CURATION​

COMMUNITY RESOURCES​

OTHER AWESOME LISTS​

BLOGS AND SCHOLARSHIP​

Attachments

destruction brings creation

pinky promise

Get out of my sight, inferior!

A nondescript anime avatar account

We wuz Centurion and shiet

💪🍦💪

AWESOME WEB ARCHIVING

CONTENTS

TRAINING/DOCUMENTATION

RESOURCES FOR WEB PUBLISHERS

TOOLS & SOFTWARE

ACQUISITION

REPLAY

SEARCH & DISCOVERY

UTILITIES

WARC I/O LIBRARIES

ANALYSIS

QUALITY ASSURANCE

CURATION

COMMUNITY RESOURCES

OTHER AWESOME LISTS

BLOGS AND SCHOLARSHIP