Offline Long-Term Digital Archival - Archiving data for when the Internet cannot be depended on.

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.

Jotch

Well-oiled and ready to last for the long-run
True & Honest Fan
kiwifarms.net
Joined
Dec 15, 2022
Hello,
I am personally interested in archiving video which I personally enjoy. This includes every MATI stream recording I could find. As well as youtube channels, movies and TV seasons which I find entertaining/nostalgic.
I am also interested in archiving Books, Games, Music, AI Models and other misc data.
While I am personally only interested in archival of personally curated data, I am also interested in the knowledge of "data hoarders".

I ask you for your knowledge, your stories, tips, or actual sharing of said archives!
For example, with video archival, how do you efficiently encode video? I use ffmpeg to encode videos as 720p with libx265 for video and libopus for audio.
What do you use in terms of physical storage mediums? I have a few 'junk' HDDs, and BD-RWs that I stuff full of redundant copies of the archive, and use sha256 checksums on the archives to make sure nothing is corrupt.

Sorry if the OP is sloppy, it's the first time I've tried to do a real thread. I will add to this/improve it as time goes on!

A clarification, this is generally not intended for surviving the fata through an apocolypse. Instead ot is more the case where, as the thread subtitle says, the Internet can not be depended on. This includes a more fractured or censorious internet where whatever stuff you want is no longer accessible online.
 
Last edited:
x265 compresses well. 720p is kind of low. 1080p should be fine too. Opus is great. I would compress at 128k. 96k also works. If you use the opus encoder and not libopus in ffmpeg, you can set the frames to 40ms and it'll shrink your files down even smaller. You'd need higher bitrates for surround audio. Not like that would matter in a SHTF situation.
Opus is most likely the best audio format out there. Works great from spoken word to music.

For backups, the 3-2-1 rule is the golden standard. 3 Backups. 2 Mediums. 1 offsite. That doesn't really count for the stuff below. But for irreplaceable data, it does.

Kiwix copies of Wikipedia and project Gutenberg.

The downloads come out to 95gb and 62gb respectfully.

The survivors library. Torrents here

Mostly PDFs and comes out to like ~350gb.

CD3WD.
CD3WD is a project that focuses on assisting in third world development by making technical documents and other relevant information easily available to all people.

Archive of wikispooks.
The Harvard 5ft shelf
Regarded by many as the most comprehensive anthology of all time, ‘The Harvard Classics’ was first published in 1909 under the supervision of the Harvard president Charles W. Eliot. An esteemed academic, Eliot had argued that the elements of a liberal education could be gained by spending 15 minutes a day reading from a collection of books that could fit on a five-foot shelf. The publisher P. F. Collier challenged Eliot to make good on this statement and ‘Dr. Eliot’s Five Foot Shelf’ was the result. Eight years later Eliot added a further 20 volumes as a sub-collection titled ‘The Harvard Classics Shelf of Fiction’, offering some of the greatest novels and short stories of world literature. The exhaustive anthology of the ‘The Harvard Classics’ comprises every major literary figure, philosopher, religion, folklore and historical subject up to the twentieth century.

Some starting points, digitally.
 
A friend of mine made torrents out of my MATI archive, including 2018-2023 missing only the lost life is strange episodes. Let me know if I missed anything.
He only seeds it over I2P for whatever reason and I can't reliably seed anything, sadly this is the best I can do.
Attached is the torrents in case you want!
 

Attachments

I've recently begun pirating and formatting (transcoding, subbing and fixing the metadata) many shows and movies I'd like to preserve so I can talk a little bit about that.
x265 compresses well. 720p is kind of low. 1080p should be fine too. Opus is great. I would compress at 128k. 96k also works. If you use the opus encoder and not libopus in ffmpeg, you can set the frames to 40ms and it'll shrink your files down even smaller. You'd need higher bitrates for surround audio. Not like that would matter in a SHTF situation.
Opus really is the endgame audio codec beating any other codec over a wide range of bitrates and having very low latency, however, I've chosen to use AAC instead since it's a lot older and enjoys more support, for example, Windows media player can't play MP4 files with opus audio.
1719973706169.png
Regarding video codecs, I've chosen to use AV1 as it's FOSS, enjoys wide support and is the second most efficient video codec behind VVC/h266 which nobody seems to care about. HEVC/h265 requires a paid decoder to be played on Windows and isn't well supported so no thank you.
 
What do you use in terms of physical storage mediums?
For "on-grid" data backups, I currently have two direct attached storage devices, one QNAP TR-004 with 8 TB HDDs setup in a RAID 5 config. This gives me 24 TB of net storage.

The other is the Terramaster D8 Hybrid. I don't have all of the bays filled yet, but I'm running this as just individual drives. I'm not a huge fan of this one. If you want to use it for just individual drives / JBOD it's probably good to go, but it is definitely not meant for RAID config. It has 4 HDD bays, and 4 NVME slots. The NVME slots can't be run in RAID mode though and have slow read speeds, atleast based on my use.

For "off-grid" backups, CDs (yes) are the best way to go. For anything vital that you can fit under 700 MBs, Verbatim CD-Rs that you can buy in a 100 pk spindle for 20 bucks will last decades if they are taken care of and the disc is not physically damaged. Given the low cost of CDs, you can make as many redundant copies as you want. You can even encrypt CDs.

CD drives are everywhere, and even if computers stopped being made tomorrow you would still be able to find something to read them.

M-Discs can fit a lot more data on them, up to 25 GBs but require a blu-ray writer to write to them. M-Discs can be read by most DVD/CD players though.
 

Archiving Books​

Calibre is a wonderful local digital library management program. You can store tons of books in a small amount of space, and the program lets you quickly convert, categorise, read, edit, and transfer them to an e-reader.

Free Sources​

Anna's Archive
Largest selection of them all. Tons of rare books, non-English books, and studies. Very delayed downloads without a membership.​
Purports to contain all of LibGen, Sci-Hub, & Z-lib.​
Project Gutenberg
Public domain books, primarily old books with expired copyright.​
Library Genesis
Lots of non-fiction, scientific publications, and studies. Also some popular fiction.​
Sci-Hub
Hasn't had new papers added since 2020 because of legal battles. Use Anna's Archive SciDB instead.​
IRCHighway #ebooks
Great selection of fiction books in a variety of formats. If you can't find a book anywhere, sometimes you'll find it here. Needs to be accessed using IRC. Instructions here.​
Libby
Used by local libraries to provide digital borrowing for a variety of e-readers and formats. Includes audiobooks. Requires a library card to access specific libraries.​
Provides files with DRM. For easiest removal, after checking out a book, go to Manage Loan > Read With > Other Ways To Read > ePub. This will download an acsm file, see the Adobe Digital Edition Libraries under DRM below for instructions on accessing it.​
Internet Archive
Mostly PDF format, so not great for e-readers. Recently has added DRM to many check-out only books. Read on below to learn how to remove it.​

DRM​

Many books purchased from online e-book sellers come with access restrictions called DRM, and recently Internet Archive has started adding DRM to books checked out from its digital library. This means that you won't have reliable long term access to them if you can't login to whatever site the DRM is attached to, as the contents of the book is often encrypted and the decryption key is attached to either an authorized e-reader or online account. DRM can be easily removed though using the Calibre DeDRM_tools plugin. An archive of the plugin is attached to this post in case it is ever taken down. Download and unzip the file. Install the DeDRM plugin by going to Preferences > Plugins > Load plugin from file and selecting the DeDRM_plugin.zip file stored within the DeDRM_tools_*.*.*.zip file.
To setup the DeDRM plugin, in Calibre go to Preferences > Plugins > File Type and double click on DeDRM. For each source of book you want to remove DRM from, you'll have to add separate information to the plugin. For example, if you want to remove DRM from a book bought on one Amazon account, you'll have to add separate information than the information used to access books attached to one Internet Archive account. How you do this is very dependent on the source of the DRM, so you need to research on the specific source if it isn't below.
For Amazon, I'll try to write a guide in the future but from what I read the method changed recently and since I don't have any recent Amazon e-book purchases I want to wait to make sure I understand how to do it correctly. If anyone else has instructions, let me know and I'll add them here. See https://github.com/noDRM/DeDRM_tools/blob/master/FAQs.md for a starting point, you'll need to add the KFX Input plugin for KFX format downloads.
On Internet Archive, you can either borrow a book for 14 days, which can only be done if no one else is doing that, or borrow it for 1 hour, which can be concurrent to other 1 hour or 14 day borrows. Typically with a 1 hour borrow, you can only access the book in the web viewer, while the 14 day borrow lets you download it (with DRM, of course). Note that Internet Archive only provides PDFs, which aren't the best for e-readers.
Either way, start by adding the DeACSM plugin to Calibre, so you don't have to install the Adobe Digital Editions DRM software. To do this, open Calibre, go to Preferences > Plugins > Get new plugins and search for DeACSM then hit Install.

1 Hour Borrow​

First, borrow the book you want for an hour. Now, find the identifier for the book, which is the part of the URL immediately following "https://archive.org/details/". Now, replace INSERT_IDENTIFIER with the identifier in the following URL, and go to that page: "https://archive.org/services/loans/...ifier=INSERT_IDENTIFIER&format=pdf&redirect=1". This will download an acsm file, after which you can proceed to the next section.

14 Day Borrow​

Check the book out for 14 days, then simply click the Download PDF link on Internet Archive and proceed once the acsm file is downloaded.

Finally, open the downloaded acsm file directly in the main Calibre program, which will automatically download and remove the DRM from the PDF then add it to your library.
There is also another method using the Amazon Digital Editions software if this does not work, see here.
Many local library branches provide digital access to some books using Amazon Digital Editions. If an online library provides an acsm file, install the DeACSM plugin in Calibre, then simply open the acsm file within the program. For further details, read the Internet Archive section.

Once you have set up DeDRM to remove DRM from a particular source, simply add the DRM'd e-book file to the Calibre library by dragging it onto the window, and the plugin will automatically remove the DRM. To verify that the DRM was successfully removed, convert the file to another format, such as PDF, EPUB, or MOBI. To do this, right click on the added book, and select Convert Books > Convert Individually, then select a format different from its current format, and convert it. If you can open the book in the output format, the DRM was removed, if Calibre gives you an error stating "Cannot convert Book", it was not removed. You can delete the converted copy of the book after that, it is just to check the DRM was removed. In the future when importing books from the same source, there isn't really a need to do the conversion check.
 

Attachments

Last edited:
Most Blu-ray discs today are literally fucking metal inside. They all should last decades, unlike dye-based cdr and dvdr. Granted, Blu-ray drives are less common than either predecessor, but the durability is hopefully worth the inconvenience.
I didn't know that we can expect greater longevity out of Blu-ray. Too bad the entertainment industry is killing optical discs in favor of streaming and all-digital game distribution. The technology is there for consoomers to have a cheap 1 TB or greater disc format, but all signs point to it not happening, even for 8K resolution.

I don't even care about "buy physical media!" per se since it's all data (you can pirate and store a movie for less than a genuine DVD costs in the Walmart bargain bin). The important part is that there's no cheap new consumer format coming for backup purposes. We went from ~700 MB to 4.7 GB to 25-50 GB and now it's all over. There are BDXL blanks up to 100-128 GB but they look absurdly expensive and are a small jump after all this time. Optical drives for PCs will also become more expensive, and Blu-ray drives shipping in pre-built PCs are made harder to find because of Intel dropping SGX.

I think a good option might be to get refurbished enterprise-grade HDDs which are being sold on ebay. As long as you have good backup practices. You also might want to avoid drives with shingled magnetic recording (SMR).
 
Last edited:
Regarding video codecs, I've chosen to use AV1 as it's FOSS, enjoys wide support and is the second most efficient video codec behind VVC/h266 which nobody seems to care about. HEVC/h265 requires a paid decoder to be played on Windows and isn't well supported so no thank you.
I've only seen AV1 used for youtube vids. I don't think there's any piracy groups encoding movies to AV1. Yet. But more power to you.

The most compatible format is x264. Doesn't compress well, but if you do it right, you can play it on the PS3 / Xbox 360. Idk about windows and x265, but mpv and vlc are free and should just play it out of the box on windows.

I think a good option might be to get refurbished enterprise-grade HDDs which are being sold on ebay. As long as you have good backup practices. You also might want to avoid drives with shingled magnetic recording (SMR).
I've done that just fine. I've gotten them on ebay. ~150-200 for a 16tb hdd. You can run a tool call badblock, at least on linux, and it'll write several patterns to the drive and read them back. It takes like a week or two on a 16tb drive. But its worth it just to make sure you don't have a dud drive. Its faster on smaller capacity drives.

The script i've used, recommended by perfectmediaserver.com
 
I've been using x265 for most of the video content I personally archive. Compresses well without a massive drop in quality. If anyone has ffmpeg scripts for transcoding video/audio files to optimized sizes for the formats mentioned ITT, please share.

My current quickfire method is this:
Code:
ffmpeg -i vid.mp4 -c:v libx265 -preset slow -vf scale=-2:480 -crf 30 -c:a copy compressed-vid.mp4
Basic converter for .mp4 inputs to a 480p x265 output. Does nothing to the audio though.
 
I don't even care about "buy physical media!" per se since it's all data (you can pirate and store a movie for less than a genuine DVD costs in the Walmart bargain bin). The important part is that there's no cheap new consumer format coming for backup purposes. We went from ~700 MB to 4.7 GB to 25-50 GB and now it's all over. There are BDXL blanks up to 100-128 GB but they look absurdly expensive and are a small jump after all this time. Optical drives for PCs will also become more expensive, and Blu-ray drives shipping in pre-built PCs are made harder to find because of Intel dropping SGX.
SD cards should be noted, which are very quickly increasing in size, lowering in cost, and increasing in versatility. Their biggest downside is that, like SSDs, they are electrical bits, not magnetic like HDDs or physical like optical media, meaning they are prone to corruption when not powered for extended periods. While redundancy mitigates this, I too would sooner bulk buy HDDs at $15/TB unless I needed that data to be extremely portable and accessible (aka for SneakerNet).
 
SD cards should be noted, which are very quickly increasing in size, lowering in cost, and increasing in versatility. Their biggest downside is that, like SSDs, they are electrical bits, not magnetic like HDDs or physical like optical media, meaning they are prone to corruption when not powered for extended periods. While redundancy mitigates this, I too would sooner bulk buy HDDs at $15/TB unless I needed that data to be extremely portable and accessible (aka for SneakerNet).
You answered yourself there. It's amazing technology now allowing up to 2 TB in the size of a pinky fingernail but it has problems.

I am mad about optical discs, see above.

I am mad about HDDs. Great when they work, but when they fail, they fail hard. Improvements in capacity/dollar have been slow, especially since the Thailand floods. The volume of HDDs shipped has steadily declined, as the industry has pivoted away from consumers and to datacenters. The decline may have finally stopped recently because of the AI bubble and the need to store lots of training data.

I am mad about NAND memory. There's the unpowered data retention issue, but speeds and endurance are getting worse as the industry moves to QLC, PLC, and beyond. Slow speeds are hidden behind pseudo-SLC caching or DRAM, if the drive maker bothers to give you any. A more recent annoyance is that SSD prices spiked by about 2x from their low point. You can expect the same cyclical price volatility seen in the DRAM market.

We could really use the commercialization of a new storage technology or two that can replace all of these. On the bulk/archival side, nanoetched quartz discs are a possibility that could store hundred of terabytes "forever" but aren't going to be rewritable. To replace NAND and ideally even DRAM, we could use a new dense non-volatile memory ("universal memory" if it can take on the role of DRAM). If that has the properties needed for archival storage too, even better. These have been worked on for decades but have remained vaporware.

More on topic, I keep microSD cards, card readers, and other small electronics in metal tins (that are inside bigger metal tins, all lined with cardboard) to act as Faraday shields to give some protection against an EMP, solar flare, etc. Plus it helps keep things organized better than chucking them in a desk drawer.

Opus really is the endgame audio codec beating any other codec over a wide range of bitrates and having very low latency, however, I've chosen to use AAC instead since it's a lot older and enjoys more support, for example, Windows media player can't play MP4 files with opus audio.
FaceMeta has beaten Opus with "MLow" for low bitrate audio voice calls. If that's possible, I could see a new codec coming in the future and beating Opus across all bitrates and types of content. It will probably be designed with or utilizing "AI".

There's no question that Opus is very good though.

What's the point of making sure your audio/video plays in Windows Media Player? Is it to ensure you can boot up an old surviving computer in the apocalypse and get it to work with the content you've stored?
 
You answered yourself there.
I wasn’t clear enough, it was meant to add on to the discussion of the current space of consumer storage solutions rather than ask for how it fares. Fully agree with all of your gripes, and I’ll even add on that I’m mad we don’t have caddies for disks anymore. It makes perfect sense, especially as bit density increases, and we simply ignore it.

These have been worked on for decades but have remained vaporware.
I want my Superman memory crystals and DNA storage mediums, dammit!

More on topic, I keep microSD cards, card readers, and other small electronics in metal tins (that are inside bigger metal tins, all lined with cardboard) to act as Faraday shields to give some protection against an EMP, solar flare, etc. Plus it helps keep things organized better than chucking them in a desk drawer.
I do similar with 3D printed containers. Helps me keep organized as I can store the cards by size. I store all MicroSD cards inside of adapters for added compatibility.

What's the point of making sure your audio/video plays in Windows Media Player? Is it to ensure you can boot up an old surviving computer in the apocalypse and get it to work with the content you've stored?
More than just in an apocalypse, personally any media for a bugout bag I would want to be as universal as possible.
 
What's the point of making sure your audio/video plays in Windows Media Player? Is it to ensure you can boot up an old surviving computer in the apocalypse and get it to work with the content you've stored?
It's my rule of thumb for compatibility. If it can be played in WMP, chances are I can play it with any modern-ish device with no tinkering and actually enjoy what I've saved. For purely archival purposes, lossless media or very high quality encodes are much preferred because you can transcode the source without suffering from iterative transcoding losses. AV1 is NOT suited for this role as it was designed as a lossy codec first and foremost, hence, it takes even more memory to do lossless encodes than the source video.
I've only seen AV1 used for youtube vids. I don't think there's any piracy groups encoding movies to AV1. Yet. But more power to you.

The most compatible format is x264. Doesn't compress well, but if you do it right, you can play it on the PS3 / Xbox 360. Idk about windows and x265, but mpv and vlc are free and should just play it out of the box on windows.
Yes of course VLC plays it, but not everybody has VLC especially on a cellphone, moreover, AV1 achieves roughly 20% better compression compared to h265.
 
Last edited:
I am mad about NAND memory. There's the unpowered data retention issue, but speeds and endurance are getting worse as the industry moves to QLC, PLC, and beyond. Slow speeds are hidden behind pseudo-SLC caching or DRAM, if the drive maker bothers to give you any. A more recent annoyance is that SSD prices spiked by about 2x from their low point. You can expect the same cyclical price volatility seen in the DRAM market.

We could really use the commercialization of a new storage technology or two that can replace all of these. On the bulk/archival side, nanoetched quartz discs are a possibility that could store hundred of terabytes "forever" but aren't going to be rewritable. To replace NAND and ideally even DRAM, we could use a new dense non-volatile memory ("universal memory" if it can take on the role of DRAM). If that has the properties needed for archival storage too, even better. These have been worked on for decades but have remained vaporware.
I am building a machine for AI/DL right now and was astounded how many motherboards and cases featured 2-5 slots for m.2 NAND memory, but maybe 2 slots SATA, 2 SSD. A lot of people will tell you to "just get a NAS already"; I already have a NAS, I want copies of data appearing in multiple places at once. You can buy 8TB of high-quality SATA right now for the same as 2TB of NAND M.2, and it's so weird to me that people just keep thinking NAND = better. One is fast, and good for your OS install drive, but you don't need to archive large files on a fucking m.2 drive. It's nuts.

I have ancient IDE and ATA drives that I can pull out of long-term storage to check on, and they do just fine (though one is almost 20 years old and is starting to get errors). OP, if you can, look into getting an external enclosure for hard drives, that way you can just dump/image the storage you need and then store them in a cool, dark place safe away from any danger. I've done this for years, and it's kept my disks in pretty good shape; if you intend to use the drive regularly, install it like normal. An external enclosure rarely provides the ventilation and cooling a drive needs for long-term stability.

As a lifelong data hoarder, I'd say that having a system to properly sort your collection is key. It's very easy to end up with multiple copies of stuff in your collection, leading to bloat. Create categories (Movies<Horror<80s<A-M and the like, keeps it tidy), put everything in the correct folder, and conduct a file audit every quarter to see what's in there. Don't depend on file checkers and checksums to tell you if you have dupes or not because you might have forgotten that you had something and downloaded it from another source. I've done this myself many times.

If you have stuff in your collection that you just want to have but aren't going to use much (I have a large comic/visual novel collection, for example, but I will not read these much, so I have them on saved on one detachable drive and a large HDD that is stored away), you can move these off your computer. Label your physical media clearly! Yes, you can always plug it in to check it, but once you start having 30 different hard drives, especially backups for backups, you're going to start getting tired of "Is it on the Red one? Or the Yellow one?" real damn quick.

Data hoarding is a fun pastime, but it can be a serious one, too. It'll make you ask yourself, is this really worth keeping? Some people go nuts with it, but it can be so much fun rediscovering old gems when you have to go back through (especially when you've got a failing drive and you have to haul ass to save whatever's on it, lol.) If you plan access all your content regularly (do make backups for your backups and store them properly), get a full sized tower. I really recommend cases that have a front drivebay design like this:

1720063611196.png


So you can mount as many as you like in there, without worrying about airflow. It doesn't need to be THIS big, btw, you can run your NAS/hoard on an ATX midtower, just make sure you get ample drive bays that are accessible!

Also well played on archiving the AI models, we're in a golden age right now and soon all that shit is going to be regulated. Grab it while you can!
 
It doesn't need to be THIS big, btw, you can run your NAS/hoard on an ATX midtower, just make sure you get ample drive bays that are accessible!
What's with the tiny case?
5191_3.jpeg
Although, I have the 20 drive version, lets you slap a pair of 2.5" drives on top.
I have a main array in a 12 drive case, all identical drives, RAID6, etc.
The giant box is the backup server. All random drives with mergerfs and snapraid. As the primary array gets upgraded then drives migrate to the backup server, and smaller drives go away.

Backup-backups are 2 drives in a pelican case that gets stored at my storage unit. And swapped with a second set of 2 drives periodically. That's just my content, not downloaded stuff.
 
Back