Wendell of Level1Techs is too nice to really "tear Linus a new one", but he does have the experience. He did a talk with Allan Jude, a famous freeBSD developer who works a lot with ZFS, and while neither one says it outright, they still make it abundantly clear how much of an idiot Linus is, how inadequate his setup was, and how much work it was to recover data Linus didn't even care about.
Many thanks for the link, this is great. Hilarious and informative.
Time stamps:
-0:20 Pool was mostly recovered, needed some outside help from one of the contributors of ZFS to figure out all the wierdness
-2:00 A tool called ZDB, which is basically ZFS compiled into a utility, is a great way to recover data from a pool because it runs outside the Kernel and can force the ZFS code to operate in strange ways that optimize recovery.
-3:00 LTT's pool was setup as a bunch of Raid-Z2 arrays. They hadn't run ZFS scrubs in years, basically since startup. Somehow only two mechanically failed drives.
-9:30 The pool was so fucked, including backplane issues and drive failure, that it was the perfect dataset to test edge case recovery scenarios.
-14:37 Wendall had to unfuck the pool from LTT's recovery attempts by pushing the rollback feature as hard as he could. See pedo stache not getting he fucked up and learning to fuck off.
-21:32 Wendall went through the data with a hex editor to find identical blocks for recovery on the messed up drives. Two of the drives where mechanical failures. He also asks to fucking back things up and run scrubs.
-22:20 Linus stuffing every available slot full for the sake of muh petabyte or some such made everyone's job harder. The lack of a hot/cold spare, or an open slot means that copying out a bad drive means one has to swap the bad drive and rebuild rather than having a perfect copy of the troublesome drives since more than one will be shit at a time, statistically.
-29:05 They recovered 99.9% of Linus's data
-30:10 Backplane failure was random, Wendall couldn't find a pattern. Some discussion of SAS vs SATA
-32:00 Discussion of ZFS and redundancy. They poopoo hardware level solutions a bit since they have much less capabilities.
-36:10 Full grey beard moment. Windows 95 and 98 would display a drive as faulty with a single error. Because of this HDD manufacturers would hide errors until they started piling up.
-38:12 List of things to do when array has failed: stop poking it, image the array so you can work on it offline, and call for a pro because you will fuck it up more. This was in the context of Linus's problems. lmao
-39:40 The array went down so they just set it to scrub, where it just sat for a week. This caused the array to go read only because you slammed the thing with pile of small operations. Retards.
-43:40 Future of ZFS Some new block copy features which reduce file copy overhead, discussion on NVME on ZFS
-50:20 "Not a catastrophic situation despite less than ideal inputs" read LTT is a bunch of retards with no understanding of how to setup an array. "Don't be like Linus and ignore your pool" lmao
This was a great reminder to check my server and make sure scrubs and email alerts were still working.