Most ridiculous problem I've solved in a while and I feel like puking it out somewhere.
I have an 2 TB NVMe drive in my new workstation (blazingly fast, SATA SSD is like an USB stick in comparsion) I am quite happy with. Even though I had my doubts regarding complexity I formatted the drive to btrfs mainly because of checksumming and snapshots and the wish of trying something new, being aware that btrfs is quite controversial and has it's problems. (that's what backups are for and yes I'm aware of zfs but I just couldn't be arsed with something outside the kernel tree) Copied my old system from my old SSD over to this drive, configured a new kernel for my new computer, enjoyed both files and programs opening instantly, no trouble.
Now I noticed a bizarre problem I couldn't reliably reproduce - scrubbing the filesystem would sometimes find a massive amount of corruption errors with tons of files not lining up with the stored checksum. Comparing the files with my backups would then show me that the files were identical, so no actual corruption. Curiously, rebooting the system would also fix the errors and let a few scrubbings come up clean until later ones would find a massive amount of errors again. Sometimes it was also completly random. Two scrubbings clean, one with massive amount of errors (>250.000), next one clean again. Curious. RAM wasn't the problem. (I also have [working] ECC RAM) A hardware problem with the drive was possible but health reports didn't show any problem, even the temperature was fine. It was baffling and after reading some horror stories online I was inclined to blame btrfs and it's tools itself since no files were actually damaged although that theory didn't sit well with me. Such a basic problem would've been noticed earlier and not by me first. There had to be something else at work here.
Even though the drive, according to temperature sensor, topped out at 40-47C when operating on full tilt (like during a scrub) and although the controller of the drive should slow things down when the drive is overheating, I started wondering and pointed a fan at the drive's heatsink. And indeed, the problems with the random corruption disappeared. I could run dozens of scrubs with no errors. Long story short - the heatsink was not properly attached to all ICs of the drive. Two of the four Flash ICs had barely any contact to it. The controller chip (which probably also contains the temperature sensor) did and so could neither register nor act on things getting too hot for some of the flash memory. The fan directly blowing on the drive cooled things down so far it didn't matter and in normal operation, the drive never got hot enough to become unreliable, only when doing large operations with lots of activity, like scrubbing the filesystem for a prolonged period of time. Rebooting the computer or pauses between scrubs allowed things to cool down somewhat, nothing else. With a filesystem like ext4 I would've never noticed this and would've just had an nvme drive that's unreliable under specific circumstances and suffered possible, completely random data corruption. It's always fun solving problems.