a focus on functionality over reliability causes an existential crisis
This article is going to touch some nerves, but my hope is to generate some dialogue that results in resolution to the issues rather than hiding or otherwise exacerbating them.
The latest and greatest feature in ZFS is native encryption, where ZFS is encrypting data; this is in contrast to previous configurations with LUKS or ecryptfs, where encryption is performed outside ZFS.
Native encryption has a number of compelling reasons to use it; especially the “raw”
send -w parameter that allows sending encrypted data blocks exactly as they were on-disk. This means the target server has no need to load any key or decrypt/encrypt data locally, allowing “blind” backups. The sender may know the data, but the receiver does not.
Of course, it is also an invasive code change that touches nearly every piece of the project.
I’d been using this feature on my desktop since around November 2017, and ran into several issues. This was to be expected; I had several copies of my data (I waste so much money on redundancy thanks to previous bug-related zpool failures) and the benefits of native encryption were greater than the drawbacks.
errata per aspara
A corruption bug was discovered in November 2017 that was then officially marked as errata 3 in the kernel module. Anyone importing a pool with affected datasets then had to recreate their dataset, which might not have been easy if you used
-O encryption=on during pool creation, but still a pain in the ass for anyone who had a substantial amount of data stored in an encrypted dataset.
My spacemaps were corrupted leading to a double free. I blamed the hardware, but it’s a Xeon with ECC on an online UPS. I haven’t had any issues with spacemaps until this time. I had to recreate the pool and restore. We never identified the actual issue, and the panic was never handled.
the (unofficial) fourth
Another corruption bug caused my blind backups to be unusable. I found this out when I was migrating from one pool to another - upon loading keys, I was unable to mount the dataset. It’s kind of scary to think what might have happened if this had been my only backup and I’d never loaded the key to attempt recovery!
This prevented my system from running, and though I was able to recreate my pool, I asked for it to be added as an official errata should anyone else run into the problem, they could know what was wrong. The project management disagreed and thought the errata was unnecessary as it was only a short time window where this bug existed, and I was probably the only one who had hit it. Time told a different story with approximately a half dozen users coming along to report they’d hit the same Invalid Exchange error during boot.
It was a more confusing experience than it had to be, but the lesson here is: if you’re using development builds of ZFS, there is no guarantee of any kind.
Speaking of guarantees..
If you’re reading this article, I assume you know the history of ZFS and how it was known to be the safest filesystem around, thanks to its use of checksums to validate data contents.
In comparison to Ceph scrub, ZFS scrub is pretty fantastic; it actually compares data on-disk to that stored checksum, unlike Ceph that generates a checksum on demand.
With ZFS encryption, half the checksum field is used for the encryption MAC. This is used to authenticate the block, but
zpool scrub knows nothing about them, does nothing with them. Even if you load the key,
scrub does not find encryption issues! Which brings me to the next problem..
the actual fourth errata
Since encryption was merged in 2017, the project has had three major corruption bugs - the first errata was added since the infrastructure for it was added to ZFS on Linux in 2014. Of course, encryption and all of its bugs haven’t made it to a release yet - only release candidates.
One particular issue, I opened on 2 December 2018. I discovered it when I was attempting to rebuild my pool using the newly merged special allocation classes feature (which was nearly merged in a useless form until persistent users complained) to segregate my metadata onto NVMe.
I rebooted into the received rootfs on the new pool and hit a ton of random I/O errors and kernel panic. Luckily, I had the original copy of the data still, and was able to un-fuck it using
zfs send without any arguments so that a newly received pool no longer had I/O errors.
I tried to work with Tom Caputi to help him reproduce the issue, and he could not provoke the bug. I wanted to send him a copy of my pool, but it was a 500G SSD and
dding / uploading the file would have taken forever. He was unable to diagnose the issue with remote SSH access to the system. All of the raw block /
zdb output in the world was useless in helping him discover what was wrong.
I tried to get it resolved for quite some time, although I’d worked around the problem locally by recreating the pool (again - noticing a pattern here?) it was only a matter of time before I hit it again. And then I did, after installing
logrotate and rotating some several-dozen GB size logs for the first time.
So, I tried to get Tom onboard to help reproduce the issue, but he was MIA - at a conference for a week, in meetings all day.. Datto (his employer) has a tendency to waste most of their productive hours with useless meetings. I digress.
I asked Brian Behlendorf (ZFS on Linux project leader) if he could help resolve the issue - he suggested Tom could fix it, because he is most familiar with the encryption code. This is a response I’d received numerous times over the 3 months the bug hung in limbo - don’t worry, Tom will solve it. He understands the code.
I managed to upload a 127G sparse copy of my corrupt zpool to a server and tried to get someone to look at it.
On the OpenZFS Slack group, I offered to buy pizza for an entire organisation (US$150) if anyone could simply reproduce my issue using the provided image. Out of 200-something developers/users in the group, only one responded with any suggestions for diagnosis. Another suggested that $150 doesn’t really even approach the radar for any company that contributes to ZFS, as if I should be paying them far more to get attention on a corruption issue.
Brian told me they’re trying the best with the resources they have, and that’s fine - totally understandable. But I wondered, where do we draw a line when one person is supposed to be responsible for understanding a particular section of the code? When do we start getting others like Brian himself or Matthew Ahrens to actually look at the encryption code and see what’s going wrong here?
I was told that issues take time to reproduce, and they take time to resolve. Within a couple hours, Brian had already reproduced the issue when Tom could not (in >3 months, nonetheless). That evening we had a fix.
This is fantastic, we’ve resolved the corruption issue, and for the first time since using native encryption, my blind backups are working properly and have zero I/O errors.
As mentioned earlier, the
zpool scrub didn’t catch any of this because it has no insight to encryption problems.
so it’s all resolved, right?
After this was resolved, I then hit another kernel panic that I’d previously reported and had resolved. Debugging this involved five days of back-and-forth with Tom and Brian where they added
printk calls to the source code repeatedly getting more debug info until they had an idea how certain flags were being reset or otherwise ignored by the receive thread, causing raw receives to be (incorrectly) encrypted.
We resolved that one, and now I’m hitting another panic in the
recv code - this time it is related to an unencrypted dataset.
I would have expected some kind of response by now but it appears that everyone is busy fixing the hastily-implemented TRIM code that was rammed into the source tree.
To the community, it looks like the project is steamrolling ahead to a 0.8 release that will be loaded with new features, and an unfortunate ratio of bugs to accompany. For example, why on Earth did Brian import the TRIM code after three release candidates were published? The fourth RC is going to be the first one with TRIM, and the last RC before the major release. This is due to some burning desire to push 0.8 out as it’s “taking too long”.
Btrfs is a “bad word” in ZFS circles because of longstanding corruption concerns in parity data. At least we haven’t pushed this brokenness to stable releases - that is unlike this corruption bug which is in a stable release and has not had any attention given to it in a month.
And no, it’s not a April Fools’ Day anymore. At least, not here.