Because:
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.
But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.
A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.
But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.
Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.
This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".
I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.
Like jdupes or duperemove.
I sent PR's to both the ZFS folks and the duperemove folks to support the syscalls needed.
I actually have to go followup on the ZFS one, it took a while to review and i realized i completely forget to finish it up.
I also wonder if it would make sense for ZFS to always automatically dedupe before taking a snapshot. But you'd have to make this behavior configurable since it would turn shapshotting from a quick operation into an expensive one.
Sometimes it can be a similar issue in some edge cases performance wise, but usually caching can address those problems.
Efficiency being the enemy of reliability, sometimes.
There are of course edge cases to consider to avoid data loss, but I imagine it might come soon, either officially or as a third-party tool.
You should be able to detect duplicates online. Low priority sweeping is something else. But you can at least reduce pause times.
Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.
zfs set mountpoint=foopy/foo /mnt/foo
zfs set dedup=off foopy/foo
zfs set mountpoint=foopy/baz /mnt/baz
zfs set dedup=on foopy/baz
Save all your stuff in /mnt/foo, then when you want to dedup do mv /mnt/foo/bar /mnt/baz/
Yeah... this feels like picrel, and it is https://i.pinimg.com/originals/cb/09/16/cb091697350736aae53afe4b548b9d43.jpg
but it's here and now and you can do it now.