Hacker News new | past | comments | ask | show | jobs | submit
Lazy/off-line dedup requires block pointer rewrite, but ZFS _cannot_ and will not ever get true BP rewrite because ZFS is not truly a CAS system. The problem is that physical locations are hashed into the Merkle hash tree, and that makes moving physical locations prohibitively expensive as you have to rewrite all the interior nodes on the way to the nodes you want to rewrite.

A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.

But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.

Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.

loading story #42007148