Story Detail of id 42001288 | Liveview Hacker News

Wowfunhappy1 month ago | on: OpenZFS deduplication is good now and you shouldn't use it

I want "offline" dedupe, or "lazy" dedupe that doesn't require the pool to be fully offline, but doesn't happen immediately.

Because:

> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.

To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.

But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.

cryptonector1 month ago | parent | next

Lazy/off-line dedup requires block pointer rewrite, but ZFS _cannot_ and will not ever get true BP rewrite because ZFS is not truly a CAS system. The problem is that physical locations are hashed into the Merkle hash tree, and that makes moving physical locations prohibitively expensive as you have to rewrite all the interior nodes on the way to the nodes you want to rewrite.

A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.

But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.

Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.

loading story #42007148

EvanAnderson1 month ago | parent | next

> I just wish we had "offline" dedupe, or even "lazy" dedupe...

This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".

I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.

DannyBee1 month ago | parent | next

You can use any of the offline dupe finders to do this.

Like jdupes or duperemove.

I sent PR's to both the ZFS folks and the duperemove folks to support the syscalls needed.

I actually have to go followup on the ZFS one, it took a while to review and i realized i completely forget to finish it up.

Dylan168071 month ago | parent | next

The ability to alter existing snapshots, even in ways that fully preserve the data, is extremely limited in ZFS. So yes that would be great, but if I was holding my breath for Block Pointer Rewrite I'd be long dead.

Wowfunhappy1 month ago | root | parent

You need block pointer rewrite for this?

Dylan168071 month ago | root | parent

You don't need it to dedup writable files. But redundant copies in snapshots are stuck there as far as I'm aware. So if you search for duplicates every once in a while, you're not going to reap the space savings until your snapshots fully rotate.

Wowfunhappy1 month ago | root | parent | next

Thanks. I do think dedupe for non-snapshots would still be useful, since as you say most people will get rid of old snapshots eventually.

I also wonder if it would make sense for ZFS to always automatically dedupe before taking a snapshot. But you'd have to make this behavior configurable since it would turn shapshotting from a quick operation into an expensive one.

lazide1 month ago | root | parent

The issue with this, in my experience, is that at some point that pro (exactly, and literally, only one copy of a specific bit of data despite many apparent copies) can become a con if there is some data corruption somewhere.

Sometimes it can be a similar issue in some edge cases performance wise, but usually caching can address those problems.

Efficiency being the enemy of reliability, sometimes.

loading story #42004019

UltraSane1 month ago | parent | next

The neat thing about inline dedupe is that if the block hash already exists than the block doesn't have to be written. This can save a LOT of write IO in many situations. There are even extensions where a file copy between to VMs on a dedupe storage array will not actually copy any data but just increment the original blocks reference counter. You will see absurd TB/s write speeds in the OS, it is pretty cool.

loading story #42002407

magicalhippo1 month ago | parent | next

The author of the new file-based block cloning code had this in mind. A backround process would scan files and identify dupes, delete the dupes and replace them with cloned versions.

There are of course edge cases to consider to avoid data loss, but I imagine it might come soon, either officially or as a third-party tool.

hinkley1 month ago | parent | next

I get the feeling that a hypothetical ZFS maintainer reading some literature on concurrent mark and sweep would be... inspirational, if not immediately helpful.

You should be able to detect duplicates online. Low priority sweeping is something else. But you can at least reduce pause times.

loading story #42008278

LeoPanthera1 month ago | parent | next

btrfs has this. You can deduplicate a filesystem after the fact, as an overnight cron job or whatever. I really wish ZFS could do this.

loading story #42006184

loading story #42002178

tiagod1 month ago | parent | next

I run rdfind[1] as a cronjob to replace duplicates with hardlinks. Works fine!

https://github.com/pauldreik/rdfind

AndrewDavis1 month ago | root | parent | next

So this is great, if you're just looking to deduplicate read only files. Less so if you intend to write to them. Write to one and they're both updated.

Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.

loading story #42003625

loading story #42003822

sureglymop1 month ago | root | parent | next

Quite cool, though it's not as storage saving as deduplicating at e.g. N byte blocks, at block level.

Wowfunhappy1 month ago | root | parent

But then you have to be careful not to remove the one which happens to be the "original" or the hardlinks will break, right?

loading story #42002698

nixdev1 month ago | parent

You can already do offline/lazy dedupe.

  zfs set mountpoint=foopy/foo /mnt/foo
  zfs set dedup=off  foopy/foo

  zfs set mountpoint=foopy/baz /mnt/baz
  zfs set dedup=on   foopy/baz

Save all your stuff in /mnt/foo, then when you want to dedup do

  mv /mnt/foo/bar /mnt/baz/

Yeah... this feels like picrel, and it is

  https://i.pinimg.com/originals/cb/09/16/cb091697350736aae53afe4b548b9d43.jpg

but it's here and now and you can do it now.

#visit	11157429
#session	45005
#live-session	0