Introduce zfs rewrite subcommand #17246

amotin · 2025-04-15T20:33:49Z

Motivation and Context

For years users were asking for an ability to re-balance pool after vdev addition, de-fragment randomly written files, change some properties for already written files, etc. The closest option would be to either copy and rename a file or send/receive/rename the dataset. Unfortunately all of those options have some downsides.

Description

This change introduces new zfs rewrite subcommand, that allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties.

How Has This Been Tested?

Manually tested it on FreeBSD. Linux-specific code is not yet tested.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-04-15T20:53:03Z

I've tried to find some kernel APIs to wire this to, but found that plenty of Linux file systems each implement their own IOCTL's for similar purposes. I did the same, except the IOCTL number I took almost arbitrary, since ZFS seems quite rough in this area. I am open to any better ideas before this is committed.

module/zfs/zfs_vnops.c

Momi-V · 2025-04-15T23:34:16Z

This looks amazing! Not having to sift through half a dozen shell scripts every time this comes up to see what currently handles the most edge cases correctly is very much appreciated. Especially with RaidZ expansion, being able to direct users to run a built-in command instead of debating what script to send them to would be very nice.

Also being able to reliably rewrite a live dataset while it's in use without having to worry about skipped files or mtime conflicts would make the whole process much less of a hassle. With the only thing to really worry about being snapshots/space usage this seems as close to perfect as reasonably possible (without diving deep into internals and messing with snapshot immutability). Bravo!

clhedrick · 2025-04-16T14:20:08Z

thank you. Fixes one of the biggest problems with ZFS.

Is there a way to suspend the process? It might be nice to have it run only during off hours.

amotin · 2025-04-16T14:41:06Z

Is there a way to suspend the process? It might be nice to have it run only during off hours.

It does one file at a time, and should be killable in between. Signal handling within one huge file can probably be added. Though the question of the process restart is on the user. I didn't plan to go that deep into the area within this PR.

clhedrick · 2025-04-16T20:14:16Z

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

stuartthebruce · 2025-04-16T20:29:35Z

I guess you could feed it with a "find" command.

This will likely be a good use case for GNU Parallel.

Momi-V · 2025-04-16T20:49:48Z

I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well.

It can take a directory as an argument and there are some recursive functions and iterators in the code so piping find into it should not be necessary. That avoids some userspace file handling overhead, but it still has to go through the contents of each directory one file at a time. I also don't see any parallel execution or threading (though I'm not too familiar with ZFS internals, maybe some of the primitives used here run asynchronously?).

Whether doing parallelism in userspace by just calling it for many files/directories at once or not it should have the required locking to just run in the background and be significantly more elegant than the CP + mtime (or potentially userspace hash) check to make sure files didn't change during the copy process avoiding one of the potential pitfalls of existing solutions.

amotin · 2025-04-16T21:26:27Z

I haven't benchmarked it deep yet, but unless the files are tiny, I don't expect there is a major need for parallelism. The code in kernel should handle up to 16MB at a time, plus allows ZFS to do read-ahead and write-back on top of that, so there will be quite a lot in the pipeline to saturate the disks and/or the system, especially if there is some compression/checksuming/encryption. And without need to copy data to/from user-space, the only thread will not be doing too much, I think mostly a decompression from ARC. Bunch of small files on a wide HDD pool I suspect may indeed suffer from read latency, but that in user-space we can optimize/parallelize all day long.

tonyhutter · 2025-04-16T21:31:18Z

I gave this a quick test. It's very fast and does exactly what it says 👍

# Copy ZFS source workspace to pool with compression=off
$ time cp -a ~/zfs /tank2

real	0m0.600s
user	0m0.032s
sys	0m0.519s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  893M  8.4G  10% /tank2


# Set compression to 'gzip' and rewrite
$ sudo ./zfs set compression=gzip tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m2.272s
user	0m0.005s
sys	0m0.005s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  402M  8.9G   5% /tank2


# Set compression to 'lz4' and rewrite
$ sudo ./zfs set compression=lz4 tank2
$ time sudo ./zfs rewrite -r /tank2
real	0m1.947s
user	0m0.002s
sys	0m0.010s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  456M  8.8G   5% /tank2


# Set compression to 'zstd' and rewrite
$ sudo ./zfs set compression=zstd tank2
$ time sudo ./zfs rewrite -r /tank2

real	0m0.616s
user	0m0.003s
sys	0m0.006s

$ df -h /tank2
Filesystem      Size  Used Avail Use% Mounted on
tank2           9.3G  366M  8.9G   4% /tank2

I can already see people writing scripts that go though every dataset, setting the optimal compression, recordsize, etc, and zfs rewrite-ing them.

amotin · 2025-04-16T21:42:51Z

Cool! Though the recordsize is one of things it can't change, since it would requite real byte-level copy, not just marking existing blocks dirty. I am not sure it can be done under the load in general. At least it would be much more complicated.

snajpa · 2025-04-17T16:58:22Z

Umm this is basically same as doing send | recv, isn't it? I mean, in a way, this is already possible to do without any changes, isn't it? Recv will even respect a lower recordsize, if I'm not mistaken - at least when receiving into a pool without large blocks support, it has to do that.

I'm thinking whether we can do better, in the original sense of ZFS "better", meaning "automagic" - what do you think of using snapshots, send|recv, in a loop with ever decreasing delta size and then when the delta isn't decreasing anymore, we could swap those datasets and use (perhaps slightly modified) zfs_resume_fs transparently to the userspace... that way we would get transparent migration into a dataset with different options, that would scratch some itches for people, wouldn't it?

It'd be even cooler if it could coalesce smaller blocks into larger ones, but that potentially implies performance problems with write amplification, I would say if the app writes in smaler chunks that it gets onto disk in such smaller chunks, it's probably for the best to leave them that way. For any practical use-case I could think of though, I would definitely appreciate the ability to split the blocks of a dataset using smaller recordsize.

If there's a way how to make zfs rewrite more automagical, I think it's at least worth considering.

Momi-V · 2025-04-17T17:15:06Z

send recv has the huge downside of requiring 2x the space, even if you do the delta size thing since it has to send the entire dataset at least once and old data can't be deleted until the new dataset is complete.
Also recv doesn't increase block sizes, it only splits them if they are larger than the other pool supports (and iirc. there have even been some issues with that).
Also that idea sounds a lot more complex than simply walking the directory tree and iterating through the files to mark their records as dirty to cause a rewrite.

we would get transparent migration into a dataset with different options, that would scratch some itches for people, wouldn't it?

Isn't this exactly what rewrite does? Change the options, run it and all the blocks are changed in the background. Without an application even seeing a change to the file. And unlike send recv it only needs a few MB of extra space.

Edit: with the only real exception being record size, but recv also solves that only partially at best and it doesn't look like there's a reasonable way to work around that in a wholly transparent fashion.

amotin · 2025-04-19T01:40:34Z

Added -x flag to not cross mount points.
Added signal handling in kernel.
Added man page.

stuartthebruce · 2025-04-19T23:20:48Z

Which release is this game changing enhancement likely to land in?

amotin · 2025-04-20T00:33:04Z

@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time.

stuartthebruce · 2025-04-20T00:41:01Z

@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time.

Good to know there are no obvious blockers from including in a future 2.3.x. Once this hits master I will help by setting up a test system with 1/2PB of 10^9 small files to see if I can break it. Is there any reason to think the code will be sensitive to Linux vs FreeBSD?

amotin · 2025-04-20T00:46:04Z

Is there any reason to think the code will be sensitive to Linux vs FreeBSD?

IOCTL interface of the kernels is obviously slightly different, requiring OS-specific shims, as with most of other VFS-related code. But seems like not a big problem, as Tony confirmed it works on Linux too from the first try.

amotin · 2025-04-20T00:50:37Z

Once this hits master

Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the flags field I already reserved for later extensions. I was thinking about some options to not rewrite in some cases, but didn't want to pollute the code until I am convinced it is required.

stuartthebruce · 2025-04-20T01:00:02Z

Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the flags field I already reserved for later extensions. I was thinking about some options to not rewrite in some cases, but didn't want to pollute the code until I am convinced it is required.

OK, I will see if I can find some time this next week to stress test.

robn · 2025-07-23T03:12:22Z

How does this interact with snapshots? If I rewrite everything and have a snapshot, am I now using twice the space?

Yes, if you rewrite a block that's in a snapshot, then the snapshot keeps one, and you get a new one, so it will use more space on the pool. This is mentioned in zfs-rewrite(8).

It may not be twice the space; it depends on the properties at time of rewrite.

robn · 2025-07-23T03:25:31Z

Really my biggest issue with "rewrite" is that it could be confused with block pointer rewrite, the once promised feature never delivered, which it certainly isn't except in the most roundabout way imaginable. Yes, technically rewriting the data rewrites the block pointers, but it's a very different kind of operation.

I'm not really sure the distinction between "rewrite the data" and "rewrite the block pointer" really means anything here, since the BP describes how to interpret the data, so changing the data layout or transforms necessarily requires the the block pointer to change.

In any case, I doubt there's going to be much confusion. "Block pointer rewrite" is pretty inside-baseball at this point; most people who just use OpenZFS have likely never heard of BPR or if they have, not with any particular idea of what it is or should be. Hell, I've been working on OpenZFS for three years now and I only know it in wish form ("gee, it'd be nice to just upgrade all the block pointers"). Which is what zfs-rewrite is, just with a narrower scope.

owlshrimp · 2025-07-23T08:07:04Z

How does zfs rewrite interact with an array that has data errors?

maxximino · 2025-07-23T10:38:13Z

Which is what zfs-rewrite is, just with a narrower scope.

AFAIU the biggest difference between "zfs rewrite" and the mythical block-pointer-rewrite is that data rewritten by "zfs rewrite" count as newly-written data, with some unpleasant consequences:

until you delete old snapshots, the old data is still needed and will still take space. So, it can't be used to implement e.g. arbitrary layout changes in the zpool unless the user agrees to forfeit all snapshots.
(haven't tried myself, but fairly convinced) if you use zfs send | recv to send backups offsite, you'll be sending everything again, with big waste of bandwidth and space on the receiving side.

At VFS layer, zfs rewrite is practically invisible (and awesome). At other layers, not so much. AFAIUm real BP rewrite ideally should be invisible also in other layers.

amotin · 2025-07-23T13:52:32Z

How does zfs rewrite interact with an array that has data errors?

Kernel should return error to user-space, same as read, but user-space there is made to log errors and continue with next file.

amotin · 2025-07-24T23:49:19Z

(haven't tried myself, but fairly convinced) if you use zfs send | recv to send backups offsite, you'll be sending everything again, with big waste of bandwidth and space on the receiving side.

@maxximino Look here: #17565 . :)

This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]>

devifish · 2025-08-25T12:58:45Z

Does this feature solve the problem of raid-z single-disk expansion resulting in less free space than expected?

amotin · 2025-08-25T13:23:22Z

@devifish There are two sides in RAIDZ expansion free space issue. The first is that old data after expansion really have old parity ratio, taking more space than they could. Rewrite does help with this. The second is that free space prediction is still based on the original parity ratio, underestimating free space when reporting. Rewrite does not help with this, but it is only a reporting issue, and the more you write the lower will be the difference, so you'll be able to use all expected space.

owlshrimp · 2025-08-25T19:03:28Z

The second is that free space prediction is still based on the original parity ratio, underestimating free space when reporting. Rewrite does not help with this, but it is only a reporting issue, and the more you write the lower will be the difference, so you'll be able to use all expected space.

If all the data on the entire raidz (or the entire pool) is rewritten, does the pool ever change to using the new parity ratio in it's calculation of free space?

amotin · 2025-08-25T19:08:39Z

If all the data on the entire raidz (or the entire pool) is rewritten, does the pool ever change to using the new parity ratio in it's calculation of free space?

It is not currently implemented. But I don't think it is theoretically impossible.

CicerBro · 2025-08-27T09:14:27Z

So will this be able to change the recordsize on existing data? I have a few TB of tiny files that I changed the recordsize of. AFAIK send|recv won't do this and I need to use rsync, which isn't really good with millions of small files.

Momi-V · 2025-08-27T09:19:27Z

So will this be able to change the recordsize on existing data? I have a few TB of tiny files that I changed the recordsize of. AFAIK send|recv won't do this and I need to use rsync, which isn't really good with millions of small files.

Record Size is basically the only thing that isn't changed by rewrite. It rewrites files record by record, and because a single file can only ever have one record size (that can't change as soon as it consists of 2 or more records) the only way to change record size is by creating a new file (like cp --reflink=never, rsync, etc. do) and renaming it back to replace the original.

CicerBro · 2025-08-27T09:23:11Z

Record Size is basically the only thing that isn't changed by rewrite. It rewrites files record by record, and because a single file can only ever have one record size (that can't change as soon as it consists of 2 or more records) the only way to change record size is by creating a new file (like cp --reflink=never, rsync, etc. do) and renaming it back to replace the original.

Damn. I read rewrite and thought "this is exactly what I need" lol. Before ZFS I used rsync to move millions of small files to a different machine and that was a massive pain in the butt. rsync doesn't handle huge amounts of files very well.

the8472 · 2025-08-27T16:58:50Z

Since this can rewrite checksums I wonder how airtight the "chain of custody" is, i.e. is there any gap between reading the data with the old checksums, verifying, computing the new checksum and writing it out where a bit could end up being flipped without that being noticed?

owlshrimp · 2025-08-27T20:11:53Z

Since this can rewrite checksums I wonder how airtight the "chain of custody" is, i.e. is there any gap between reading the data with the old checksums, verifying, computing the new checksum and writing it out where a bit could end up being flipped without that being noticed?

At a guess, no worse than any normal read or write. Whenever reads or writes come from or are presented to userspace they're not checksummed at all. It's sitting in some normal buffer in ram. Underpinning pretty much all storage is the basic assumption that things like your CPU and ram work correctly, which excludes things like random DRAM bitflips. There are ways to mitigate this of course, ECC being the obvious one.

Beyond that you probably start getting into clusters and quorums and distributed consensus, which is kind of out of scope for ZFS which runs on a single host? (Unless you've got some truly bizarre mainframe hardware where every app is run invisibly on multiple CPUs and cross-checked)

the8472 · 2025-08-27T20:39:11Z

no worse than any normal read or write

Well yeah, but for long-term storage once data is written and sitting there it's protected by checksum verification and regular scrubs, it won't be altered.

But taking it out of this state and moving it through a read-write cycle introduces a new sources of corruption, however small it might be. And if a new checksum is calculated then this may be silent.

I think one could do better than a plain read-write. For example O_DIRECT read, checksum, write, and then a second pass of reads from source and destination (through separate buffers or even NUMA nodes or whatever if one is extra paranoid) to ensure that what was written is bit/checksum-identical to the original.

stuartthebruce · 2025-08-27T20:52:16Z

But taking it out of this state and moving it through a read-write cycle introduces a new sources of corruption, however small it might be. And if a new checksum is calculated then this may be silent.

How about considering it an error if the new checksum does not match the original, and if so keeping the original bits by rolling back the transaction for that specific file? Note, if that seems like a good idea then this extra check should presumably also confirm the checksum was not expectedly different due to a file modification during the re-write transaction?

owlshrimp · 2025-08-27T21:02:56Z

no worse than any normal read or write

Well yeah, but for long-term storage once data is written and sitting there it's protected by checksum verification and regular scrubs, it won't be altered.

But taking it out of this state and moving it through a read-write cycle introduces a new sources of corruption, however small it might be. And if a new checksum is calculated then this may be silent.

That "state" is "in ram." You are probably incurring the same risk every time you scrub your array, since corruption in ram then could lead ZFS to conclude your data is bad and needs to be "fixed." If you're this worried about corruption happening in ram, buy ECC. That should completely solve the issue.

I think one could do better than a plain read-write. For example O_DIRECT read, checksum, write, and then a second pass of reads from source and destination (through separate buffers or even NUMA nodes or whatever if one is extra paranoid) to ensure that what was written is bit/checksum-identical to the original.

You are running in circles chasing your tail, because you are both distrusting your CPU and RAM, while ALSO trusting them COMPLETELY. You are trusting that they're reading the correct data both times. What if the address is corrupted? You are trusting them to compute the checksum correctly. You are trusting them to report the correct result when you COMPARE the checksums. You are trusting them to write the raw data back out correctly after the checksums have been stripped back off. Remember, CoW filesystems like ZFS cannot by design write back out to the same location twice.

The answer to this entire line of reasoning is "don't use ZFS." Go use something like CEPH that is distributed across separate nodes and has a real consensus model.

Momi-V · 2025-08-27T21:04:50Z

no worse than any normal read or write

Well yeah, but for long-term storage once data is written and sitting there it's protected by checksum verification and regular scrubs, it won't be altered.

But taking it out of this state and moving it through a read-write cycle introduces a new sources of corruption, however small it might be. And if a new checksum is calculated then this may be silent.

I think one could do better than a plain read-write. For example O_DIRECT read, checksum, write, and then a second pass of reads from source and destination (through separate buffers or even NUMA nodes or whatever if one is extra paranoid) to ensure that what was written is bit/checksum-identical to the original.

It's a worthwhile discussion and you're free to implement that or file an issue explaining the rationale, but for now this seems like a fairly straightforward implementation mostly based on existing ZFS primitives (basically iterating through a file and marking all records as dirty so they just get transparently rewritten by the lower layers).

Keeping both on disk versions around and active (but unreferenced) to do a second unbuffered read pass on both of them to rule out memory corruption, before finally committing the new one while also handling ongoing I/O transparently (what happens if a file gets partially overwritten/changed from userspace during the rewrite?) isn't nearly as trivial.

This initial version is basically aimed at replacing the userspace shell and python scripts floating around that automate a cp + mv and sometimes can't even do proper locking to ensure a file isn't modified during the copy process. It's relatively clean, simple, doesn't even require ANY on disk format changes, can run without conflicting with other workloads (because the record is just atomically marked as dirty, so any extra incoming write will just be handled by the normal path for multiple overlapping writes).

Further improvements are being made (like the -P physical rewrite flag to skip rewritten records in zfs send) so a feature request for a "strict mode" for systems hosting critical data without ECC RAM could be discussed.

But for now it has the same (or better) guarantees as any alternative rewrite option and should catch everything from disk errors to signal path corruptions on both the read and write path, except for memory errors on non ECC systems during the actual checksum computation.

owlshrimp · 2025-08-27T21:05:36Z

But taking it out of this state and moving it through a read-write cycle introduces a new sources of corruption, however small it might be. And if a new checksum is calculated then this may be silent.

How about considering it an error if the new checksum does not match the original, and if so keeping the original bits by rolling back the transaction for that specific file? Note, if that seems like a good idea then this extra check should presumably also confirm the checksum was not expectedly different due to a file modification during the re-write transaction?

See my other post. There are myriad other ways things can go horribly wrong, and if you distrust your CPU/RAM to such a degree then this problem is completely unsolvable. Stop using a single-node solution like ZFS and go use something with a real distributed consensus model like CEPH.

the8472 · 2025-08-27T21:17:40Z

How about considering it an error if the new checksum does not match the original, and if so keeping the original bits by rolling back the transaction for that specific file?

If the checksum stays the same and this is verified then this is no different then a send-receive. That's fine.

But the feature is described as being able to change the checksum, so in general it seems weaker than send-receive-scrub.

That "state" is "in ram."

And buses, caches, forwarding buffers, etc. etc.
The data has to pass a lot of hands to go from disk to disk.

You are probably incurring the same risk every time you scrub your array, since corruption in ram then could lead ZFS to conclude your data is bad and needs to be "fixed."

No? This is covered by copies/parity. Resilver will also have to checksum the reconstruction before even attempting to writing it back. So even a spurious scrub-error should not result in a write of incorrect data.

You are trusting that they're reading the correct data both times. What if the address is corrupted? You are trusting them to compute the checksum correctly. You are trusting them to report the correct result when you COMPARE the checksums.

To some extent, I agree, even a verification step is not perfect.

But the failure modes you describe are are less likely to be silent. For example data first has to be corrupted and then an additional verification step has to have a separate failure to flip the checksum-comparison from false to true, the probabilities of such coincidences go down a lot compared to singular errors that aren't caught because there was a completely unverified step somewhere in the pipeline.

The data sitting around without redundancy, without integrity, enabling silent corruption is what concerns me.
If corruption can be detected by having an unbroken chain of verification, then the residual possibility of errors conspiring to cover up verification failure is more tolerable.

This initial version is basically aimed at replacing the userspace shell and python scripts floating around that automate a cp + mv and sometimes can't even do proper locking to ensure a file isn't modified during the copy process.

I see, yes that's an improvement, but in my mind it does fall short of a send-receive then. Possibly even falls short of a paranoid userspace implementation.

Momi-V · 2025-08-27T21:36:13Z

I see, yes that's an improvement, but in my mind it does fall short of a send-receive then. Possibly even falls short of a paranoid userspace implementation.

send | recv has the same issue: the record arrives on the target system, the checksum is verified and then it is potentially recompressed, changing the checksum and finally it's written out.

Here data is read in, the checksum is verified, then it is potentially recompressed, and finally written out.

ZFS checksums are as far as i know always done on and verified against the physical representaion of the record. In send | recv it's possible to keep the physical representaion the same with -C to tell the sender to send compressed records as long as the receiver accepts them and doesn't overwrite the compression property to use a different level, algorithm, etc.

But that's useless if your goal is to change the compression algorithm, because then the checksum always has to be recomputed at some point. Yes, the process has room for improvement and technically using flock, copying each file individually, running a userspace sha512 to make sure the VFS level contents are the same and then doing an mv is less susceptible to corruption. But it's also not realistic for a production system to have to schedule downtime because you want to flock the postgres data directory.

owlshrimp · 2025-08-27T21:43:52Z

And buses, caches, forwarding buffers, etc. etc. The data has to pass a lot of hands to go from disk to disk.

But the failure modes you describe are are less likely to be silent.

Everything you have described doing must happen in ram, not just pass through it. Comparison of checksums, calculation of addresses, even the basic variables, pointers, function call addresses, and code of ZFS itself. A random corruption of any of the variables or code of ZFS in ram would instantly and undeniably destroy your entire pool. No matter what verification scheme you propose, it will never be possible to prevent a well-placed bitflip from destroying all of your data in an instant.

If you care about corruption along the way (or indeed, code corruption), use ECC. If you're worried about other parts of your computer lying to you, use a multi-node distributed-consensus system like CEPH that explicitly can handle the total failure/misbehaviour of an entire computer.

the8472 · 2025-08-27T22:06:56Z

@HPPinata Thanks for the explanation. My use is cold data/read-only storage. I'll stick to userspace copies with extra src/dest checksumming then.

If zfs had something like ioctl_xfs_exchange_range then userspace impls would be a bit simpler since they'd only have to duplicate the content, not all the attributes.

@owlshrimp

Everything you have described doing must happen in ram

ECC RAM is great, but it's not the only thing the data flows through, and I don't think it should be the only thing protecting the data, it's still less resilient than SHA256 when it comes to verification.

"all data is end-to-end verified through cryptographic hashes, discrepancies will be detected on read or scrub" is easier to reason about than the handoffs of unverified data between multiple system components with various levels of ECC or parity (if at all... do CPU registers use parity? I don't know).

Trust, but verify

A random corruption of any of the variables or code of ZFS in ram would instantly and undeniably destroy your entire pool

I doubt that the probabilities of a random data block getting corrupted vs. the entire array getting killed are the same. And when it comes to data integrity probability matters, since we can never have perfect integrity, it's always the nines.

And I'd also expect this kind of failure to be loud, in which case I could at least restore from backups. Silent corruption gets propagated to backups and then the data is permanently damaged.

Momi-V · 2025-08-27T22:34:49Z

Minor note on:

That "state" is "in ram."

And buses, caches, forwarding buffers, etc. etc. The data has to pass a lot of hands to go from disk to disk.

Corruption in buses, caches forwarding buffers etc. is covered for everything I/O side.

The old checksum is verified after the record is read into memory and the new one computed before it's written out to disk.

So any early corruption is caught on the read and any late corruption is caught by the next scrub.

So the true "unprotected" signal path is DRAM -> Memory Controller -> CPU Bus & Cache -> Core and Back. The CPU caches I've seen data sheets for all have some sort of multi bit ECC and the Bus / Data Fabrics generally also have at least functional integrity checks. So for all intents and purposes the CPU internal data path is generally treated as integrity protected. So the only real missing link to full hardware level data integrity guarantees is ECC memory.

Of course if your usecase allows for you to go the extra mile and effort of doing a manual verification of the data before and after a rewrite that's an additional layer of protection (Data Fabrics definitely aren't using SHA512 for their internal consistency guarantees), but it's debatable if it's a Filesystems job to make sure there isn't a random bit flip in the code of a compression library causing silent (logic) translation errors in converting from uncompressed to compressed data buffers.

owlshrimp · 2025-08-27T22:36:56Z

ECC RAM is great, but it's not the only thing the data flows through, and I don't think it should be the only thing protecting the data, it's still less resilient than SHA256 when it comes to verification.

"all data is end-to-end verified through cryptographic hashes, discrepancies will be detected on read or scrub" is easier to reason about than the handoffs of unverified data between multiple system components with various levels of ECC or parity (if at all... do CPU registers use parity? I don't know).

You are right, registers do not. Look up "mercurial cores." At that level of paranoia though you don't trust your CPU and the problems get even worse. Your only option is multiple computers with an explicit consensus model. ie CEPH.

A random corruption of any of the variables or code of ZFS in ram would instantly and undeniably destroy your entire pool

I doubt that the probabilities of a random data block getting corrupted vs. the entire array getting killed are the same. And when it comes to data integrity probability matters, since we can never have perfect integrity, it's always the nines.

The probability of the entire array getting trashed is waaaaaaaaaay higher, for a given bitflip in ram. The code of ZFS sits in your unprotected ECC-less ram from the time you boot up your computer until you shut it down. There's plenty of time for a particle from space to strike it, vs a buffer that's only a couple of MB at most and is continually being freshly rewritten.

And I'd also expect this kind of failure to be loud, in which case I could at least restore from backups. Silent corruption gets propagated to backups and then the data is permanently damaged.

A small bitflip in ZFS' code that causes an early return during the writeout could result in bad (incomplete) data getting written and never being checked. A bitflip in the checksum comparison code would cause every checksum comparison to return success. A bitflip in code that computes block pointers could result in the continual writing and checking of (correct!) data to the same spot in the array over and over, and you'd never know until you manually tried to access it and found most of it missing.

You are coming up with ad-hoc solutions to a very small subset of guessed-at problems, without really basing any of this on a set of explicit and consistent assumptions about the state of the system and what you can actually rely on.

This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]>

github-actions bot added the Status: Work in Progress Not yet ready for general review label Apr 15, 2025

amotin force-pushed the rewrite branch from e6a1719 to eec53cf Compare April 15, 2025 20:48

gmelikov reviewed Apr 15, 2025

View reviewed changes

module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved

amotin force-pushed the rewrite branch from eec53cf to 90451a8 Compare April 16, 2025 02:28

amotin added the Status: Design Review Needed Architecture or design is under discussion label Apr 16, 2025

amotin force-pushed the rewrite branch from 90451a8 to a846f69 Compare April 19, 2025 01:36

amotin force-pushed the rewrite branch 4 times, most recently from d23a371 to c5f4413 Compare April 19, 2025 22:49

amotin marked this pull request as ready for review April 20, 2025 20:39

amotin mentioned this pull request Jul 25, 2025

Physical rewrite #17565

Closed

14 tasks

amotin mentioned this pull request Aug 7, 2025

2.3.4 staging prep #17595

Merged

14 tasks

Introduce zfs rewrite subcommand #17246

Introduce zfs rewrite subcommand #17246

Uh oh!

Conversation

amotin commented Apr 15, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin commented Apr 15, 2025

Uh oh!

Uh oh!

Momi-V commented Apr 15, 2025

Uh oh!

clhedrick commented Apr 16, 2025

Uh oh!

amotin commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clhedrick commented Apr 16, 2025

Uh oh!

stuartthebruce commented Apr 16, 2025

Uh oh!

Momi-V commented Apr 16, 2025

Uh oh!

amotin commented Apr 16, 2025

Uh oh!

tonyhutter commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Apr 16, 2025

Uh oh!

snajpa commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Momi-V commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Apr 19, 2025

Uh oh!

stuartthebruce commented Apr 19, 2025

Uh oh!

amotin commented Apr 20, 2025

Uh oh!

stuartthebruce commented Apr 20, 2025

Uh oh!

amotin commented Apr 20, 2025

Uh oh!

amotin commented Apr 20, 2025

Uh oh!

stuartthebruce commented Apr 20, 2025

Uh oh!

robn commented Jul 23, 2025

Uh oh!

robn commented Jul 23, 2025

Uh oh!

owlshrimp commented Jul 23, 2025

Uh oh!

maxximino commented Jul 23, 2025

Uh oh!

amotin commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Jul 24, 2025

Uh oh!

devifish commented Aug 25, 2025

Uh oh!

amotin commented Aug 25, 2025

Uh oh!

owlshrimp commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Aug 25, 2025

Uh oh!

CicerBro commented Aug 27, 2025

Uh oh!

Momi-V commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

amotin commented Apr 16, 2025 •

edited

Loading

tonyhutter commented Apr 16, 2025 •

edited

Loading

snajpa commented Apr 17, 2025 •

edited

Loading

Momi-V commented Apr 17, 2025 •

edited

Loading

amotin commented Jul 23, 2025 •

edited

Loading

owlshrimp commented Aug 25, 2025 •

edited

Loading

Momi-V commented Aug 27, 2025 •

edited

Loading

owlshrimp commented Aug 27, 2025 •

edited

Loading

owlshrimp commented Aug 27, 2025 •

edited

Loading

Momi-V commented Aug 27, 2025 •

edited

Loading

the8472 commented Aug 27, 2025 •

edited

Loading

Momi-V commented Aug 27, 2025 •

edited

Loading

owlshrimp commented Aug 27, 2025 •

edited

Loading

owlshrimp commented Aug 27, 2025 •

edited

Loading