-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Introduce zfs rewrite subcommand #17246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've tried to find some kernel APIs to wire this to, but found that plenty of Linux file systems each implement their own IOCTL's for similar purposes. I did the same, except the IOCTL number I took almost arbitrary, since ZFS seems quite rough in this area. I am open to any better ideas before this is committed. |
This looks amazing! Not having to sift through half a dozen shell scripts every time this comes up to see what currently handles the most edge cases correctly is very much appreciated. Especially with RaidZ expansion, being able to direct users to run a built-in command instead of debating what script to send them to would be very nice. Also being able to reliably rewrite a live dataset while it's in use without having to worry about skipped files or mtime conflicts would make the whole process much less of a hassle. With the only thing to really worry about being snapshots/space usage this seems as close to perfect as reasonably possible (without diving deep into internals and messing with snapshot immutability). Bravo! |
thank you. Fixes one of the biggest problems with ZFS. Is there a way to suspend the process? It might be nice to have it run only during off hours. |
It does one file at a time, and should be killable in between. Signal handling within one huge file can probably be added. Though the question of the process restart is on the user. I didn't plan to go that deep into the area within this PR. |
I couldn't find documentation in the files changed, so I have to guess how it actually works. Is it a file at a time? I guess you could feed it with a "find" command. For a system with a billion files, do you have a sense how long this is gong to take? We can do scrubs in a day or two, but rsync is impractically slow. If this is happening at the file system level, that migth be the case here as well. |
This will likely be a good use case for GNU Parallel. |
It can take a directory as an argument and there are some recursive functions and iterators in the code so piping find into it should not be necessary. That avoids some userspace file handling overhead, but it still has to go through the contents of each directory one file at a time. I also don't see any parallel execution or threading (though I'm not too familiar with ZFS internals, maybe some of the primitives used here run asynchronously?). Whether doing parallelism in userspace by just calling it for many files/directories at once or not it should have the required locking to just run in the background and be significantly more elegant than the CP + mtime (or potentially userspace hash) check to make sure files didn't change during the copy process avoiding one of the potential pitfalls of existing solutions. |
I haven't benchmarked it deep yet, but unless the files are tiny, I don't expect there is a major need for parallelism. The code in kernel should handle up to 16MB at a time, plus allows ZFS to do read-ahead and write-back on top of that, so there will be quite a lot in the pipeline to saturate the disks and/or the system, especially if there is some compression/checksuming/encryption. And without need to copy data to/from user-space, the only thread will not be doing too much, I think mostly a decompression from ARC. Bunch of small files on a wide HDD pool I suspect may indeed suffer from read latency, but that in user-space we can optimize/parallelize all day long. |
I gave this a quick test. It's very fast and does exactly what it says 👍
I can already see people writing scripts that go though every dataset, setting the optimal compression, recordsize, etc, and zfs rewrite-ing them. |
Cool! Though the recordsize is one of things it can't change, since it would requite real byte-level copy, not just marking existing blocks dirty. I am not sure it can be done under the load in general. At least it would be much more complicated. |
Umm this is basically same as doing send | recv, isn't it? I mean, in a way, this is already possible to do without any changes, isn't it? Recv will even respect a lower recordsize, if I'm not mistaken - at least when receiving into a pool without large blocks support, it has to do that. I'm thinking whether we can do better, in the original sense of ZFS "better", meaning "automagic" - what do you think of using snapshots, send|recv, in a loop with ever decreasing delta size and then when the delta isn't decreasing anymore, we could swap those datasets and use (perhaps slightly modified) It'd be even cooler if it could coalesce smaller blocks into larger ones, but that potentially implies performance problems with write amplification, I would say if the app writes in smaler chunks that it gets onto disk in such smaller chunks, it's probably for the best to leave them that way. For any practical use-case I could think of though, I would definitely appreciate the ability to split the blocks of a dataset using smaller If there's a way how to make |
send recv has the huge downside of requiring 2x the space, even if you do the delta size thing since it has to send the entire dataset at least once and old data can't be deleted until the new dataset is complete.
Isn't this exactly what rewrite does? Change the options, run it and all the blocks are changed in the background. Without an application even seeing a change to the file. And unlike send recv it only needs a few MB of extra space. Edit: with the only real exception being record size, but recv also solves that only partially at best and it doesn't look like there's a reasonable way to work around that in a wholly transparent fashion. |
|
d23a371
to
c5f4413
Compare
Which release is this game changing enhancement likely to land in? |
@stuartthebruce So far it haven't landed even in master, so anybody who want to speed it up is welcome to test and comment. In general though, when completed, there is no reason why aside of 2.4.0 it can't be ported back to some 2.3.x of the time. |
Good to know there are no obvious blockers from including in a future 2.3.x. Once this hits master I will help by setting up a test system with 1/2PB of 10^9 small files to see if I can break it. Is there any reason to think the code will be sensitive to Linux vs FreeBSD? |
IOCTL interface of the kernels is obviously slightly different, requiring OS-specific shims, as with most of other VFS-related code. But seems like not a big problem, as Tony confirmed it works on Linux too from the first try. |
Since this introduces new IOCTL API, I'd appreciate some feedback before it hit master in case some desired functionality might require API changes aside of the |
OK, I will see if I can find some time this next week to stress test. |
Yes, if you rewrite a block that's in a snapshot, then the snapshot keeps one, and you get a new one, so it will use more space on the pool. This is mentioned in It may not be twice the space; it depends on the properties at time of rewrite. |
I'm not really sure the distinction between "rewrite the data" and "rewrite the block pointer" really means anything here, since the BP describes how to interpret the data, so changing the data layout or transforms necessarily requires the the block pointer to change. In any case, I doubt there's going to be much confusion. "Block pointer rewrite" is pretty inside-baseball at this point; most people who just use OpenZFS have likely never heard of BPR or if they have, not with any particular idea of what it is or should be. Hell, I've been working on OpenZFS for three years now and I only know it in wish form ("gee, it'd be nice to just upgrade all the block pointers"). Which is what |
How does zfs rewrite interact with an array that has data errors? |
AFAIU the biggest difference between "zfs rewrite" and the mythical block-pointer-rewrite is that data rewritten by "zfs rewrite" count as newly-written data, with some unpleasant consequences:
At VFS layer, |
Kernel should return error to user-space, same as read, but user-space there is made to log errors and continue with next file. |
@maxximino Look here: #17565 . :) |
This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]>
Does this feature solve the problem of raid-z single-disk expansion resulting in less free space than expected? |
@devifish There are two sides in RAIDZ expansion free space issue. The first is that old data after expansion really have old parity ratio, taking more space than they could. Rewrite does help with this. The second is that free space prediction is still based on the original parity ratio, underestimating free space when reporting. Rewrite does not help with this, but it is only a reporting issue, and the more you write the lower will be the difference, so you'll be able to use all expected space. |
If all the data on the entire raidz (or the entire pool) is rewritten, does the pool ever change to using the new parity ratio in it's calculation of free space? |
It is not currently implemented. But I don't think it is theoretically impossible. |
So will this be able to change the recordsize on existing data? I have a few TB of tiny files that I changed the recordsize of. AFAIK send|recv won't do this and I need to use rsync, which isn't really good with millions of small files. |
Record Size is basically the only thing that isn't changed by rewrite. It rewrites files record by record, and because a single file can only ever have one record size (that can't change as soon as it consists of 2 or more records) the only way to change record size is by creating a new file (like cp --reflink=never, rsync, etc. do) and renaming it back to replace the original. |
Damn. I read |
Since this can rewrite checksums I wonder how airtight the "chain of custody" is, i.e. is there any gap between reading the data with the old checksums, verifying, computing the new checksum and writing it out where a bit could end up being flipped without that being noticed? |
At a guess, no worse than any normal read or write. Whenever reads or writes come from or are presented to userspace they're not checksummed at all. It's sitting in some normal buffer in ram. Underpinning pretty much all storage is the basic assumption that things like your CPU and ram work correctly, which excludes things like random DRAM bitflips. There are ways to mitigate this of course, ECC being the obvious one. Beyond that you probably start getting into clusters and quorums and distributed consensus, which is kind of out of scope for ZFS which runs on a single host? (Unless you've got some truly bizarre mainframe hardware where every app is run invisibly on multiple CPUs and cross-checked) |
Well yeah, but for long-term storage once data is written and sitting there it's protected by checksum verification and regular scrubs, it won't be altered. But taking it out of this state and moving it through a read-write cycle introduces a new sources of corruption, however small it might be. And if a new checksum is calculated then this may be silent. I think one could do better than a plain read-write. For example O_DIRECT read, checksum, write, and then a second pass of reads from source and destination (through separate buffers or even NUMA nodes or whatever if one is extra paranoid) to ensure that what was written is bit/checksum-identical to the original. |
How about considering it an error if the new checksum does not match the original, and if so keeping the original bits by rolling back the transaction for that specific file? Note, if that seems like a good idea then this extra check should presumably also confirm the checksum was not expectedly different due to a file modification during the re-write transaction? |
That "state" is "in ram." You are probably incurring the same risk every time you scrub your array, since corruption in ram then could lead ZFS to conclude your data is bad and needs to be "fixed." If you're this worried about corruption happening in ram, buy ECC. That should completely solve the issue.
You are running in circles chasing your tail, because you are both distrusting your CPU and RAM, while ALSO trusting them COMPLETELY. You are trusting that they're reading the correct data both times. What if the address is corrupted? You are trusting them to compute the checksum correctly. You are trusting them to report the correct result when you COMPARE the checksums. You are trusting them to write the raw data back out correctly after the checksums have been stripped back off. Remember, CoW filesystems like ZFS cannot by design write back out to the same location twice. The answer to this entire line of reasoning is "don't use ZFS." Go use something like CEPH that is distributed across separate nodes and has a real consensus model. |
It's a worthwhile discussion and you're free to implement that or file an issue explaining the rationale, but for now this seems like a fairly straightforward implementation mostly based on existing ZFS primitives (basically iterating through a file and marking all records as dirty so they just get transparently rewritten by the lower layers). Keeping both on disk versions around and active (but unreferenced) to do a second unbuffered read pass on both of them to rule out memory corruption, before finally committing the new one while also handling ongoing I/O transparently (what happens if a file gets partially overwritten/changed from userspace during the rewrite?) isn't nearly as trivial. This initial version is basically aimed at replacing the userspace shell and python scripts floating around that automate a cp + mv and sometimes can't even do proper locking to ensure a file isn't modified during the copy process. It's relatively clean, simple, doesn't even require ANY on disk format changes, can run without conflicting with other workloads (because the record is just atomically marked as dirty, so any extra incoming write will just be handled by the normal path for multiple overlapping writes). Further improvements are being made (like the -P physical rewrite flag to skip rewritten records in zfs send) so a feature request for a "strict mode" for systems hosting critical data without ECC RAM could be discussed. But for now it has the same (or better) guarantees as any alternative rewrite option and should catch everything from disk errors to signal path corruptions on both the read and write path, except for memory errors on non ECC systems during the actual checksum computation. |
See my other post. There are myriad other ways things can go horribly wrong, and if you distrust your CPU/RAM to such a degree then this problem is completely unsolvable. Stop using a single-node solution like ZFS and go use something with a real distributed consensus model like CEPH. |
If the checksum stays the same and this is verified then this is no different then a send-receive. That's fine. But the feature is described as being able to change the checksum, so in general it seems weaker than send-receive-scrub.
And buses, caches, forwarding buffers, etc. etc.
No? This is covered by copies/parity. Resilver will also have to checksum the reconstruction before even attempting to writing it back. So even a spurious scrub-error should not result in a write of incorrect data.
To some extent, I agree, even a verification step is not perfect. But the failure modes you describe are are less likely to be silent. For example data first has to be corrupted and then an additional verification step has to have a separate failure to flip the checksum-comparison from The data sitting around without redundancy, without integrity, enabling silent corruption is what concerns me.
I see, yes that's an improvement, but in my mind it does fall short of a send-receive then. Possibly even falls short of a paranoid userspace implementation. |
send | recv has the same issue: the record arrives on the target system, the checksum is verified and then it is potentially recompressed, changing the checksum and finally it's written out. Here data is read in, the checksum is verified, then it is potentially recompressed, and finally written out. ZFS checksums are as far as i know always done on and verified against the physical representaion of the record. In send | recv it's possible to keep the physical representaion the same with -C to tell the sender to send compressed records as long as the receiver accepts them and doesn't overwrite the compression property to use a different level, algorithm, etc. But that's useless if your goal is to change the compression algorithm, because then the checksum always has to be recomputed at some point. Yes, the process has room for improvement and technically using flock, copying each file individually, running a userspace sha512 to make sure the VFS level contents are the same and then doing an mv is less susceptible to corruption. But it's also not realistic for a production system to have to schedule downtime because you want to flock the postgres data directory. |
Everything you have described doing must happen in ram, not just pass through it. Comparison of checksums, calculation of addresses, even the basic variables, pointers, function call addresses, and code of ZFS itself. A random corruption of any of the variables or code of ZFS in ram would instantly and undeniably destroy your entire pool. No matter what verification scheme you propose, it will never be possible to prevent a well-placed bitflip from destroying all of your data in an instant. If you care about corruption along the way (or indeed, code corruption), use ECC. If you're worried about other parts of your computer lying to you, use a multi-node distributed-consensus system like CEPH that explicitly can handle the total failure/misbehaviour of an entire computer. |
@HPPinata Thanks for the explanation. My use is cold data/read-only storage. I'll stick to userspace copies with extra src/dest checksumming then. If zfs had something like
ECC RAM is great, but it's not the only thing the data flows through, and I don't think it should be the only thing protecting the data, it's still less resilient than SHA256 when it comes to verification. "all data is end-to-end verified through cryptographic hashes, discrepancies will be detected on read or scrub" is easier to reason about than the handoffs of unverified data between multiple system components with various levels of ECC or parity (if at all... do CPU registers use parity? I don't know). Trust, but verify
I doubt that the probabilities of a random data block getting corrupted vs. the entire array getting killed are the same. And when it comes to data integrity probability matters, since we can never have perfect integrity, it's always the nines. And I'd also expect this kind of failure to be loud, in which case I could at least restore from backups. Silent corruption gets propagated to backups and then the data is permanently damaged. |
Minor note on:
Corruption in buses, caches forwarding buffers etc. is covered for everything I/O side. The old checksum is verified after the record is read into memory and the new one computed before it's written out to disk. So any early corruption is caught on the read and any late corruption is caught by the next scrub. So the true "unprotected" signal path is DRAM -> Memory Controller -> CPU Bus & Cache -> Core and Back. The CPU caches I've seen data sheets for all have some sort of multi bit ECC and the Bus / Data Fabrics generally also have at least functional integrity checks. So for all intents and purposes the CPU internal data path is generally treated as integrity protected. So the only real missing link to full hardware level data integrity guarantees is ECC memory. Of course if your usecase allows for you to go the extra mile and effort of doing a manual verification of the data before and after a rewrite that's an additional layer of protection (Data Fabrics definitely aren't using SHA512 for their internal consistency guarantees), but it's debatable if it's a Filesystems job to make sure there isn't a random bit flip in the code of a compression library causing silent (logic) translation errors in converting from uncompressed to compressed data buffers. |
You are right, registers do not. Look up "mercurial cores." At that level of paranoia though you don't trust your CPU and the problems get even worse. Your only option is multiple computers with an explicit consensus model. ie CEPH.
The probability of the entire array getting trashed is waaaaaaaaaay higher, for a given bitflip in ram. The code of ZFS sits in your unprotected ECC-less ram from the time you boot up your computer until you shut it down. There's plenty of time for a particle from space to strike it, vs a buffer that's only a couple of MB at most and is continually being freshly rewritten.
A small bitflip in ZFS' code that causes an early return during the writeout could result in bad (incomplete) data getting written and never being checked. A bitflip in the checksum comparison code would cause every checksum comparison to return success. A bitflip in code that computes block pointers could result in the continual writing and checking of (correct!) data to the same spot in the array over and over, and you'd never know until you manually tried to access it and found most of it missing. You are coming up with ad-hoc solutions to a very small subset of guessed-at problems, without really basing any of this on a set of explicit and consistent assumptions about the state of the system and what you can actually rely on. |
This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]>
Motivation and Context
For years users were asking for an ability to re-balance pool after vdev addition, de-fragment randomly written files, change some properties for already written files, etc. The closest option would be to either copy and rename a file or send/receive/rename the dataset. Unfortunately all of those options have some downsides.
Description
This change introduces new
zfs rewrite
subcommand, that allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties.How Has This Been Tested?
Manually tested it on FreeBSD. Linux-specific code is not yet tested.
Types of changes
Checklist:
Signed-off-by
.