linux/zvol_os: fix crash with blk-mq on Linux 4.19 #17597

robn · 2025-08-06T07:43:52Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

03987f7 (#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash.

Description

Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call get_cpu() to get a valid CPU id and use that instead.

How Has This Been Tested?

Since 03987f7, zvol_stress against 4.19.325 will crash with a bad memory access within a few seconds. With this change, the entire zvol test tag runs to successful completion.

Compiled checked against 5.4.294, which is the oldest kernel I have lying around ahead of 4.19. It defines HAVE_BLK_MQ_RQ_HCTX, so it would not hit this codepath.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

This was quite a puzzle for years, but I just wonder: couldn't we check rq->cpu for the mentioned -1 value, if it is such when not initialized? Can there be some legal cases when rq->cpu is intentionally not equal to get_cpu()? Though the effects should only be limited to performance, so I don't insist for so old versions.

module/os/linux/zfs/zvol_os.c

robn · 2025-08-07T01:07:42Z

This was quite a puzzle for years, but I just wonder: couldn't we check rq->cpu for the mentioned -1 value, if it is such when not initialized? Can there be some legal cases when rq->cpu is intentionally not equal to get_cpu()?

@amotin I did think about making sure that rq->cpu was in the range of valid CPUs ids, but then I started thinking about things like, what if a hotplug'd CPU is offline, or what if it is locked out or crashed or something, or outside a cpuset, or all sorts of other thing that might mean its not usable for IO. And then I went "eh", because I really just wanted to fix the compile, not become an expert in ancient blk-mq. And to be sure, your suggestion of -1 is far narrower, but still I think it applies: I don't know what most of this, and for a niche kernel range that probably no one is using, it should be fine. Especially since it's been crashing for over a year and no one noticed!

(Reading that back, I kinda sound like an asshole maybe. I dunno, I feel like there's an indicator of some other problem here, but maybe its just that Linux is hard to track, so more of the same).

amotin · 2025-08-07T01:41:38Z

I did think about making sure that rq->cpu was in the range of valid CPUs ids

@robn I agree that checking for validity would be an overkill. But you mentioned specific value of -1, which does not sound uninitialized to me, but intentionally left undefined, which is not the same. But if you think it can not really be trusted -- I won't argue.

BTW, should here be smp_processor_id() or raw_processor_id()? Because according to comments in include/linux/smp.h the first might give a warning if used in context not bound to the CPU.

robn · 2025-08-07T02:44:56Z

@amotin Fair. I only saw initialisation to -1, but I didn't read in a lot of depth. Can't stress how much I'm trying to avoid learning this code!

I'd be fine with raw_smp_processor_id(), mostly because we already use it in a couple of places as an index via CPU_SEQID_UNSTABLE. Though we use both.

Maybe different/better question: ultimately, we're hunting down queue_num, so we can us it as an input to cityhash(zv, offset, qnum) and from there to select one of our own taskqs. I've not though about it hard, but I suppose that's about very roughly keeping IO to the same region of a zvol on the same taskq? Or, roughly matching taskqs to request queues? To what extent is getting this right important? Is there a simpler thing we could do if getting a CPU id is fiddly?

amotin · 2025-08-07T03:08:12Z

@robn The goal of this logic is to keep requests that may be sequential together to reduce reorder between them to not confuse speculative prefetcher too much, but same time get as even as possible load distribution between multiple taskqueues. The assumption is that requests coming through different queues may already be reordered, so we don't care much. Same time even within one queue we may have multiple unrelated sequential streams, that we can try to roughly split offset. Since my last year round there speculative prefetcher should now be able to handle some depth of reorder, not being as bad as it was before, but it is still not infinite.

amotin · 2025-08-07T13:40:31Z

I'd be fine with raw_smp_processor_id(), mostly because we already use it in a couple of places as an index via CPU_SEQID_UNSTABLE.

As I see, CPU_SEQID we use only inside kpreempt_disable(). So unless the block layer promises some sort of affinity for zvol callers, or you have any other reason not to, lets switch to raw_smp_processor_id() here, since we don't really care about CPU migrations in zvol code, the ID is not used for any sort of access serialization.

behlendorf · 2025-08-07T18:53:29Z

Ah yeah, I agree raw_smp_processor_id() is a better fit here for the reasons @amotin mentioned.

robn · 2025-08-08T03:08:49Z

Updated to use raw_smp_processor_id() and retested, looks good. Thanks both for the explainer, much appreciated :)

03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete. In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash. Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call raw_smp_processor_id() to get a valid CPU id and use that instead. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete. In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash. Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call raw_smp_processor_id() to get a valid CPU id and use that instead. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#17597

03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete. In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash. Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call raw_smp_processor_id() to get a valid CPU id and use that instead. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#17597 (cherry picked from commit 3aa21bd)

amotin approved these changes Aug 6, 2025

View reviewed changes

amotin added the Status: Code Review Needed Ready for review and testing label Aug 6, 2025

pcd1193182 approved these changes Aug 6, 2025

View reviewed changes

behlendorf requested changes Aug 6, 2025

View reviewed changes

module/os/linux/zfs/zvol_os.c Outdated Show resolved Hide resolved

robn force-pushed the zvol-old-hwctx branch from 05474a9 to 2de3465 Compare August 6, 2025 21:40

behlendorf approved these changes Aug 6, 2025

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 6, 2025

robn force-pushed the zvol-old-hwctx branch from 2de3465 to d663214 Compare August 8, 2025 03:08

github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Aug 8, 2025

robn force-pushed the zvol-old-hwctx branch from d663214 to 39d9582 Compare August 8, 2025 05:23

amotin approved these changes Aug 8, 2025

View reviewed changes

amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Aug 8, 2025

behlendorf merged commit b270663 into openzfs:master Aug 8, 2025
37 of 41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

linux/zvol_os: fix crash with blk-mq on Linux 4.19 #17597

linux/zvol_os: fix crash with blk-mq on Linux 4.19 #17597

Uh oh!

robn commented Aug 6, 2025

Uh oh!

amotin left a comment

Uh oh!

Uh oh!

robn commented Aug 7, 2025

Uh oh!

amotin commented Aug 7, 2025

Uh oh!

robn commented Aug 7, 2025

Uh oh!

amotin commented Aug 7, 2025

Uh oh!

amotin commented Aug 7, 2025

Uh oh!

behlendorf commented Aug 7, 2025

Uh oh!

robn commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

linux/zvol_os: fix crash with blk-mq on Linux 4.19 #17597

linux/zvol_os: fix crash with blk-mq on Linux 4.19 #17597

Uh oh!

Conversation

robn commented Aug 6, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robn commented Aug 7, 2025

Uh oh!

amotin commented Aug 7, 2025

Uh oh!

robn commented Aug 7, 2025

Uh oh!

amotin commented Aug 7, 2025

Uh oh!

amotin commented Aug 7, 2025

Uh oh!

behlendorf commented Aug 7, 2025

Uh oh!

robn commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robn commented Aug 8, 2025 •

edited

Loading