Skip to content

Conversation

robn
Copy link
Member

@robn robn commented Aug 6, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

03987f7 (#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash.

Description

Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call get_cpu() to get a valid CPU id and use that instead.

How Has This Been Tested?

Since 03987f7, zvol_stress against 4.19.325 will crash with a bad memory access within a few seconds. With this change, the entire zvol test tag runs to successful completion.

Compiled checked against 5.4.294, which is the oldest kernel I have lying around ahead of 4.19. It defines HAVE_BLK_MQ_RQ_HCTX, so it would not hit this codepath.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was quite a puzzle for years, but I just wonder: couldn't we check rq->cpu for the mentioned -1 value, if it is such when not initialized? Can there be some legal cases when rq->cpu is intentionally not equal to get_cpu()? Though the effects should only be limited to performance, so I don't insist for so old versions.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Aug 6, 2025
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 6, 2025
@robn
Copy link
Member Author

robn commented Aug 7, 2025

This was quite a puzzle for years, but I just wonder: couldn't we check rq->cpu for the mentioned -1 value, if it is such when not initialized? Can there be some legal cases when rq->cpu is intentionally not equal to get_cpu()?

@amotin I did think about making sure that rq->cpu was in the range of valid CPUs ids, but then I started thinking about things like, what if a hotplug'd CPU is offline, or what if it is locked out or crashed or something, or outside a cpuset, or all sorts of other thing that might mean its not usable for IO. And then I went "eh", because I really just wanted to fix the compile, not become an expert in ancient blk-mq. And to be sure, your suggestion of -1 is far narrower, but still I think it applies: I don't know what most of this, and for a niche kernel range that probably no one is using, it should be fine. Especially since it's been crashing for over a year and no one noticed!

(Reading that back, I kinda sound like an asshole maybe. I dunno, I feel like there's an indicator of some other problem here, but maybe its just that Linux is hard to track, so more of the same).

@amotin
Copy link
Member

amotin commented Aug 7, 2025

I did think about making sure that rq->cpu was in the range of valid CPUs ids

@robn I agree that checking for validity would be an overkill. But you mentioned specific value of -1, which does not sound uninitialized to me, but intentionally left undefined, which is not the same. But if you think it can not really be trusted -- I won't argue.

BTW, should here be smp_processor_id() or raw_processor_id()? Because according to comments in include/linux/smp.h the first might give a warning if used in context not bound to the CPU.

@robn
Copy link
Member Author

robn commented Aug 7, 2025

@amotin Fair. I only saw initialisation to -1, but I didn't read in a lot of depth. Can't stress how much I'm trying to avoid learning this code!

I'd be fine with raw_smp_processor_id(), mostly because we already use it in a couple of places as an index via CPU_SEQID_UNSTABLE. Though we use both.

Maybe different/better question: ultimately, we're hunting down queue_num, so we can us it as an input to cityhash(zv, offset, qnum) and from there to select one of our own taskqs. I've not though about it hard, but I suppose that's about very roughly keeping IO to the same region of a zvol on the same taskq? Or, roughly matching taskqs to request queues? To what extent is getting this right important? Is there a simpler thing we could do if getting a CPU id is fiddly?

@amotin
Copy link
Member

amotin commented Aug 7, 2025

@robn The goal of this logic is to keep requests that may be sequential together to reduce reorder between them to not confuse speculative prefetcher too much, but same time get as even as possible load distribution between multiple taskqueues. The assumption is that requests coming through different queues may already be reordered, so we don't care much. Same time even within one queue we may have multiple unrelated sequential streams, that we can try to roughly split offset. Since my last year round there speculative prefetcher should now be able to handle some depth of reorder, not being as bad as it was before, but it is still not infinite.

@amotin
Copy link
Member

amotin commented Aug 7, 2025

I'd be fine with raw_smp_processor_id(), mostly because we already use it in a couple of places as an index via CPU_SEQID_UNSTABLE.

As I see, CPU_SEQID we use only inside kpreempt_disable(). So unless the block layer promises some sort of affinity for zvol callers, or you have any other reason not to, lets switch to raw_smp_processor_id() here, since we don't really care about CPU migrations in zvol code, the ID is not used for any sort of access serialization.

@behlendorf
Copy link
Contributor

Ah yeah, I agree raw_smp_processor_id() is a better fit here for the reasons @amotin mentioned.

@github-actions github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Aug 8, 2025
@robn
Copy link
Member Author

robn commented Aug 8, 2025

Updated to use raw_smp_processor_id() and retested, looks good. Thanks both for the explainer, much appreciated :)

03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware
context for older kernels that don't cache it in the struct request.
However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then
the cached rq->cpu will be -1, and so using it to index into mq_map
causes a crash.

Given that the upstream 4.19 is now in extended LTS and rarely seen,
RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached
hardware context has been available since 5.1, I'm not going to huge
lengths to get queue selection correct for the very few people that are
likely to feel it. To that end, we simply call raw_smp_processor_id() to
get a valid CPU id and use that instead.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@amotin amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Aug 8, 2025
@behlendorf behlendorf merged commit b270663 into openzfs:master Aug 8, 2025
37 of 41 checks passed
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 13, 2025
03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware
context for older kernels that don't cache it in the struct request.
However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then
the cached rq->cpu will be -1, and so using it to index into mq_map
causes a crash.

Given that the upstream 4.19 is now in extended LTS and rarely seen,
RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached
hardware context has been available since 5.1, I'm not going to huge
lengths to get queue selection correct for the very few people that are
likely to feel it. To that end, we simply call raw_smp_processor_id() to
get a valid CPU id and use that instead.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#17597
ixhamza pushed a commit to truenas/zfs that referenced this pull request Aug 28, 2025
03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware
context for older kernels that don't cache it in the struct request.
However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then
the cached rq->cpu will be -1, and so using it to index into mq_map
causes a crash.

Given that the upstream 4.19 is now in extended LTS and rarely seen,
RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached
hardware context has been available since 5.1, I'm not going to huge
lengths to get queue selection correct for the very few people that are
likely to feel it. To that end, we simply call raw_smp_processor_id() to
get a valid CPU id and use that instead.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#17597
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware
context for older kernels that don't cache it in the struct request.
However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then
the cached rq->cpu will be -1, and so using it to index into mq_map
causes a crash.

Given that the upstream 4.19 is now in extended LTS and rarely seen,
RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached
hardware context has been available since 5.1, I'm not going to huge
lengths to get queue selection correct for the very few people that are
likely to feel it. To that end, we simply call raw_smp_processor_id() to
get a valid CPU id and use that instead.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#17597
bugclerk pushed a commit to truenas/zfs that referenced this pull request Sep 8, 2025
03987f7 (openzfs#16069) added a workaround to get the blk-mq hardware
context for older kernels that don't cache it in the struct request.
However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then
the cached rq->cpu will be -1, and so using it to index into mq_map
causes a crash.

Given that the upstream 4.19 is now in extended LTS and rarely seen,
RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached
hardware context has been available since 5.1, I'm not going to huge
lengths to get queue selection correct for the very few people that are
likely to feel it. To that end, we simply call raw_smp_processor_id() to
get a valid CPU id and use that instead.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#17597
(cherry picked from commit 3aa21bd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants