Skip to content

Conversation

pcd1193182
Copy link
Contributor

@pcd1193182 pcd1193182 commented Jul 10, 2025

Sponsored by: [Wasabi Technology, Inc; Klara, Inc]

Motivation and Context

Currently, after a failed allocation, the metaslab code recalculates the weight for a metaslab. However, for space-based metaslabs, it uses the maximum free segment size instead of the normal weighting algorithm. This is presumably because the normal metaslab weight is (roughly) intended to estimate the size of the largest free segment, but it doesn't do that reliably at most fragmentation levels1. This means that recalculated metaslabs are forced to a weight that isn't really using the same units as the rest of them, resulting in undesirable behaviors (mostly metaslabs never being selected again due to an artificially low weight).

As far as I can tell this code dates back to 2010, long before we had metaslab_should_allocate, segment-based metaslabs, or any of the modern features of the allocation code.

Description

We switch this to use the normal space-weighting function for this recalculation.

How Has This Been Tested?

Tested for correctness with the ZFS test suite. For performance effects, ran extensive performance testing on a highly fragmented pool. The change resulted in a 71% reduction in stddev of TXG sync times, and a 56% reduction in 99th percentile sync times. It also reduced the number of loads of metaslabs that did not result in eventual allocations by approximately 50%.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Footnotes:
1. The weight works by multiplying the total free space by 100 - fragmentation, and then dividing by 100. As one simple example of how this doesn't do a good job of approximating the largest free segment, if you have free chunks of size ~2MiB, this results in a division by roughly 1.5-2 (It maps to 36 on the frag table, plus presumably many smaller free segments, which increase the fragmentation). However, if you have a 50 of these chunks, this results in a weight that is ~25 times higher than the largest free segment.

Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jul 11, 2025
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 16, 2025
@behlendorf behlendorf merged commit c1e51c5 into openzfs:master Jul 16, 2025
23 checks passed
amotin pushed a commit that referenced this pull request Aug 5, 2025
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes #17531
ixhamza pushed a commit to truenas/zfs that referenced this pull request Aug 28, 2025
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes openzfs#17531
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes openzfs#17531
bugclerk pushed a commit to truenas/zfs that referenced this pull request Sep 8, 2025
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes openzfs#17531
(cherry picked from commit 5132f8c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants