Skip to content

Conversation

razdoburdin
Copy link
Contributor

@razdoburdin razdoburdin commented Apr 7, 2025

This PR speed-ups data initialization for large sparce datasets being executed on multi-core CPUs by parallelizing the execution.
For bosch dataset this PR improve fitting time on 1.3x for 2x56cores system.

To avoid the race condition, I have also switched from using bitfields as missing flag to uint8_t.

@razdoburdin razdoburdin marked this pull request as draft April 7, 2025 13:00
@razdoburdin razdoburdin marked this pull request as ready for review April 11, 2025 09:29
@trivialfis
Copy link
Member

Note to myself:

I have also switched from using bitfields as missing flag to uint8_t.

That is increasing memory usage.

@razdoburdin
Copy link
Contributor Author

hi @trivialfis ,
what is your opinion about this optimization?

@trivialfis
Copy link
Member

Apologies, coming back from a trip. Will look into the optimization.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

@@ -233,7 +233,7 @@ class GHistIndexMatrix {
void PushAdapterBatchColumns(Context const* ctx, Batch const& batch, float missing,
size_t rbegin);

void ResizeIndex(const size_t n_index, const bool isDense);
void ResizeIndex(const size_t n_index, const bool isDense, int n_threads = 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share in which case nthread=1, and what are the other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the code, no default value now.

auto ref = RefResourceView{resource->DataAs<T>(), n_elements, resource};

size_t block_size = n_elements / n_threads + (n_elements % n_threads > 0);
#pragma omp parallel num_threads(n_threads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster than std::fill_n for primitive data? Seems unlikely..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, if number of elements is high. Significant speed-up for number of elements ~1e8-1e9.

Comment on lines 212 to 213
ColumnBinT* begin = &local_index[feature_offsets_[fid]];
begin[rid] = bin_id - index_base_[fid];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines look exactly the same as the following two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the first line outside branches. The second one differs.

public:
// get number of features
[[nodiscard]] bst_feature_t GetNumFeature() const {
return static_cast<bst_feature_t>(type_.size());
}

ColumnMatrix() = default;
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold) {
this->InitStorage(gmat, sparse_threshold);
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads = 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case n_threads is 1, and what are the other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@razdoburdin
Copy link
Contributor Author

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

I measured a peak-memory consumption for bosch dataset for 224 threads. Master branch: 10.06 GB, this PR: 10.31 GB.

@trivialfis
Copy link
Member

trivialfis commented May 12, 2025

Got the following results from synthesized dense data, memory usage measured by cgmemtime.

* master

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.732778310775757 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 27.98164939880371}, 'Qdm': {'Train-DMatrix-Iter (sec)': 91.10861659049988, 'train (sec)': 25.732778310775757}}

user: 1809.226 s
sys:   28.329 s
wall: 119.860 s
child_RSS_high:   37892596 KiB
group_mem_high:   37677792 KiB

* opt pr

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.054997444152832 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 28.075414180755615}, 'Qdm': {'Train-DMatrix-Iter (sec)': 93.2668731212616, 'train (sec)': 25.054997444152832}}

user: 1807.715 s
sys:   31.093 s
wall: 121.895 s
child_RSS_high:   45232596 KiB
group_mem_high:   45032396 KiB

That's a 20 percent increase (45032396 - 37677792) / 37677792 in memory usage for dense data. Are you sure you want this PR to go in? Asking since the memory usage has been a pain point for XGBoost for a very long time. We receive issues mostly about memory usage instead of computation time, so we care about it a lot.

  • n_samples: 16777216 (2 ** 24)
  • n_features: 512
  • density: 1.0
  • dtype: f32

I used my custom benchmark scripts here https://github.com/trivialfis/dxgb_bench.git (not very polished). I loaded the data using iterator with arrays stored in .npy files. In addition, QuantileDMatrix is used. Feel free to use your own benchmark scripts.

I can test other sparsity if needed.

@razdoburdin
Copy link
Contributor Author

razdoburdin commented May 19, 2025

I was able to return to bitfield representation for missing indicator without loosing thread-safe access. It requires quite careful data management, but allows to combine benefits of parallelization and low memory consumption. Some additional memory should be allocated in this case for data alignment, but it is less than 4 bytes per feature in worse case.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slow response, will do some tests myself. Please see inlined comments.

@@ -195,34 +236,42 @@ class ColumnMatrix {
}
};

void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold);
void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads);

template <typename ColumnBinT, typename BinT, typename RIdx>
void SetBinSparse(BinT bin_id, RIdx rid, bst_feature_t fid, ColumnBinT* local_index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function still used now that we have a new SetBinSparse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original SetBinSparse is also used

* If base_rowid > 0 we need to shift the blocks boundaries.
* Otherwise the two threads may operate with the single word of bitfield.
*/
size_t shift = MissingIndicator::BitFieldT::kValueSize -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand how this shifting works. Could you please help clarify it? For starters, this should represent the number of samples each thread needs to shift. How is it related to the bit field value size? How is it related to the module? Why set it to 0 when it equals the value size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand how this shifting works. Could you please help clarify it? For starters, this should represent the number of samples each thread needs to shift. How is it related to the bit field value size? How is it related to the module? Why set it to 0 when it equals the value size?

if base_rowid > 0, we add few samples to thread 0 in a way to make sure each next thread will start from a word boundary.

For instance, if the first batch had 35 rows, base_rowid for the second batch would be 35. The 36th row's missing flag is at bit 3 of the second word in our bitfield (word[1]). If we simply divided the work evenly, Thread 0 might process rows 35-66 and Thread 1 rows 67-98.
This would cause both threads to access and modify word[1] and word[2], leading to a race condition.

base_rowid % MissingIndicator::BitFieldT::kValueSize calculates the starting position of our new data within a 32-bit word.
For base_rowid = 35, the result is 3. This means we are 3 bits into the word. 32 - 3 gives 29. This shift value represents the number of rows (or bits) the first thread must process to get to the next clean word boundary. After processing these 29 rows, the next row to process will be at the start of a new 32-bit word, allowing subsequent blocks to be aligned.

If base_rowid is perfectly divisible by 32, the calculation becomes 32.
A shift of 32 is unnecessary because we are already at a perfect boundary. So we set shift = 0.

A have modified the comment to make this logic more clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are talking about the external memory in XGBoost. Is the bitfield shared across multiple batches of data? Otherwise, the

the 36th row's missing flag is at bit 3

should not be true since there should be a different column matrix for each batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are talking about the external memory in XGBoost. Is the bitfield shared across multiple batches of data? Otherwise, the

the 36th row's missing flag is at bit 3

should not be true since there should be a different column matrix for each batch.

In the original code, the bitfield is allocated for the total amount of elements, but each batch uses it's own part. I didn't touch this part in my PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do some digging tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.
I have made some more tests. It looks like the value of base_rowid one operates with in the SetIndexMixedColumns can differs from the base_rowid in GHistIndexMatrix.

In case of external memory the function call goes the following:
ExtGradientIndexPageSource::Fetch() calls PushAdapterBatchColumns with rbegin = 0

this->page_->PushAdapterBatchColumns(ctx_, value, this->missing_, rbegin);

GHistIndexMatrix::PushAdapterBatchColumns calls PushBatch with the same rbegin = 0

this->columns_->PushBatch(ctx->Threads(), batch, missing, *this, rbegin);

ColumnMatrix::PushBatch calls SetIndexMixedColumns with base_rowid = rbegin = 0

SetIndexMixedColumns(base_rowid, batch, gmat, missing);

So, the value of base_rowid inside SetIndexMixedColumns is 0 in case of external memory even if base_rowid in gmat is non-zero, and the shift for bitfield is also zero in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do more tests, probably need some cleanup/comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we postpone this PR a little bit? It's an optimization for data initialization, and I find it quite difficult to understand. I will merge the inference PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can postpone it, but I think it shouldn't be postponed for a long time. Overwise, resolving of merge conflicts will be extremely tricky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants