-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Optimization of data initialization for large sparce datasets #11390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Optimization of data initialization for large sparce datasets #11390
Conversation
Note to myself:
That is increasing memory usage. |
hi @trivialfis , |
Apologies, coming back from a trip. Will look into the optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide some data on the effect of memory usage where there are semi-dense columns?
src/data/gradient_index.h
Outdated
@@ -233,7 +233,7 @@ class GHistIndexMatrix { | |||
void PushAdapterBatchColumns(Context const* ctx, Batch const& batch, float missing, | |||
size_t rbegin); | |||
|
|||
void ResizeIndex(const size_t n_index, const bool isDense); | |||
void ResizeIndex(const size_t n_index, const bool isDense, int n_threads = 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share in which case nthread=1
, and what are the other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the code, no default value now.
src/common/ref_resource_view.h
Outdated
auto ref = RefResourceView{resource->DataAs<T>(), n_elements, resource}; | ||
|
||
size_t block_size = n_elements / n_threads + (n_elements % n_threads > 0); | ||
#pragma omp parallel num_threads(n_threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this faster than std::fill_n
for primitive data? Seems unlikely..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, if number of elements is high. Significant speed-up for number of elements ~1e8-1e9.
src/common/column_matrix.h
Outdated
ColumnBinT* begin = &local_index[feature_offsets_[fid]]; | ||
begin[rid] = bin_id - index_base_[fid]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines look exactly the same as the following two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the first line outside branches. The second one differs.
src/common/column_matrix.h
Outdated
public: | ||
// get number of features | ||
[[nodiscard]] bst_feature_t GetNumFeature() const { | ||
return static_cast<bst_feature_t>(type_.size()); | ||
} | ||
|
||
ColumnMatrix() = default; | ||
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold) { | ||
this->InitStorage(gmat, sparse_threshold); | ||
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads = 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which case n_threads
is 1, and what are the other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I measured a peak-memory consumption for |
Got the following results from synthesized dense data, memory usage measured by
That's a 20 percent increase
I used my custom benchmark scripts here https://github.com/trivialfis/dxgb_bench.git (not very polished). I loaded the data using iterator with arrays stored in I can test other sparsity if needed. |
I was able to return to bitfield representation for missing indicator without loosing thread-safe access. It requires quite careful data management, but allows to combine benefits of parallelization and low memory consumption. Some additional memory should be allocated in this case for data alignment, but it is less than 4 bytes per feature in worse case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slow response, will do some tests myself. Please see inlined comments.
@@ -195,34 +236,42 @@ class ColumnMatrix { | |||
} | |||
}; | |||
|
|||
void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold); | |||
void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads); | |||
|
|||
template <typename ColumnBinT, typename BinT, typename RIdx> | |||
void SetBinSparse(BinT bin_id, RIdx rid, bst_feature_t fid, ColumnBinT* local_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function still used now that we have a new SetBinSparse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original SetBinSparse is also used
Co-authored-by: Jiaming Yuan <[email protected]>
Co-authored-by: Jiaming Yuan <[email protected]>
* If base_rowid > 0 we need to shift the blocks boundaries. | ||
* Otherwise the two threads may operate with the single word of bitfield. | ||
*/ | ||
size_t shift = MissingIndicator::BitFieldT::kValueSize - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand how this shifting works. Could you please help clarify it? For starters, this should represent the number of samples each thread needs to shift. How is it related to the bit field value size? How is it related to the module? Why set it to 0 when it equals the value size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand how this shifting works. Could you please help clarify it? For starters, this should represent the number of samples each thread needs to shift. How is it related to the bit field value size? How is it related to the module? Why set it to 0 when it equals the value size?
if base_rowid > 0
, we add few samples to thread 0 in a way to make sure each next thread will start from a word boundary.
For instance, if the first batch had 35 rows, base_rowid
for the second batch would be 35. The 36th row's missing flag is at bit 3 of the second word in our bitfield (word[1]). If we simply divided the work evenly, Thread 0 might process rows 35-66 and Thread 1 rows 67-98.
This would cause both threads to access and modify word[1] and word[2], leading to a race condition.
base_rowid % MissingIndicator::BitFieldT::kValueSize
calculates the starting position of our new data within a 32-bit word.
For base_rowid
= 35, the result is 3. This means we are 3 bits into the word. 32 - 3 gives 29. This shift value represents the number of rows (or bits) the first thread must process to get to the next clean word boundary. After processing these 29 rows, the next row to process will be at the start of a new 32-bit word, allowing subsequent blocks to be aligned.
If base_rowid
is perfectly divisible by 32, the calculation becomes 32.
A shift of 32 is unnecessary because we are already at a perfect boundary. So we set shift = 0.
A have modified the comment to make this logic more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are talking about the external memory in XGBoost. Is the bitfield shared across multiple batches of data? Otherwise, the
the 36th row's missing flag is at bit 3
should not be true since there should be a different column matrix for each batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are talking about the external memory in XGBoost. Is the bitfield shared across multiple batches of data? Otherwise, the
the 36th row's missing flag is at bit 3
should not be true since there should be a different column matrix for each batch.
In the original code, the bitfield is allocated for the total amount of elements, but each batch uses it's own part. I didn't touch this part in my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do some digging tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the two different cases, see:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
I have made some more tests. It looks like the value of base_rowid
one operates with in the SetIndexMixedColumns
can differs from the base_rowid
in GHistIndexMatrix
.
In case of external memory the function call goes the following:
ExtGradientIndexPageSource::Fetch()
calls PushAdapterBatchColumns
with rbegin = 0
this->page_->PushAdapterBatchColumns(ctx_, value, this->missing_, rbegin); |
GHistIndexMatrix::PushAdapterBatchColumns
calls PushBatch
with the same rbegin = 0
xgboost/src/data/gradient_index.cc
Line 122 in e4406da
this->columns_->PushBatch(ctx->Threads(), batch, missing, *this, rbegin); |
ColumnMatrix::PushBatch
calls SetIndexMixedColumns with base_rowid = rbegin = 0
xgboost/src/common/column_matrix.h
Line 287 in e4406da
SetIndexMixedColumns(base_rowid, batch, gmat, missing); |
So, the value of base_rowid
inside SetIndexMixedColumns
is 0 in case of external memory even if base_rowid
in gmat is non-zero, and the shift for bitfield is also zero in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do more tests, probably need some cleanup/comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we postpone this PR a little bit? It's an optimization for data initialization, and I find it quite difficult to understand. I will merge the inference PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can postpone it, but I think it shouldn't be postponed for a long time. Overwise, resolving of merge conflicts will be extremely tricky.
This PR speed-ups data initialization for large sparce datasets being executed on multi-core CPUs by parallelizing the execution.
For
bosch
dataset this PR improve fitting time on 1.3x for 2x56cores system.To avoid the race condition, I have also switched from using bitfields as missing flag to uint8_t.