Optimization of data initialization for large sparce datasets #11390

razdoburdin · 2025-04-07T12:38:50Z

This PR speed-ups data initialization for large sparce datasets being executed on multi-core CPUs by parallelizing the execution.
For bosch dataset this PR improve fitting time on 1.3x for 2x56cores system.

To avoid the race condition, I have also switched from using bitfields as missing flag to uint8_t.

trivialfis · 2025-04-14T11:08:43Z

Note to myself:

I have also switched from using bitfields as missing flag to uint8_t.

That is increasing memory usage.

razdoburdin · 2025-04-28T11:22:47Z

hi @trivialfis ,
what is your opinion about this optimization?

trivialfis · 2025-04-30T12:43:06Z

Apologies, coming back from a trip. Will look into the optimization.

trivialfis

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

trivialfis · 2025-04-30T12:53:18Z

src/data/gradient_index.h

@@ -233,7 +233,7 @@ class GHistIndexMatrix {
  void PushAdapterBatchColumns(Context const* ctx, Batch const& batch, float missing,
                               size_t rbegin);

-  void ResizeIndex(const size_t n_index, const bool isDense);
+  void ResizeIndex(const size_t n_index, const bool isDense, int n_threads = 1);


Could you please share in which case nthread=1, and what are the other cases?

I fixed the code, no default value now.

trivialfis · 2025-04-30T12:54:49Z

src/common/ref_resource_view.h

+  auto ref = RefResourceView{resource->DataAs<T>(), n_elements, resource};
+
+  size_t block_size = n_elements / n_threads + (n_elements % n_threads > 0);
+  #pragma omp parallel num_threads(n_threads)


Is this faster than std::fill_n for primitive data? Seems unlikely..

It is, if number of elements is high. Significant speed-up for number of elements ~1e8-1e9.

trivialfis · 2025-04-30T12:59:20Z

src/common/column_matrix.h

+      ColumnBinT* begin = &local_index[feature_offsets_[fid]];
+      begin[rid] = bin_id - index_base_[fid];


These two lines look exactly the same as the following two lines.

I moved the first line outside branches. The second one differs.

trivialfis · 2025-04-30T12:59:50Z

src/common/column_matrix.h

 public:
  // get number of features
  [[nodiscard]] bst_feature_t GetNumFeature() const {
    return static_cast<bst_feature_t>(type_.size());
  }

  ColumnMatrix() = default;
-  ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold) {
-    this->InitStorage(gmat, sparse_threshold);
+  ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads = 1) {


in which case n_threads is 1, and what are the other cases?

src/common/column_matrix.h

razdoburdin · 2025-05-05T15:11:53Z

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

I measured a peak-memory consumption for bosch dataset for 224 threads. Master branch: 10.06 GB, this PR: 10.31 GB.

trivialfis · 2025-05-12T20:52:08Z

Got the following results from synthesized dense data, memory usage measured by cgmemtime.

* master

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.732778310775757 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 27.98164939880371}, 'Qdm': {'Train-DMatrix-Iter (sec)': 91.10861659049988, 'train (sec)': 25.732778310775757}}

user: 1809.226 s
sys:   28.329 s
wall: 119.860 s
child_RSS_high:   37892596 KiB
group_mem_high:   37677792 KiB

* opt pr

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.054997444152832 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 28.075414180755615}, 'Qdm': {'Train-DMatrix-Iter (sec)': 93.2668731212616, 'train (sec)': 25.054997444152832}}

user: 1807.715 s
sys:   31.093 s
wall: 121.895 s
child_RSS_high:   45232596 KiB
group_mem_high:   45032396 KiB

That's a 20 percent increase (45032396 - 37677792) / 37677792 in memory usage for dense data. Are you sure you want this PR to go in? Asking since the memory usage has been a pain point for XGBoost for a very long time. We receive issues mostly about memory usage instead of computation time, so we care about it a lot.

n_samples: 16777216 (2 ** 24)
n_features: 512
density: 1.0
dtype: f32

I used my custom benchmark scripts here https://github.com/trivialfis/dxgb_bench.git (not very polished). I loaded the data using iterator with arrays stored in .npy files. In addition, QuantileDMatrix is used. Feel free to use your own benchmark scripts.

I can test other sparsity if needed.

razdoburdin · 2025-05-19T08:49:41Z

I was able to return to bitfield representation for missing indicator without loosing thread-safe access. It requires quite careful data management, but allows to combine benefits of parallelization and low memory consumption. Some additional memory should be allocated in this case for data alignment, but it is less than 4 bytes per feature in worse case.

trivialfis

Apologies for the slow response, will do some tests myself. Please see inlined comments.

src/common/column_matrix.h

trivialfis · 2025-05-29T08:06:02Z

src/common/column_matrix.h

@@ -195,34 +236,42 @@ class ColumnMatrix {
    }
  };

-  void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold);
+  void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads);

  template <typename ColumnBinT, typename BinT, typename RIdx>
  void SetBinSparse(BinT bin_id, RIdx rid, bst_feature_t fid, ColumnBinT* local_index) {


Is this function still used now that we have a new SetBinSparse?

The original SetBinSparse is also used

src/common/column_matrix.h

Co-authored-by: Jiaming Yuan <[email protected]>

trivialfis · 2025-07-01T11:35:14Z

src/common/column_matrix.h

+       * If base_rowid > 0 we need to shift the blocks boundaries.
+       * Otherwise the two threads may operate with the single word of bitfield.
+       */
+      size_t shift = MissingIndicator::BitFieldT::kValueSize -


I don't quite understand how this shifting works. Could you please help clarify it? For starters, this should represent the number of samples each thread needs to shift. How is it related to the bit field value size? How is it related to the module? Why set it to 0 when it equals the value size?

I don't quite understand how this shifting works. Could you please help clarify it? For starters, this should represent the number of samples each thread needs to shift. How is it related to the bit field value size? How is it related to the module? Why set it to 0 when it equals the value size?

if base_rowid > 0, we add few samples to thread 0 in a way to make sure each next thread will start from a word boundary.

For instance, if the first batch had 35 rows, base_rowid for the second batch would be 35. The 36th row's missing flag is at bit 3 of the second word in our bitfield (word[1]). If we simply divided the work evenly, Thread 0 might process rows 35-66 and Thread 1 rows 67-98.
This would cause both threads to access and modify word[1] and word[2], leading to a race condition.

base_rowid % MissingIndicator::BitFieldT::kValueSize calculates the starting position of our new data within a 32-bit word.
For base_rowid = 35, the result is 3. This means we are 3 bits into the word. 32 - 3 gives 29. This shift value represents the number of rows (or bits) the first thread must process to get to the next clean word boundary. After processing these 29 rows, the next row to process will be at the start of a new 32-bit word, allowing subsequent blocks to be aligned.

If base_rowid is perfectly divisible by 32, the calculation becomes 32.
A shift of 32 is unnecessary because we are already at a perfect boundary. So we set shift = 0.

A have modified the comment to make this logic more clear.

You are talking about the external memory in XGBoost. Is the bitfield shared across multiple batches of data? Otherwise, the

the 36th row's missing flag is at bit 3

should not be true since there should be a different column matrix for each batch.

You are talking about the external memory in XGBoost. Is the bitfield shared across multiple batches of data? Otherwise, the

the 36th row's missing flag is at bit 3

should not be true since there should be a different column matrix for each batch.

In the original code, the bitfield is allocated for the total amount of elements, but each batch uses it's own part. I didn't touch this part in my PR.

Let me do some digging tomorrow.

For the two different cases, see:

https://github.com/dmlc/xgboost/blob/master/demo/guide-python/quantile_data_iterator.py (change it to CPU)

https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py

Got it.
I have made some more tests. It looks like the value of base_rowid one operates with in the SetIndexMixedColumns can differs from the base_rowid in GHistIndexMatrix.

In case of external memory the function call goes the following:
ExtGradientIndexPageSource::Fetch() calls PushAdapterBatchColumns with rbegin = 0

xgboost/src/data/gradient_index_page_source.cc

Line 64 in e4406da

this->page_->PushAdapterBatchColumns(ctx_, value, this->missing_, rbegin);

GHistIndexMatrix::PushAdapterBatchColumns calls PushBatch with the same rbegin = 0

xgboost/src/data/gradient_index.cc

Line 122 in e4406da

this->columns_->PushBatch(ctx->Threads(), batch, missing, *this, rbegin);

ColumnMatrix::PushBatch calls SetIndexMixedColumns with base_rowid = rbegin = 0

xgboost/src/common/column_matrix.h

Line 287 in e4406da

SetIndexMixedColumns(base_rowid, batch, gmat, missing);

So, the value of base_rowid inside SetIndexMixedColumns is 0 in case of external memory even if base_rowid in gmat is non-zero, and the shift for bitfield is also zero in this case.

Let me do more tests, probably need some cleanup/comments.

Can we postpone this PR a little bit? It's an optimization for data initialization, and I find it quite difficult to understand. I will merge the inference PR.

We can postpone it, but I think it shouldn't be postponed for a long time. Overwise, resolving of merge conflicts will be extremely tricky.

Dmitry Razdoburdin and others added 3 commits April 4, 2025 02:49

optimise data initialisation

11d4b63

Merge branch 'dmlc:master' into dev/cpu/init_data_optimisation

4589533

linting

3464f8c

razdoburdin marked this pull request as draft April 7, 2025 13:00

Dmitry Razdoburdin and others added 15 commits April 8, 2025 00:29

changing the capture for inner lambdas

e211ab9

fix

9221573

set default

0a793e3

linting

e249a3b

fix test

1be6f5d

Merge branch 'dmlc:master' into dev/cpu/init_data_optimisation

396f4b3

submodule fix

55a89d7

fix for i386

8a15c70

proteckt thread-unsafe code

085627f

fix for multi-batch

b1e714f

fix compilation error

70fd6bc

remove critical section; avoid using of bit filds

edef9e7

tildy

606c537

return deleted code

6f885b0

remove unactual code

c0dbd7e

razdoburdin marked this pull request as ready for review April 11, 2025 09:29

trivialfis mentioned this pull request Apr 22, 2025

RAM problem using xgb.QuantileDMatrix with NA values #11421

Open

trivialfis reviewed Apr 30, 2025

View reviewed changes

Dmitry Razdoburdin added 2 commits May 5, 2025 07:11

address comments

1cb3693

fix calling ColumnMatrix constructor

560a67a

switch back to bitfield

0ac338e

linting

61b3878

trivialfis reviewed May 29, 2025

View reviewed changes

razdoburdin and others added 6 commits June 2, 2025 08:37

Update src/common/column_matrix.h

98ef541

Co-authored-by: Jiaming Yuan <[email protected]>

Update src/common/column_matrix.h

2b090e6

Co-authored-by: Jiaming Yuan <[email protected]>

Merge branch 'master' into dev/cpu/init_data_optimisation

9f5ba75

Cleanup, typos.

8cdd7db

rename.

15aa65b

typos.

58920a0

trivialfis reviewed Jul 1, 2025

View reviewed changes

update comment

0b7037a

		ColumnBinT* begin = &local_index[feature_offsets_[fid]];
		begin[rid] = bin_id - index_base_[fid];

Uh oh!

Optimization of data initialization for large sparce datasets #11390

Are you sure you want to change the base?

Optimization of data initialization for large sparce datasets #11390

Conversation

razdoburdin commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Apr 14, 2025

Uh oh!

razdoburdin commented Apr 28, 2025

Uh oh!

trivialfis commented Apr 30, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

razdoburdin commented May 5, 2025

Uh oh!

trivialfis commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razdoburdin commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

razdoburdin commented Apr 7, 2025 •

edited

Loading

trivialfis commented May 12, 2025 •

edited

Loading

razdoburdin commented May 19, 2025 •

edited

Loading