Skip to content

Commit 6e6f81a

Browse files
committed
Docs: New results
1 parent 58ee32e commit 6e6f81a

File tree

1 file changed

+26
-38
lines changed

1 file changed

+26
-38
lines changed

README.md

Lines changed: 26 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,8 @@ That's it.
152152
153153
There are many other thread-pool implementations, that are more feature-rich, but have different limitations and design goals.
154154
155-
- C++: [`taskflow/taskflow`](https://github.com/taskflow/taskflow), [`progschj/ThreadPool`](https://github.com/progschj/ThreadPool), [`bshoshany/thread-pool`](https://github.com/bshoshany/thread-pool), [`vit-vit/CTPL`](https://github.com/vit-vit/CTPL), [`mtrebi/thread-pool`](https://github.com/mtrebi/thread-pool)
155+
- Modern C++: [`taskflow/taskflow`](https://github.com/taskflow/taskflow), [`progschj/ThreadPool`](https://github.com/progschj/ThreadPool), [`bshoshany/thread-pool`](https://github.com/bshoshany/thread-pool)
156+
- Traditional C++: [`vit-vit/CTPL`](https://github.com/vit-vit/CTPL), [`mtrebi/thread-pool`](https://github.com/mtrebi/thread-pool)
156157
- Rust: [`tokio-rs/tokio`](https://github.com/tokio-rs/tokio), [`rayon-rs/rayon`](https://github.com/rayon-rs/rayon), [`smol-rs/smol`](https://github.com/smol-rs/smol)
157158
158159
Those are not designed for the same OpenMP-like use-cases as __`fork_union`__.
@@ -206,68 +207,55 @@ In that case, even on x86, where the entire cache will be exclusively owned by a
206207
207208
## Performance
208209
209-
One of the most common parallel workloads is the N-body simulation.
210+
One of the most common parallel workloads is the N-body simulation ¹.
210211
An implementation is available in both C++ and Rust in `scripts/nbody.cpp` and `scripts/nbody.rs` respectively.
211212
Both are extremely light-weight and involve little logic outside of number-crunching, so both can be easily profiled with `time` and introspected with `perf` Linux tools.
212213
213-
> Another common workload is [Parallel Reductions](https://github.com/ashvardanian/ParallelReductionsBenchmark).
214+
---
214215
215-
### C++ Benchmarks
216-
217-
For $N=128$ bodies, $I=1e6$ iterations, using the maximum number of threads available on the machine, the numbers are as follows:
216+
C++ benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
218217
219218
| Machine | OpenMP (D) | OpenMP (S) | Fork Union (D) | Fork Union (S) |
220219
| :------------- | ---------: | ---------: | -------------: | -------------: |
221220
| 16x Intel SPR | 20.3s | 16.0s | 18.1s | 10.3s |
222-
| 12x Apple M2 | ? | 76.7s | 90.3s ¹ | 100.7s ¹ |
223-
| 96x Graviton 4 | 32.2s | 20.8s | 41.2 | 26.0s |
224-
225-
> ¹ When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing.
221+
| 12x Apple M2 | ? | 1m:16.7s | 1m:30.3s ² | 1m:40.7s ² |
222+
| 96x Graviton 4 | 32.2s | 20.8s | 39.8s | 26.0s |
226223
227-
### Rust Benchmarks
224+
Rust benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
228225
229-
For $N=128$ bodies, $I=1e6$ iterations, using the maximum number of threads available on the machine, the numbers are as follows:
226+
| Machine | Rayon (D) | Rayon (S) | Fork Union (D) | Fork Union (S) |
227+
| :------------- | --------: | --------: | -------------: | -------------: |
228+
| 16x Intel SPR | 51.4s | 38.1s | 15.9s | 9.8s |
229+
| 12x Apple M2 | 3m:23.5s | 2m:0.6s | 4m:8.4s | 1m:20.8s |
230+
| 96x Graviton 4 | 2m:13.9s | 1m:35.6s | 18.9s | 10.1s |
230231
231-
| Machine | Tokio | Rayon | Fork Union (D) | Fork Union (S) |
232-
| :------------ | ----: | ----: | -------------: | -------------: |
233-
| 16x Intel SPR | | | | |
232+
> ¹ Another common workload is "Parallel Reductions" covered in a separate [repository](https://github.com/ashvardanian/ParallelReductionsBenchmark).
233+
> ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing.
234234
235235
## Safety & Logic
236236
237-
There are only 4 atomic variables in this thread-pool, and some of them are practically optional.
238-
Let's call every invocation of `for_each_*` a "fork", and every exit from it a "join".
239-
240-
| Variable | Users Perspective | Internal Usage |
241-
| :----------------- | :---------------------------------- | :-------------------------------------- |
242-
| `stop` | Stop the entire thread-pool | Tells workers when to exit the loop |
243-
| `fork_generation` | How many "forks" have been finished | Tells workers to wake up on new "forks" |
244-
| `prongs_remaining` | Number of tasks left in this "fork" | Tells main thread when workers finish |
245-
| `prongs_passed` | Number of tasks done in this "fork" | Helps balance work in "dynamic" mode |
246-
247-
__Why use `prongs_remaining` and `prongs_passed`?__
248-
When greedily stealing tasks from each other, we need to:
249-
250-
1. Poll an incomplete task index.
251-
2. Execute the task.
252-
3. Mark the task as completed.
237+
There are only 3 core atomic variables in this thread-pool, and some of them are practically optional.
238+
Let's call every invocation of a `for_*` API - a "fork", and every exit from it a "join".
253239
254-
Trivial, but we can't do 3 before we do 2, as otherwise, the calling thread will exit too early, and we will face Undefined Behavior.
255-
Using two atomic variables our task polling becomes largely independent of the task completion.
256-
Moreover, we don't need to enforce the strictest memory ordering rules for `prongs_passed`.
240+
| Variable | Users Perspective | Internal Usage |
241+
| :---------------- | :--------------------------- | :------------------------------------ |
242+
| `stop` | Stop the entire thread-pool | Tells workers when to exit the loop |
243+
| `fork_generation` | "Forks" called since init | Tells workers to wake up on new forks |
244+
| `threads_to_sync` | Threads not joined this fork | Tells main thread when workers finish |
257245
258-
__Why don't we need atomics for `total_threads`?__
246+
__Why don't we need atomics for "total_threads"?__
259247
The only way to change the number of threads is to `stop_and_reset` the entire thread-pool and then `try_spawn` it again.
260248
Either of those operations can only be called from one thread at a time and never coincides with any running tasks.
261249
That's ensured by the `stop`.
262250
263-
__Why don't we need atomics for `task_parts` and `task_pointer`?__
264-
A new task can only be submitted from one thread, that updates the number of parts for each new "fork".
251+
__Why don't we need atomics for a "job pointer"?__
252+
A new task can only be submitted from one thread, that updates the number of parts for each new fork.
265253
During that update, the workers are asleep, spinning on old values of `fork_generation` and `stop`.
266254
They only wake up and access the new value once `fork_generation` increments, ensuring safety.
267255
268256
__How do we deal with overflows and `SIZE_MAX`-sized tasks?__
269257
The library entirely avoids saturating multiplication and only uses one saturating addition in "release" builds.
270-
To test the consistency of arithmetic, the C++ template class can be instantiated with a custom `index_type_t`, such as `std::uint8_t` or `std::uint16_t`.
258+
To test the consistency of arithmetic, the C++ template class can be instantiated with a custom `index_t`, such as `std::uint8_t` or `std::uint16_t`.
271259
In the former case, no more than 255 threads can operate and no more than 255 tasks can be addressed, allowing us to easily test every weird corner case of [0:255] threads competing for [0:255] tasks.
272260
273261
## Testing and Benchmarking

0 commit comments

Comments
 (0)