You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Those are not designed for the same OpenMP-like use-cases as __`fork_union`__.
@@ -206,68 +207,55 @@ In that case, even on x86, where the entire cache will be exclusively owned by a
206
207
207
208
## Performance
208
209
209
-
One of the most common parallel workloads is the N-body simulation.
210
+
One of the most common parallel workloads is the N-body simulation ¹.
210
211
An implementation is available in both C++ and Rust in `scripts/nbody.cpp` and `scripts/nbody.rs` respectively.
211
212
Both are extremely light-weight and involve little logic outside of number-crunching, so both can be easily profiled with `time` and introspected with `perf` Linux tools.
212
213
213
-
> Another common workload is [Parallel Reductions](https://github.com/ashvardanian/ParallelReductionsBenchmark).
214
+
---
214
215
215
-
### C++ Benchmarks
216
-
217
-
For $N=128$ bodies, $I=1e6$ iterations, using the maximum number of threads available on the machine, the numbers are as follows:
216
+
C++ benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
218
217
219
218
| Machine | OpenMP (D) | OpenMP (S) | Fork Union (D) | Fork Union (S) |
> ¹ Another common workload is "Parallel Reductions" covered in a separate [repository](https://github.com/ashvardanian/ParallelReductionsBenchmark).
233
+
> ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing.
234
234
235
235
## Safety & Logic
236
236
237
-
There are only 4 atomic variables in this thread-pool, and some of them are practically optional.
238
-
Let's call every invocation of `for_each_*` a "fork", and every exit from it a "join".
| `stop` | Stop the entire thread-pool | Tells workers when to exit the loop |
243
+
| `fork_generation` | "Forks" called since init | Tells workers to wake up on new forks |
244
+
| `threads_to_sync` | Threads not joined this fork | Tells main thread when workers finish |
257
245
258
-
__Why don't we need atomics for `total_threads`?__
246
+
__Why don't we need atomics for "total_threads"?__
259
247
The only way to change the number of threads is to `stop_and_reset` the entire thread-pool and then `try_spawn` it again.
260
248
Either of those operations can only be called from one thread at a time and never coincides with any running tasks.
261
249
That's ensured by the `stop`.
262
250
263
-
__Why don't we need atomics for `task_parts` and `task_pointer`?__
264
-
A new task can only be submitted from one thread, that updates the number of parts for each new "fork".
251
+
__Why don't we need atomics for a "job pointer"?__
252
+
A new task can only be submitted from one thread, that updates the number of parts for each new fork.
265
253
During that update, the workers are asleep, spinning on old values of `fork_generation` and `stop`.
266
254
They only wake up and access the new value once `fork_generation` increments, ensuring safety.
267
255
268
256
__How do we deal with overflows and `SIZE_MAX`-sized tasks?__
269
257
The library entirely avoids saturating multiplication and only uses one saturating addition in "release" builds.
270
-
To test the consistency of arithmetic, the C++ template class can be instantiated with a custom `index_type_t`, such as `std::uint8_t` or `std::uint16_t`.
258
+
To test the consistency of arithmetic, the C++ template class can be instantiated with a custom `index_t`, such as `std::uint8_t` or `std::uint16_t`.
271
259
In the former case, no more than 255 threads can operate and no more than 255 tasks can be addressed, allowing us to easily test every weird corner case of [0:255] threads competing for [0:255] tasks.
0 commit comments