-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Use thread_local instead of thread_specific_storage for internals #5834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…gement thread_local is faster.
@swolchok Could you please help reviewing this PR? Looks good to me. (I only spent a couple minutes looking at it. It seem pretty straightforward, but I'm not sure I have the full big picture.) |
My opinion is very similar. I think it might be nice to have a benchmark that exercises these paths so that we can confirm the order of magnitude of the improvement; it seems like one could do that easily by locally patching pybind to initialize get_num_interpreters_seen to 2 and running pybind11_benchmark. |
@b-pass Are you set up to do the benchmarking easily? |
I ran the benchmarking on my M4 Mac. Measuring before and after, with and without the num_interpreters_seen counter hardcoded to initialize to 2. 3 trials of beforebaseline: 48.2, 47.5, 49.4 afterbaseline: 49.8, 47.3, 48.6 There is some noise, but it seems clear enough to me that this change is a small improvement in the multiple subinterpreters case. |
Thanks @swolchok, that's awesome. I'd take this PR just for the simplification aspect. My main worry: Is it correct? My mental model of threading is sketchy. If both of you think this is correct, let's merge this? |
Forgot to add: By "simplification" I mean that we're not using |
Strictly speaking, since the members are static, the instances must also be singletons or this wouldn't work. They already are, but we can make the class enforce it to be more 'self-documenting'.
Yes, I think the only functionality difference between the thread_specific_storage version and the thread_local version is that the TSS version has instance members but the TL version has statics instead. Since there is only ever one instance of each specialization of the manager, the behavior is exactly the same in both cases. To make it more fool proof, I just pushed a change that makes the internals_pp_manager enforce that it is a singleton (with a private constructor) so that future changes won't accidentally break it. |
Looks pretty safe to me by construction. I would merge it if it was up to me. |
Description
Per #5830,
thread_local
is faster thanthread_specific_storage
and all pybind11 platforms now supportthread_local
.The
internals_pp_manager
can be changed to usethread_local
, which should increase performance. While this holds the internals, it isn't part of the ABI so these changes don't required an internals version change.Suggested changelog entry:
thread_local
instead ofthread_specific_storage
for increased performance.