Skip to content

Conversation

b-pass
Copy link
Contributor

@b-pass b-pass commented Sep 12, 2025

Description

Per #5830, thread_local is faster than thread_specific_storage and all pybind11 platforms now support thread_local.

The internals_pp_manager can be changed to use thread_local, which should increase performance. While this holds the internals, it isn't part of the ABI so these changes don't required an internals version change.

Suggested changelog entry:

  • Changed internals to use thread_local instead of thread_specific_storage for increased performance.

@rwgk
Copy link
Collaborator

rwgk commented Sep 13, 2025

@swolchok Could you please help reviewing this PR?

Looks good to me. (I only spent a couple minutes looking at it. It seem pretty straightforward, but I'm not sure I have the full big picture.)

@swolchok
Copy link
Contributor

@swolchok Could you please help reviewing this PR?

Looks good to me. (I only spent a couple minutes looking at it. It seem pretty straightforward, but I'm not sure I have the full big picture.)

My opinion is very similar. I think it might be nice to have a benchmark that exercises these paths so that we can confirm the order of magnitude of the improvement; it seems like one could do that easily by locally patching pybind to initialize get_num_interpreters_seen to 2 and running pybind11_benchmark.

@rwgk
Copy link
Collaborator

rwgk commented Sep 13, 2025

@b-pass Are you set up to do the benchmarking easily?

@swolchok
Copy link
Contributor

I ran the benchmarking on my M4 Mac. Measuring before and after, with and without the num_interpreters_seen counter hardcoded to initialize to 2. 3 trials of python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' each time, numbers in nsec per loop.

before

baseline: 48.2, 47.5, 49.4
hardcode multiple subinterpreters: 52.1, 53, 53.9

after

baseline: 49.8, 47.3, 48.6
hardcode multiple subinterpreters: 49.8, 49.9, 50.1

There is some noise, but it seems clear enough to me that this change is a small improvement in the multiple subinterpreters case.

@rwgk
Copy link
Collaborator

rwgk commented Sep 13, 2025

Thanks @swolchok, that's awesome.

I'd take this PR just for the simplification aspect. My main worry: Is it correct? My mental model of threading is sketchy. If both of you think this is correct, let's merge this?

@rwgk
Copy link
Collaborator

rwgk commented Sep 13, 2025

Forgot to add: By "simplification" I mean that we're not using thread_specific_storage<PyInterpreterState> anymore.

Strictly speaking, since the members are static, the instances must also be singletons or this wouldn't work.  They already are, but we can make the class enforce it to be more 'self-documenting'.
@b-pass
Copy link
Contributor Author

b-pass commented Sep 13, 2025

Is it correct?

Yes, I think the only functionality difference between the thread_specific_storage version and the thread_local version is that the TSS version has instance members but the TL version has statics instead. Since there is only ever one instance of each specialization of the manager, the behavior is exactly the same in both cases.

To make it more fool proof, I just pushed a change that makes the internals_pp_manager enforce that it is a singleton (with a private constructor) so that future changes won't accidentally break it.

@swolchok
Copy link
Contributor

My main worry: Is it correct?

Looks pretty safe to me by construction. I would merge it if it was up to me.

@rwgk
Copy link
Collaborator

rwgk commented Sep 14, 2025

Thanks @b-pass and @swolchok!

@rwgk rwgk merged commit 326b106 into pybind:master Sep 14, 2025
85 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Sep 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs changelog Possibly needs a changelog entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants