Use thread_local instead of thread_specific_storage for internals #5834

b-pass · 2025-09-12T22:09:19Z

Description

Per #5830, thread_local is faster than thread_specific_storage and all pybind11 platforms now support thread_local.

The internals_pp_manager can be changed to use thread_local, which should increase performance. While this holds the internals, it isn't part of the ABI so these changes don't required an internals version change.

Suggested changelog entry:

Changed internals to use thread_local instead of thread_specific_storage for increased performance.

…gement thread_local is faster.

rwgk · 2025-09-13T04:07:40Z

@swolchok Could you please help reviewing this PR?

Looks good to me. (I only spent a couple minutes looking at it. It seem pretty straightforward, but I'm not sure I have the full big picture.)

swolchok · 2025-09-13T07:04:33Z

@swolchok Could you please help reviewing this PR?

Looks good to me. (I only spent a couple minutes looking at it. It seem pretty straightforward, but I'm not sure I have the full big picture.)

My opinion is very similar. I think it might be nice to have a benchmark that exercises these paths so that we can confirm the order of magnitude of the improvement; it seems like one could do that easily by locally patching pybind to initialize get_num_interpreters_seen to 2 and running pybind11_benchmark.

rwgk · 2025-09-13T15:17:01Z

@b-pass Are you set up to do the benchmarking easily?

swolchok · 2025-09-13T16:09:42Z

I ran the benchmarking on my M4 Mac. Measuring before and after, with and without the num_interpreters_seen counter hardcoded to initialize to 2. 3 trials of python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' each time, numbers in nsec per loop.

before

baseline: 48.2, 47.5, 49.4
hardcode multiple subinterpreters: 52.1, 53, 53.9

after

baseline: 49.8, 47.3, 48.6
hardcode multiple subinterpreters: 49.8, 49.9, 50.1

There is some noise, but it seems clear enough to me that this change is a small improvement in the multiple subinterpreters case.

rwgk · 2025-09-13T16:22:14Z

Thanks @swolchok, that's awesome.

I'd take this PR just for the simplification aspect. My main worry: Is it correct? My mental model of threading is sketchy. If both of you think this is correct, let's merge this?

rwgk · 2025-09-13T16:23:24Z

Forgot to add: By "simplification" I mean that we're not using thread_specific_storage<PyInterpreterState> anymore.

Strictly speaking, since the members are static, the instances must also be singletons or this wouldn't work. They already are, but we can make the class enforce it to be more 'self-documenting'.

b-pass · 2025-09-13T20:07:06Z

Is it correct?

Yes, I think the only functionality difference between the thread_specific_storage version and the thread_local version is that the TSS version has instance members but the TL version has statics instead. Since there is only ever one instance of each specialization of the manager, the behavior is exactly the same in both cases.

To make it more fool proof, I just pushed a change that makes the internals_pp_manager enforce that it is a singleton (with a private constructor) so that future changes won't accidentally break it.

swolchok · 2025-09-14T15:37:47Z

My main worry: Is it correct?

Looks pretty safe to me by construction. I would merge it if it was up to me.

rwgk · 2025-09-14T16:06:45Z

Thanks @b-pass and @swolchok!

Use thread_local instead of thread_specific_storage for internals man…

f8e90f5

…gement thread_local is faster.

Make the pp manager a singleton.

372d994

Strictly speaking, since the members are static, the instances must also be singletons or this wouldn't work. They already are, but we can make the class enforce it to be more 'self-documenting'.

rwgk approved these changes Sep 14, 2025

View reviewed changes

rwgk merged commit 326b106 into pybind:master Sep 14, 2025
85 checks passed

github-actions bot added the needs changelog Possibly needs a changelog entry label Sep 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use thread_local instead of thread_specific_storage for internals #5834

Use thread_local instead of thread_specific_storage for internals #5834

Uh oh!

b-pass commented Sep 12, 2025

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

swolchok commented Sep 13, 2025

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

swolchok commented Sep 13, 2025

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

b-pass commented Sep 13, 2025 •

edited

Loading

Uh oh!

swolchok commented Sep 14, 2025

Uh oh!

rwgk commented Sep 14, 2025

Uh oh!

Uh oh!

Uh oh!

Use thread_local instead of thread_specific_storage for internals #5834

Use thread_local instead of thread_specific_storage for internals #5834

Uh oh!

Conversation

b-pass commented Sep 12, 2025

Description

Suggested changelog entry:

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

swolchok commented Sep 13, 2025

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

swolchok commented Sep 13, 2025

before

after

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

rwgk commented Sep 13, 2025

Uh oh!

b-pass commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swolchok commented Sep 14, 2025

Uh oh!

rwgk commented Sep 14, 2025

Uh oh!

Uh oh!

Uh oh!

b-pass commented Sep 13, 2025 •

edited

Loading