Skip to content

Conversation

gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Feb 7, 2025

@gbaraldi gbaraldi added backport 1.10 Change should be backported to the 1.10 release backport 1.11 Change should be backported to release-1.11 backport 1.12 Change should be backported to release-1.12 labels Feb 7, 2025
@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

Still not clear to me why alignment issues could be causing a bug which manifests on multithreaded Julia, but not single-threaded one -- #57309 (comment).

@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

Also, could you test this on the same architecture + OS the user specified on #54560 to check whether it fixes the issue there?

@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

Requesting this because I tested on a bunch of processor models which were not AMD Ryzen 9 (what the user has), but couldn't reproduce in any of them...

@gbaraldi
Copy link
Member Author

gbaraldi commented Feb 7, 2025

The user mentioned this made it work on his machine :\

@LebedevRI
Copy link
Contributor

Requesting this because I tested on a bunch of processor models which were not AMD Ryzen 9 (what the user has), but couldn't reproduce in any of them...

But was that tested on the same distro, and same CPU architecture at least?
(FYI this is the first time i'm reading that it was tested this extensively)

Testing this PR too...

@LebedevRI
Copy link
Contributor

The user mentioned this made it work on his machine :\

For reference, the patch disscussed in Slack did:

diff --git a/src/threading.c b/src/threading.c
index d8b6b784c6..f3d289f310 100644
--- a/src/threading.c
+++ b/src/threading.c
@@ -346,7 +346,8 @@ jl_ptls_t jl_init_threadtls(int16_t tid)
 #endif
     if (jl_get_pgcstack() != NULL)
         abort();
-    jl_ptls_t ptls = (jl_ptls_t)calloc(1, sizeof(jl_tls_states_t));
+    jl_ptls_t ptls = (jl_ptls_t)aligned_alloc(alignof(jl_tls_states_t), sizeof(jl_tls_states_t));
+    memset(ptls, 0, sizeof(jl_tls_states_t));
 #ifndef _OS_WINDOWS_
     pthread_setspecific(jl_task_exit_key, (void*)ptls);
 #endif

@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

But was that tested on the same distro, and same CPU architecture at least?
(FYI this is the first time i'm reading that it was tested this extensively)

No, that's the point. We didn't catch this on CI and I didn't catch this on a few aarch64 and a few Intel x86_64, so that's why I'm requesting to check in this particular platform, which seems to be one of the few in which the bug manifests -- but that seems to be addressed by your comment already.

@LebedevRI
Copy link
Contributor

But was that tested on the same distro, and same CPU architecture at least?
(FYI this is the first time i'm reading that it was tested this extensively)

No, that's the point. We didn't catch this on CI and I didn't catch this on a few aarch64 and a few Intel x86_64, so that's why I'm requesting to check in this particular platform, which seems to be one of the few in which the bug manifests -- but that seems to be addressed by your comment already.

(compile running...)
Well, i would guess that it doesn't need to be an actual same processor,
any Ryzen would do i would think, and honestly even an intel CPU would do.
If libc has different default alignments, then it would be based on CPU vector width or something.
I would think the distro mismatch may be a bigger issue.

@LebedevRI
Copy link
Contributor

Ok, i CAN confirm that this PR fixes segfault i was seeing.
I do NOT know if this only hides the problem, or actually fixes it, or which change in master hid it previously.

@gbaraldi Thank you!

@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

Just to confirm: you cherry-picked this patch to 1.11 and tested it there, right? (Note this PR is against master, and the issue was not present there -- #54560 (comment)).

@LebedevRI
Copy link
Contributor

Just to confirm: you cherry-picked this patch to 1.11 and tested it there, right? (Note this PR is against master, and the issue was not present there -- #54560 (comment)).

Yes, i checkout'ed v1.11.3, built it, checked that the segfault is still there,
then applied patch, built it, and tested to no segfault.
I just successfully re-did that experiment.

@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

Perfect, thanks.

@LebedevRI
Copy link
Contributor

Just to repeat from Slack, it would be really good to know whether or not UBSan catches this issue.
(And if it does not, i would really love to have a reproducer, please)

Copy link
Member

@Keno Keno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should audit other uses of our allocators and replace them with a version that static_asserts the alignment requirement.

@gbaraldi
Copy link
Member Author

gbaraldi commented Feb 8, 2025

We should default to 16 byte alignment everywhere since that's what our GC promises to be consistent

@Keno Keno merged commit 79e98e3 into master Feb 9, 2025
5 of 7 checks passed
@Keno Keno deleted the gb/align-ptls branch February 9, 2025 00:36
KristofferC pushed a commit that referenced this pull request Feb 11, 2025
@KristofferC KristofferC mentioned this pull request Feb 11, 2025
32 tasks
KristofferC added a commit that referenced this pull request Feb 13, 2025
Backported PRs:
- [x] #57142 <!-- Add reference to time_ns in time -->
- [x] #57241 <!-- Handle `waitpid` race condition when `SIGCHLD` is set
to `SIG_IGN` -->
- [x] #57249 <!-- restore non-freebsd-unix fix for profiling -->
- [x] #57211 <!-- Ensure read/readavailable for BufferStream are
threadsafe -->
- [x] #57262 <!-- edit NEWS for v1.12 -->
- [x] #57226 <!-- cfunction: reimplement, as originally planned, for
reliable performance -->
- [x] #57253 <!-- bpart: Fully switch to partitioned semantics -->
- [x] #57273 <!-- fix "Right arrow autocompletes at line end"
implementation -->
- [x] #57280 <!-- dep: Update JuliaSyntax -->
- [x] #57229 <!-- staticdata: Close data race after backedge insertion
-->
- [x] #57298 <!-- Updating binding version to fix MMTk CI -->
- [x] #57248 <!-- improve concurrency safety for `Compiler.finish!` -->
- [x] #57312 <!-- Profile.print: de-focus sleeping frames as gray -->
- [x] #57289 <!-- Make `OncePerX` subtype `Function` -->
- [x] #57310 <!-- Make ptls allocations at least 128 byte aligned -->
- [x] #57311 <!-- Add a warning for auto-import of types -->
- [x] #57338 <!-- fix typo in Float32 random number generation -->
- [x] #57293 <!-- Fix getfield_tfunc when order or boundscheck is Vararg
-->
- [x] #57349 <!-- docs: fix-up world-age handling for META access -->
- [x] #57344 <!-- Add missing type asserts when taking the queue out of
the task struct -->
- [x] #57348 <!-- 🤖 [master] Bump the SparseArrays stdlib from 212981b
to 72c7cac -->
- [x] #55040 <!-- Allow macrocall as function sig -->
- [x] #57299 <!-- Add missing latestworld after parameterized type alias
-->
@KristofferC KristofferC removed the backport 1.12 Change should be backported to release-1.12 label Feb 13, 2025
KristofferC pushed a commit that referenced this pull request Feb 14, 2025
@KristofferC KristofferC mentioned this pull request Feb 14, 2025
33 tasks
KristofferC pushed a commit that referenced this pull request Feb 14, 2025
@KristofferC KristofferC removed backport 1.11 Change should be backported to release-1.11 backport 1.10 Change should be backported to the 1.10 release labels Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--threads option seems to now result in segfault in jl_init_thread_heap (julia/src/gc.c:3992)
6 participants