Add some helpers to report potential problem on scheduler side #182

imran-kn · 2025-08-06T02:12:49Z

No description provided.

brenns10

I really like this module, it's going to be really useful. The code looks great as well, but I have a suggestion. The docstrings for check_runq_wakelist and check_idle_runq have really useful context, and it's a bit of a shame for that context to be hidden away in the code.

Do you think it would be possible to make each explanation a global variable, and then have the run_queue_check() function print each explanation at the end, only if necessary? Something like this:

EXPLAIN_WAKELIST = """
[TASKS ON WAKE LIST]
    A runq's wakelist temporarily holds tasks that are about to be woken
    up on that CPU. If this list has multiple tasks, it usually means that
    this CPU has missed multiple scheduler IPIs. This can imply issues
    like IRQs being disabled for too long, IRQ delivery issues between hypervisor
    and VM, or some other issue.
"""
EXPLAIN_IDLE_RUNQ = """
[IDLE RUN QUEUE]
    ...
"""

Then, each function that performs checks could take a parameter, explanations: Set[str], and you could do:

explanations.add(EXPLAIN_WAKELIST)
print(f"cpu: {cpu} has following tasks in its runq wake_list:")
...
print("See TASKS ON WAKE LIST below")

And then run_queue_check() could handle printing out explanations:

explanations = set()
for cpu in for_each_online_cpu(prog):
    check_runq_wakelist(prog, cpu, explanations) 
    ...
if not explanations:
    return
print("Note: found some possible run queue issues. Explanations below:")
for explanation in explanations:
    print(explanation)

I think this would be really helpful even for support & sustaining because we don't all have the context for why a detected issue may be a problem, and what next steps we should look into.

drgn_tools/runq.py

imran-kn · 2025-09-09T08:20:08Z

I really like this module, it's going to be really useful. The code looks great as well, but I have a suggestion. The docstrings for check_runq_wakelist and check_idle_runq have really useful context, and it's a bit of a shame for that context to be hidden away in the code.

Do you think it would be possible to make each explanation a global variable, and then have the run_queue_check() function print each explanation at the end, only if necessary? Something like this:
EXPLAIN_WAKELIST = """
[TASKS ON WAKE LIST]
    A runq's wakelist temporarily holds tasks that are about to be woken
    up on that CPU. If this list has multiple tasks, it usually means that
    this CPU has missed multiple scheduler IPIs. This can imply issues
    like IRQs being disabled for too long, IRQ delivery issues between hypervisor
    and VM, or some other issue.
"""
EXPLAIN_IDLE_RUNQ = """
[IDLE RUN QUEUE]
    ...
"""
Then, each function that performs checks could take a parameter, explanations: Set[str], and you could do:
explanations.add(EXPLAIN_WAKELIST)
print(f"cpu: {cpu} has following tasks in its runq wake_list:")
...
print("See TASKS ON WAKE LIST below")
And then run_queue_check() could handle printing out explanations:
explanations = set()
for cpu in for_each_online_cpu(prog):
    check_runq_wakelist(prog, cpu, explanations) 
    ...
if not explanations:
    return
print("Note: found some possible run queue issues. Explanations below:")
for explanation in explanations:
    print(explanation)
I think this would be really helpful even for support & sustaining because we don't all have the context for why a detected issue may be a problem, and what next steps we should look into.

Yes its definitely possible and I agree that this help string can help in getting a better context of this data. I have made the suggested change

Issues like missing IPIs or irq disablement for long, can create obervable anomalies on scheduler side. For example a scheduler IPI, missed/absent due to hypervisor issue, can cause migration/X threads on VMs to get stuck for ever, causing softlockup. Having a knowledge of these anomalies can help in locating the actual problem. For example following snippet shows that migration/23 thread was stuck in runq's wake_list. cpu: 23 has following tasks in its runq wake_list: task pid: 273301, comm: ora_ipc0_csadsd task pid: 281335, comm: ora_lms0_ccawsd task pid: 391691, comm: ora_ipc0_webprd task pid: 390722, comm: ora_ipc0_dppd1 task pid: 394144, comm: ora_lmd1_audprd task pid: 394450, comm: ora_lms1_rbsspr task pid: 393235, comm: ora_lmd0_etudpr task pid: 24469, comm: cvfwd task pid: 357613, comm: tnslsnr task pid: 351700, comm: ocssd.bin task pid: 394519, comm: ora_dia0_wspd1 task pid: 394307, comm: ora_lms1_wmsprd task pid: 394773, comm: ora_lms0_ccadmp task pid: 351141, comm: ocssd.bin task pid: 394690, comm: ora_lms0_wspd1 task pid: 351774, comm: ocssd.bin task pid: 351678, comm: ocssd.bin task pid: 351692, comm: ocssd.bin task pid: 351683, comm: ocssd.bin task pid: 351680, comm: ocssd.bin task pid: 351686, comm: ocssd.bin task pid: 351681, comm: ocssd.bin task pid: 351688, comm: ocssd.bin task pid: 150, comm: migration/23 This will block migration/X threads running on other CPUs. The helpers and corelens module added here can detect such issues. Signed-off-by: Imran Khan <[email protected]>

Signed-off-by: Imran Khan <[email protected]>

imran-kn · 2025-09-09T09:21:05Z

@brenns10 thanks a lot for taking a look and sharing your feedback. I have addressed the review comments but at the moment I see some jobs failing. This looks like a setup issue.

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Aug 6, 2025

imran-kn requested review from brenns10, biger410 and pssatapathy-oracle August 6, 2025 02:13

brenns10 requested changes Sep 8, 2025

View reviewed changes

drgn_tools/runq.py Outdated Show resolved Hide resolved

drgn_tools/runq.py Show resolved Hide resolved

imran-kn force-pushed the sched_hp_work_new branch from 2415226 to 733f566 Compare September 9, 2025 08:49

runq: add test case for new runq helpers.

733f566

Signed-off-by: Imran Khan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add some helpers to report potential problem on scheduler side #182

Add some helpers to report potential problem on scheduler side #182

Uh oh!

imran-kn commented Aug 6, 2025

Uh oh!

brenns10 left a comment

Uh oh!

Uh oh!

Uh oh!

imran-kn commented Sep 9, 2025

Uh oh!

imran-kn commented Sep 9, 2025

Uh oh!

Uh oh!

Add some helpers to report potential problem on scheduler side #182

Are you sure you want to change the base?

Add some helpers to report potential problem on scheduler side #182

Uh oh!

Conversation

imran-kn commented Aug 6, 2025

Uh oh!

brenns10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

imran-kn commented Sep 9, 2025

Uh oh!

imran-kn commented Sep 9, 2025

Uh oh!

Uh oh!