-
Notifications
You must be signed in to change notification settings - Fork 4.7k
trt-2246: lower parallelism based on worker nodes #30226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@neisw: This pull request references trt-2246 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-agent-ha-dualstack-conformance |
@neisw: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/399c85c0-8d73-11f0-83a7-92ac913ff9d9-0 |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: neisw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 4cb1659
New tests seen in this PR at sha: 4cb1659
|
/retest-required |
workerParallelism := 10 * workerNodes | ||
logrus.Infof("Parallelism based on worker node count: %d", workerParallelism) | ||
parallelism = min(parallelism, workerParallelism) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time="2025-09-09T13:09:11Z" level=info msg="Suite defined parallelism 30"
time="2025-09-09T13:09:11Z" level=info msg="Found 1 worker nodes"
time="2025-09-09T13:09:11Z" level=info msg="Found 1 nodes"
time="2025-09-09T13:09:11Z" level=info msg="Parallelism based on worker node count: 10"
time="2025-09-09T13:09:11Z" level=info msg="Total nodes: 1, Worker nodes: 1, Parallelism: 10"
Runtime looks to jump from 1h40m to 2h30m.
Single node upgrade seems to have failed to get far enough to run but it'd be curious to see what it does to that one. They probably disable fewer tests than microshift so timeouts could be an issue.
This could help stabilize those jobs, but I wonder how many jobs lurking out there will start to timeout, and will we notice.
microshift serial logs: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30226/pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial/1965382312942112768
time="2025-09-09T13:06:45Z" level=info msg="Suite defined parallelism 0"
time="2025-09-09T13:06:45Z" level=info msg="Found 1 worker nodes"
time="2025-09-09T13:06:45Z" level=info msg="Found 1 nodes"
time="2025-09-09T13:06:45Z" level=info msg="Parallelism based on worker node count: 10"
time="2025-09-09T13:06:45Z" level=info msg="Total nodes: 1, Worker nodes: 1, Parallelism: 10"
But the intervals indicate they do actually run serially despite the logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microshift probably doesn't run anything but very CPU light tests (they don't run build tests for example), that increase is really unfortunate - could we special case it, or is there evidence we need to reduce parallelism on microshift?
@neisw: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Jobs like
e2e-agent-ha-dualstack-conformance
only install 2 worker nodes. Running the default 30 tests in parallel leads to overloaded CPU. Expectations for the 30 parallel jobs is a standard 6 node cluster with 3 workers. This will lower the default parallelism when fewer workers are detected.