Skip to content

Conversation

neisw
Copy link
Contributor

@neisw neisw commented Sep 9, 2025

Jobs like e2e-agent-ha-dualstack-conformance only install 2 worker nodes. Running the default 30 tests in parallel leads to overloaded CPU. Expectations for the 30 parallel jobs is a standard 6 node cluster with 3 workers. This will lower the default parallelism when fewer workers are detected.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 9, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 9, 2025

@neisw: This pull request references trt-2246 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Jobs like e2e-agent-ha-dualstack-conformance only install 2 worker nodes. Running the default 30 tests in parallel leads to overloaded CPU. Expectations for the 30 parallel jobs is a standard 6 node cluster with 3 workers. This will lower the default parallelism when fewer workers are detected.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@neisw
Copy link
Contributor Author

neisw commented Sep 9, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-agent-ha-dualstack-conformance

Copy link
Contributor

openshift-ci bot commented Sep 9, 2025

@neisw: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/399c85c0-8d73-11f0-83a7-92ac913ff9d9-0

Copy link
Contributor

openshift-ci bot commented Sep 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neisw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2025
Copy link

openshift-trt bot commented Sep 9, 2025

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 4cb1659

Job Name New Test Risk
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn Medium - "Find the input image origin_scos-4.21_tools and tag it into the pipeline" is a new test, and was only seen in one job.

New tests seen in this PR at sha: 4cb1659

  • "Find the input image origin_scos-4.21_tools and tag it into the pipeline" [Total: 1, Pass: 1, Fail: 0, Flake: 0]

@neisw
Copy link
Contributor Author

neisw commented Sep 9, 2025

/retest-required

workerParallelism := 10 * workerNodes
logrus.Infof("Parallelism based on worker node count: %d", workerParallelism)
parallelism = min(parallelism, workerParallelism)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30226/pull-ci-openshift-origin-main-e2e-aws-ovn-microshift/1965382306705182720

time="2025-09-09T13:09:11Z" level=info msg="Suite defined parallelism 30"
time="2025-09-09T13:09:11Z" level=info msg="Found 1 worker nodes"
time="2025-09-09T13:09:11Z" level=info msg="Found 1 nodes"
time="2025-09-09T13:09:11Z" level=info msg="Parallelism based on worker node count: 10"
time="2025-09-09T13:09:11Z" level=info msg="Total nodes: 1, Worker nodes: 1, Parallelism: 10"

Runtime looks to jump from 1h40m to 2h30m.

Single node upgrade seems to have failed to get far enough to run but it'd be curious to see what it does to that one. They probably disable fewer tests than microshift so timeouts could be an issue.
This could help stabilize those jobs, but I wonder how many jobs lurking out there will start to timeout, and will we notice.

microshift serial logs: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30226/pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial/1965382312942112768

time="2025-09-09T13:06:45Z" level=info msg="Suite defined parallelism 0"
time="2025-09-09T13:06:45Z" level=info msg="Found 1 worker nodes"
time="2025-09-09T13:06:45Z" level=info msg="Found 1 nodes"
time="2025-09-09T13:06:45Z" level=info msg="Parallelism based on worker node count: 10"
time="2025-09-09T13:06:45Z" level=info msg="Total nodes: 1, Worker nodes: 1, Parallelism: 10"

But the intervals indicate they do actually run serially despite the logging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microshift probably doesn't run anything but very CPU light tests (they don't run build tests for example), that increase is really unfortunate - could we special case it, or is there evidence we need to reduce parallelism on microshift?

Copy link
Contributor

openshift-ci bot commented Sep 9, 2025

@neisw: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn 4cb1659 link false /test e2e-openstack-ovn
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout 4cb1659 link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-single-node-serial 4cb1659 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn 4cb1659 link false /test e2e-aws-ovn
ci/prow/e2e-aws-ovn-single-node-upgrade 4cb1659 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-metal-ipi-serial-2of2 4cb1659 link false /test e2e-metal-ipi-serial-2of2
ci/prow/e2e-gcp-ovn-techpreview 4cb1659 link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-aws-ovn-serial-2of2 4cb1659 link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-hypershift-conformance 4cb1659 link false /test e2e-hypershift-conformance
ci/prow/e2e-aws-disruptive 4cb1659 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2 4cb1659 link false /test e2e-gcp-ovn-techpreview-serial-2of2
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 4cb1659 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/okd-scos-e2e-aws-ovn 4cb1659 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-vsphere-ovn 4cb1659 link true /test e2e-vsphere-ovn
ci/prow/e2e-metal-ipi-virtualmedia 4cb1659 link false /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-2of2 4cb1659 link false /test e2e-metal-ipi-serial-ovn-ipv6-2of2
ci/prow/e2e-aws-csi 4cb1659 link false /test e2e-aws-csi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants