Skip to content

Conversation

eggfoobar
Copy link
Contributor

add two node fencing exception to the etcd operator state transition during upgrade, in two node fencing the etcd operator will go unavailable as the two pods are updated and etcd fencing job is running via pacemaker, this is expected behavior due to the limitations of two node deployments

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 8, 2025

@eggfoobar: This pull request references OCPEDGE-1916 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set.

In response to this:

add two node fencing exception to the etcd operator state transition during upgrade, in two node fencing the etcd operator will go unavailable as the two pods are updated and etcd fencing job is running via pacemaker, this is expected behavior due to the limitations of two node deployments

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 8, 2025
@eggfoobar
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

Copy link
Contributor

openshift-ci bot commented Sep 8, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/37c2f790-8c72-11f0-889b-f227d4e7d021-0

Copy link
Contributor

openshift-ci bot commented Sep 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eggfoobar
Once this PR has been reviewed and has the lgtm label, please assign stbenjam for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eggfoobar
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

Copy link
Contributor

openshift-ci bot commented Sep 8, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5c464ce0-8cac-11f0-8150-467409da5b51-0

@eggfoobar eggfoobar force-pushed the add-two-node-fencing-exception branch from 81364bb to 758a627 Compare September 8, 2025 15:41
@eggfoobar
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

Copy link
Contributor

openshift-ci bot commented Sep 8, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5afce150-8cca-11f0-86cb-573ed793600f-0

add two node fencing exception to the etcd operator state transition during upgrade, in two node fencing the etcd operator will go unavailable as the two pods are updated and etcd fencing job is running via pacemaker, this is expected behavior due to the limitations of two node deployments

Signed-off-by: ehila <[email protected]>
@eggfoobar eggfoobar force-pushed the add-two-node-fencing-exception branch from 758a627 to 43e6915 Compare September 8, 2025 19:12
@eggfoobar
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

Copy link
Contributor

openshift-ci bot commented Sep 8, 2025

@eggfoobar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 758a627 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-vsphere-ovn-upi 758a627 link true /test e2e-vsphere-ovn-upi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

openshift-ci bot commented Sep 8, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c36908a0-8ce7-11f0-943f-b163f5f7e932-0

@eggfoobar
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

Copy link
Contributor

openshift-ci bot commented Sep 8, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f097ce90-8d0c-11f0-8fcc-7944b029ffea-0

Copy link

openshift-trt bot commented Sep 9, 2025

Job Failure Risk Analysis for sha: 43e6915

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-2of2 IncompleteTests
Tests for this run (22) are below the historical average (1538): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2 IncompleteTests
Tests for this run (22) are below the historical average (1516): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-virtualmedia IncompleteTests
Tests for this run (102) are below the historical average (2462): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@eggfoobar
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

Copy link
Contributor

openshift-ci bot commented Sep 9, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c83827d0-8d32-11f0-8cac-2c537bd949d8-0

@eggfoobar
Copy link
Contributor Author

After some issues with the equinix account, the latest job run is successful, we no longer see the etcd events trigger the test. This is good to go.

Copy link
Contributor

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want us to be more targeted with this exception. This should only happen during initial deployment. IIUC, the setup job shouldn't be re-run after initial deployment unless we're doing a control-plane node replacement.

Currently, we are too aggressive in TNF with regards to "unavailable". Etcd should only be unavailable from the start of the setup job, to it's completion. Other jobs should only affect the progressing status, with "degraded" being set if we give up.

@eggfoobar
Copy link
Contributor Author

I want us to be more targeted with this exception. This should only happen during initial deployment. IIUC, the setup job shouldn't be re-run after initial deployment unless we're doing a control-plane node replacement.

Currently, we are too aggressive in TNF with regards to "unavailable". Etcd should only be unavailable from the start of the setup job, to it's completion. Other jobs should only affect the progressing status, with "degraded" being set if we give up.

I can tighten up that catch. As it currently stands, there are 4 conditions this captures, tnf-auth-job running, tnf-after-setup-job running, tnf-fencing-job running, and etcd quorum 1 of 2 condition. Do any of those sound undesirable in a fencing upgrade situation?

@eggfoobar
Copy link
Contributor Author

Closing in favor of openshift/cluster-etcd-operator#1481

@eggfoobar eggfoobar closed this Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants