OCPEDGE-1916: fix: add exception for two node fencing #30218

eggfoobar · 2025-09-08T05:10:02Z

add two node fencing exception to the etcd operator state transition during upgrade, in two node fencing the etcd operator will go unavailable as the two pods are updated and etcd fencing job is running via pacemaker, this is expected behavior due to the limitations of two node deployments

openshift-ci-robot · 2025-09-08T05:10:05Z

@eggfoobar: This pull request references OCPEDGE-1916 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set.

In response to this:

add two node fencing exception to the etcd operator state transition during upgrade, in two node fencing the etcd operator will go unavailable as the two pods are updated and etcd fencing job is running via pacemaker, this is expected behavior due to the limitations of two node deployments

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

eggfoobar · 2025-09-08T05:11:03Z

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

openshift-ci · 2025-09-08T05:11:07Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/37c2f790-8c72-11f0-889b-f227d4e7d021-0

openshift-ci · 2025-09-08T05:11:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eggfoobar
Once this PR has been reviewed and has the lgtm label, please assign stbenjam for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

eggfoobar · 2025-09-08T12:07:15Z

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

openshift-ci · 2025-09-08T12:07:31Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5c464ce0-8cac-11f0-8150-467409da5b51-0

eggfoobar · 2025-09-08T15:41:57Z

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

openshift-ci · 2025-09-08T15:42:30Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5afce150-8cca-11f0-86cb-573ed793600f-0

add two node fencing exception to the etcd operator state transition during upgrade, in two node fencing the etcd operator will go unavailable as the two pods are updated and etcd fencing job is running via pacemaker, this is expected behavior due to the limitations of two node deployments Signed-off-by: ehila <[email protected]>

eggfoobar · 2025-09-08T19:12:28Z

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

openshift-ci · 2025-09-08T19:12:34Z

@eggfoobar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`758a627`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-vsphere-ovn-upi	`758a627`	link	true	`/test e2e-vsphere-ovn-upi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2025-09-08T19:12:55Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c36908a0-8ce7-11f0-943f-b163f5f7e932-0

eggfoobar · 2025-09-08T23:38:35Z

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

openshift-ci · 2025-09-08T23:38:38Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f097ce90-8d0c-11f0-8fcc-7944b029ffea-0

openshift-trt · 2025-09-09T00:09:03Z

Job Failure Risk Analysis for sha: 43e6915

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-2of2	IncompleteTests Tests for this run (22) are below the historical average (1538): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2	IncompleteTests Tests for this run (22) are below the historical average (1516): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-virtualmedia	IncompleteTests Tests for this run (102) are below the historical average (2462): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

eggfoobar · 2025-09-09T04:09:29Z

/payload-job periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

openshift-ci · 2025-09-09T04:09:33Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ovn-two-node-fencing-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c83827d0-8d32-11f0-8cac-2c537bd949d8-0

eggfoobar · 2025-09-09T13:27:52Z

After some issues with the equinix account, the latest job run is successful, we no longer see the etcd events trigger the test. This is good to go.

jaypoulz

I want us to be more targeted with this exception. This should only happen during initial deployment. IIUC, the setup job shouldn't be re-run after initial deployment unless we're doing a control-plane node replacement.

Currently, we are too aggressive in TNF with regards to "unavailable". Etcd should only be unavailable from the start of the setup job, to it's completion. Other jobs should only affect the progressing status, with "degraded" being set if we give up.

eggfoobar · 2025-09-09T15:28:01Z

I want us to be more targeted with this exception. This should only happen during initial deployment. IIUC, the setup job shouldn't be re-run after initial deployment unless we're doing a control-plane node replacement.

Currently, we are too aggressive in TNF with regards to "unavailable". Etcd should only be unavailable from the start of the setup job, to it's completion. Other jobs should only affect the progressing status, with "degraded" being set if we give up.

I can tighten up that catch. As it currently stands, there are 4 conditions this captures, tnf-auth-job running, tnf-after-setup-job running, tnf-fencing-job running, and etcd quorum 1 of 2 condition. Do any of those sound undesirable in a fencing upgrade situation?

eggfoobar · 2025-09-11T01:38:00Z

Closing in favor of openshift/cluster-etcd-operator#1481

jaypoulz · 2025-09-11T19:43:50Z

In order to fix this test, we need both fixes:

OCPEDGE-2183: Updating Quorum detection logic for absolve TNF of quorum loss reports. cluster-etcd-operator#1483
OCPEDGE-2088, OCPEDGE-1885: Updated state transitions & tests for TNF setup job cluster-etcd-operator#1481

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 8, 2025

openshift-ci bot requested review from p0lyn0mial and sjenning September 8, 2025 05:11

eggfoobar force-pushed the add-two-node-fencing-exception branch from 81364bb to 758a627 Compare September 8, 2025 15:41

eggfoobar force-pushed the add-two-node-fencing-exception branch from 758a627 to 43e6915 Compare September 8, 2025 19:12

jaypoulz suggested changes Sep 9, 2025

View reviewed changes

eggfoobar closed this Sep 11, 2025

OCPEDGE-1916: fix: add exception for two node fencing #30218

OCPEDGE-1916: fix: add exception for two node fencing #30218

Uh oh!

Conversation

eggfoobar commented Sep 8, 2025

Uh oh!

openshift-ci-robot commented Sep 8, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eggfoobar commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

eggfoobar commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

eggfoobar commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

eggfoobar commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

eggfoobar commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

openshift-trt bot commented Sep 9, 2025

Uh oh!

eggfoobar commented Sep 9, 2025

Uh oh!

openshift-ci bot commented Sep 9, 2025

Uh oh!

eggfoobar commented Sep 9, 2025

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

eggfoobar commented Sep 9, 2025

Uh oh!

eggfoobar commented Sep 11, 2025

Uh oh!

jaypoulz commented Sep 11, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 8, 2025 •

edited by openshift-ci bot

Loading