BZ:1973274 Added a note on maximum cluster downtime #35099

mjpytlak · 2021-08-02T20:34:48Z

CP to 4.6, 4.7, 4.8, and 4.9

https://bugzilla.redhat.com/show_bug.cgi?id=1973274

Updated Shutting down a cluster gracefully:

Added a note that states the maximum amount of time that a cluster can remain down and be expected to restart gracefully is 1 year.
Step 1 specifies how to determine when the cluster certificates will expire.

https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-shutdown?utm_source=github&utm_campaign=bot_dp

QE: In a case where a cluster has been down for 1 year, please also verify that steps 4 and 6 in Restarting the cluster gracefully is all a user would need to do with regard to pending CSRs. I suspect the answer is yes, but wanted to verify.

https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-restart.html

netlify · 2021-08-02T20:36:22Z

✔️ Deploy Preview for osdocs ready!

🔨 Explore the source changes: e82d998

🔍 Inspect the deploy log: https://app.netlify.com/sites/osdocs/deploys/611293a845877a00088e3dd1

😎 Browse the preview: https://deploy-preview-35099--osdocs.netlify.app

kazuhisya · 2021-08-03T00:16:34Z

modules/graceful-shutdown.adoc

for up to 1 year and expect: Can we provide a way to check this date?

For example, it looks like this, but we need to confirm from QE.

$ oc -n openshift-kube-apiserver-operator get secret kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'

If the cluster remains down for 1 year: Can we clarify whether this is "before" or "after" the expiration date confirmed above?

If my understanding is correct, the scenario Eng said is "the day" or "before".

you manually approve: It's a easy procedure, but it's nice to have a sample.

For example, it looks like this.

$oc get csr -oname | xargs oc adm certificate approve

Thank you.
[1] When are you suggesting that this command (pending confirm from QE) be run? I interpreted your comment as prior to shutting the cluster down for an extended period, run the command and note the cert expiration date. Do I understand you correctly?
[3] We presently have the steps documented on how to manually approve pending CSRs. In my original comment, I have asked the QE verify the current steps in https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-restart.html apply to this scenario. Once verified, I will be sure to reference this topic in my updates.

Hello. Thank you for your reply.

[1] This should be checked before the cluster shutdown. For example, "up to 1 year" is correct the day after OCP is installed, but I wondering if we do this after a long time (ex. a month) after installation, we may not be able to keep it cluster shutdown for a whole year. If my understanding is correct.

[3] Yes, I think this makes sense. Thank you for letting me know.

Good evening,

[1] I believe your understanding is correct. In my next round of edits, I will be sure to clarify the point that certificates expire 1 year from the date of installation. Thank you for pointing this out.

xingxingxia · 2021-08-10T10:45:57Z

Very sorry, was busy in work on led subteams so didn't have time on this.
Today I read through the bug of long comments which are useful for me to get the points in place. But later the Bugzilla is down when I only read part of comments. I need read more when it is back.

The PR is LGTM.

But frankly speaking, it can't be tested because shutting down a cluster for 1 year and the restarting can't be tested. Then how to deal with this issue?
In DR tests, we have two ways as described in our case OCP-30788:
One way is: install fresh env, shut down and wait until it has been 24 hours since installed, then restart, this tests control plane certs of initial 24 hours expiry.
Another way is: install baremetal env (could provider env can't use this way), shut down OCP kubelet and crio services on nodes, then move ahead time for 1 year on all nodes, then restart OCP kubelet and crio services, this covers the cert recovery of long expiry. Can't shut down after moving ahead, because after restarting, the time is restored to World time.

(CC @wangke19 @gangwgr FYI about this doc's need)

mjpytlak · 2021-08-10T13:04:37Z

@xingxingxia

Thank you for looking at this. With regard to your comment about being tested, please see the referenced RFE [1]. The stated period of 1 year was based on David Eads comment (2021/June/02) that there is automated testing for certificate expiration and the ability for the cluster to recover from expired certs. Given this information, the team decided to provide the guidance in the doc. Please see Vikram Goyal's comments on 2021-08-02 in the BZ [2] for details. Given this info, do I have QE approval?

[1] https://issues.redhat.com/browse/RFE-450?focusedCommentId=16224784&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16224784.
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1973274

xingxingxia · 2021-08-10T13:18:32Z

OK, Will continue looking at those details tomorrow. Today didn't have time due to other things. Adding:
/lgtm

jeana-redhat

Just a few suggestions

jeana-redhat · 2021-08-10T14:15:06Z

modules/graceful-shutdown.adoc

Suggested change

From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After which time, the cluster certificates expire.

From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After 1 year, the cluster certificates expire.

jeana-redhat · 2021-08-10T14:17:08Z

modules/graceful-shutdown.adoc

Suggested change

. If you are shutting the cluster down for an extended period, note the date on which certificates expire.

. If you are shutting the cluster down for an extended period, determine the date on which certificates expire:

Here, the command is not noting it, but getting the info from the cluster. You could add a followup to this step to say "make a note of this" or similar, but I think the little footnote with the example output actually covers this pretty well already :)

bergerhoffer

Also just a question - did you think it would be worthwhile to also add any of this info onto the restarting a cluster gracefully page?

bergerhoffer · 2021-08-10T14:25:13Z

modules/graceful-shutdown.adoc

The way this is worded now doesn't seem super clear that it is 1 year from the installation date. Suggest a reword. Maybe?

You can shut down a cluster until a year from the installation date and expect it to restart gracefully. After a year from the installation date, the cluster certificates expire.

or maybe:

As long as you restart the cluster before the cluster certificates expire, which is a year from the installation date, you can expect the cluster to restart gracefully.

bergerhoffer · 2021-08-10T14:30:53Z

modules/graceful-shutdown.adoc

I can't remember if we talked about this, but the recovering from expired certs section [1] was recently updated to add additional CSRs that might need to approved (for UPI installs). So it might not only be node-bootstrapper CSRs. Suggest leaving the specifics out and just saying "... to manually approve the pending certificate signing requests (CSRs) to recover kubelet certificates."

[1] https://docs.openshift.com/container-platform/4.8/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html#dr-scenario-3-recovering-expired-certs_dr-recovering-expired-certs

openshift-ci · 2021-08-10T14:40:59Z

New changes are detected. LGTM label has been removed.

…dure

jeana-redhat · 2021-08-11T16:56:24Z

/cherrypick enterprise-4.9

jeana-redhat · 2021-08-11T16:56:51Z

/cherrypick enterprise-4.8

openshift-cherrypick-robot · 2021-08-11T16:57:04Z

@jeana-redhat: new pull request created: #35440

In response to this:

/cherrypick enterprise-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jeana-redhat · 2021-08-11T16:57:04Z

/cherrypick enterprise-4.7

jeana-redhat · 2021-08-11T16:57:20Z

/cherrypick enterprise-4.6

openshift-cherrypick-robot · 2021-08-11T16:57:31Z

@jeana-redhat: new pull request created: #35441

In response to this:

/cherrypick enterprise-4.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2021-08-11T16:57:43Z

@jeana-redhat: new pull request created: #35442

In response to this:

/cherrypick enterprise-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2021-08-11T16:57:59Z

@jeana-redhat: new pull request created: #35443

In response to this:

/cherrypick enterprise-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xingxingxia · 2021-08-16T13:32:38Z

xingxingxia commented 6 days ago
OK, Will continue looking at those details tomorrow. Today didn't have time due to other things.

(Time debt) sorry for continuing look till now (after merged). Today finished look and got them, they sound good. BTW the test way under my "Another way is" in previous comment seems same as their auto test, because for which we once ever discussed with Dev for learning their auto script in our QE test.

openshift-ci bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 2, 2021

kazuhisya reviewed Aug 3, 2021

View reviewed changes

mjpytlak changed the title ~~BZ:1886814 Added a note on maximum cluster downtime~~ BZ:1973274 Added a note on maximum cluster downtime Aug 3, 2021

vikram-redhat added branch/enterprise-4.6 branch/enterprise-4.7 branch/enterprise-4.8 branch/enterprise-4.9 labels Aug 4, 2021

vikram-redhat added this to the Next Release milestone Aug 4, 2021

mjpytlak force-pushed the bz1973274 branch from 06ef27d to c6b7bb2 Compare August 5, 2021 18:14

openshift-ci bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 5, 2021

openshift-ci bot assigned xingxingxia Aug 10, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2021

jeana-redhat reviewed Aug 10, 2021

View reviewed changes

bergerhoffer reviewed Aug 10, 2021

View reviewed changes

bergerhoffer added the peer-review-done Signifies that the peer review team has reviewed this PR label Aug 10, 2021

mjpytlak force-pushed the bz1973274 branch from c6b7bb2 to 23507b4 Compare August 10, 2021 14:39

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2021

BZ:1973274 Added a note on maximum cluster downtime and updated proce…

e82d998

…dure

mjpytlak force-pushed the bz1973274 branch from 23507b4 to e82d998 Compare August 10, 2021 14:56

jeana-redhat merged commit 960b724 into openshift:main Aug 11, 2021

openshift-cherrypick-robot mentioned this pull request Aug 11, 2021

[enterprise-4.9] BZ:1973274 Added a note on maximum cluster downtime #35440

Merged

openshift-cherrypick-robot mentioned this pull request Aug 11, 2021

[enterprise-4.8] BZ:1973274 Added a note on maximum cluster downtime #35441

Merged

This was referenced Aug 11, 2021

[enterprise-4.7] BZ:1973274 Added a note on maximum cluster downtime #35442

Merged

[enterprise-4.6] BZ:1973274 Added a note on maximum cluster downtime #35443

Merged

	From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After which time, the cluster certificates expire.
	From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After 1 year, the cluster certificates expire.

	. If you are shutting the cluster down for an extended period, note the date on which certificates expire.
	. If you are shutting the cluster down for an extended period, determine the date on which certificates expire:

BZ:1973274 Added a note on maximum cluster downtime #35099

BZ:1973274 Added a note on maximum cluster downtime #35099

Uh oh!

Conversation

mjpytlak commented Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjpytlak Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xingxingxia commented Aug 10, 2021

Uh oh!

mjpytlak commented Aug 10, 2021

Uh oh!

xingxingxia commented Aug 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeana-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bergerhoffer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Aug 10, 2021

Uh oh!

jeana-redhat commented Aug 11, 2021

Uh oh!

jeana-redhat commented Aug 11, 2021

Uh oh!

openshift-cherrypick-robot commented Aug 11, 2021

Uh oh!

jeana-redhat commented Aug 11, 2021

Uh oh!

jeana-redhat commented Aug 11, 2021

Uh oh!

openshift-cherrypick-robot commented Aug 11, 2021

Uh oh!

openshift-cherrypick-robot commented Aug 11, 2021

Uh oh!

openshift-cherrypick-robot commented Aug 11, 2021

Uh oh!

xingxingxia commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mjpytlak commented Aug 2, 2021 •

edited

Loading

netlify bot commented Aug 2, 2021 •

edited

Loading

mjpytlak Aug 3, 2021 •

edited

Loading

xingxingxia commented Aug 10, 2021 •

edited

Loading

xingxingxia commented Aug 16, 2021 •

edited

Loading