-
Notifications
You must be signed in to change notification settings - Fork 1.8k
BZ:1973274 Added a note on maximum cluster downtime #35099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✔️ Deploy Preview for osdocs ready! 🔨 Explore the source changes: e82d998 🔍 Inspect the deploy log: https://app.netlify.com/sites/osdocs/deploys/611293a845877a00088e3dd1 😎 Browse the preview: https://deploy-preview-35099--osdocs.netlify.app |
modules/graceful-shutdown.adoc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
for up to 1 year and expect
: Can we provide a way to check this date?- For example, it looks like this, but we need to confirm from QE.
$ oc -n openshift-kube-apiserver-operator get secret kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
-
If the cluster remains down for 1 year
: Can we clarify whether this is "before" or "after" the expiration date confirmed above?- If my understanding is correct, the scenario Eng said is "the day" or "before".
-
you manually approve:
It's a easy procedure, but it's nice to have a sample.- For example, it looks like this.
- $
oc get csr -oname | xargs oc adm certificate approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
[1] When are you suggesting that this command (pending confirm from QE) be run? I interpreted your comment as prior to shutting the cluster down for an extended period, run the command and note the cert expiration date. Do I understand you correctly?
[3] We presently have the steps documented on how to manually approve pending CSRs. In my original comment, I have asked the QE verify the current steps in https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-restart.html apply to this scenario. Once verified, I will be sure to reference this topic in my updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello. Thank you for your reply.
[1] This should be checked before the cluster shutdown. For example, "up to 1 year" is correct the day after OCP is installed, but I wondering if we do this after a long time (ex. a month) after installation, we may not be able to keep it cluster shutdown for a whole year. If my understanding is correct.
[3] Yes, I think this makes sense. Thank you for letting me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good evening,
[1] I believe your understanding is correct. In my next round of edits, I will be sure to clarify the point that certificates expire 1 year from the date of installation. Thank you for pointing this out.
Very sorry, was busy in work on led subteams so didn't have time on this. The PR is LGTM. But frankly speaking, it can't be tested because shutting down a cluster for 1 year and the restarting can't be tested. Then how to deal with this issue? |
Thank you for looking at this. With regard to your comment about being tested, please see the referenced RFE [1]. The stated period of 1 year was based on David Eads comment (2021/June/02) that there is automated testing for certificate expiration and the ability for the cluster to recover from expired certs. Given this information, the team decided to provide the guidance in the doc. Please see Vikram Goyal's comments on 2021-08-02 in the BZ [2] for details. Given this info, do I have QE approval? [1] https://issues.redhat.com/browse/RFE-450?focusedCommentId=16224784&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16224784. |
OK, Will continue looking at those details tomorrow. Today didn't have time due to other things. Adding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions
modules/graceful-shutdown.adoc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After which time, the cluster certificates expire. | |
From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After 1 year, the cluster certificates expire. |
modules/graceful-shutdown.adoc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. If you are shutting the cluster down for an extended period, note the date on which certificates expire. | |
. If you are shutting the cluster down for an extended period, determine the date on which certificates expire: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the command is not noting it, but getting the info from the cluster. You could add a followup to this step to say "make a note of this" or similar, but I think the little footnote with the example output actually covers this pretty well already :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just a question - did you think it would be worthwhile to also add any of this info onto the restarting a cluster gracefully page?
modules/graceful-shutdown.adoc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is worded now doesn't seem super clear that it is 1 year from the installation date. Suggest a reword. Maybe?
You can shut down a cluster until a year from the installation date and expect it to restart gracefully. After a year from the installation date, the cluster certificates expire.
or maybe:
As long as you restart the cluster before the cluster certificates expire, which is a year from the installation date, you can expect the cluster to restart gracefully.
modules/graceful-shutdown.adoc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember if we talked about this, but the recovering from expired certs section [1] was recently updated to add additional CSRs that might need to approved (for UPI installs). So it might not only be node-bootstrapper
CSRs. Suggest leaving the specifics out and just saying "... to manually approve the pending certificate signing requests (CSRs) to recover kubelet certificates."
New changes are detected. LGTM label has been removed. |
/cherrypick enterprise-4.9 |
/cherrypick enterprise-4.8 |
@jeana-redhat: new pull request created: #35440 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherrypick enterprise-4.7 |
/cherrypick enterprise-4.6 |
@jeana-redhat: new pull request created: #35441 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@jeana-redhat: new pull request created: #35442 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@jeana-redhat: new pull request created: #35443 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(Time debt) sorry for continuing look till now (after merged). Today finished look and got them, they sound good. BTW the test way under my "Another way is" in previous comment seems same as their auto test, because for which we once ever discussed with Dev for learning their auto script in our QE test. |
CP to 4.6, 4.7, 4.8, and 4.9
https://bugzilla.redhat.com/show_bug.cgi?id=1973274
Updated Shutting down a cluster gracefully:
https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-shutdown?utm_source=github&utm_campaign=bot_dp
QE: In a case where a cluster has been down for 1 year, please also verify that steps 4 and 6 in Restarting the cluster gracefully is all a user would need to do with regard to pending CSRs. I suspect the answer is yes, but wanted to verify.
https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-restart.html