Skip to content

Conversation

mjpytlak
Copy link
Contributor

@mjpytlak mjpytlak commented Aug 2, 2021

CP to 4.6, 4.7, 4.8, and 4.9

https://bugzilla.redhat.com/show_bug.cgi?id=1973274

Updated Shutting down a cluster gracefully:

  • Added a note that states the maximum amount of time that a cluster can remain down and be expected to restart gracefully is 1 year.
  • Step 1 specifies how to determine when the cluster certificates will expire.

https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-shutdown?utm_source=github&utm_campaign=bot_dp

QE: In a case where a cluster has been down for 1 year, please also verify that steps 4 and 6 in Restarting the cluster gracefully is all a user would need to do with regard to pending CSRs. I suspect the answer is yes, but wanted to verify.

https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-restart.html

@openshift-ci openshift-ci bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 2, 2021
@netlify
Copy link

netlify bot commented Aug 2, 2021

✔️ Deploy Preview for osdocs ready!

🔨 Explore the source changes: e82d998

🔍 Inspect the deploy log: https://app.netlify.com/sites/osdocs/deploys/611293a845877a00088e3dd1

😎 Browse the preview: https://deploy-preview-35099--osdocs.netlify.app

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. for up to 1 year and expect: Can we provide a way to check this date?

    • For example, it looks like this, but we need to confirm from QE.
    • $ oc -n openshift-kube-apiserver-operator get secret kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
  2. If the cluster remains down for 1 year: Can we clarify whether this is "before" or "after" the expiration date confirmed above?

    • If my understanding is correct, the scenario Eng said is "the day" or "before".
  3. you manually approve: It's a easy procedure, but it's nice to have a sample.

    • For example, it looks like this.
    • $oc get csr -oname | xargs oc adm certificate approve

Copy link
Contributor Author

@mjpytlak mjpytlak Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.
[1] When are you suggesting that this command (pending confirm from QE) be run? I interpreted your comment as prior to shutting the cluster down for an extended period, run the command and note the cert expiration date. Do I understand you correctly?
[3] We presently have the steps documented on how to manually approve pending CSRs. In my original comment, I have asked the QE verify the current steps in https://deploy-preview-35099--osdocs.netlify.app/openshift-enterprise/latest/backup_and_restore/graceful-cluster-restart.html apply to this scenario. Once verified, I will be sure to reference this topic in my updates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. Thank you for your reply.

[1] This should be checked before the cluster shutdown. For example, "up to 1 year" is correct the day after OCP is installed, but I wondering if we do this after a long time (ex. a month) after installation, we may not be able to keep it cluster shutdown for a whole year. If my understanding is correct.

[3] Yes, I think this makes sense. Thank you for letting me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good evening,

[1] I believe your understanding is correct. In my next round of edits, I will be sure to clarify the point that certificates expire 1 year from the date of installation. Thank you for pointing this out.

@mjpytlak mjpytlak changed the title BZ:1886814 Added a note on maximum cluster downtime BZ:1973274 Added a note on maximum cluster downtime Aug 3, 2021
@vikram-redhat vikram-redhat added this to the Next Release milestone Aug 4, 2021
@openshift-ci openshift-ci bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 5, 2021
@xingxingxia
Copy link
Contributor

Very sorry, was busy in work on led subteams so didn't have time on this.
Today I read through the bug of long comments which are useful for me to get the points in place. But later the Bugzilla is down when I only read part of comments. I need read more when it is back.

The PR is LGTM.

But frankly speaking, it can't be tested because shutting down a cluster for 1 year and the restarting can't be tested. Then how to deal with this issue?
In DR tests, we have two ways as described in our case OCP-30788:
One way is: install fresh env, shut down and wait until it has been 24 hours since installed, then restart, this tests control plane certs of initial 24 hours expiry.
Another way is: install baremetal env (could provider env can't use this way), shut down OCP kubelet and crio services on nodes, then move ahead time for 1 year on all nodes, then restart OCP kubelet and crio services, this covers the cert recovery of long expiry. Can't shut down after moving ahead, because after restarting, the time is restored to World time.

(CC @wangke19 @gangwgr FYI about this doc's need)

@mjpytlak
Copy link
Contributor Author

@xingxingxia

Thank you for looking at this. With regard to your comment about being tested, please see the referenced RFE [1]. The stated period of 1 year was based on David Eads comment (2021/June/02) that there is automated testing for certificate expiration and the ability for the cluster to recover from expired certs. Given this information, the team decided to provide the guidance in the doc. Please see Vikram Goyal's comments on 2021-08-02 in the BZ [2] for details. Given this info, do I have QE approval?

[1] https://issues.redhat.com/browse/RFE-450?focusedCommentId=16224784&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16224784.
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1973274

@xingxingxia
Copy link
Contributor

xingxingxia commented Aug 10, 2021

OK, Will continue looking at those details tomorrow. Today didn't have time due to other things. Adding:
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2021
Copy link
Contributor

@jeana-redhat jeana-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After which time, the cluster certificates expire.
From the installation date, you can leave a cluster shut down for up to 1 year and expect it to restart gracefully. After 1 year, the cluster certificates expire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. If you are shutting the cluster down for an extended period, note the date on which certificates expire.
. If you are shutting the cluster down for an extended period, determine the date on which certificates expire:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the command is not noting it, but getting the info from the cluster. You could add a followup to this step to say "make a note of this" or similar, but I think the little footnote with the example output actually covers this pretty well already :)

Copy link
Contributor

@bergerhoffer bergerhoffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just a question - did you think it would be worthwhile to also add any of this info onto the restarting a cluster gracefully page?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is worded now doesn't seem super clear that it is 1 year from the installation date. Suggest a reword. Maybe?

You can shut down a cluster until a year from the installation date and expect it to restart gracefully. After a year from the installation date, the cluster certificates expire.

or maybe:

As long as you restart the cluster before the cluster certificates expire, which is a year from the installation date, you can expect the cluster to restart gracefully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember if we talked about this, but the recovering from expired certs section [1] was recently updated to add additional CSRs that might need to approved (for UPI installs). So it might not only be node-bootstrapper CSRs. Suggest leaving the specifics out and just saying "... to manually approve the pending certificate signing requests (CSRs) to recover kubelet certificates."

[1] https://docs.openshift.com/container-platform/4.8/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html#dr-scenario-3-recovering-expired-certs_dr-recovering-expired-certs

@bergerhoffer bergerhoffer added the peer-review-done Signifies that the peer review team has reviewed this PR label Aug 10, 2021
@openshift-ci
Copy link

openshift-ci bot commented Aug 10, 2021

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2021
@jeana-redhat jeana-redhat merged commit 960b724 into openshift:main Aug 11, 2021
@jeana-redhat
Copy link
Contributor

/cherrypick enterprise-4.9

@jeana-redhat
Copy link
Contributor

/cherrypick enterprise-4.8

@openshift-cherrypick-robot

@jeana-redhat: new pull request created: #35440

In response to this:

/cherrypick enterprise-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jeana-redhat
Copy link
Contributor

/cherrypick enterprise-4.7

@jeana-redhat
Copy link
Contributor

/cherrypick enterprise-4.6

@openshift-cherrypick-robot

@jeana-redhat: new pull request created: #35441

In response to this:

/cherrypick enterprise-4.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@jeana-redhat: new pull request created: #35442

In response to this:

/cherrypick enterprise-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@jeana-redhat: new pull request created: #35443

In response to this:

/cherrypick enterprise-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xingxingxia
Copy link
Contributor

xingxingxia commented Aug 16, 2021

xingxingxia commented 6 days ago
OK, Will continue looking at those details tomorrow. Today didn't have time due to other things.

(Time debt) sorry for continuing look till now (after merged). Today finished look and got them, they sound good. BTW the test way under my "Another way is" in previous comment seems same as their auto test, because for which we once ever discussed with Dev for learning their auto script in our QE test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch/enterprise-4.6 branch/enterprise-4.7 branch/enterprise-4.8 branch/enterprise-4.9 peer-review-done Signifies that the peer review team has reviewed this PR size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants