Skip to content

Conversation

Amarthya-v
Copy link

@Amarthya-v Amarthya-v commented Aug 4, 2025

This test validates STS permissions required for ROSA HCP NodePool operations:

  • ec2:RunInstances (create EC2 instances for new nodes)
  • ec2:CreateTags (tag new instances)
  • ec2:DescribeInstances (validate nodes exist in AWS)

The test performs day 2 operations by creating a NodePool, waiting for nodes
to provision, and scheduling workloads. Test failures block releases until
STS permissions are fixed, ensuring proper AWS access for cluster operations.

Fixes: https://issues.redhat.com/browse/OSD-28896

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 4, 2025
Copy link
Contributor

openshift-ci bot commented Aug 4, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Aug 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Amarthya-v
Once this PR has been reviewed and has the lgtm label, please assign ritmun for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@btoll
Copy link
Contributor

btoll commented Aug 4, 2025

@Amarthya-v Could you remove the trailing whitespace in the file? I use https://github.com/bitc/vim-bad-whitespace for vim, I'm sure there's a similar plugin for VS Code.

Copy link
Contributor

@btoll btoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if should create nodepool and provision EC2 instances with proper STS permissions could be split up into smaller tests. For example, would it make sense to have a test just to create the nodepool and then a second to do the rest?

Could you add more tests? Also, can you test the bad paths as well as the happy paths?

@Amarthya-v Amarthya-v force-pushed the nodepool-sts-permissions-test branch from 2fc11fd to 3e9ef47 Compare August 8, 2025 20:09
@Amarthya-v Amarthya-v force-pushed the nodepool-sts-permissions-test branch from 3e9ef47 to 5e1aef9 Compare August 13, 2025 04:23
@christophermancini
Copy link

/test all

@christophermancini
Copy link

@Amarthya-v looks like you have some failing checks at the moment.

Copy link
Contributor

openshift-ci bot commented Aug 13, 2025

@Amarthya-v: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/code-quality-checks 5e1aef9 link true /test code-quality-checks
ci/prow/hypershift-pr-check 5e1aef9 link false /test hypershift-pr-check

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Amarthya-v Amarthya-v force-pushed the nodepool-sts-permissions-test branch 2 times, most recently from ad2b5db to f1274bb Compare August 13, 2025 15:25
ginkgo.GinkgoLogr.Error(err, "Failed to cleanup test NodePool", "name", testNodePoolName)
}

ginkgo.By("Cleaning up test pods")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't deleting the nodepool also delete its pods?

@Amarthya-v Amarthya-v force-pushed the nodepool-sts-permissions-test branch 2 times, most recently from d9b4fe5 to 37a587c Compare August 19, 2025 03:38
np, err := h.Dynamic().Resource(nodePoolGVR).Namespace(clusterNamespace).
Get(ctx, testNodePoolName, metav1.GetOptions{})
if err != nil { return false, err }
ready, _, _ := unstructured.NestedInt64(np.Object, "status", "readyReplicas")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to check for an error here?

@Amarthya-v Amarthya-v force-pushed the nodepool-sts-permissions-test branch from 37a587c to d0df745 Compare August 28, 2025 03:15
@Amarthya-v Amarthya-v force-pushed the nodepool-sts-permissions-test branch from d0df745 to bc48a6d Compare September 4, 2025 13:59
return false
}

func failWithSTSError(err error, operation, permissions string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not combine this with isSTSError and call it something like failIfSTSError?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


func getClusterNamespace(h *helper.H) (string, error) {
if name := viper.GetString(config.Cluster.Name); name != "" {
return "clusters-" + name, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using fmt.Sprintf in other functions but you're not using it here. Probably be good to be consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now using fmt.Sprintf consistently


ginkgo.BeforeAll(func() {
cloudProvider := viper.GetString(config.CloudProvider.CloudProviderID)
if cloudProvider != "aws" {
Copy link
Contributor

@btoll btoll Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a constant here for the "aws" string and anywhere else you have "aws"? (like line 274)

})

ginkgo.It("rejects operations on non-existent NodePools", func(ctx context.Context) {
if clusterNamespace == "" || !canAccessNodePoolAPI(h, clusterNamespace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once here, clusterNamespace should never by an empty string, should it? And, can the canAccessNodePoolAPI test be done before any tests are run?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved API access check to BeforeAll and removed redundant namespace checks

})

ginkgo.It("rejects duplicate NodePool creation", func(ctx context.Context) {
if clusterNamespace == "" || !canAccessNodePoolAPI(h, clusterNamespace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once here, clusterNamespace should never by an empty string, should it? And, can the canAccessNodePoolAPI test be done before any tests are run?

})

ginkgo.AfterAll(func() {
if createdNodePool && testNodePoolName != "" && clusterNamespace != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conditional necessary? Are any of the tests doing any node pool cleanup? And, if they are, should they be?

})

ginkgo.It("creates new NodePool successfully", func(ctx context.Context) {
if clusterNamespace == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once here, clusterNamespace should never by an empty string, should it? And, can the canAccessNodePoolAPI test be done before any tests are run?

Note, this isn't consistent with the others.

}

expect.NoError(err, "NodePool creation failed")
createdNodePool = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting any test that is dependent upon createdNodePool being true to be in the same block? Otherwise, it's kind of kludgy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouped NodePool creation and provisioning tests in same Describe block

expect.NoError(err, "Failed to find ready NodePool nodes")
Expect(len(nodePoolNodes)).To(BeNumerically(">", 0), "No ready NodePool nodes available")

targetNode := &nodePoolNodes[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are several nodepool nodes in this slice? Would it matter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic to prefer ready nodes instead of just taking the first

expect.NoError(err, "Should create pod with invalid selector")
defer client.Delete(ctx, pod)

time.Sleep(30 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this arbitrary sleep?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with proper wait condition using wait.For

expect.NoError(err, "Failed to find ready NodePool nodes")
Expect(len(nodePoolNodes)).To(BeNumerically(">", 0), "No ready NodePool nodes available")

targetNode := &nodePoolNodes[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about the possibility of several nodepool nodes.

}
}

func createPodWithSelector(namespace string, nodeSelector map[string]string) *corev1.Pod {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function is virtually equivalent to createTestPod?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created single generic createPod function to eliminate duplication

}
}

func createResourceIntensivePod(namespace, nodeName string) *corev1.Pod {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably also roll this into the same generic createPod function.


ginkgo.AfterAll(func() {
if createdNodePool && testNodePoolName != "" && clusterNamespace != "" {
cleanupNodePool(context.Background(), h, clusterNamespace, testNodePoolName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only done once, do you think you need a separate function to do it? What about inlining it here?

})

ginkgo.AfterAll(func() {
if createdNodePool {
Copy link
Contributor

@btoll btoll Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you want to move this entire block into the one that starts on line 95?

})

ginkgo.Describe("NodePool creation and provisioning", func() {
ginkgo.It("creates new NodePool successfully", func(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you want to create the nodepool in a BeforeAll and then you can get rid of the createdNodePool boolean?

Expect(apierrors.IsNotFound(err)).To(BeTrue(), "Should return NotFound error")
})

ginkgo.It("rejects duplicate NodePool creation", func(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this into the Describe block above for the nodepool?

return "", fmt.Errorf("could not determine cluster namespace")
}

func canAccessNodePoolAPI(h *helper.H, namespace string) bool {
Copy link
Contributor

@btoll btoll Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you want to fail the entire suite if this cannot be accessed and exit early? You could also then remove all of the canAccessAPI checks.

for _, pattern := range patterns {
if strings.Contains(msg, pattern) {
ginkgo.Fail(fmt.Sprintf(
"STS_PERMISSION_ERROR: %s failed due to missing AWS permissions (%s): %v. "+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cleaned up. There's no need to concat strings within a fmt.Sprintf method call.

pod.Spec.NodeSelector = config.nodeSelector
}

if config.cpuLimit != "" && config.memLimit != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need resource limits? Aren't these short-lived pods? Are you anticipating resource spikes that would need to be throttled?

},
}

if config.nodeName != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this check and the one below it? Can't it just be defined in the above struct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants