Skip to content

Automatic GKE node pool update fills controller logs with error #1363

@salasberryfin

Description

@salasberryfin

/kind bug

Description

After a new GKE cluster becomes ready, GCP may automatically trigger a node pool update (this doesn't always happen but I've seen it occur consistently - not with Autopilot enabled, though). This makes the controller complain about an existing operation running. Same happens when deleting a cluster and may apply to any other events in which GCP performs managed updates.

The following is the error log from the controller:

reconcile.go:204] "Deleting node pool resources" controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202"
gcpmanagedmachinepool_controller.go:383] "Reconcile error" err=<
	rpc error: code = FailedPrecondition desc = Cluster is running incompatible operation operation-1731137433803-29e34938-9333-469b-b14f-5bdf10b143d2.
	error details: name = ErrorInfo reason = CLUSTER_ALREADY_HAS_OPERATION domain = container.googleapis.com metadata = map[]
	error details: name = RequestInfo id = 0xe7ae075979bd6d29 data =
 > controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202" controller="gcpmanagedmachinepool" action="delete" reconciler="nodepools"

At some point in the reconciliation loop, a node pool update operation is initiated while the automatic update is still running, causing this error. Once the node pool update completes, the controller resumes normal operation, and the cluster either becomes ready or is deleted successfully.

As a user, I would expect the CAPG controller to handle this error gracefully, avoiding node pool updates for clusters that already have an operation in progress. Initial investigation suggests that this issue may be caused by the controller not being able to unwrap an error during reconciliation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions