Automatic GKE node pool update fills controller logs with error

/kind bug

# Description

After a new GKE cluster becomes ready, GCP may automatically trigger a node pool update (this doesn't always happen but I've seen it occur consistently - not with Autopilot enabled, though). This makes the controller complain about an existing operation running. Same happens when deleting a cluster and may apply to any other events in which GCP performs managed updates.

The following is the error log from the controller:

```
reconcile.go:204] "Deleting node pool resources" controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202"
gcpmanagedmachinepool_controller.go:383] "Reconcile error" err=<
	rpc error: code = FailedPrecondition desc = Cluster is running incompatible operation operation-1731137433803-29e34938-9333-469b-b14f-5bdf10b143d2.
	error details: name = ErrorInfo reason = CLUSTER_ALREADY_HAS_OPERATION domain = container.googleapis.com metadata = map[]
	error details: name = RequestInfo id = 0xe7ae075979bd6d29 data =
 > controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202" controller="gcpmanagedmachinepool" action="delete" reconciler="nodepools"
```

At some point in the reconciliation loop, a node pool update operation is initiated while the automatic update is still running, causing this error. Once the node pool update completes, the controller resumes normal operation, and the cluster either becomes ready or is deleted successfully.

As a user, I would expect the CAPG controller to handle this error gracefully, avoiding node pool updates for clusters that already have an operation in progress. Initial investigation suggests that this issue may be caused by the controller not being able to unwrap an error during reconciliation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatic GKE node pool update fills controller logs with error #1363

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Automatic GKE node pool update fills controller logs with error #1363

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions