-
Notifications
You must be signed in to change notification settings - Fork 219
Description
/kind bug
Description
After a new GKE cluster becomes ready, GCP may automatically trigger a node pool update (this doesn't always happen but I've seen it occur consistently - not with Autopilot enabled, though). This makes the controller complain about an existing operation running. Same happens when deleting a cluster and may apply to any other events in which GCP performs managed updates.
The following is the error log from the controller:
reconcile.go:204] "Deleting node pool resources" controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202"
gcpmanagedmachinepool_controller.go:383] "Reconcile error" err=<
rpc error: code = FailedPrecondition desc = Cluster is running incompatible operation operation-1731137433803-29e34938-9333-469b-b14f-5bdf10b143d2.
error details: name = ErrorInfo reason = CLUSTER_ALREADY_HAS_OPERATION domain = container.googleapis.com metadata = map[]
error details: name = RequestInfo id = 0xe7ae075979bd6d29 data =
> controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202" controller="gcpmanagedmachinepool" action="delete" reconciler="nodepools"
At some point in the reconciliation loop, a node pool update operation is initiated while the automatic update is still running, causing this error. Once the node pool update completes, the controller resumes normal operation, and the cluster either becomes ready or is deleted successfully.
As a user, I would expect the CAPG controller to handle this error gracefully, avoiding node pool updates for clusters that already have an operation in progress. Initial investigation suggests that this issue may be caused by the controller not being able to unwrap an error during reconciliation.