Skip to content

[BUG] The controller panics when UptimeRobot's API suddently drop the connexion #635

@polyedre

Description

@polyedre

Describe the bug

The UptimeRobot API sometimes fails with this error:

{"level":"error","ts":1750345098.7199364,"logger":"http-client","msg":"","error":"Post \"https://api.uptimerobot.com/v2/getMonitors\": unexpected EOF","stacktrace":"github.com/stakater/IngressMonitorController/v2/pkg/http.(*HttpClient).RequestWithHeaders\n\t/workspace/pkg/http/httpClient.go:52\ngithub.com/stakater/IngressMonitorController/v2/pkg/http.(*HttpClient).PostUrlEncodedFormBody\n\t/workspace/pkg/http/httpClient.go:91\ngithub.com/stakater/IngressMonitorController/v2/pkg/monitors/uptimerobot.(*UpTimeMonitorService).GetByName\n\t/workspace/pkg/monitors/uptimerobot/uptime-monitor.go:53\ngithub.com/stakater/IngressMonitorController/v2/pkg/monitors.(*MonitorServiceProxy).GetByName\n\t/workspace/pkg/monitors/monitor-proxy.go:105\ngithub.com/stakater/IngressMonitorController/v2/pkg/controllers.findMonitorByName\n\t/workspace/pkg/controllers/endpointmonitor_util.go:10\ngithub.com/stakater/IngressMonitorController/v2/pkg/controllers.(*EndpointMonitorReconciler).Reconcile\n\t/workspace/pkg/controllers/endpointmonitor_controller.go:92\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

I'm not aware of the exact root cause. I was able to reproduce consistently when requesting the API several times per seconds.

Unfortunately, the consequence is that the controller panics a few line later.

The issue is that code at https://github.com/stakater/IngressMonitorController/blob/master/pkg/http/httpClient.go#L52 does not return when it's failing. As a consequence, at https://github.com/stakater/IngressMonitorController/blob/master/pkg/http/httpClient.go#L58, the code tries to dereference response to access response.StatusCode. This panics:

{"level":"info","ts":1750345098.720687,"msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"endpointmonitor","controllerGroup":"endpointmonitor.stakater.com","controllerKind":"EndpointMonitor","endpointMonitor":{"name":"redacted","namespace":"redacted"},"namespace":"redacted","name":"redacted","reconcileID":"redacted"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x181e1df]

goroutine 289 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x1da
panic({0x1a3e740?, 0x2eb02d0?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/stakater/IngressMonitorController/v2/pkg/http.(*HttpClient).RequestWithHeaders(0xc00063b728, {0x1d0ed0d, 0x4}, {0xc0000e0c80, 0x7f, 0x80}, 0xc00063b788)
        /workspace/pkg/http/httpClient.go:58 +0x33f
github.com/stakater/IngressMonitorController/v2/pkg/http.(*HttpClient).PostUrlEncodedFormBody(...)
        /workspace/pkg/http/httpClient.go:91
github.com/stakater/IngressMonitorController/v2/pkg/monitors/uptimerobot.(*UpTimeMonitorService).GetByName(0xc00079e410, {0xc0001621b0, 0x2a})
        /workspace/pkg/monitors/uptimerobot/uptime-monitor.go:53 +0x1de
github.com/stakater/IngressMonitorController/v2/pkg/monitors.(*MonitorServiceProxy).GetByName(...)
        /workspace/pkg/monitors/monitor-proxy.go:105
github.com/stakater/IngressMonitorController/v2/pkg/controllers.findMonitorByName(0x0?, {0xc0001621b0?, 0x1?})
        /workspace/pkg/controllers/endpointmonitor_util.go:10 +0x22
github.com/stakater/IngressMonitorController/v2/pkg/controllers.(*EndpointMonitorReconciler).Reconcile(0xc00079e460, {0x30?, 0xc00007e800?}, {{{0xc000686ae0, 0x10}, {0xc000686aa0, 0x7}}})
        /workspace/pkg/controllers/endpointmonitor_controller.go:92 +0x4c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1fb42e8?, {0x1fb42b0?, 0xc002810780?}, {{{0xc000686ae0?, 0x1c330a0?}, {0xc000686aa0?, 0xc00063bda8?}}})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00048c460, {0x1fb42e8, 0xc00079e2d0}, {0x1ad6120?, 0xc000632280?})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320 +0x318
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00048c460, {0x1fb42e8, 0xc00079e2d0})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273 +0x1af
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 264
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:230 +0x565

To Reproduce

We were able to reproduce by creating and deleting an EndpointMonitor every second.

Expected behavior

I would expect the controller to gracefully handle the error from the API.

Environment

  • Operator Version: Current last version (2.2.3)
  • Kubernetes/OpenShift Version: 1.32.2

Additional context

We implemented a patch locally that force this function to return a StatusCode ServerError:

	if err != nil {
		log.Error(err, "")
+		return HttpResponse{
+			StatusCode: http.StatusInternalServerError,
+			Header:     make(map[string][]string),
+		}
	} else if response == nil {
		log.Error(nil, "got empty response")
+		return HttpResponse{
+			StatusCode: http.StatusInternalServerError,
+			Header:     make(map[string][]string),
+		}
	}

This works perfectly. However, this is not ideal.

I'm not fluent in Golang, but I guess in a perfect world this function should return an Error. Because this is a core library, we would need to patch a lot of calls to handle the errors.

The patch we did does not have this issue, and should break a little less the API. However, returning http.StatusInternalServerError is quite arbitrary.

What do you think? Do you want me to open a PR with this changes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions