Skip to content

Incorrect Container Memory Consumption Graph Behavior When Pod is Restarted #2522

@vladmalynych

Description

@vladmalynych

Problem:

The Grafana dashboards defined in grafana-dashboardDefinitions.yaml include graphs for memory consumption per pod. The memory consumption query currently used is:

https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/grafana-dashboardDefinitions.yaml#L8300

                  "targets": [
                      {
                          "datasource": {
                              "type": "prometheus",
                              "uid": "${datasource}"
                          },
                          "expr": "sum(container_memory_working_set_bytes{job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"\", image!=\"\"}) by (container)",
                          "legendFormat": "__auto"
                      },
                      {
                          "datasource": {
                              "type": "prometheus",
                              "uid": "${datasource}"
                          },
                          "expr": "sum(\n    kube_pod_container_resource_requests{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", resource=\"memory\"}\n)\n",
                          "legendFormat": "requests"
                      },
                      {
                          "datasource": {
                              "type": "prometheus",
                              "uid": "${datasource}"
                          },
                          "expr": "sum(\n    kube_pod_container_resource_limits{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", resource=\"memory\"}\n)\n",
                          "legendFormat": "limits"
                      }
                  ],
                  "title": "Memory Usage (WSS)",
                  "type": "timeseries"
              },

When a pod is restarted, the current query adds memory usage data from both the old and new containers simultaneously. This can lead to temporary spikes in the displayed memory consumption. As a result, the dashboard may show memory usage that exceeds the container's memory limit, even though the actual memory consumption is within the limit.

Screenshot 2024-09-17 at 14 20 49
Screenshot 2024-09-17 at 14 22 55 (1)

Steps to Reproduce:

  • Trigger a pod restart (e.g OOM kill, or Evict).
  • Compare graphs with expression grouped by just container field with graph that has expression that groups by container and id:
"expr": "sum(container_memory_working_set_bytes{job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"\", image!=\"\"}) by (container, id)"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions