Pod in 'CrashLoopBackOff' State - 'Readiness\Liveness probe failed: Get http://{POD_IP}:8082/actuator/health: dial tcp {POD_IP}:8082: connect: connection refused

intapiuser · ‎03-02-2023

When checking the pod status using:

kubectl -n $(NAMESPACE) get pods

You may encounter one of the pods in an unhealthy state:

jobs-cf6b46bcc-r2rkc 1/1 Running 0 27d

management-96449b57b-bdjsk 0/1 CrashLoopBackOff 1467 27d

model-graphql-84b79fb449-xkgcc 1/1 Running 0 27d

Which means that the pod is constantly attempting to initialize but crashing.

To further investigate what's causing this constant crashing, we can check the pod events with this command:

kubectl -n sisense --field-selector involvedObject.name=management-96449b57b-bdjsk

Which returns the following output:\

LAST SEEN   TYPE      REASON                 OBJECT                           MESSAGE

22m Warning Unhealthy pod/management-96449b57b-bdjsk

Readiness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused

42m Warning Unhealthy pod/management-96449b57b-bdjsk

Liveness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused

17m Warning BackOff pod/management-96449b57b-bdjsk Back-off restarting failed container

12m Warning MatchNodeSelector pod/management-96449b57b-bdjsk Predicate MatchNodeSelector failed

11m Normal TaintManagerEviction pod/management-96449b57b-bdjsk Cancelling deletion of Pod sisense/management-96449b57b-bdjsk

What we can see from this output is that the readiness and liveness probes are not listening on the expected endpoint/port (http://10.233.81.105:8082/actuator/health).

What are the Readiness\Liveness probes?

The kubelet, which is the primary node agent and runs on each one of the nodes, ensures that the pods which are supposed to be running are in a healthy state. There are 3 different methods that the kubelet can check if the pods are healthy. In Sisense deployments, we use the HTTP endpoint option which checks if the endpoint is alive (by default every 20 seconds, can be changed in the pod deployment spec.containers[*].livenessProbe.periodSeconds).

Kubelet uses the liveness probe in order to know when to restart a container\pod.

Kubelet uses the readiness probe in order to know when the container is ready to accept traffic. A pod is ready when all of its containers are ready.

We can check what is configured for the management deployment:

kubectl -n sisense get deploy management -o yaml

...

livenessProbe:

failureThreshold: 3

httpGet:

path: /actuator/health

port: 8082

scheme: HTTP

initialDelaySeconds: 60

periodSeconds: 20

successThreshold: 1

timeoutSeconds: 10

readinessProbe:

failureThreshold: 3

httpGet:

path: /actuator/health

port: 8082

scheme: HTTP

initialDelaySeconds: 10

periodSeconds: 10

successThreshold: 1

timeoutSeconds: 5

...

As we can see from the above output, we have both the readinessProbe and livenessProbe using the HTTP endpoint /actuator/health on 8082 to detect whether the pod is healthy.

Why were the endpoints unavailable for the readiness\liveness probes?

A few ways we can troubleshoot the issue:

1) Reviewing the machine resources in Grafana, it was evident that the machine resources were insufficient (RAM maxed out) so the pod (and its associated probes) were not able to check the endpoint.

2) Access the pod and wait until it crashes again:

kubectl -n sisense exec management-96449b57b-x8swv -it -- bash

After a few minutes having an active session inside the pod, we see the following message as the session is terminated:

/opt/sisense/management# command terminated with exit code 137

Exit code 137 is essentially an error code signaling that the application arrived at an out of memory condition and the 'docker stop' command was sent by Kubernetes to the container.

3) We can run the command:

docker inspect $(CONTAINER_ID)

To get more information about why the container was terminated.

To retrieve the docker container ID, we can describe the pod and review the output:

kubectl -n sisense describe pod management

...

Containers:

management:

Container ID: docker://599dcc237dc76b25ac60c1ed2b8f1b78a438ff51de7d462d080bfbe2aab76bbe

...

4) We can use system daemon journal to check for any out of memory messages:

journalctl -r -k

# OR

journalctl -r -k | grep -i -e memory -e oom

You'd need to run this command on the node that was hosting the container that was killed.

5) Using the 'describe' command, we can clearly see that the container failed because OOM with exit code 137:

kubectl -n sisense describe pod management

# ...

# Containers:

# management:

# ...

# State:

# Reason: CrashLoopBackOff

# Last State:

# Reason: Error

# Exit Code: 137

# ...

Sisense Community

Pod in 'CrashLoopBackOff' State - 'Readiness\Liveness probe failed: Get http://{POD_IP}:8082/actuator/health: dial tcp {POD_IP}:8082: connect: connection refused

What are the Readiness\Liveness probes?

Why were the endpoints unavailable for the readiness\liveness probes?