Pod in 'CrashLoopBackOff' State - 'Readiness\Liveness probe failed: Get http://{POD_IP}:8082/actuator/health: dial tcp {POD_IP}:8082: connect: connection refused
When checking the pod status using:
kubectl -n $(NAMESPACE) get podsYou may encounter one of the pods in an unhealthy state:
jobs-cf6b46bcc-r2rkc 1/1 Running 0 27dmanagement-96449b57b-bdjsk 0/1 CrashLoopBackOff 1467 27dmodel-graphql-84b79fb449-xkgcc 1/1 Running 0 27dWhich means that the pod is constantly attempting to initialize but crashing.
To further investigate what's causing this constant crashing, we can check the pod events with this command:
kubectl -n sisense --field-selector involvedObject.name=management-96449b57b-bdjskWhich returns the following output:\
LAST SEEN TYPE REASON OBJECT MESSAGE22m Warning Unhealthy pod/management-96449b57b-bdjsk Readiness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused42m Warning Unhealthy pod/management-96449b57b-bdjsk Liveness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused17m Warning BackOff pod/management-96449b57b-bdjsk Back-off restarting failed container12m Warning MatchNodeSelector pod/management-96449b57b-bdjsk Predicate MatchNodeSelector failed11m Normal TaintManagerEviction pod/management-96449b57b-bdjsk Cancelling deletion of Pod sisense/management-96449b57b-bdjskWhat we can see from this output is that the readiness and liveness probes are not listening on the expected endpoint/port (http://10.233.81.105:8082/actuator/health).
What are the Readiness\Liveness probes?
The kubelet, which is the primary node agent and runs on each one of the nodes, ensures that the pods which are supposed to be running are in a healthy state. There are 3 different methods that the kubelet can check if the pods are healthy. In Sisense deployments, we use the HTTP endpoint option which checks if the endpoint is alive (by default every 20 seconds, can be changed in the pod deployment spec.containers[*].livenessProbe.periodSeconds).
Kubelet uses the liveness probe in order to know when to restart a container\pod.
Kubelet uses the readiness probe in order to know when the container is ready to accept traffic. A pod is ready when all of its containers are ready.
We can check what is configured for the management deployment:
kubectl -n sisense get deploy management -o yaml...livenessProbe: failureThreshold: 3 httpGet: path: /actuator/health port: 8082 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 20 successThreshold: 1 timeoutSeconds: 10readinessProbe: failureThreshold: 3 httpGet: path: /actuator/health port: 8082 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 ...As we can see from the above output, we have both the readinessProbe and livenessProbe using the HTTP endpoint /actuator/health on 8082 to detect whether the pod is healthy.
Why were the endpoints unavailable for the readiness\liveness probes?
A few ways we can troubleshoot the issue:
1) Reviewing the machine resources in Grafana, it was evident that the machine resources were insufficient (RAM maxed out) so the pod (and its associated probes) were not able to check the endpoint.
2) Access the pod and wait until it crashes again:
kubectl -n sisense exec management-96449b57b-x8swv -it -- bashAfter a few minutes having an active session inside the pod, we see the following message as the session is terminated:
/opt/sisense/management# command terminated with exit code 137Exit code 137 is essentially an error code signaling that the application arrived at an out of memory condition and the 'docker stop' command was sent by Kubernetes to the container.
3) We can run the command:
docker inspect $(CONTAINER_ID)To get more information about why the container was terminated.
To retrieve the docker container ID, we can describe the pod and review the output:
kubectl -n sisense describe pod management...Containers: management: Container ID: docker://599dcc237dc76b25ac60c1ed2b8f1b78a438ff51de7d462d080bfbe2aab76bbe...4) We can use system daemon journal to check for any out of memory messages:
journalctl -r -k# ORjournalctl -r -k | grep -i -e memory -e oomYou'd need to run this command on the node that was hosting the container that was killed.
5) Using the 'describe' command, we can clearly see that the container failed because OOM with exit code 137:
kubectl -n sisense describe pod management# ...# Containers:# management:# ...# State:# Reason: CrashLoopBackOff# Last State:# Reason: Error# Exit Code: 137# ...Updated 03-02-2023
intapiuser
Admin
Joined December 15, 2022