intapiuser
Community Team Member
Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
on 03-02-2023 08:52 AM
When checking the pod status using:
kubectl -n $(NAMESPACE) get pods
You may encounter one of the pods in an unhealthy state:
jobs-cf6b46bcc-r2rkc 1/1
Running 0 27d
management-96449b57b-bdjsk 0/1
CrashLoopBackOff 1467 27d
model-graphql-84b79fb449-xkgcc 1/1
Running 0 27d
Which means that the pod is constantly attempting to initialize but crashing.
To further investigate what's causing this constant crashing, we can check the pod events with this command:
kubectl -n sisense --field-selector involvedObject.name=management-96449b57b-bdjsk
Which returns the following output:\
LAST SEEN TYPE REASON OBJECT MESSAGE
22m Warning Unhealthy pod/management-96449b57b-bdjsk
Readiness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused
42m Warning Unhealthy pod/management-96449b57b-bdjsk
Liveness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused
17m Warning BackOff pod/management-96449b57b-bdjsk
Back-off restarting failed container
12m Warning MatchNodeSelector pod/management-96449b57b-bdjsk
Predicate MatchNodeSelector failed
11m Normal TaintManagerEviction pod/management-96449b57b-bdjsk
Cancelling deletion of Pod sisense/management-96449b57b-bdjsk
What we can see from this output is that the readiness and liveness probes are not listening on the expected endpoint/port (http://10.233.81.105:8082/actuator/health).
What are the Readiness\Liveness probes?
The kubelet, which is the primary node agent and runs on each one of the nodes, ensures that the pods which are supposed to be running are in a healthy state. There are 3 different methods that the kubelet can check if the pods are healthy. In Sisense deployments, we use the HTTP endpoint option which checks if the endpoint is alive (by default every 20 seconds, can be changed in the pod deployment spec.containers[*].livenessProbe.periodSeconds).
Kubelet uses the liveness probe in order to know when to restart a container\pod.
Kubelet uses the readiness probe in order to know when the container is ready to accept traffic. A pod is ready when all of its containers are ready.
We can check what is configured for the management deployment:
kubectl -n sisense get deploy management -o yaml
...
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8082
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8082
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
...
As we can see from the above output, we have both the readinessProbe and livenessProbe using the HTTP endpoint /actuator/health on 8082 to detect whether the pod is healthy.
Why were the endpoints unavailable for the readiness\liveness probes?
A few ways we can troubleshoot the issue:
1) Reviewing the machine resources in Grafana, it was evident that the machine resources were insufficient (RAM maxed out) so the pod (and its associated probes) were not able to check the endpoint.
2) Access the pod and wait until it crashes again:
kubectl -n sisense exec
management-96449b57b-x8swv -it -- bash
After a few minutes having an active session inside the pod, we see the following message as the session is terminated:
/opt/sisense/management# command terminated with exit code 137
Exit code 137 is essentially an error code signaling that the application arrived at an out of memory condition and the 'docker stop' command was sent by Kubernetes to the container.
3) We can run the command:
docker inspect $(CONTAINER_ID)
To get more information about why the container was terminated.
To retrieve the docker container ID, we can describe the pod and review the output:
kubectl -n sisense describe pod management
...
Containers:
management:
Container ID: docker://599dcc237dc76b25ac60c1ed2b8f1b78a438ff51de7d462d080bfbe2aab76bbe
...
4) We can use system daemon journal to check for any out of memory messages:
journalctl -r -k
# OR
journalctl -r -k | grep
-i -e memory -e oom
You'd need to run this command on the node that was hosting the container that was killed.
5) Using the 'describe' command, we can clearly see that the container failed because OOM with exit code 137:
kubectl -n sisense describe pod management
# ...
# Containers:
# management:
# ...
# State:
# Reason: CrashLoopBackOff
# Last State:
# Reason: Error
# Exit Code: 137
# ...
Labels:
Rate this article: