cancel
Showing results for 
Search instead for 
Did you mean: 
intapiuser
Community Team Member
Community Team Member
When checking the pod status using:
kubectl -n $(NAMESPACE) get pods

You may encounter one of the pods in an unhealthy state:
jobs-cf6b46bcc-r2rkc                                            1/1     Running            0          27d
management-96449b57b-bdjsk                                      0/1     CrashLoopBackOff   1467       27d
model-graphql-84b79fb449-xkgcc                                  1/1     Running            0          27d

Which means that the pod is constantly attempting to initialize but crashing.
To further investigate what's causing this constant crashing, we can check the pod events with this command:
kubectl -n sisense --field-selector involvedObject.name=management-96449b57b-bdjsk

Which returns the following output:\
LAST SEEN   TYPE      REASON                 OBJECT                           MESSAGE
22m         Warning   Unhealthy              pod/management-96449b57b-bdjsk   Readiness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused
42m         Warning   Unhealthy              pod/management-96449b57b-bdjsk   Liveness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused
17m         Warning   BackOff                pod/management-96449b57b-bdjsk   Back-off restarting failed container
12m         Warning   MatchNodeSelector      pod/management-96449b57b-bdjsk   Predicate MatchNodeSelector failed
11m         Normal    TaintManagerEviction   pod/management-96449b57b-bdjsk   Cancelling deletion of Pod sisense/management-96449b57b-bdjsk

What we can see from this output is that the readiness and liveness probes are not listening on the expected endpoint/port (http://10.233.81.105:8082/actuator/health).

What are the Readiness\Liveness probes?

The kubelet, which is the primary node agent and runs on each one of the nodes, ensures that the pods which are supposed to be running are in a healthy state. There are 3 different methods that the kubelet can check if the pods are healthy. In Sisense deployments, we use the HTTP endpoint option which checks if the endpoint is alive (by default every 20 seconds, can be changed in the pod deployment spec.containers[*].livenessProbe.periodSeconds). 
Kubelet uses the liveness probe in order to know when to restart a container\pod.
Kubelet uses the readiness probe in order to know when the container is ready to accept traffic. A pod is ready when all of its containers are ready.

We can check what is configured for the management deployment:
kubectl -n sisense get deploy management -o yaml
 
...
livenessProbe:
   failureThreshold: 3
   httpGet:
     path: /actuator/health
     port: 8082
     scheme: HTTP
   initialDelaySeconds: 60
   periodSeconds: 20
   successThreshold: 1
   timeoutSeconds: 10
readinessProbe:          
   failureThreshold: 3    
   httpGet:               
     path: /actuator/health
     port: 8082           
     scheme: HTTP         
   initialDelaySeconds: 10
   periodSeconds: 10      
   successThreshold: 1    
   timeoutSeconds: 5      
...

As we can see from the above output, we have both the readinessProbe and livenessProbe using the HTTP endpoint /actuator/health on 8082 to detect whether the pod is healthy.

Why were the endpoints unavailable for the readiness\liveness probes?


A few ways we can troubleshoot the issue:
1) Reviewing the machine resources in Grafana, it was evident that the machine resources were insufficient (RAM maxed out) so the pod (and its associated probes) were not able to check the endpoint.

2) Access the pod and wait until it crashes again:
kubectl -n sisense exec management-96449b57b-x8swv -it -- bash

After a few minutes having an active session inside the pod, we see the following message as the session is terminated:
/opt/sisense/management# command terminated with exit code 137

Exit code 137 is essentially an error code signaling that the application arrived at an out of memory condition and the 'docker stop' command was sent by Kubernetes to the container.

3) We can run the command:
docker inspect $(CONTAINER_ID)
To get more information about why the container was terminated.
To retrieve the docker container ID, we can describe the pod and review the output:
kubectl -n sisense describe pod management
...
Containers:
   management:
     Container ID: docker://599dcc237dc76b25ac60c1ed2b8f1b78a438ff51de7d462d080bfbe2aab76bbe
...

4) We can use system daemon journal to check for any out of memory messages:
journalctl -r -k
# OR
journalctl -r -k | grep -i -e memory -e oom
You'd need to run this command on the node that was hosting the container that was killed.

5) Using the 'describe' command, we can clearly see that the container failed because OOM with exit code 137:
kubectl -n sisense describe pod management
 
# ...
# Containers:
#   management:
#     ...
#     State:
#       Reason:   CrashLoopBackOff
#     Last State:
#       Reason:   Error
#       Exit Code: 137
#     ...
Rate this article:
Version history
Last update:
‎03-02-2023 08:52 AM
Updated by:
Contributors