Troubleshooting Pods in Kubernetes Clusters

vsolodkyi · ‎09-05-2024

Troubleshooting Pods in Kubernetes Clusters

In Kubernetes, pods can encounter various issues that prevent them from running correctly. Understanding the different pod states and how to troubleshoot them is essential for maintaining a healthy cluster. This guide covers common pod states and provides steps for diagnosing and resolving issues.

Common Pod States:

Init: The pod is initializing. All init containers must be completed before the main containers start.
0/1 Running: The pod is running, but not all containers are in a ready state.
CrashLoopBackOff: The pod repeatedly fails and is restarted.
Pending: The pod is waiting to be scheduled on a node.
ImagePullBackOff: The pod cannot pull the container image.

Troubleshooting Steps

1. Pod in Init State:

When a pod is stuck in the Init state, it indicates that one or more init containers haven't completed successfully.

Check Pod Description:
kubectl -n sisense describe pod <pod_name>
In the description, look for the init containers section. All init containers should ideally be in a "Completed" state. If one is still running or has failed, it can block the rest of the pod's containers from starting.
Example:

Init Containers:
  init-mongodb:
    State:          Running

This indicates an issue with the init-mongodb container.

Check Logs of Init Containers: If the init container is still running or has failed, check its logs:
kubectl -n sisense logs <pod_name> -c init-mongodb

After identifying issues in the init container, investigate the related services or dependencies, such as the MongoDB pod itself in this example.

2. Pod in 0/1, 1/2 Running State:

This state indicates that the pod is running, but not all containers are in a ready state.

Describe the Pod:
kubectl -n sisense describe pod <pod_name>

Check the State section for each container. Look for reasons why a container is not in a ready state, such as CrashLoopBackOff, ImagePullBackOff, or other errors.
Check Logs for Previous Failed Container: If a container is in an error state or has restarted, checking the logs can provide more context about the issue.

Current Logs:
kubectl -n sisense logs <pod_name> -c <container_name>

Replace <container_name> with the name of the specific container.

Previous Logs:
kubectl -n sisense logs <pod_name> -c <container_name> -p

This command retrieves logs from the previous instance of the container, which can be particularly useful if the container has restarted.

3. Pod in CrashLoopBackOff State:

A pod enters the CrashLoopBackOff state when it repeatedly fails and is restarted. To diagnose this issue:

Describe the Pod:
kubectl -n sisense describe pod <pod_name>
This command provides detailed information, including the events and container statuses.
Example:

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       OOMKilled

The OOMKilled reason indicates that the container was killed due to exceeding its memory limit. Increase the memory limit to fix the issue.
Check Events and Container States: At the bottom of the describe output, you'll find the events section, which includes messages about why the pod failed. For example, FailedScheduling may indicate resource constraints or node issues.
Review Logs: Logs can provide valuable insights when a pod is in the CrashLoopBackOff state.

Check Current Logs:
kubectl -n sisense logs <pod_name>

This command retrieves the logs from the current running container.

Check Previous Logs:
kubectl -n sisense logs <pod_name> -p

This command retrieves logs from the previous container instance, which is useful if the container was restarted. Specify the container name if the pod has multiple containers:
kubectl -n sisense logs <pod_name> -c <container_name> -p

4. Pod in Pending State:

If a pod is Pending, it means it hasn't been scheduled on a node yet.

Check Pod Scheduling Events:
kubectl -n sisense describe pod <pod_name>

Look for events like:

Warning  FailedScheduling  85m   default-scheduler  0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules.

This message indicates that the scheduler couldn't find a suitable node for the pod due to resource constraints, node affinity rules, or other scheduling policies.

5. Pod in ImagePullBackOff State:

This state occurs when a pod cannot pull the container image from the registry.

Check Pod Description for Image Issues:

kubectl -n sisense describe pod <pod_name>

Look for messages indicating issues with pulling the image, such as incorrect image names, tag issues, authentication problems, or network errors. For multinode deployments, note the server in the message. It is possible that the image may not exist on all servers.
Verify Image Name and Tag: Ensure that the image name and tag are correct and that the image is available in the specified registry.
Check Image Pull Secrets: If the image is in a private registry, ensure that the registry is accessible.

Manually Pull the Image: Sometimes, images are not available or cannot be downloaded within the default timeout. To verify the availability of the image and check for any errors, try pulling the image manually on the node specified in the error message:

docker pull <image_name>:<tag>

Replace <image_name> and <tag> with the appropriate image and tag names. This can help determine if the issue is with the image itself or the registry configuration.

Conclusion

By understanding these common pod states and following the troubleshooting steps, you can diagnose and resolve many issues in Kubernetes. Regularly monitoring pods and logs is essential for maintaining a stable and reliable Kubernetes environment.

Sisense Community

Troubleshooting Pods in Kubernetes Clusters