Basic kubernetes troubleshooting guide [Linux]
This guide provides a general approach for troubleshooting Sisense Linux-based deployments running on Kubernetes. It is intended to collect initial information and perform basic diagnostics.
Step-by-Step Guide:
Do not restart anything before collecting information from step 1
1. Collect Overall Cluster Information
Start by collecting a high-level overview of the cluster and the current state of the Sisense deployment.
Commands:
- kubectl get pods -A
- kubectl -n sisense get events
- kubectl -n sisense get pv
- kubectl -n sisense get pvc
- kubectl get nodes
- kubectl get all -A > all.txt
- kubectl describe all -A > desc.txt
Collecting these outputs provides an overall snapshot of the environment, which will help identify potential issues.
2. Validate Core Sisense Services
Verify that the following core services are up and running:
- RabbitMQ
- MongoDB
- Zookeeper
- Storage (FSx, GlusterFS, etc.)
- API Gateway
- Galaxy
Check the pod status for each service in the Sisense namespace:
- kubectl -n sisense get pods
All critical pods should be in the running state. Any CrashLoopBackOff, Error, or Pending states should be investigated.
3. Identify Failed Pods and Gather Details
Locate any failed pods that require investigation and proceed with the troubleshooting:
- kubectl -n sisense describe pod <pod_name>
Analyze the pod events and container termination statuses:
- Exit Code 137:
- Indicates the container was terminated by signal 9 (SIGKILL). In most cases, the container exceeded its memory limit (Out-Of-Memory Killer triggered) or the node itself was under memory pressure and killed processes to recover. Also could be related to other issues caused by resource pressure.
- Exit Code 1:
- Application-level failure inside the container. Focus on application logs to identify the cause.
- Liveness Probe Failure:
- Kubernetes considers the container unhealthy and restarts it. May indicate that the application is unresponsive or failed internal health checks.
- Readiness Probe Failure:
- Indicates the application is not ready to serve traffic but may still be running. Often related to initialization, dependencies, or internal connectivity issues.
4. Collect Container Logs
After identifying the problematic pod, collect its logs:
- kubectl -n sisense logs <pod_name> -p
(The -p flag retrieves logs from the previously terminated container instance.)
Review the logs for specific error messages, stack traces, or indications of configuration or environment issues.
5. Next Steps
- Collect all information from steps 1-4.
- Provide full output to Sisense Support for further analysis.
Conclusion:
This is a basic troubleshooting workflow. Complex environments may require additional network, storage, or cluster-level diagnostics.
Add a disclaimer for custom solutions. DO NOT CHANGE IT THIS DISCLAIMER!!! ⤵️
Disclaimer: This post outlines a potential custom workaround for a specific use case or provides instructions regarding a specific task. The solution may not work in all scenarios or Sisense versions, so we strongly recommend testing it in your environment before deployment. If you need further assistance with this, please let us know.