Installation Linux error - Timeout when waiting for 127.0.0.1:10250 to stop
If you see the following errors in the installation log: failed: [node1] (item=10250) => {"changed": false, "elapsed": 2, "item": "10250", "msg": "Timeout when waiting for 127.0.0.1:10250 to stop."} failed: [node1] (item=10248) => {"changed": false, "elapsed": 2, "item": "10248", "msg": "Timeout when waiting for 127.0.0.1:10248 to stop."} failed: [node1] (item=10249) => {"changed": false, "elapsed": 2, "item": "10249", "msg": "Timeout when waiting for 127.0.0.1:10249 to stop."} failed: [node1] (item=10256) => {"changed": false, "elapsed": 2, "item": "10256", "msg": "Timeout when waiting for 127.0.0.1:10256 to stop."} This means that the ports for the Kubernetes cluster are closed: TCP 10248 - 10259 Kubernetes Note - these ports need to be open even if it is not a multi-node deployment.25KViews0likes0CommentsPod in 'CrashLoopBackOff' State - 'Readiness\Liveness probe failed: Get http://{POD_IP}:8082/actuator/health: dial tcp {POD_IP}:8082: connect: connection refused
When checking the pod status using: kubectl -n $(NAMESPACE) get pods You may encounter one of the pods in an unhealthy state: jobs-cf6b46bcc-r2rkc 1/1 Running 0 27d management-96449b57b-bdjsk 0/1 CrashLoopBackOff 1467 27d model-graphql-84b79fb449-xkgcc 1/1 Running 0 27d Which means that the pod is constantly attempting to initialize but crashing. To further investigate what's causing this constant crashing, we can check the pod events with this command: kubectl -n sisense --field-selector involvedObject.name=management-96449b57b-bdjsk Which returns the following output:\ LAST SEEN TYPE REASON OBJECT MESSAGE 22m Warning Unhealthy pod/management-96449b57b-bdjsk Readiness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused 42m Warning Unhealthy pod/management-96449b57b-bdjsk Liveness probe failed: Get http://10.233.81.105:8082/actuator/health: dial tcp 10.233.81.105:8082: connect: connection refused 17m Warning BackOff pod/management-96449b57b-bdjsk Back-off restarting failed container 12m Warning MatchNodeSelector pod/management-96449b57b-bdjsk Predicate MatchNodeSelector failed 11m Normal TaintManagerEviction pod/management-96449b57b-bdjsk Cancelling deletion of Pod sisense/management-96449b57b-bdjsk What we can see from this output is that the readiness and liveness probes are not listening on the expected endpoint/port (http://10.233.81.105:8082/actuator/health). What are the Readiness\Liveness probes? The kubelet, which is the primary node agent and runs on each one of the nodes, ensures that the pods which are supposed to be running are in a healthy state. There are 3 different methods that the kubelet can check if the pods are healthy. In Sisense deployments, we use the HTTP endpoint option which checks if the endpoint is alive (by default every 20 seconds, can be changed in the pod deployment spec.containers[*].livenessProbe.periodSeconds). Kubelet uses the liveness probe in order to know when to restart a container\pod. Kubelet uses the readiness probe in order to know when the container is ready to accept traffic. A pod is ready when all of its containers are ready. We can check what is configured for the management deployment: kubectl -n sisense get deploy management -o yaml ... livenessProbe: failureThreshold: 3 httpGet: path: /actuator/health port: 8082 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 20 successThreshold: 1 timeoutSeconds: 10 readinessProbe: failureThreshold: 3 httpGet: path: /actuator/health port: 8082 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 ... As we can see from the above output, we have both the readinessProbe and livenessProbe using the HTTP endpoint /actuator/health on 8082 to detect whether the pod is healthy. Why were the endpoints unavailable for the readiness\liveness probes? A few ways we can troubleshoot the issue: 1) Reviewing the machine resources in Grafana, it was evident that the machine resources were insufficient (RAM maxed out) so the pod (and its associated probes) were not able to check the endpoint. 2) Access the pod and wait until it crashes again: kubectl -n sisense exec management-96449b57b-x8swv -it -- bash After a few minutes having an active session inside the pod, we see the following message as the session is terminated: /opt/sisense/management# command terminated with exit code 137 Exit code 137 is essentially an error code signaling that the application arrived at an out of memory condition and the 'docker stop' command was sent by Kubernetes to the container. 3) We can run the command: docker inspect $(CONTAINER_ID) To get more information about why the container was terminated. To retrieve the docker container ID, we can describe the pod and review the output: kubectl -n sisense describe pod management ... Containers: management: Container ID: docker://599dcc237dc76b25ac60c1ed2b8f1b78a438ff51de7d462d080bfbe2aab76bbe ... 4) We can use system daemon journal to check for any out of memory messages: journalctl -r -k # OR journalctl -r -k | grep -i -e memory -e oom You'd need to run this command on the node that was hosting the container that was killed. 5) Using the 'describe' command, we can clearly see that the container failed because OOM with exit code 137: kubectl -n sisense describe pod management # ... # Containers: # management: # ... # State: # Reason: CrashLoopBackOff # Last State: # Reason: Error # Exit Code: 137 # ...13KViews0likes0CommentsHow to check Total usage of RAM in Grafana in Multinode
Using the Linux monitoring Grafana dashboard General /Kubernetes / Compute Resources / Namespace (Workloads) that is showing total memory allocation in a sever, by default, you cannot switch between nodes (build/query) and check the total load of Build or Query servers separately. Solution: 1 Edit the dashboard 2. Add new Variable 3. see settings query: label_values(kube_node_info, node) 4. Edit Memory Usage Widget 5. add filter by node node=~"$node" Now you should be able to switch between nodes5.1KViews0likes0CommentsHow to fix Safe-Mode on Build, Dashboard
The Safe-Mode is triggered when the pod (query/build) OR overall server RAM consumption gets to 85% of usage. If Safe Mode is triggered on build - it will cancel the build due to OOM. On the dashboard - the application will restart the Query pod of the cube (delete and start it again) in order to release memory. The safe mode also has a grace period that cancels all new queries for 30 seconds. On the next dashboard page restart after 30 seconds, you will see the results in case of correct Data Groups settings. Related Using Grafana to troubleshoot performance issues Log location: single node, (MultiNode logs located on a first node (the one that is the first in the list of nodes in config.yaml that been used to install Sisense.) /var/log/sisense/sisense/ build - ec-<name of cube>-bld-hash.log for example ec-sample-ecommerce-bld-a63510c3-3672-0.log query - ec-<name of cube>-qry-hash.log for example ec-sample-ecommerce-qry-a63510c3-3672-0.log In case you received the Safe-mode triggered while BUILD. Check out the build error message and identify if it's a pod or server overall issue. Build pod limits issue BE#521691 : In order to fix the POD limitation issue, consider increasing the Build Node RAM limits in the Data Groups: Build server overall OOM issue BE#636134: In case of overall server OOM you need to check what is consuming RAM using Grafana, it can be: - Other builds. If so please consider changing the build schedule accordingly. - In most cases, the RAM is consumed by Query pods. If so, try to stop the cubes of the query pod to release the RAM. - Consider RAM upgrade - Increase MAX RAM in Data Group settings Settings of the Elasticube Build Safe-Mode are located in the Configuration Manager, Elasticube Build Params, where you can enable/disable Safe mode and change the % of RAM that should be saved. It's not recommended to disable the Safe-mode as it supposes to save the 15% of RAM for Technicians to login and fix issues. Note, in case disabling Safe Mode the server may become unresponsive and even server restart might not fix the issue. Do not disable Safe-mode without urgent need! In case you received the Safe-mode triggered while surfing the Dashboard. It's also can be or due to POD limitation, or server OOM. The Dashboards uses RAM of the query pods where Sisense keeps results of Dashboard's query execution results for further reuse by another dashboard user. Sisense collects results for 85% of the RAM allocated in the Data group settings after it tries to remove old results and replace them with new, at this time if the huge quey will be received from a dashboard, the Safe Mode will be triggered that will delete the query pod (remove all saved results) and will start a new pod to calculate results. At this moment the user will see the error on a dashboard, however, after the pod will be restarted (usually about 14 sec, however, the more RAM allocated to the query pod the more it will take to release the RAM and start a new pod) user can restart the page to get the results. UPD: starting from Sisense L2021.5, the Soft-Restart of the Elasticubes has been implemented (beta, please test it before go live). The Soft Restart of the qry pod will restart only Monet DB and not the entire pod, which will speed up restart in case hitting Safe-Mode. to enable Safe Restart: - go to the control panel - click 5 times on Sisense Logo at the top left corner - navigate to Management section on the left-hand side menu - scroll down to the end of a page - enable Soft-Restart in order to fix: - Consider changing Data Group Settings - Consider RAM upgrade - increase amount of Instances in the Data Group. So when one of the instances would be restarted du to a Safe-Mode the other will handle the requests Settings of the Elasticube Query Safe-Mode located in the Configuration Manager, Elasticube Query Params: Cleanup Threshold Percentage - the % of the Safe-Mode (default 15%) Safe-Mode Grace Period - after safe-mode is triggered, Sisense will cancel all queries for 30 sec. Disable/Enable Safe-mode - NOT RECOMENDED!!!3.5KViews1like0CommentsUsing Grafana to troubleshoot performance issues
Overview Out of the box, Sisense running on Linux comes with embedded monitoring tool Grafana. Sisense created a few dashboards that can be helpful in order to troubleshoot the performance issues. Grafana can be accessed from Admin page->System Settings-> Monitoring (in top right corner) or by https://your_web_site/app/grafana/ The dashboards list is available by navigating on the left-hand side menu on a 4 square button-> manage Navigation Grafana has some base controls that makes it straightforward to gather info about the system over a period of time. Within a dashboard you will see similar headers to the following: 1: Filter dropdowns - such as namespace (by default, Sisense components run in the sisense namespace), pods, and node(s). 2: Time period - can be configured to be relative or absolute (according to browser timezone) 3: Refresh - can manually toggle the refresh or set a refresh rate Within a visualization, you can hover over particular lines or click on one or more values Click and drag over a time period to zoom in on the time period for more accurate values For more information, check out the official Grafana docs - Working with Grafana dashboard UI Available Dashboards We will focus our troubleshooting using 4 dashboards: Nodes All Pods per namespace Kubernetes / Compute Resources / Namespace (Workloads) Import the dashboard 12928 Nodes Dashboard This dashboard is useful for troubleshooting overall node performance during a period of time. It can tell us what the CPU, RAM, Disk I/O, and networking usage of the machine during a period of time. Use the 'instance' drop down to select which node to examine. All Pods per Namespace Dashboard The dashboard header will allow you to filter the pods/nodes, setup the timeframe, specify refresh/autorefresh: The dashboard has 3 widgets (CPU, RAM and Network) and will show the load of a server pod-by-pod (separately) overlapping each other This is useful when checking what the RAM consumption of a pod is. Please note this dashboard not does not show overall RAM consumption by default, it will show the RAM of each pod individually. Click to check how to add Total: In Grafana (hosted on /app/grafana), under dashboard named all pods per namespace , two widgets (Cpu Usage, Memory Usage) need to add the sum of all Pods. This is mandatory information when troubleshooting resource pressure situations. Add the following metric to the memory widget: sum (container_memory_working_set_bytes{job="kubelet", pod_name=~"$pod", container_name!=""}) We can then see more clearly the state of the machine (RAM in this case): using the Pod filter you can filter all the query or build POD by typing "qry" or "bld" in a filter and selecting needed Pods: Hover over on the widget item to see the name of a POD Drag and drop to select needed timeframe: Kubernetes / Compute Resources / Namespace (Workloads) This dashboard has a lot more widgets and will show you the total usage of RAM, CPU etc Useful when need to see total usage to identify Safe-Mode trigger. For multinode see the manual This dashboard additional widgets that can be helpful in monitoring your server performance, network usage, etc. Sisense Cluster Detail Dashboard (#12928) This dashboard is included by default in many recent versions of Sisense on Linux. This dashboard has a lot of pre-set-up widgets that will show you pods, cluster, Drive, RAM etc usage. Before you try to import this dashboard, check to see if the dashboard is already in your Grafana dashboard menu. If your instance is on an older release or you would like to import a dashboard from other users, follow the steps below: 1. Click the 4 square menu icon, then Manage dashboard 2. Press on import dashboard in the top right corner 3. Specify the number of a dashboard 12928 (or other dashboard number) and press load 4. Select Logs Data source "Prometheus" 5. Click import and enjoy3.5KViews2likes0CommentsJumpToDashboard - Troubleshooting the most common configuration issues
JumpToDashboard - Troubleshooting the most common configuration issues This article provides possible solutions to the most common configuration issues with the JumpToDashboard plugin. Please review the symptom of the issue first (what error/behavior you experience with the JumpToDashboard plugin) and then follow the solution instructions. If this doesn’t solve your issue, feel free to contact our Support Team describing your issue in detail. Symptoms: A target (usually with “_drill” in prefix) dashboard disappeared for non-owner users in the left-hand side panel. Solution: this behavior could be intended and is controlled by the JumpToDashboard parameter called “hideDrilledDashboards”. To make the dashboard visible for the non-owners, please check the following: 1. Log in as a dashboard owner and find the dashboard in question in the left-hand side panel. Click the 3-dots menu and make sure it’s not hidden: 2. If it’s not hidden by the owner intentionally, then navigate to the Admin tab > System Management (under Server & Hardware) > File Management > plugins > jumpToDashboard > js > config.js and check if hideDrilledDashboards set to true. If so, then change it to false and save the changes in the config file. 3. Wait until the JumpToDashboard plugin is rebuilt under the Admin tab > Server & Hardware > Add-ons page and ask your user to refresh the browser page to check if a drill dashboard appears on the left-hand side panel. Symptoms: No "Jump to dashboard" menu appears in a widget edit mode clicking the 3-dots menu. Solution: there could be different reasons for such behavior so check the most common cases below: Double-check if the JumpToDashboard plugin is enabled under the Admin tab > Server & Hardware > Add-ons page. Make sure that both dashboards (parent and target) are based on the same ElastiCube. By default, the JumpToDashboard plugin has sameCubeRestriction: true in the config.js file that prevents the ‘jump to’ menu from appearing when a drill dashboard uses a different data source. Check that the prefix you used for the drill dashboard creation is correct. It could be changed in the config.js file. By default, it uses “_drill”: Symptoms: when clicking on a widget that should open a drill dashboard, nothing happens. Solution: in such cases, we recommend opening your browser console (for example, F12 for Chrome > Console tab) to see if there are any errors that could indicate the issue. For example, a 403 error in the console indicates that the target dashboard is not shared with the user who is experiencing the issue. To fix it, login as an owner of the drill dashboard and share it with the relevant user or group. Symptoms: when clicking on a widget to get the drill dashboard you get a 404 error. Solution: This issue usually happens when the target/drill dashboard is removed from the system. In order to fix it, please follow the steps below: Log in to the system as an owner. Find the parent widget and open it in edit mode. Click the 3 dots menu > choose the ‘Jump to dashboard’ menu and select any other dashboard that exists in the system. Press Apply and publish the changes to other users. Note: if you need just to remove a drill dashboard that doesn’t exist from this widget and not substitute it with another one, try the following: after you choose a new drill dashboard, just unselect it after that and then save the changes. If the jump to dashboard menu doesn’t appear for this widget, try to create a new temporary dashboard with “_drill” in the prefix and do the same. Symptoms: The drill dashboard is not opening for some viewers. Solution: republish the drill dashboard to make sure the updated version is delivered to all end users. Additional Resources: JumpToDashboard Plugin - How to Use and Customize1.5KViews1like2CommentsHow to check network performance in Linux
You might need to check the network performance between the nodes of the multi-node cluster or between a Linux server and a remote database on Linux. 1., Install the following tool on each node: For Ubuntu: sudo apt install iperf3 For Red Hat / CentOS / Amazon Linux: sudo yum install iperf 2. Start the server on the first node: iperf3 -s (or 'iperf -s') 3. Notice its port from the output and then launch the client on another node / another server: iperf3 -c IP_of_the_server_above -p port_of_the_server Feel free to heck the the following combination: node1 - node2 , node1 - node3 , node2 - node3 (or Sisense server - database server) And example output: ----------------------------------------------------------- Server listening on 5201 ----------------------------------------------------------- Accepted connection from 172.16.0.98, port 49216 [ 5] local 172.16.0.38 port 5201 connected to 172.16.0.98 port 49218 [ ID] Interval Transfer Bandwidth [ 5] 0.00-1.00 sec 105 MBytes 884 Mbits/sec [ 5] 1.00-2.00 sec 110 MBytes 922 Mbits/sec [ 5] 2.00-3.00 sec 110 MBytes 924 Mbits/sec [ 5] 3.00-4.00 sec 110 MBytes 923 Mbits/sec [ 5] 4.00-5.00 sec 110 MBytes 926 Mbits/sec [ 5] 5.00-6.00 sec 110 MBytes 919 Mbits/sec [ 5] 6.00-7.00 sec 110 MBytes 921 Mbits/sec [ 5] 7.00-8.00 sec 110 MBytes 922 Mbits/sec [ 5] 8.00-9.00 sec 110 MBytes 926 Mbits/sec [ 5] 9.00-10.00 sec 109 MBytes 913 Mbits/sec [ 5] 10.00-10.05 sec 4.99 MBytes 927 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.05 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-10.05 sec 1.07 GBytes 918 Mbits/sec receiver1.5KViews0likes0Comments- 1.4KViews0likes1Comment
Adding RabbitMQ Administrator CLI (rabbitmqadmin) to Container
How to enable it? Run the following script: for i in 0 1 ; do echo sisense-rabbitmq-ha-$i " "; kubectl -n sisense exec -it sisense-rabbitmq-ha-$i -c rabbitmq-ha -- bash -c "apk update;apk add python;wget http://127.0.0.1:15672/cli/rabbitmqadmin;chmod a+x rabbitmqadmin" ; done This will install the CLI in all Pods. If you only have one Pod (single node deployment), you will probably see the following error: sisense-rabbitmq-ha-1 Error from server (NotFound): pods "sisense-rabbitmq-ha-1" not found You can ignore it. (Optional) Enable bash auto-completion for the CLI After you've successfully installed rabbitmqadmin, you can enable the tab-enabled auto-completion by running: source $(./rabbitmqadmin --bash-completion) ./rabbitmqadmin [PRESS TAB + TAB]: --depth --host --sort --vhost delete help publish --format --password --sort-reverse close export import purge --help --port --username declare get list show How to print the number of messages and consumers per queue? ./rabbitmqadmin list queues name consumers messages +------------------------------------------------------------------------------------------+-----------+----------+ | name | consumers | messages | +------------------------------------------------------------------------------------------+-----------+----------+ | activities/addDirectActivity_v1 | 1 | 0 | | alerts/updateAlertsForDeletedGroups_v1 | 1 | 0 | | alerts/updateAlertsForDeletedUsers_v1 | 1 | 0 | ... More info about the commands can be found here.1.3KViews0likes0Comments