Known Limitation in Logging Infrastructure
Known Limitation in Logging Infrastructure
Introduction
Sisense uses the Fluent-bit daemonset and the Fluentd deployment in tandem to aggregate and manage Sisense product logs. In order to do this, a Fluent-bit daemonset is deployed, ensuring that Fluent-bit runs on each Kubernetes node. Fluent-bit listens for logs to be written to the containers running on each node, and forwards those logs to a listening Fluentd client.
The Fluentd service is scheduled to run on the “primary” node of the Sisense Cluster. It is responsible for writing all logs to a single folder within that node. By default, this folder is /var/log/{$deploymentNamespace}/sisense. Additionally, the cronjob called cronjob-logrotate, manages log rotation, and must be scheduled to run against this same node, to archive and clean up stale logs as they age.
In order to run on the “primary” node of a Sisense Cluster, we set a nodeSelector parameter in the Fluentd deployment and the cronjob cronjob-logrotate at installation time. These values are set to the first node hostname value within the installation yaml file.
Issues with Log Generation
In some instances, we have found that Sisense customers experience issues with logs not being written to the logs folder. You may observe the Fluentd pod in a not-running state, with a warning similar to “0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector.”
This error in combination with the Fluentd pod in a not-running state is usually an indication that the value of the nodeSelector in the Fluentd deployment may be invalid.
Troubleshooting
First, ensure the nodeSelector value in the Fluentd deployment is set to a valid hostname. This can be updated, especially in self-managed EKS clusters where autoscaling removes the primary host and a new host replaces the original. In this case, the deployment for Fluentd and cronjob, cronjob-logrotate will manually need to be updated to reference the new hostname.
Summary
Node selectors can sometimes pose issues for Fluentd deployments leading to inaccessible logs. This is the first thing that should be investigated when logs are observed missing.