Solving the "Zombie Connection" Race Condition in Istio/Envoy [Linux]
Introduction: If you are running Sisense on Kubernetes with Istio (ASM), you may encounter intermittent 502 Bad Gateway or 503 Service Unavailable errors during JAQL queries, even when your pods appear healthy.
Step-by-Step Guide:
1. The Problem: Timeout Mismatch
By default, the Istio Ingress Gateway (Envoy) tries to keep TCP connections to backend pods alive for as long as possible (defaulting to 1 hour). However, the underlying GCP VPC network or the Sisense application pods often have much shorter idle timeouts (usually 60 seconds).
This discrepancy creates a race condition:
- A connection sits idle between dashboard refreshes.
- The Sisense pod or the GCP network closes the connection due to age.
- At that exact microsecond, Envoy attempts to reuse that "established" connection to send a new JAQL request.
- Because the other end is already closed, Envoy receives a TCP Reset (RST).
Log Examples
In your Istio Ingress Gateway logs, you will see the following flags:
- 503 UC: Upstream Connection termination.
- 101 DC: Downstream Disconnect.
2. The Solution: EnvoyFilter Alignment
To fix this, you must force Envoy to be more "aggressive" than the network. By setting Envoy’s idle_timeout to 45 seconds, Envoy will proactively kill its own idle connections before the GCP network or the Sisense pod has a chance to drop them silently.
Critical Implementation Detail
Many users attempt to apply the filter in the application namespace (e.g., sisense), but if sidecar injection is not enabled, that filter will do nothing. The filter must be applied where the Envoy proxy actually lives: the istio-system namespace.
The Fix
Apply this configuration to target the Ingress Gateway specifically:
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: sisense-idle-timeout
namespace: istio-system # Must be in the same namespace as the Ingress
spec:
workloadSelector:
labels:
app: istio-ingressgateway # Targets the Gateway proxy
configPatches:
- applyTo: CLUSTER
match:
context: GATEWAY
patch:
operation: MERGE
value:
common_http_protocol_options:
idle_timeout: 45s # Closes connection before the 60s VPC/App limit
Conclusion: This ensures that the Ingress Gateway the only Envoy proxy in a non-sidecar setup manages the connection lifecycle properly. By "hanging up" first, it prevents the use of "zombie" connections, eliminating the 503 UC resets.
Disclaimer: This post outlines a potential custom workaround for a specific use case or provides instructions regarding a specific task. The solution may not work in all scenarios or Sisense versions, so we strongly recommend testing it in your environment before deployment. If you need further assistance with this, please let us know.