High Availability (HA) and resilience

Consider these guidelines to ensure HA of ThoughtSpot app, and node resilience.

Requirements for node resilience

  • The cluster must have at least 3 nodes.

  • The cluster must have spare capacity; if one node fails, the remaining nodes must be able to host and serve all loaded data.

What happens during node failure

  • When a node loses connection with the main service manager process, it becomes unhealthy.

  • ThoughtSpot migrates all migratable services that run on the failed node to other (healthy) nodes. For all practical purposes, ThoughtSpot ignores the failed node until it reports itself as healthy.

  • ThoughtSpot rebalances and redistributes the data served from the failed node onto healthy nodes. Healthy nodes read the data from the HDFS storage layer into the in-memory database processes.

Disruption: impact on users

The process of redistributing and loading the data in the affected tables on HDFS layer from a failed node to the remaining healthy nodes is not instantaneous. The failover may impact the user experience.