Estimated reading time: 5 minutes

This reference identifies the messages that can appear in the System Health > Overview > Critical Alerts and in the Alerts dashboard.

Informational alerts

TASK_TERMINATED

Msg: Task {{.Service}}.{{.Task}} terminated on machine {{.Machine}}

Type: INFO

This alert is raised when a task terminates.

DISK_ERROR

Msg: Machine {{.Machine}} has disk errors

Type: INFO

Raised when a machine has disk errors.

ZK_AVG_LATENCY

Msg: Average Zookeeper latency is more than {{.Num}} msec

Type: INFO

Raised when average Zookeeper latency is above a threshold.

ZK_MAX_LATENCY

Msg: Max Zookeeper latency is more than {{.Num}} msec

Type: INFO

Raised when max Zookeeper latency is above a threshold.

ZK_MIN_LATENCY

Msg: Min Zookeeper latency is more than {{.Num}} msec

Type: INFO

Raised when min Zookeeper latency is above a threshold.

ZK_OUTSTANDING_REQUESTS

Msg: Number of outstanding Zookeeper requests exceeds {{.Num}}

Type: INFO

Raised when there are too many outstanding Zookeeper requests.

ZK_NUM_WATCHERS

Msg: Number of Zookeeper watchers exceeds {{.Num}}

Type: INFO

Raised when there are too many Zookeeper watchers.

MASTER_ELECTION

Msg: {{.Machine}} elected as Orion Master

Type: INFO

Raised when a new Orion Master is elected.

PERIODIC_BACKUP

Msg: {{.Process}} periodic backup for policy {{.Name}} failed.

Type: INFO

Raised when periodic backup fails.

PERIODIC_SNAPSHOT

Msg: {{.Process}} periodic snapshot {{.Name}} failed.

Type: INFO

Raised when a periodic snapshot fails.

HDFS_CORRUPTION

Msg: HDFS root directory is in a corrupted state.

Type: INFO

Raised when HDFS root directory is corrupted.

APPLICATION_INVALID_STATE

Msg: {{.Service}}.{{.Task}} on {{.Machine}} at location {{.Location}}

Type: INFO

Raised when Application raises invalid state alert.

UPDATE_START

Msg: Starting update of ThoughtSpot cluster {{.Cluster}}

Type: INFO

Raised when update starts.

UPDATE_END

Msg: Finished update of ThoughtSpot cluster {{.Cluster}} to release {{.Release}}

Type: INFO

Raised when update completes.

Errors

TIMELY_JOB_RUN_ERROR

Msg: Job run {{.Message}}

Type: ERROR

Raised when a job run fails.

TIMELY_ERROR

Msg: Job manager {{.Message}}

Type: ERROR

Raised when a job manager runs into an inconsistent state.

Warnings

DISK_SPACE

Msg: Machine {{.Machine}} has less than {{.Perc}}% disk space free

Type: WARNING

Raised when a disk is low on available disk space. Valid only in the 3.2 version of ThoughtSpot.

ROOT_DISK_SPACE

Msg: Machine {{.Machine}} has less than {{.Perc}}% disk space free on root partition

Type: WARNING

Raised when a machine is low on available disk space on root partition.

BOOT_DISK_SPACE

Msg: Machine {{.Machine}} has less than {{.Perc}}% disk space free on boot partition

Type: WARNING

Raised when a machine is low on available disk space on boot partition.

UPDATE_DISK_SPACE

Msg: Machine {{.Machine}} has less than {{.Perc}}% disk space free on update partition

Type: WARNING

Raised when a machine is low on available disk space on update partition.

EXPORT_DISK_SPACE

Msg: Machine {{.Machine}} has less than {{.Perc}}% disk space free on export partition

Type: WARNING

Raised when a machine is low on available disk space on export partition.

HDFS_NAMENODE_DISK_SPACE

Msg: Machine {{.Machine}} has less than {{.Perc}}% disk space free on HDFS namenode drive

Type: WARNING

Raised when a machine is low on available disk space on HDFS namenode drive.

MEMORY

Msg: Machine {{.Machine}} has less than {{.Perc}}% memory free

Type: WARNING

Raised when a machine is low on free memory.

OS_USERS

Msg: Machine {{.Machine}} has more than {{.Num}} logged in users

Type: WARNING

Raised when a machine has too many users logged in.

OS_PROCS

Msg: Machine {{.Machine}} has more than {{.Num}} processes

Type: WARNING

Raised when a machine has more too many processes.

SSH

Msg: Machine {{.Machine}} doesn't have an active SSH server

Type: WARNING

Raised when a machine has more than 600 processes.

DISK_ERROR_EXTERNAL

Msg: Machine {{.Machine}} has disk errors

Type: WARNING

Raised when more than 2 disk errors happen in a day.

ZK_FD_COUNT

Msg: Zookeeper has more than {{.Num}} open file descriptors

Type: WARNING

Raised when there are too many open Zookeeper files.

ZK_EPHEMERAL_COUNT

Msg: Zookeeper has more than {{.Num}} ephemeral files

Type: WARNING

Raised when there are too many Zookeeper ephemeral files.

HOST_DOWN

Msg: {{.Machine}} is down

Type: WARNING

Raised when a host is down.

TASK_UNREACHABLE

Msg: {{.ServiceDesc}} on {{.Machine}} is unreachable over HTTP

Type: WARNING

Raised when a task is unreachable over HTTP.

TASK_NOT_RUNNING

Msg: {{.ServiceDesc}} is not running

Type: WARNING

Raised when a service task is not running on any machine in the cluster.

Critical alerts

TASK_FLAPPING

Msg: Task {{.Service}}.{{.Task}} terminated {{._actual_num_occurrences}} times in last {{._earliest_duration_str}}

Type: CRITICAL

This alert is raised when a task is crashing repeatedly. The service is evaluted across the whole cluster. So, if a service crashes 5 times in a day across all nodes in the cluster, this alert is generated.

OREO_TERMINATED

Msg: Oreo terminated on machine {{.Machine}}

Type: CRITICAL

This alert is raised when the Oreo daemon on a machine terminates due to an error. This typically happens due to an error accessing Zookeeper, HDFS, or a hardware issue.

HDFS_DISK_SPACE

Msg: HDFS has less than {{.Perc}}% space free

Type: CRITICAL

Raised when a HDFS cluster is low on total available disk space.

ZK_INACCESSIBLE

Msg: Zookeeper is not accessible

Type: CRITICAL

Raised when Zookeeper is inaccessible.

PERIODIC_BACKUP_FLAPPING

Msg: Periodic backup failed {{._actual_num_occurrences}} times in last {{._earliest_duration_str}}

Type: CRITICAL

This alert is raised when a periodic backup failed repeatedly.

PERIODIC_SNAPSHOT_FLAPPING

Msg: Periodic snapshot failed {{._actual_num_occurrences}} times in last {{._earliest_duration_str}}

Type: CRITICAL

This alert is raised when periodic snapshot failed repeatedly.

APPLICATION_INVALID_STATE_EXTERNAL

Msg: {{.Service}}.{{.Task}} on {{.Machine}} at location {{.Location}}

Type: CRITICAL

Raised when Application raises invalid state alert.