Understand backup modes

Learn about types of backups.

A backup is a procedure that stores a snapshot outside of a ThoughtSpot cluster. Backups are stored in a directory on a local or network file system. You can store all of the data associated with a snapshot, a portion of that data, or only metadata. Other advanced administrative operations also use backups.

You can use a backup to restore a cluster to a prior state or to a differently configured appliance. You can also use a backup to move a cluster from an appliance to a virtual cluster, or vice versa.

You can create a manual backup or configure an automated, periodic backup. For manual backups, the system creates a backup using the named snapshot you specify. For periodic backups, the system uses the most recent snapshot to create the backup.

You should never disable the periodic snapshot system, because backups rely on it. For example, if you have disabled the periodic snapshots system and periodic backups are enabled, then the periodic backup may use a very outdated snapshot or it may fail altogether.

ThoughtSpot usually stores backups on a NAS (network attached storage) file system but you can store them on a local disk as well. You can back up an AWS cluster using an S3 bucket. You can back up a GCP cluster using a GCS bucket. When creating a backup, ThoughtSpot copies a release tarball and several supporting files to a storage you specify. Storing these supporting files takes about 10 GB of extra space temporarily beyond the backup itself. These supporting files are removed after the backup completes successfully. So, make sure you have enough disk space both to take a backup and store the result. Use the tscli storage df command to identify the amount of space available.

You can create a backup using one of three modes: full, lightweight, or dataless.

Full backups

Full backups are entire backups of the cluster with all data, whether loaded from the web interface or from tsload. This is the best mode for restoring a cluster and all your data. After a FULL backup is created, you can move the backup between clusters, even if the cluster configuration is different. Full backups can be as large as 20 GB in addition to the 5 GB of additional files. Some installations can exceed these limits, this is why it is important to test your backup configuration.

Before creating a manual backup or configuring automated backups, make sure there is enough disk space on the target disk. Consider an example, where you want to store three backups. If the backup itself takes 18GB, you need about 18 + 5 = 23 GB of free disk space. Don’t forget that the backup size can grow over time, so you should occasionally check to ensure you are not in danger of running out of disk space to store backups.

Lightweight backups

Lightweight backups contain everything that is in a dataless backup, as well as everything that makes up a cluster:

  • Cluster configuration (SSH, LDAP, and so on)

  • In-memory data cache

  • All data that is stored unencrypted in HDFS

  • Data uploaded by users

  • Metadata for the data store

  • Users, groups and permissions

  • Objects created by users (pinboards, worksheets, and formulas) with their shares and permissions.

  • Data model and row-level security rules.

Lightweight backups do not contain data loaded through ThoughtSpot Loader (tsload), or ODBC/JDBC drivers. The expectation is that data loaded by tsload is from external sources and so can be re-loaded after the cluster is restored. An exception is if these mechanisms were used to load data into tables that were first created through CSV import (that is, a user first loaded the tables using the UI). In this case, the data, like the tables they were loaded into, are saved.

Dataless backups

A dataless backup saves a backup of the schema (metadata), with no customer data or search indices saved to the backup. Dataless backups allow you to send a copy of your cluster metadata to ThoughtSpot Support for troubleshooting. For clusters where you connected to an external cloud data warehouse, you can use dataless backups to recover the cluster in the case of irrecoverable issues, such as a cluster crash or a non-restartable cluster.

When restoring from a dataless backup, you must supply the correct release tarball, since this type of backup does not include the release tarball, just the release version. Note that the release version that you use to restore the cluster from the backup must exactly match the version information in the backup.

Dataless backup contents for deployments with ThoughtSpot’s in-memory database

Dataless backups contain the following information for deployments with ThoughtSpot’s in-memory database:

  • Metadata:

    • Users, groups, answers, Pinboards, visualizations, worksheets, data modeling settings, row level security filters

  • Scheduled jobs (for example, scheduled Pinboards)

  • Cluster details: ID, name, version

  • [For clusters with S3 storage]: AWS configuration (label, region, S3 bucket name)

  • Cluster manager configuration (e.g. backup policy), and service configuration (e.g. service enabled or disabled, memory limits for service)

  • Database schema: Definitions of databases, tables and views. Definition of a table includes its type (dimension/fact), version, internal/external, column details, region, primary key, unique key, relationships (foreign key)

  • End-user license agreement (EULA) policy and file

  • ThoughtSpot Software artifacts: version, checksum of binaries

  • Hadoop layout

  • Firewall configuration

  • mailname and mailfromname

  • SAML configuration

  • Consumption pricing user activity

Dataless backups for ThoughtSpot’s in-memory database DO NOT include the following information:

  • Data stored in the in-memory database

  • Search index tokens

  • Usage information

  • Traces

Dataless backup contents for deployments with connections to external cloud data warehouses

Dataless backups contain the following information for deployments with connections to external cloud data warehouses:

  • Metadata:

    • Users, groups, answers, Pinboards, visualizations, worksheets, data modeling settings, row level security filters, tables, columns

    • Configuration of connections to external cloud data warehouses

  • Scheduled jobs (for example, scheduled Pinboards)

  • Cluster details: ID, name, version

  • [For clusters with S3 storage]: AWS configuration (label, region, S3 bucket name)

  • Cluster manager configuration (e.g., backup policy), and service configuration (e.g., service enabled or disabled, memory limits for service)

  • End-user license agreement (EULA) policy and file

  • ThoughtSpot Software artifacts: version, checksum of binaries

  • Hadoop layout

  • Firewall configuration

  • mailname and mailfromname

  • SAML configuration

  • Consumption pricing user activity

Dataless backups for connections to external cloud data warehouses DO NOT include the following information:

  • Search index tokens

  • Usage information

  • Traces