ThoughtSpot caches data as relational tables in memory. You can source tables from different data sources and joined them together. ThoughtSpot has several approaches for getting data into the cluster.
tsloadcommand line utility. JDBC and ODBC drivers are also available.
If your company stores source data externally in data warehouses, you can use ThoughtSpot Embrace to directly query that data and use ThoughtSpot’s analysis and visualization features, without moving the data into ThoughtSpot. While Embrace caches metadata, it does not cache the data itself within ThoughtSpot.
Embrace supports the following external databases:
- Amazon Redshift
- Google BigQuery
- Microsoft Azure Synapse
- SAP HANA
DataFlow is a capability in ThoughtSpot through which users can easily ingest data into ThoughtSpot from dozens of the most common databases, data warehouses, file sources, and applications. If your company maintains large sources of data externally, you can use DataFlow to easily ingest the relevant information, and use ThoughtSpot’s analysis and visualization features. And after you configure the scheduled refresh, your analysis visuals are always up to date. DataFlow supports a large number of databases, applications, and file systems.
DataFlow is recommended for large amounts of data, and for scheduled refresh. Many times the source data could have many years of data and it becomes infeasible to load all that data on a daily basis. Through DataFlow you can specify the filter conditions to get only the latest data.
JDBC and ODBC Drivers
ThoughtSpot provides a JDBC and ODBC driver that can be used to write data to ThoughtSpot. This is useful for customers who already have an existing ETL process or tool, and want to extend it to populate the ThoughtSpot cache.
JDBC and ODBC drivers are appropriate under the following circumstances:
- have an ETL load, such as Informatica, SSIS, and so on
- have available resources to create and manage ETL
- have smaller daily loads
You can use the
tsload command line tool to bulk load delimited data with very
high throughput. Finally, individual users can upload smaller (< 50MB)
spreadsheets or delimited files.
We recommend the tsload approach in the following cases:
- initial data load
- JDBC or ODBC drivers are not available
- there are large recurring daily loads
- for higher throughput; this can add I/O costs
Choosing a Data Use Strategy
The approach you choose depends on your environment and data needs. There are, of course, tradeoffs between different data caching options.
Many implementations use a variety of approaches. For example, a solution with a large amount of initial data and smaller daily increments might use
tsload to load the initial data, and then use the JDBC driver with an ETL tool for incremental loads.
- [ThoughtSpot Embrace](/latest/data-integrate/embrace/embrace-intro.html)
- [ThoughtSpot DataFlow](/latest/data-integrate/dataflow/dataflow.html)
- [ThoughtSpot with ODBC](/latest/data-integrate/clients/about-odbc.html)
- [ThoughtSpot with JDBC](/latest/data-integrate/clients/about-jdbc.html)