Optimizing ThoughtSpot Workloads with Databricks

To enhance performance and cost-efficiency for BI workloads, a strategic approach involving serverless clusters, Databricks' Photon engine, and Delta Cache is essential. The following document outlines key recommendations to optimize BI workloads effectively.

Recommendations

Utilize Serverless SQL Warehouse Clusters

Benefits
Instant Start

Serverless SQL Warehouse clusters have a startup time of seconds compared to minutes for general-purpose, non-serverless clusters.

Elastic Scaling

Automatically adjusts to the workload with options for minimum and maximum workers.

Fully Managed Service

Simplifies operations with no need for manual cluster management or software updates.

Strategy
Auto Stop

Set clusters to auto-stop after n minutes of inactivity to prevent unnecessary costs.

Concurrency Tuning

Scale between a minimum of 2 and a maximum of 10 workers, depending on the workload. Monitor and tune accordingly.

Engage Databricks Team

Collaborate with the Databricks account team to fine-tune SQL Warehouses for optimal performance and cost.

Leverage Photon

Benefits
High Performance

Utilizes CPU-level optimization and effective memory management for increased speed.

Optimized Parquet Writing

With a C++ Parquet writer, operations involving Parquet and Delta files are expedited.

Serverless Integration

Available by default with serverless clusters, enhancing performance without additional configuration.

Implement Delta Cache

Benefits
Faster Access

Keeps frequently accessed data on worker SSDs, significantly reducing query times.

Automatic Inclusion

Standard with SQL Serverless warehouses, requiring no extra setup.

Usage Tip
Preload Data

Use CACHE SELECT * FROM table at the start of an endpoint to preload "hot" tables, ensuring rapid access.

Be Cognizant of Other Tunables

Lazy Evaluation

Important for Data Engineering and writing pipelines, although not directly impacting ThoughtSpot workloads.

Z-Order Optimize

Regularly employ Z-Ordering to co-locate related data, which accelerates queries and decreases cloud storage costs through more efficient data reads.



Was this page helpful?