ThoughtSpot tables can be replicated or sharded. Replicated tables exist in their entirety, the complete data set, on each node. Sharded tables consist of a single data set divided into multiple tables or shards. The shards have identical schemas but different sets of data.
When to use sharding
By default, ThoughtSpot tables are replicated, you must explicitly shard tables. Sharding your tables impacts the total amount of memory used by the table as well as its performance.
For example, you might shard a large table of sales data. So, you could divide a single sales table into shards each of which contains only the data falling within a single year. These shards are then distributed across several nodes. Requests for sales data are dispersed both by the year and the location of the shard in the node cluster. No single table or node is overloaded, and so the performance of a query and the system load are both improved.
To optimize ThoughtSpot performance, you should shard very large fact tables whenever possible. If you have a large dimension table, you might choose to shard it along with the fact table it is joined with. Sharding both the fact and dimension table is known as co-sharding.
Table sizes and sharding recommendations
|Number of rows per shard||5-10 million|
|Maximum||10 million rows per shard|
|Maximum number of shards||~ 80% of CPU cores|
|Number of rows in table||1.1 billion|
|CPUS in cluster||256|
|HASH (128)||~50% of total CPUs|
|8.6 million rows per shard|
How to shard
Sharding is a type partitioning and is sometimes called Horizontal
partitioning. The term sharding is particular to situations where data is
distributed not only among tables but across nodes in a system. To create a
sharded table add the add
PARTITION BY HASH ( ) clause to your
TQL> CREATE TABLE ... ...PARTITION BY HASH (96) KEY ("customer_id");
HASH parameter determines the number of shards and the
KEY parameter the
sharding key. The recommended number of shards depends upon the number of nodes
in your cluster:
|Number of Nodes||Number of Shards|
If you omit the
PARTITION BY HASH statement or if the
HASH parameter is 1
(one), the table is unsharded. This also means the table physically exists in
its entirety on each node.
If you want to use the primary key for sharding, specify that the table
is to be partitioned by
HASH on the primary key, as in this example:
TQL> CREATE TABLE "supplier" ( "s_suppkey" BIGINT, "s_name" VARCHAR(255), "s_address" VARCHAR(255), "s_city" VARCHAR(255), "s_phone" VARCHAR(255), CONSTRAINT PRIMARY KEY ("s_suppkey") ) PARTITION BY HASH (96) KEY ("s_suppkey");
The system does not use primary keys as sharding keys by default. If you specify
PARTITION BY HASH statement with a
HASH greater than 1 (one) but omit the
KEY parameter ThoughtSpot shards the table randomly. This is not recommended;
avoid this by always ensuring you specify the
KEY parameter with a HASH
greater than 1 (one).
How to choose a shard key
When you shard a large table, you select a shard key from the table. This key exists in every shard. You can use any data type that is valid for use as the primary key as the shard key. Choosing a shard key plays an important role in the number of shards and the size of any single shard.
A shard key should contain a value that has a good distribution (roughly the number of rows with each value in that column). This value is typically part of the primary key, but it can include other columns. For example:
CREATE TABLE "sales_fact" ("saleid" int, "locationid" int, "vendorid" int, "quantity" int, "sale_amount" double, "fruitid" int, CONSTRAINT PRIMARY KEY("saleid,vendorid")) PARTITION BY HASH(96) KEY ("saleid");
Notice the shard key contains the
saleid value that is also part of the
primary key. When creating a shard key use these guidelines.
Include one or more values of the table’s primary key in the shard key.
This prevents scenarios where the data with the same primary key ends up in different shards and nodes because the shard key changed.
If you expect to join two tables that are both sharded, make sure both tables use the same shard key.
This guideline ensures better join performance. So, for example, if you have two tables and the primary keys are:
PRIMARY KEY("saleid,vendorid")on A
PRIMARY KEY("saleid,custerid")on B
You should use
saleidfor your shared key when you shard both table A and B.
Choose a shard key so that the data is distributed well across the keys.
For example, suppose the table you want to shard has a primary key made up of
locationid. If you have 10K sales but 400 locations, and 2000 customers, you would not want to use the
locationidin your shard key if 5k sales were concentrated in just 2 locations. The result would be data in fewer shards and degrade your performance. Instead, your shard key may be
Choose a shard key that results in a wide variety of keys.
For example, suppose the table you want to shard has a primary key made up of
locationid. Suppose the table has 10K sales, 40 locations, and 200 products. If the sales are evenly distributed across locations you would not want to use the
locationidin your shard key. Instead,
productidwould be the better choice as it results in a wider variety of keys.
As mentioned in the previous section, it is possible to simply use the primary
key as a shard key. It isn’t a good idea to use shard keys outside of the
primary key. The reason is that it, with a non-primary shard key, it is
possible to get two versions of a record if the shard key for a record changes,
but the primary key doesn’t. A second version reults because, in the absence of
a unique shard key, the system create a secondary record rather than doing a SQL
Sharded dimension tables
In a typical schema, you’d have a sharded fact table with foreign keys to small dimension tables. These small dimension tables are replicated in their entirety and distributed on every node. This works best where dimension tables under 50MB in size.
If you have a large dimension table, replicating it and distributing it can impact the performance of your ThoughtSpot System. In this case, you want to shard the dimension tables and distribute it the same way as the fact table it joins to.
When sharding both a fact and its dimension table (known as co-sharding) keep in mind the guidance for creating a shard key. Only shard dimension tables if the dimension table is large (over 50MB) and the join between the fact and dimension tables use the same columns. Specifically, the tables must:
- be related by a primary key and foreign key
- be sharded on the same primary key/foreign key
- have the same number of regions (or shards)
If these requirements are met, ThoughtSpot automatically co-shards the tables for you. Co-sharded tables are always joined on the sharding key. Data skew can develop if a very large proportion of the rows have the same sharding key.
This example shows the
CREATE TABLE statements that meet the criteria for
sharding both a fact table and its dimension table:
TQL> CREATE TABLE products_dim ( "id" int, "prod_name" varchar(30), "prod_desc" varchar(100), PRIMARY KEY ("id") ) PARTITION BY HASH (96) KEY ("id") ; TQL> CREATE TABLE retail_fact ( "trans_id" int, "product_id" int, "amount" double, FOREIGN KEY ("product_id") REFERENCES products_dim ("id") ) PARTITION BY HASH (96) KEY ("product_id") ;
If a dimension table is joined to multiple fact tables, all of the fact tables must be sharded in the same way as the dimension table. Self-joins are not supported.
Joining two sharded fact tables
You can also join two sharded fact tables with different shard keys, this is known as non co-sharded tables. It may take a while to join two tables sharded on different keys since a lot of data redistribution is required. Therefore, ThoughtSpot recommends that you use a common shard key for two fact tables.
You are not limited by the column connection or relationship type.