Flexible aggregation functions
The primary purpose of the group_aggregate function is for level of detail analysis. As this function generates a sub-query as a SQL Common Table Expression (CTE), this can be used for advanced edge cases beyond this primary purpose.
The group_aggregate function provides greater controls than specific group functions like group_sum and group_max. This flexibility is primarily in relation to how grouping columns or filters will be accepted or ignored as part of the final calculation. The group_aggregate function also allows for the definition of an outer aggregate.
This function includes three input variables:
-
The aggregation type and the column to be aggregated.
-
How grouping columns will be accepted or ignored.
-
How filters will be accepted or ignored, or if a filter is always applied.
Instance modes
Since the 9.3.0.cl release, two query generation modes exist. Note that all new clusters have the new pattern defined, and that an exception is required to revert to the legacy behavior. For risk-mitigation purposes, this is slowly being applied across all instances.
To explain this behavior, let’s assume that the user wishes to answer the following business questions:
-
What are my sales by store?
-
What are my sales by region?
-
What is the contribution of store sales to the region?
In order to answer questions 2 and 3, we need level of detail calculations.
For sales by region, we would use the following formula:
sales of region = group_aggregate ( sum ( sales ) , { region } , query_filters ( ) )
For contribution of store sales to region, we would use the following formula:
sales percentage of region = sum ( sales ) / sum(group_aggregate ( sum ( sales ) , { region } , query_filters ( ) ))
The following table is an example of the resulting answer. Note that the granularity of the search is at the region and state level. Therefore, the total sales value is calculated at this level. The sales of region column shows a repeating value where the total is the amount for the entire region.
Respect fixed grouping columns (new and default behavior)
As per the example formula for sales of region, the calculation is fixed at the regional level. Therefore, if region is removed from the search or not visualized in a chart, the result does not change. That is, the result is still calculated at the regional level.
Reaggregate unused grouping columns (legacy behavior)
The legacy behavior would only fix the results when the column is defined in the search or as a visualized column. Therefore, if the region column is removed, the result is reaggregated as total sales. This behavior is incorrect as the analyst has defined a fixed definition. The contribution is the contribution of state sales to total sales.
Highlighted behavior changes
The following are highlights in formula changes between the legacy mode and the new mode.
Ratio definition
It is important to define the outer aggregation when calculating rations. The legacy behavior would include a default aggregation. As the new behavior respects the fixed grouping column, this should be defined in the formula.
sum ( sales ) /group_aggregate ( sum ( sales ) , { region } , query_filters ( ) )
sum ( sales ) / sum(group_aggregate ( sum ( sales ) , { region } , query_filters ( ) ))
Optional grouping columns
A dynamic denominator is useful where the resulting ratio is updated based upon the grouping columns included in the search. That is, if the user adds the region column, then the contribution ratio should be the state’s contribution to region. If the user removes the region column, then the contribution should be the state’s contribution to the total.
As the legacy behavior did not respect the fized column attributes, formulas were written to simulate this behavior. The new query pattern requires that the data analyst explicitly defines this optional grouping.
sum ( sales ) / sum(group_aggregate ( sum ( sales ), query_groups(region),query_filters ( ) ))
Groups and filters
Flexible group aggregate formulas allow for flexibility in both groupings and filters. The formulas give you the ability to specify only groupings or only filters.
Query groups
With query_groups () + {attribute_column} or query_groups () - {attribute_column}, you can aggregate results while including or excluding a column from the original search. The query_groups() function returns all attribute columns defined in the base search when the table view is displayed.
Columns are not included in the query_groups definition when a chart is displayed and the column is not visualized.
|
If, for example, you use the condition query_groups() - {region, sku, name}, this changes the level of detail for the group aggregate formula. In this scenario, region, sku, and name have been removed. If these columns are not included in the base search, then this definition is ignored. Under the new logic, the {region, sku, name} condition fixes the level of detail for the group by columns to the columns defined.
You can combine options to both add and remove columns like in the following example: query_groups() + {region} - {sku, name}.
Regarding dates, when query_groups() is defined, the date period defined in the search will be passed into the group function. Assuming the search is monthly, then the group function will also be at the monthly grain. You can use date functions to change the grain, such as {start_of_year(transaction date)}.
Query_groups optional grouping columns
The query_groups function is used in group_aggregate, first_value, and last_value functions. This returns a list of columns/ attributes that have been included in the search. This allows for flexible inclusion and exclusion of grouping columns using the following syntax:
-
query_groups(): includes all grouping columns. -
{region, department, store}: specifies a list of grouping columns explicitly. These columns can be considered as the fixed level of detail for this formula. -
query_groups() - {customer_name, customer_region}: includes all grouping columns except customer_name and customer_region. -
query_groups() + {order_id}: includes all grouping columns and ensures order_id is always included. -
query_groups() - {customer_name} + {order_id}: combines the inclusion of order_id and exclusion of customer_name.
You can now define an explicit optional list of grouping columns using the following syntax: `query_groups(region, department, store). This ensures that only the specified columns are included, if they are present in the query.
Previously, this capability required manually excluding all other columns from the underlying Model. This approach was difficult to maintain, as new columns added to the Model would also need to be added to the exclusion list for every formula.
Query filters
Query filters support the ability to define default filters, accept filters, or ignore filters.
Use the following syntax:
* query_filters(): accepts all filters applied in the search.
* query_filters() + {filter condition}: accepts all filters and includes a specific filter, such as ship mode = car.
* query_filters() - {column}: ignores filters from a specific column or columns.
* {column}: only accepts filters for a specific column or columns.
With query_filters() + {filter condition} or query_filters() - {filter condition}, you can aggregate the results while including or excluding a filter condition. Consider the filter condition Ship Mode - 'car'.
For a search on Category Customer ID sales by customer id and category Ship Mode='car', you can add a formula to calculate sales by category for each customer. For example, the formula sales by Customer ID and Category= group_aggregate (sum(Sales), {Category, Customer ID}, query_filters()+{Ship Mode='air'}). In this case, the results will be aggregated based on the dimensions
Category and Customer ID and the filters air and car.
With query_filters() - {column}, you will be able to aggregate the results while removing any expression related to a column. For a search on Customer ID sales by customer id and category Ship Mode='car', you can add a formula to calculate sales for each customer while ignoring the filter on a column. For example the formula sales by Customer ID and Category= group_aggregate (sum(Sales), {Customer ID, Category }, query_filters()-{Ship Mode}). In this case, the results will be aggregated based on the dimensions in the search (Customer ID) and any filter related to Ship Mode will not be considered while aggregating the results.
Related information
For more examples of flexible aggregation, see the group_aggregate function in the Formula function reference.
To learn about aggregation formulas in general, see Overview of aggregate formulas and Group aggregation functions.
To understand group aggregate query filters, see Aggregate filters.
To learn about how the
group-aggregatefunction can be used within your business practice, we encourage you to see Reaggregation scenarios in practice.