Improve Query Performance And Calculation Times Using SSAS-Aggregations

Aggregations

An aggregation is a data structure that stores precalculated data that Analysis Services uses to enhance query performance. You can define the aggregation design for each partition independently. Each partition can be thought of as being an aggregation at the lowest granularity of the measure group. Aggregations that are defined for a partition are processed out of the leaf level partition data by aggregating it to a higher granularity.

When a query requests data at higher levels, the aggregation structure can deliver the data more quickly because the data is already aggregated in fewer rows. As you design aggregations, you must consider the querying benefits that aggregations provide compared with the time it takes to create and refresh the aggregations. In fact, adding unnecessary aggregations can worsen query performance because the rare hits move the aggregation into the file cache at the cost of moving something else out.

While aggregations are physically designed per measure group partition, the optimization techniques for maximizing aggregation design apply whether you have one or many partitions. In this section, unless otherwise stated, aggregations are discussed in the fundamental context of a cube with a single measure group and single partition. For more information about how you can improve query performance using multiple partitions, see Partition Strategy.

Detecting Aggregation Hits

You can use SQL Server Profiler to view how and when aggregations are used to satisfy queries.

Within SQL Server Profiler, there are several events that describe how a query is fulfilled. The event that specifically pertains to aggregation hits is the Get Data From Aggregation event.

Figure-1: Scenario 1: SQL Server Profiler trace for cube with an aggregation hit

This figure displays a SQL Server Profiler trace of the query’s resolution against a cube with aggregations. In the SQL Server Profiler trace, the operations that the storage engine performs to produce the result set are revealed.

The storage engine gets data from Aggregation C 0000, 0001, 0000 as indicated by the Get Data From Aggregation event. In addition to the aggregation name, Aggregation C, Figure 10 displays a vector, 000, 0001, 0000, that describes the content of the aggregation. More information on what this vector actually means is described in the next section, How to Interpret Aggregations. The aggregation data is loaded into the storage engine measure group cache from where the query processor retrieves it and returns the result set to the client.

Suppose no aggregations can satisfy the query request? In that case, the Get Data From Aggregation event will be missing, as you can see from the following example, which shows the same cube with no aggregations.

Figure-2: Scenario 2: SQL Server Profiler trace for cube with no aggregation hit

After the query is submitted, rather than retrieving data from an aggregation, the storage engine goes to the detail data in the partition. From this point, the process is the same. The data is loaded into the storage engine measure group cache.

How to Interpret Aggregations

When Analysis Services creates an aggregation, each dimension is named by a vector, indicating whether the attribute points to the attribute or to the All level. The attribute level is represented by 1 and the All level is represented by 0.

For example, consider the following examples of aggregation vectors for the product dimension:

  • Aggregation By ProductKey Attribute = [Product Key]:1 [Color]:0 [Subcategory]:0 [Category]:0 or 1000
  • Aggregation By Category Attribute = [Product Key]:0 [Color]:0 [Subcategory]:0 [Category]:1 or 0001
  • Aggregation By ProductKey.All and All and Subcategory.All and Category.All = [Product Key]:0 [Color]:0 [Subcategory]:0 [Category]:0 or 0000

To identify each aggregation, Analysis Services combines the dimension vectors into one long vector path, also called a subcube, with each dimension vector separated by commas.

The order of the dimensions in the vector is determined by the order of the dimensions in the measure group. To find the order of dimensions in the measure group, use one of the following two techniques:

  1. With the cube opened in SQL Server Business Intelligence Development Studio, review the order of dimensions in a measure group on the Cube Structure The order of dimensions in the cube is displayed in the Dimensions pane.
  2. As an alternative, review the order of dimensions listed in the cube’s XMLA definition.

The order of attributes in the vector for each dimension is determined by the order of attributes in the dimension. You can identify the order of attributes in each dimension by reviewing the dimension XML file. For example, the subcube definition (0000, 0001, 0001) describes an aggregation for the following:

  • Product – All, All, All, All
  • Customer – All, All, All, State/Province
  • Order Date – All, All, All, Year

Understanding how to read these vectors is helpful when you review aggregation hits in SQL Server Profiler. In SQL Server Profiler, you can view how the vector maps to specific dimension attributes by enabling the Query Subcube Verbose event. In some cases (such as when attributes are disabled), it may be easier to view the Aggregation Design tab and use the Advanced view of the aggregations.

Aggregation Tradeoffs

Aggregations can improve query response time but they can increase processing time and disk storage space, consume memory that could otherwise be allocated to cache, and potentially slow the speed of other queries. The latter may occur because there is a direct correlation between the number of aggregations and the duration for the Analysis Services storage engine to parse them. As well, aggregations may cause thrashing due to their potential impact to the file system cache. A general rule of thumb is that aggregations should be less than 1/3 the size of the fact table.

Building Aggregations

Individual aggregations are organized into collections of aggregations called AggregationDesigns. Once created, an AggregationDesign can be applied to many partitions. A single measure group can also have multiple AggregationDesigns, so that you can choose different sets of aggregations for different partitions.

To help Analysis Services successfully apply the AggregationDesign algorithm, you can perform the following optimization techniques to influence and enhance the aggregation design. In this section we will discuss the following:

  • The importance of attribute hierarchies for aggregations
  • Aggregation design and partitions
  • Specifying statistics about cube data
  • Suggesting aggregation candidates
  • Usage-based optimization
  • Large cube aggregations
  • Distinct count partition aggregation considerations

Importance of Attribute Hierarchies

Aggregations work better when the cube is based on a multidimensional data model that includes natural hierarchies. While it is common in relational databases to have attributes independent of each other, multidimensional star schemas have attributes related to each other to create natural hierarchies. This is important because it allows aggregations built at a lower level of a natural hierarchy to be used when querying at a higher level.

Note that attributes that are exposed only in attribute hierarchies are not automatically considered for aggregation by the Aggregation Design Wizard. Therefore, queries involving these attributes are satisfied by summarizing data from the primary key. Without the benefit of aggregations, query performance against these attributes hierarchies can be slow. To enhance performance, it is possible to flag an attribute as an aggregation candidate by using the Aggregation Usage property. For more information about this technique, see Suggesting Aggregation Candidates. However, before you modify the Aggregation Usage property, you should consider whether you can take advantage of user hierarchies.

Aggregation Design and Partitions

When you define your partitions, they do not necessarily have to contain uniform datasets or aggregation designs. For example, for a given measure group, you may have 3 yearly partitions, 11 monthly partitions, 3 weekly partitions, and 1–7 daily partitions. Heterogeneous partitions with different levels of detail allows you to more easily manage the loading of new data without disturbing existing, larger, and stale partitions (more on this in the processing section) and you can design aggregations for groups of partitions that share the same access pattern. For each partition, you can use a different aggregation design. By taking advantage of this flexibility, you can identify those data sets that require higher aggregation design.

Consider the following example. In a cube with multiple monthly partitions, new data may flow into the single partition corresponding to the latest month. Generally that is also the partition most frequently queried. A common aggregation strategy in this case is to perform usage-based optimization to the most recent partition, leaving older, less frequently queried partitions as they are.

If you automate partition creation, it is easy to simply set the AggregationDesignID for the new partition at creation time and specify the slice for the partition. After that the partition is ready to be processed. At a later stage, you may choose to update the aggregation design for a partition when its usage pattern changes – again, you can just update the AggregationDesignID, but you will also need to invoke ProcessIndexes so that the new aggregation design takes effect for the processed partition.

Specifying Statistics About Cube Data

To make intelligent assessments of aggregation costs, the design algorithm analyzes statistics about the cube for each aggregation candidate. Examples of this metadata include member counts and fact table counts. Ensuring that your metadata is up-to-date can improve the effectiveness of your aggregation design.

Whenever you use multiple partitions for a given measure group, ensure that you update the data statistics for each partition. More specifically, it is important to ensure that the partition data and member counts (such as EstimatedRows and EstimatedCount properties) accurately reflect the specific data in the partition and not the data across the entire measure group.

Suggesting Aggregation Candidates

When Analysis Services designs aggregations, the aggregation design algorithm does not automatically consider every attribute for aggregation. Consequently, in your cube design, verify the attributes that are considered for aggregation and determine whether you need to suggest additional aggregation candidates. To streamline this process, Analysis Services uses the Aggregation Usage property to determine which attributes it should consider. For every measure group, verify the attributes that are automatically considered for aggregation and then determine whether you need to suggest additional aggregation candidates.

Aggregation Usage Rules

An aggregation candidate is an attribute that Analysis Services considers for potential aggregation. To determine whether or not a specific attribute is an aggregation candidate, the storage engine relies on the value of the Aggregation Usage property. The Aggregation Usage property is assigned a per-cube attribute, so it globally applies across all measure groups and partitions in the cube. For each attribute in a cube, the Aggregation Usage property can have one of four potential values: FullNoneUnrestricted, and Default.

  • Full— Each aggregation for the cube must include either this attribute, or a related attribute that is lower in the attribute chain. For example, suppose you have a product dimension with the following chain of related attributes: [Product], [Product Subcategory], and [Product Category]. If you specify the Aggregation Usage for [Product Category] to be Full, Analysis Services may create an aggregation that includes [Product Subcategory] as opposed to [Product Category], given that [Product Subcategory] is related to [Category] and can be used to derive [Category] totals.
  • None—No aggregation for the cube can include this attribute.
  • Unrestricted—No restrictions are placed on the aggregation designer; however, the attribute must still be evaluated to determine whether it is a valuable aggregation candidate.
  • Default—The designer applies a default rule based on the type of attribute and dimension. This defines the default value of the Aggregation Usage

The default rule is highly conservative about which attributes are considered for aggregation. The default rule is broken down into four constraints.

  • Default Constraint 1—Unrestricted - For a dimension’s measure group granularity attribute, default means Unrestricted. The granularity attribute is the same as the dimension’s key attribute as long as the measure group joins to a dimension using the primary key attribute.
  • Default Constraint 2None for Special Dimension Types –The default means None in these cases: all attributes in a many-to-many solution (except All); nonmaterialized reference dimensions; data mining dimensions. This means you can sometimes benefit from creating leaf level projections for many-to-many dimensions. Note, these defaults do not apply for parent-child dimensions; for more information, see the Special Considerations > Parent-Child dimensions
  • Default Constraint 3Unrestricted for Natural Hierarchies - A natural hierarchy is a user hierarchy where all attributes participating in the hierarchy contain attribute relationships to the attribute sourcing the next level. For such attributes, default means Unrestricted, except for nonaggregatable attributes, which are set to Full (even if they are not in a user hierarchy).
  • Default Constraint 4None For Everything Else. For all other dimension attributes, default means None.

Aggregation Usage Guidelines

Given these behaviors of the Aggregation Usage property, apply the following guidelines when designing or using aggregations:

  • Attributes exposed solely as attribute hierarchies- If a given attribute is only exposed as an attribute hierarchy such as color, you may want to change its Aggregation Usage property as follows.
    • First, change the value of the Aggregation Usage property from Default to Unrestricted if the attribute is a commonly used attribute or if there are special considerations for improving the performance in a particular pivot or drilldown. For example, if you have highly summarized scorecard style reports, you want to ensure that the users experience good initial query response time before drilling around into more detail.
    • While setting the Aggregation Usage property of a particular attribute hierarchy to Unrestricted is appropriate is some scenarios, do not set all attribute hierarchies to Unrestricted. Increasing the number of attributes to be considered increases the problem space for the aggregation algorithm. The wizard can take at least an hour to complete the design and considerably much more time to process. Set the property to Unrestricted only for the commonly queried attribute hierarchies. The general rule is to allow 5-10 Unrestricted attributes per dimension.
    • Next, change the value of the Aggregation Usage property from Default to Full in the unusual case that it is used in virtually every query you want to optimize. This is a rare case, and this change should be made only for attributes that have a relatively small number of members.
  • Infrequently used attributes—For attributes participating in natural hierarchies, you may want to change the Aggregation Usage property from Default to None if users use it infrequently. Using this approach can help you reduce the aggregation space and limit the design to the desired level of 5-10 Unrestricted attributes per dimension. For example, certain attributes might be used only by a few advanced users, and they might be willing to accept slightly slower performance. In this scenario, you are essentially forcing the aggregation design algorithm to spend time building only the aggregations that provide the most benefit to the majority of users.

Usage-Based Optimization

The Usage-Based Optimization Wizard reviews the queries in the query log (which you must set up beforehand) and designs aggregations that cover up to the top 100 slowest queries. We recommend that you specify a 100% performance gain when you use the Usage-Based Optimization Wizard, to design aggregations to avoid hitting the partition directly.

After the aggregations are designed, you can add them to the existing design or completely replace the design. Be careful adding them to the existing design – the two designs may contain aggregations that serve almost identical purposes, but which when combined are redundant. Always inspect the new aggregations compared to the old and ensure there are no near-duplicates.

Note that aggregation designs have a costly metadata impact. Do not overdesign, but try to keep the number of aggregation designs per measure group to a minimum.

When you are satisfied with your aggregations, you can copy the aggregation design to other partitions, using either SQL Server Management Studio or the design tools in SQL Server Data Tools.

References

Large Cube Aggregation Considerations

Small cubes might not even need aggregations, because aggregations are not built for partitions with fewer records than the IndexBuildThreshold (which has a default value of 4096). Even if the cube partitions exceed the IndexBuildThreshold, aggregations that are correctly designed for smaller cubes might not be the correct ones for large cubes.

As cubes become larger, it becomes more important to design aggregations and to do so correctly. As a general rule of thumb, MOLAP performance is approximately between 10 and 40 million rows per second per core, plus the I/O for aggregating data.

Larger cubes have more constraints such as small processing windows and/or not enough disk space. Therefore it may be difficult to create all of your desired aggregations. You will need to weigh the tradeoffs carefully when designing aggregations.

posted @ 2017-03-01 11:02  俊凯  阅读(168)  评论(0)    收藏  举报