【SQLpassion Performance Tuning Training Plan】- Week 14: Inside Statistics

Welcome back to the SQLpassion Performance Tuning Training Plan. Today I want to talk in more detail how Statistics are represented internally by SQL Server. Imagine the following problem: the "Estimated Number of Rows" in an operator in the Execution Plan is 42, but you know that 42 isn't the correct answer to this query. But how can you interpret(vt. 说明;口译) the statistics to understand where the estimation is coming from? Let's talk about the Histogram(n. [统计] 直方图;柱状图) and the Density(n. 密度) Vector.

Histogram

 

Let's have a first look at the histogram. The idea behind the histogram is to store the data distribution of a column in a very efficient(adj. 有效率的;有能力的;生效的), compact(adj. 紧凑的,紧密的;简洁的) way. Every time that you create an index on your table (Clustered/Non-Clustered Index), under the hood(n. 头巾;覆盖;兜帽) SQL Server also creates a statistics object for you. And this statistics object describes the data distribution in that specific column. Imagine you have an order table, and a column named Country. You have some sales in the US, the UK, Austria, Spain, and Switzerland(n. 瑞士). Therefore you can visualize(vt. 形象,形象化;想像,设想) the data distribution in that column like in the following picture.

 

 

What you are seeing in that picture is a histogram - the data distribution is described with some bars: the higher the bar is, the more records you have for that specific column value. And the same concept and format is used by SQL Server. Now let's have a look at a more concrete(adj. 混凝土的;实在的,具体的;有形的) example. In the AdventureWorks2012 database you have the tableSalesOrderDetail and within that table you have the column ProductID. That column stores the ID of the product that was part of that specific sale.

 In addition there is also an index defined on that column - which means that SQL Server has also created a statistics object that describes the data distribution within that column. You can have a look at the statistics object by opening its properties window within SQL Server Management Studio, or you can use the command DBCC SHOW_STATISTICS to return the content of the statistics object in a tabular format:

As you can see from the previous screen shot, the command returns 3 different result sets:

  • General information about the statistics object
  • Density Vector
  • Histogram

If you want to have more information about how these parts of the statistics object are used for the cardinality estimation you can read my follow-up blog posting about it.

Density Vector

Let's have a look at this mysterious density vector. When you have a look at the Non-Clustered Index IX_SalesOrderDetail_ProductID, the index is defined only on the column ProductID. But in every Non-Clustered Index SQL Server also has to store the Clustered Key at least as a logical pointer in the leaf level of the index. When you have defined a Non-Unique Non-Clustered Index, the Clustered Key is also part of the navigation structure of the Non-Clustered Index. The Clustered Key on table SalesOrderDetail is a composite(n. 复合材料;合成物;) one, on the columns SalesOrderID and SalesOrderDetailID.

This means that our Non-Unique Non-Clustered Index consists(v. 由…构成;由…组成) in reality of the columns ProductIDSalesOrderID, and SalesOrderDetailID. Under the hoods you have a composite index key. This also means now that SQL Server has generated a density vector for the other columns, because only the first column (ProductID) is part of the histogram, which we have already seen in the earlier section. When you look at the output of theDBCC SHOW_STATISTICS command, the density vector is part of the second result set that is returned:

 

SQL Server stores here the selectivity, the density of the various column combinations.For example, we have an All Density Value for the column ProductID of 0,003759399. You can also calculate (and verify) this value at your own by dividing 1 by the number of distinct values in the column ProductID:

SELECT 1 / CAST(COUNT(DISTINCT ProductID) AS NUMERIC(18, 2))

FROM Sales.SalesOrderDetail

GO

The All Density Value for the column combinations ProductIDSalesOrderID andProductIDSalesOrderIDSalesOrderDetailID is 8,242867858585359018109580685312e-6. You can again verify this, by dividing 1 by the number of distinct value in the columns ProductIDSalesOrderID, or ProductID,SalesOrderIDSalesOrderDetailID. In our case these values are unique (because they are part of the Clustered Key), so you calculate 1 / 121.317 = 8,242867858585359018109580685312e-6.

More detailed information about how SQL Server uses this information during the cardinality estimation can be found in the same blog posting as mentioned earlier.

 Summary

Today you have seen how SQL Server internally structures a statistics object. The most important things here are the Histogram and the Density Vector, which are used all the time for the cardinality(n. 基数;) estimation. I hope that you have enjoyed this installment of theSQLpassion Performance Tuning Training Plan.

Next week we will continue by talking about limitations with the cardinality estimation in SQL Server, and how to solve them. See you very soon. 

posted @ 2017-08-18 11:09  FH1004322  阅读(154)  评论(0)    收藏  举报