SQL Server data structures(转)

SQL Server data structures

Pages

In SQL Server, data is organized in pages. A page has a fixed size (8 KB) . Each page contains records. The number of records that can be stored in a page depends on the size of the records. The operation of reading data from a page is called a logical IO. The smaller the size of a record is, the more records can be read with the same number of logical IOs.

A page can be identified by a 6 bytes page pointer: 2 bytes for the file ID and 4 bytes for the page number. A row in a page can be identified by an 8 bytes row ID: 2 bytes for the file ID, 4 bytes for the page number and 2 bytes for the row number.

There are two types of non system pages: data pages, composed of data rows, and index pages, composed of index rows.

Structure of a table: heaps and clustered tables

There are only two types of tables in SQL Server: heaps and clustered tables.

Heaps

Heaps are tables that have no clustered index. They are entirely composed of data pages. The data rows are not stored in any particular order, and there is no particular order to the sequence of the data pages. The data pages are not linked in a linked list.


A Heap


Clustered tables

A clustered table is a table whose rows are kept in sorted order. The clustered index is the structure that stores and maintains the rows of a clustered table in sorted order. The rows are ordered using the clustering key, which is defined by one or more columns of the table. Like in a heap, the data rows are stored in data pages, and the clustered index additionally uses index pages to navigate in the data.

SQL Server organizes indexes as trees, with one page at the root level, multiple pages at the leaf level, and zero or more levels in between.

  • Leaf level:
    The leaf level is the data of the table. It is composed of data pages. They store all the columns of the table for every row of the table.
  • Non-leaf levels:
    The non-leaf levels are composed of index pages.


At every non-leaf level, each index row corresponds to a page in the next level. Each index row in every index page contains two things: the index key value, which is the first key value of the corresponding page in the next level of the index, and a 6 bytes page pointer to that corresponding page.

At each level (including leaf level), the pages have a pointer to the next and the previous page in the same level, so that each level forms a doubly-linked list. At each level, the pages and the rows within a page are ordered using the clustering key.

 


A clustered index.
Blue pages are index pages and green pages are data pages.

Nonclustered indexes

Nonclustered indexes are always associated with a table (it can be either a heap of a clustered table), but they have a structure completely separate from the data rows. Nonclustered indexes only contain a partial copy of the data of the associated table, sorted in a different order than the actual data. The rows are ordered using the nonclustered key, which is defined by one or more columns of the associated table. There can be multiple nonclustered indexes per table.

Nonclustered indexes have the same B-tree structure as clustered indexes, however the leaf level is different.

Leaf Level:

  

  • In a nonclustered index, the leaf level is composed of index pages instead of data pages. Each index row in the leaf level of a nonclustered index contains a nonclustered key value along with a row locator that points to the corresponding data row in the actual table structure (heap or clustered table). (如果表上已经有一个集群索引,非集群索引的页页面不是包含直接指向数据的指针,而是指向集群索引键值的指针。  为什么查找单个值非要搜索两个索引呢? 答案是记录增加减少时,当分页操作发生时,由于非集群索引使用的是集群索引的键值,而不是实际数据,非集群索引不需要重建。)
  • If the associated table is a heap, the row locator is an 8 bytes row ID . If the table is a clustered table, the row locator is the clustering key identifying the row.

Non-leaf levels:

The non-leaf levels of a nonclustered index are used for navigation identically as in a clustered index.

At each level of the nonclustered index, the pages and the rows within a page are ordered using the nonclustered key.


 

 

A nonclustered index for the clustered table above.



A nonclustered index for the heap above.

 

 

Covered columns

The columns stored at the leaf of an index are called the columns covered by this index.

A clustered index always covers all the columns.

A nonclustered index covers all the columns of its non clustering key. If it is a nonclustered index on a clustered table, it also covers the columns of the clustering key of the table, since they constitute the row locators.

Clustered index & Non-clustered index Comparison

 

Clustered Index

Non-Clustered Index

 

Only one for a table

Up to 249 for a table

 

Physically sort record according to clustered index.

a nonclustered index contains a nonclustered key value along with a row locator that points to the corresponding data row in the actual table structure (heap or clustered table).

 

To query a range of values.

To query a given value

 

Indexing on 低选择性(重复多)的列

Indexing on 高选择性(重复少)的列

Hash Index

ALTER TABLE Products

ADD cs_Pname AS checksum(ProductName)

CREATE INDEX Pname_index ON Products (cs_Pname)

The checksum index can be used as a hash index, particularly to improve indexing speed when the column to be indexed is a long character column. The checksum index can be used for equality searches.

SELECT *

FROM Products

WHERE checksum(N'Vegie-spread') = cs_Pname

AND ProductName = N'Vegie-spread'

Creating the index on the computed column materializes the checksum column, and any changes to the ProductName value will be propagated to the checksum column. Alternatively, an index could be built directly on the column indexed. However, if the key values are long, a regular index is not likely to perform as well as a checksum index.

 

Reference

Appendix:


Persisted
You may have noticed that we also used the property "Persisted" for our computed column. This property for computed columns has been introduced in SQL Server 2005 and onwards. It is important for any computed column, because many additional features depend on it. To be able to make a computed column as Persisted it has to be deterministic.

Here are a few rules:

  • If Persisted property is off then calculated column will be just a virtual column. No data for this column will be stored on disk and values will be calculated every time when referenced in a script. If this property is set active then data of computed column will be stored on disk.
  • Any update in referenced column will be synchronized automatically in computed column if it is Persisted.
  • Along with some other conditions Persisted is required to create an index on the computed column.

Nullability
Nullibility for a computed column value will be determined by the database engine itself. The result of a non-nullable referenced column may be NULL in certain conditions to avoid possible overflows or underflows. You can provide an alternate value for NULL using the ISNULL(check_expression, constant), if required.

Execution Plan & IO Statistics

What Happens When a Query is Submitted?

When you submit a query to a SQL Server database, a number of processes on the server go to work on that query.

These processes are run for each and every query submitted to the system. While there are lots of different actions occurring simultaneously within SQL Server, we're going to focus on the processes around T-SQL. They break down roughly into two stages:

  • Processes that occur in the relational engine
  • Processes that occur in the storage engine.

In the relational engine the query is parsed and then processed by the Query Optimizer , which generates an execution plan. The plan is sent (in a binary format) to the storage engine, which it then uses to retrieve or update the underlying data. The storage engine is where processes such as locking, index maintenance and transactions occur.

 

Figure: Example MySQL Relational and Storage Engine

What is SqlCommand.Prepare() for ?

It creates a compiled version of the command, and an execution plan, on the database. If the same query is going to be called several times in one session with different parameters, it can increase overall performance.

Index Scan/Table Scan 则是从左到右,把整个 B 树遍历一遍。时间复杂度 O(N)
Index Seek 是查找从 B 树的根节点开始,一级一级找到目标行。时间复杂度 O(log N)

假设唯一的目标行位于索引树最右的叶节点上(假设是非聚集索引,树深度 2,叶节点占用 k页物理存储)。
index seek引起的 IO是 4=2*2 (两次索引查找),而 index scan引起的 IO是 k,性能差别巨大。

Identify Missing Indexes Using SQL Server DMVs

In SQL Server 2005, Microsoft introduced Dynamic Management Views (DMVs). The role of DMVs is to return SQL Server state information; which can be used by database administrators and database developers to monitor the health of an SQL Server Instance and identify potential performance issues. DMVs reflect all the activities on the instance of SQL Server since the last restart.

The Dynamic Management Views (DMV) which can be used to identify missing indexes on a table are:

  • sys.dm_db_missing_index_details
  • sys.dm_db_missing_index_group_stats
  • sys.dm_db_missing_index_groups
  • sys.dm_db_missing_index_columns

Tool:

SELECT statement AS [database.scheme.table],

column_id , column_name, column_usage,

migs. user_seeks, migs. user_scans,

migs. last_user_seek, migs. avg_total_user_cost,

migs. avg_user_impact

FROM sys . dm_db_missing_index_details AS mid

CROSS APPLY sys . dm_db_missing_index_columns ( mid. index_handle)

INNER JOIN sys . dm_db_missing_index_groups AS mig

ON mig. index_handle = mid. index_handle

INNER JOIN sys . dm_db_missing_index_group_stats   AS migs

ON mig. index_group_handle= migs. group_handle

ORDER BY migs. avg_user_impact DESC

--ORDER BY mig.index_group_handle, mig.index_handle, column_id

Disadvantage of having too many indexes on a table

  • Insert, Update and Delete operations will become very slow if there are many indexes created on a table. This occurs when the Insert, Update or a Delete operation against a table results in all the indexes being updated, reducing query performance.
  • Indexes are stored on disk and the amount of space required by the index depends on the size of database table, and the number and type of columns used within the index definition.
  • A greater number of indexes result in more disk space being required to store them.

 

 

Limitations of Missing Index Feature

  • A DMV can store information from a maximum of 500 missing indexes.
  • Unable to provide recommendations for clustered, indexed views and partitioning.
  • Once the SQL Server is restarted all the information related to missing indexes is lost. To keep the information for later use, the DBA needs to backup all the data available within all the missing index DMV prior to the restart of SQL Server.

Conclusion


Database Administrators need to analyse the impact of an indexcreated for Insert, Update and Delete operationsbefore accepting recommendations given by the missing index DMVs.

Suggestions on SQL Query

查询条件中不要包含运算
           

这些运算包括字符串连接,如:

select * from Users where UserName + ‘pig’ = ‘ 张三 pig’

 

通配符在前面的 Like运算,如:

select * from tb1 where col4 like ‘%aa’

 

使用其他用户自定义函数、系统内置函数、标量函数等等(如:

select * from UserLog where datepart(dd, LogTime) = 3



            SQLServer在处理以上语句时,一样没办法估算开销。最终结果当然是 clustered index scan或者 table scan了。

查询条件中不要包含同一张表内不同列之间的运算
           

所 谓的 “运算 ”包括加减乘除或通过一些 function:: 如:

select * from tb where col1 – col2 = 1997

也包括比较运算(如:

select * from tb where col1 > col2

 

这种情况下, SQLServer一样没办法估算开销。不论 col1、 col2上都有索引还是创建了 col1、 col2上的覆盖索引还是创建了 col1 include col2的索引。

            但是这种查询有解决办法,可以在表上多创建一个计算字段,其值设置为你的 “运算 ”结果,再在该字段上创建一个索引。

posted on 2011-03-09 13:37  千羽  阅读(335)  评论(0编辑  收藏  举报

导航