A few weeks ago I was fortunate enough to attend a four day workshop on SQL Server Performance Tuning and Optimization. I really got a lot from the course and wanted to highlight a few of the things I found the most beneficial. In this post I’ll be focusing primarily on tables and indexes.
Data Architecture
I don’t want to spend much time discussing the underlying architecture but in order to fully understand how each decision affects database performance we need to establish a foundation. If you’d like a deeper discussion about SQL Server architectural concepts please refer to the additional resources section at the end of this post.
Pages
SQL Server stores its data in 8 kilobyte chunks called pages. Pages consist of a 96 byte header, a maximum of 8,060 bytes of row data, and a byte offset (slot) array indicating not only where each row starts but how many rows are stored within that page. SQL Server reads (and writes) data by the page so we should strive to maximize the number of rows per page to minimize the number of I/O operations.
Extents
Groups of eight pages are managed together in a 64 kilobyte construct called an extent. There are two types of extents: Mixed, and Uniform. Mixed extents can contain pages owned by multiple objects whereas the pages in uniform extents are all owned by the same object.
Generally speaking, new objects will be allocated with mixed extents and only start using uniform extents when they grow enough to fill an extent (consume 8 pages). This optimization allows SQL Server to make better use of disk space by not allocating an entire extent for objects that may only ever require a fraction of that space.
Indexes
Normally data is stored in an unorganized heap structure. Indexes enable more efficient data access by providing B-tree structures that SQL Server uses to quickly eliminate rows that don’t match the criteria. Each page in an index B-tree is referred to as an index node with the top node being the root node, the bottom level nodes being leaf nodes, and all nodes in between being intermediate nodes.
SQL Server supports two types of indexes:
- Clustered
- Non-clustered
Clustered Indexes
When a table has a clustered index it is no longer a heap and is instead a clustered table. A clustered index not only adds a B-tree structure on top of the table but also stores the data in index key order. The leaf pages of a clustered index are the table’s actual data pages. Microsoft recommends that most tables have a clustered index but there are a few important things to consider when creating one.
There can be only one clustered index per table.
Because a clustered index defines the storage order of the table rows only one clustered index is allowed per table. By default the primary key index will be a clustered index.
Key values should be narrow.
The key value will be used in other indexes to help locate a particular row so having wide key values will require more storage space not only in the clustered index but also in any indexes based upon it. Generally speaking, GUIDs are not good candidates for a clustered index for this reason.
Key values should be unique.
SQL Server requires clustered index keys to be unique but allows creating them on non-unique data. The way it accomplishes this is by automatically adding a four byte “uniqueifier” to the data. This uniqueifier widens the key value and increases the index storage requirements. Creating the clustered index against unique data eliminates the need for this added overhead.
Key values should be sequential.
Remember that the rows in a table with a clustered index are stored in index key order. If a new row is inserted into a table with a clustered index SQL Server must do additional work to reorganize the data to make room for the new value. If the key value is sequential SQL Server can generally just append the new data to the table. This is another reason why GUIDs are bad candidates for a clustered index.
Non-clustered Indexes
Non-clustered indexes are independent B-tree structures that ultimately point to the underlying data records. Instead of storing the data directly at the leaf level the leaf pages of a non-clustered index include either the clustered index key for clustered tables or an 8 byte Row Identifier (RID) value containing the file id, page id, and slot id for heaps.
Unlike with clustered indexes, a table is not restricted to a single non-clustered index but there are still some guidelines for making non-clustered indexes effectively:
Create non-clustered indexes for common queries that only return “a few” rows.
When a query returns a large number of rows the optimizer can choose to scan the underlying heap or B-tree (HoBT) rather than using the non-clustered index, particularly when the index doesn’t contain all of the information needed to satisfy the query.
Consider using covering indexes.
Covering indexes were introduced in SQL Server 2005 with the addition of included columns. The idea behind a covering index is to give the index itself enough information to satisfy a query without going back to the underlying table or bloating the index key. Included columns enable this by copying the data from those columns into the leaf pages of the index.
Consider using filtered indexes.
In some cases it may not be necessary or efficient to create a full table index. In SQL Server 2008 we can create indexes that focus on a subset of the data by including filter criteria on the index definition.
Filtered indexes can boost performance because the index is smaller and also uses filtered statistics. Additionally, filtered indexes only need to be maintained when rows that meet the filter criteria are changed.
Recommendations
In addition to the specific tips above there are some recommendations for optimizing indexes regardless of type.
Column order is important.
A goal of any index should be to eliminate non-matching rows as quickly as possible. One way to achieve that goal is by adding columns to the index in the following order:
- Place equality columns before inequality columns.
- The most selective equality columns should be placed first.
- The most selective inequality column should appear after the least selective equality column.
By placing the most selective equality columns first we restrict the query to a single value and therefore reduce the size of the result set and therefore also reduce number of additional reads needed to satisfy the rest of the query.
Commonly filtered joined columns should be indexed.
SQL Server uses the index key columns to navigate the B-tree so we can generally improve performance by creating indexes against the columns we know we’ll be searching frequently.
Balance index maintenance with table update frequency.
Indexes need to be maintained as the underlying data changes. Generally speaking a relatively static table can benefit from heavy indexing since the data seldom changes but extra care should be taken when indexing tables with frequent data modifications.
Small tables may not benefit from indexes.
For smaller tables such as those used primarily lookups the query optimizer may determine that a table scan is more efficient than using an index. In these cases the index will still require maintenance and storage space while providing little to no benefit.
Avoid adding unnecessary columns.
Similar to keeping key values narrow it is important to avoid adding unnecessary columns to the index key. Every column will increase the size of the index key and require additional index maintenance. Even though included columns for a covering index will only be copied to the leaf pages of a non-clustered index they will still consume additional space and also require maintenance so unnecessary columns should not be included in an index either.
Maximize the number of updates performed with a single query.
SQL Server can optimize index maintenance by when multiple rows are inserted or modified at once. For this reason statements that modify multiple rows should be used whenever possible.
Index Options
There are a few options that can affect index performance. Notable among these are the ON, MAXDOP, and FILLFACTOR.
ON
The ON option allows us to specify which partition the index will be mapped to. When this option is omitted the index will be created in the same partition as the underlying table.
MAXDOP
SQL Server can take advantage of multiple cores on a system when constructing an index but this can become a resource hog. If we want to restrict the number of cores used by index creation we can specify the maximum degree of parallelism by including the MAXDOP option on the CREATE INDEX statement.
FILLFACTOR
Another way we can affect index performance is by specifying a FILLFACTOR. FILLFACTOR is a percentage that indicates how much data will be loaded into each page with the index is built or rebuilt. Like most things in SQL Server, FILLFACTOR is a balancing act. Lower values will delay page splits whereas a higher values will decrease the number of I/O operations needed to get the data.
Data Access
How SQL Server access data depends on which types of indexes are present and it is important to understand the implications of each type.
Heaps
As previously mentioned, a table without a clustered index is a heap. Without indexes SQL Server has to perform a full table scan, that is, read every row in the table to find the requested rows even if there will only be one match because it doesn’t know enough about the data to do anything else.
When performing a table scan SQL Server will read the first Index Allocation Map (IAM) page for the heap then follow the single page allocation pointers (maximum of 8 before reaching the uniform extents) then follow the pointers through each range of extents.
Clustered Indexes
SQL Server can be more efficient with queries over clustered tables because the clustered index provides it with enough information to choose between a clustered index seek or clustered index scan.
Clustered Index Seek
A clustered index seek involves locating the root node and reading the index keys until it finds the correct value. Once the value is found it follows the file/page id pointers to locate the matching rows.
Clustered Index Scan
A clustered index scan is similar to a heap scan except that the clustered index has intermediate pages that must be navigated before reaching the data.
Non-clustered Indexes
Querying a non-clustered index is generally more complicated than querying a heap or clustered index directly because non-clustered indexes merely point to the underlying data through RIDs or clustered index key values depending on the table structure. In either case it may be possible to use a covering index to eliminate the additional I/O required by the lookups.
Non-clustered Indexes with a Heap
When a non-clustered index is built upon a heap its leaf pages contain a RID value (described above) that identifies where the corresponding row is located in the heap. An advantage of the RID is that it doesn’t usually change because the location of the underlying row usually doesn’t change. As a result accessing rows through a non-clustered index on a heap is generally quite efficient because SQL Server can go straight to the row.
Non-clustered Indexes with a Clustered Index
Non-clustered indexes built upon a clustered table cannot use the same RID mechanism as those built upon a heap because the rows in a clustered table can move. As such, the leaf pages in the non-clustered index will point to the row with the clustered index key (and uniqueifier if needed) which means that not only will we traverse the non-clustered index’s B-tree but we’ll also traverse the clustered index’s B-tree.
Index DMVs
SQL Server 2005 introduced several Dynamic Management Views (DMVs) and Functions (DMFs) for monitoring index usage and helping to identify missing indexes.
Monitoring Index Usage
These DMVs provide a glimpse into how indexes have been used since the last time SQL Server was restarted.
- sys.dm_db_index_operational_stats – identifies contention primarily in the areas of I/O, latching (lightweight locking), and locking.
- sys.dm_db_index_usage_stats – is similar to dm_db_index_operational_stats in terms of the data that is tracked but is maintained when index references rather than execution behavior.
Identifying Missing Indexes
When there are no indexes on frequently queried columns we run the risk of increased locking, increased memory pressure, increased table scans due to inefficient query plans. SQL Server also provides some DMVs for helping identify missing indexes but they only track a maximum of 500 index groups. Once this threshold has been reached SQL Server will occasionally replace existing missing index data when a new index would have a greater impact. Like other DMVs, this information is also reset at each server restart so the data it reports should only be used as a guideline.
Some of the more notable DMVs are:
- sys.dm_db_missing_index_group_stats – returns information about groups of missing indexes including number of scans, number of seeks, and percentage of impact if the index were added.
- sys.dm_db_missing_index_details – returns detailed information about missing indexes such as equality, inequality, and included columns.
Microsoft recommends periodically backing up the data from these DMVs so it can be safely reviewed after a server restart.
Additional Resources
Pages and Extents Architecture
http://msdn.microsoft.com/en-us/library/cc280360.aspx
Tables and Index Data Structures Architecture
http://msdn.microsoft.com/en-us/library/ms180978.aspx
Index Related Dynamic Management Views and Functions
http://msdn.microsoft.com/en-us/library/ms187974.aspx