Friday, October 21, 2016

Short notes on Statistics

  • Recompile would be the 1st step to influence to optimize to recompile the code again.
  • Compilation is the most important stage
  • Cost based optimizer == need to evaluate a query reasonably quickly
  • When writing a query, there are many ways to write it but do not focus to that first. Focus the result.
  • WHERE msal * 12 > literal -> Bad
  • WHERE msal > literal / 12 -> Good
  • All about statistics / estimation -> Query performance
  • Most of the cases the issue is not because of out of date stats or indexes, it could be parameter sniffing
  • EXEC <proc name> <parameters> -> uses the existing plan
  • EXEC <proc name> <parameters> WITH RECOMPILE -> generates a new plan
  • Using above two, you can identify the parameter sensitiveness (sniffing)
  • Update stats invalidate the plan
  • Estimates vs Actual -> If it is different, it may not be the stat problem always. It may be parameter sniffing.
  • Selectivity
    • Optimizer loves highly selective predicates
    • Low no.of rows -> high selectivity
    • High no.of rows -> low selectivity
  • Statistics -> Summarized info about the data distribution of table columns
  • DBCC AUTOPILOT -> Undocumented
  • Hypothetical Indexes -> just the index structure.
    • What if analysis using AUTOPILOT with >= SQL Server 2008
  • sp_helpstats ‘<table name>’, ‘all’
  • Histogram -> 200 steps + 1 row for the null if the column allows null
    • SQL Server 7.0 had 300 rows
    • When the histogram is large, it increases the compilation time because histogram is not like an index
    • EQ_ROWS – Equal rows for the index key
    • Rows * All density = Avg no.of rows returns for that column or the combination of cols
    • If the table is huge, the data distribution present in histogram is not quite accurate
    • Step compression -> When building the histogram, if it finds the value approximately similar to each adjacent steps, then the algorithm will compress those steps and create 1 step
    • RANGE_ROWS -> values between two HI_KEYs and excluding the HI_KEYs at both ends
  • Exec sp_autostats ‘<table name>’
  • Sys.stats -> shows all the stats including index stats
  • Sys.dm_db_stats_properties -> gives lot of details that you can use to update stats more effectively programmatically.
  • Histogram -> direct hit
  • Stats will update on Index Rebuild but not on Index re-org.
  • Entire stats structure stored in db as a BLOB (Header, Density Vector and Histogram)
  • Partitioned table uses table level stats which has 200 steps by each by each partition
  • Online partition level index rebuild in SQL Server 2014, but when MS introduced partitioning? It was in SQL Server 2005. So MS took 9 years to get online partition level index rebuild
  • Tuple cardinality -> used to estimate distinct cardinality
  • Optimization | Compilation | Execution
    • Local variable values are not known at optimization time
  • Parameters and literals can be sniffed -> uses histogram
  • Variables cannot be sniffed -> density vector
  • UDTATE STATS with any % will not be parallelized but FULL SCAN can be parallelized.
  • Dynamic auto updating threshold
    • For large tables to have 20% change , you need to wait long time to update stats.
  • SQL Server does not understand the correlation between columns
  • Calculation direction in query plan is from right to left.
  • Problem is always with monster tables
  • Is 200 steps enough for such tables?
  • Even the table is partitioned, still the histogram is @ table level
  • Sp_recompile for a table is an expensive operation. It needs Sch-M lock on the table
  • Filtered stats – consider creating filtered stats, even daily basis to tackle and solve the estimates problems due to skewed data. So that you will get better execution plans
  • QUERYTRACEON (2353) -> Additional info.

Monday, August 29, 2016

Powershell supports Linux

Microsoft this week announced that PowerShell is now open sourced and available on Linux, the start of a development process that aims to enable users to manage any platform from anywhere, on any device.

R Integration in SQL Server 2016

R is a popular programming language that is used in data science to analyse data. R has capabilities for data analytics. Data analytics needs special data structures such as arrays, vectors, matrix. Traditional RDBMS products do not have capabilities for data analytics because they do not have such data structures stated above. 

The latest release of MS SQL Server 2016 has R integration so that data processing and data analytics can be done in the same data management platform. 

This blog post highlights R integration in SQL Server 2016 by using a simple example. 

Let's assume you have a requirement to generate sequence of numbers. There are various methods you can do in SQL Server using T-SQL.  The typical solution would be using a WHILE loop and few control variables which is mentioned below;

There is an interesting  msdn forum question which has many alternative ways to achieve this. It is worth to read the discussion. 

The same thing can be accomplished using R in a very simple way. R has a function named seq, and below code snippet shows R seq to generate a sequence of numbers from 1 to 10. You can use RStudio to use full R functionalities. 

> seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10

Now, lets see how we can integrate the above simple R function with SQL Server 2016. First you need to enable R integration in SQL Server 2016. 

I had to restart SQL Server service after executing the above code which should not be the case. Below is the version of the SQL Server that I'm using for this demo. 

Microsoft SQL Server 2016 (RTM) - 13.0.1601.5 (X64)

I had to restart SQL Server Launchpad service too which is a new service introduced in MS SQL Server 2016. 

Then you can use sp_execute_external_script programming interface to call an external script written in R or in other language which recognized by SQL Server

Below is the output of the above code. 

There are many complex data analytics work you can do in R and the same thing you can do inside SQL Server by combing the data stored in SQL Server. 


Thursday, August 25, 2016

SQL Server on Linux

It is certain that many are excited and welcome the Microsoft official announcement of SQL Server is going to support Linux platform. Specially the open source community must be thrilled with the news. 

However, still there is not much information available to public about how SQL Server on Linux will looks like. As per Microsoft the initial release will be available on mid of 2017. 

Announcing SQL Server on Linux


If we look at the history of SQL Server, the very early released versions were on UNIX based operating system known as OS/2 which is jointly developed by IBM and Microsoft. As per wikipedia the kernel type of OS/2 is mentioned as hybrid. That was 1988/89 period. 

The SQL Server versions 1.0, 1.1, and 4.2 were all based on OS/2 platform. Microsoft had separately developed the version 4.2 to support for their first version of Windows NT OS but with the same features as in the version that runs on OS/2.

Historically, the SQL Server has born in UNIX based platform and later it has been ported to support only on Windows after ending the agreement with Sybase.  

After 25 years later, Microsoft has decided to make SQL Server available on Linux platform which is a good move. 

source: Microsoft SQL Server - History


In my opinion, the main challenge would be to develop an abstraction layer to support Linux platform. The architecture of Windows and Linux are distinctly different. Both has preemptive scheduling but for the process model, Linux has very unique implementation of threads. SQL Server's process/threads model that runs in Windows is, single process multi-threading, meaning you can see only a single process for an SQL Server instance while internally it has thread pool to service for client requests. Consequently, thread is the unit of work at OS level to accomplish SQL Server requests in Windows. 

Conversely, in Linux has multi processes and there is no concept called threads as in Windows OS. Linux calls fork() system call to create a child process from the parent process and then it is finished or killed. In a nutshell, in Linux, everything is done by in-terms of processes. 

Bridging this architectural difference in SQL Server probably the most challenging part. I'm no expert in OS but if you think in software engineering point of view, this is the image I get about the abstraction layer. 

SQL Server has it's own user mode operating system known as SQLOS which is non-preemptive scheduler to interact with the Windows OS to service better for special needs of SQL Server. The SQLOS was introduced in SQL Server 2005 code name Yukon. 

To support Linux platform, the SQLOS component may need lot of additions or else there could be a separate component like SQLOS to support Linux platform. We do not know that information yet. 

SQLOS – unleashed

What features will be in SQL Server on Linux?

It is unclear what features will be available in SQL Server version that runs on Linux. To provide full features set as in SQL Server 2016 would be a real challenge. Another question is, whether it will include the BI tools? As per the announcement blog, it clearly states "We are bringing the core relational database capabilities to preview today" What are the core relational database capabilities? I presume the following capabilities should definitely need be addressed. They are; scheduling, memory management, I/O, exception handling, ACID support, etc. How about high availability options like clustering and AlwaysOn? I predict there will be some sort of high availability option in the first release on Linux. How about the Page size, is it 8K or different? By default Linux page size is 4K (4096 Bytes), however SQL Server in Windows supports non-configurable page size which is 8K(8192 Bytes). Consequently, SQL Server on Linux has to support is default page size of 4K. However unlike in Windows, Linux page size is configurable but it is not certain how feasible and the consequences of doing it. I just checked the page size in Mac OS and it is 4K too. See below figure. 

Will there be In-Memory OLTP engine with SQL Server on Linux? In-Memory OLTP, code name Hekaton was first introduced in SQL Server 2014 and it further developed in SQL Server 2016. In-Memory OLTP is an enormously important feature in future RDBMS products. However it is unclear yet, the first release of SQL Server on Linux will have this feature or not. 

What this means to DBAs?

Skill set required for future SQL Server DBAs are going to be expanded. They need to learn Linux OS commands as well as Python and Powershell for scripting tasks. In future companies will going to have a mix of Linux and Windows based SQL Server installations most probably. Licensing cost will decide which platform version of SQL Server is going to be the bigger portion. In a nutshell, future SQL Server DBAs will be cross-platform DBAs. It's challenging isn't it?

How about the Certifications?

Most likely, a new certification path will emerge to cater for the new trends/skills of SQL Server. To become a successful DBA, you might need to be qualified for SQL Server for both Linux and Windows platforms as well as cloud platforms. 

How this will impact to other RDBMS products?

My sense is, this will impact directly to MySQL. MS SQL Server is a proven enterprise level data platform. If anyone has a chance to get that into Linux OS, then it is highly unlikely someone will choose MySQL besides SQL Server providing that SQL Server Developer edition is free. 

I'm really excited and looking forward to get some hands on with SQL Server on Linux in very near future.


Thursday, August 18, 2016

TempDB Configuration

Recently we had some discussion around the best practices for tempdb configuration. The entire discussion was based on SQL Server 2012 EE. This blogpost highlights some salient points of tempdb configuration and include links to additional resources as well. 
  • How many files (data / log) we need to create? Is it based on logical cores or physical CPUs?
Any SQL Server database requires two files (1 data file and 1 log file) at minimum. SQL Server will allow you to add multiple log files but there is no benefit by doing so as SQL Server writes to log file in serial fashion. As a result, we can safely state, you do not need multiple log files for tempdb. 

What about the data files? SQL Server allows to add multiple data files too and there are benefits of doing this. Since our discussion is around tempdb, we can add multiple data files to address the very common phenomena known as tempdb contention. The whole objective of adding multiple data files is to improve the concurrency of tempdb activities in other words to improve the throughput of tempdb. Beginning from SQL Server 2005, tempdb tends to become the bottleneck as SQL Server continues to add new features which utilizes the tempdb heavily. As a result contention around tempdb becomes increased. However the history of tempdb contention goes back to SQL Server 7.0 because TF-1118 is also applies to SQL Server 7.0. The TF-1118 uses to reduce the tempdb contention of allocation pages (PFS, GAM and SGAM).

When deciding the no.of data files for tempdb, a general rule of thumb is to have no.of data files equal to the no.of logical cores up to 8 logical cores, meaning if the machine has 16 logical cores, you can add up to 8 data files. I've seen servers where you've 16 cores and tempdb has 16 data files too. This configuration is not appropriate according to the general rule of thumb. 

Modern CPUs have large number of cores. For example, Intel Xeon E7 processor family has a processor which has 24 cores. If you consider 8 CPU sockets server then you will get 192 (24*8) logical cores and when hyperthreading is enabled it will become 384 (192*2) logical cores. So in such a server, are we going to configure 384 data files for tempdb? That is an extremely bad idea because when there are such no.of data files the SQL Server has an additional overhead of threads management. 

Back to no.of files discussion, you can start with 8 data files (if no.of cores are 8) and then increase the no.of data files by multiple of 4 (up to the no.of logical processors of the machine). You need to do that, only if you still see contention around tempdb. 

  • How to decide the initial size of the tempdb?
There is no universal formula to do this and therefore it is non-trivial. You need to identify the features you're planning to use which potentially utilize the tempdb space. E.g: consider temp table and table variable usage, sorting, ordering, version store, etc. Then use your best judgement to decide the initial size and then monitor the tempdb usage by executing a workload in a test environment and decide whether you need to resize it or not. 
  • Do we need to enable autogrowth or not? If we do, then, is it for all the files or just one file?
Autogrowth setting should be used for emergency situations (for both data and log files) and should not be the process to grow the data file or log file. You always need to manually size the data files and log file and set the autogrowth so that in an emergency situation if tempdb needs additional space the SQL Server will grow the files without impacting the current server activities. If you've multiple data files situation then enable the autogrowth for all the files. 
  • The size of autogrowth?
Microsoft RAP tool has a rule which detects non standard database autogrowth sizes. The rule is as below;

"Database files have been identified that have the next Auto Growth increment of 1GB or above, and instant file Initialization is not enabled"

When the tool detects this, it classified as severity high risk. Basically it does not recommend to have autogrowth sizes 1GB or above. So that I would keep autogrowth size as 1023 MB (yes, it is not 1024 GB). This is especially important for the log file because depending on the autogrowth chunk size, the no.of VLFs vary. No.of VLF logic is as below (this applies to SQL Server 2016 as well);

chunks less than 64MB and up to 64MB  = 4 VLFs

chunks larger than 64MB and up to 1GB = 8 VLFs

chunks larger than 1GB = 16 VLFs

Note: The size of the t-log does not affect the no.of VLFs but the chunk size. 
  • Do we need to keep all data files and log file in the same drive?
It is better to keep them in separate drives if you've the luxury to have multiple drives (to increase the IO bandwidth). The limitation comes in clustered environment where you can have limited drive letters. (Please note that you can also place the tempdb in SSD).

If you've 4 nodes cluster with 3 SQL Server instances. We have already utilized 20 drive letters (including the C drive for OS). These drive letters are for system data like master, msdb, etc, data files, log files, local MSDTC, tempdb, backup and Quorum. 
  • Changes in SQL Server 2016
The TF (trace flag) 1117 is used to grow all the data files when a file in the filegroup reaches the autogrowth threshold. Beginning from SQL Server 2016 this behavior is controlled by the two options added to ALTER DATABASE statement. The two options are, 1. AUTOGROW_SINGLE_FILE, 2. AUTOGROW_ALL_FILES. However this is applies to user databases not the tempdb. 

Starting from SQL Server 2016, the behaviour of TF-1117 or TF-1118 will be automatically enabled by default for tempdb. So it makes DBAs life much more easier. 

Just to make it clear, see the below table; 

There are lot to talk and learn about tempdb. Tempdb is the only database per instance which keeps all the temporary work of that instance. So it may become a bottleneck and the entire system will become slow. So it is always need to pay a close attention to tempdb behaviour and its activities then take remedial actions to minimize them. Best practices are there for you get a good start but what is important is to identify the optimal configuration and parameters for your system rather than following best practices blindly. 


Monday, August 8, 2016

How to check the isolation level

SQL Server supports all four ANSI standards isolation levels. They are;
  1. Read Uncommitted (the lowest level, high concurrency, less blocking, more currency related issues) 
  2. Read Committed (default)
  3. Repeatable Read
  4. Serializable (the highest level, less concurrency, higher blocking, no concurrency related issues) 
Other than the above four, SQL Server introduced additional two optimistic concurrency control isolation levels based on row versioning technique;
  1. Read Committed Snapshot
  2. Snapshot isolation level
DBCC USEROPTIONS command can be used to see the current isolation level in SQL Server.

Wednesday, August 3, 2016

Differences between SQL Server and SQL Azure

I captured this screenshot from MS Azure Webinar and thought of sharing it since it gives a high level overview of supported features in SQL Server Azure. 

For more details:

Thursday, July 28, 2016

Microsoft releases CU #1 for SQL Server 2016

Microsoft released their latest flagship database product, Microsoft SQL Server 2016 on June 1st 2016. After 2 months, now Microsoft announced the release of CU #1 which contains fixes for 146 issues across different categories of the product.


The CU1 has 146 hotfixes in 10 different categories as stated above. The highest no.of hotfixes are for the SQL Service.

You can download the CU1 by using the below link.

Cumulative Update 1 for SQL Server 2016


Friday, July 22, 2016

Denver SSUG Presentation About Latch Behavior

Yesterday I did a presentation about SQL Server Latch behavior at SQL Server User Group at Denver.

Hope everyone has learnt something and got some exposure to SQL Server latches which is internal to SQL Server and some time create issues. So that knowing about what they are and how they behaves is important in my opinion.

Below link contains the ppt and the demo scripts I used. The demo scripts are compatible with SQL Server 2014 or later.

Presentation and Demo scripts