THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Memory Allocated To Plan Caching

I could complain about Entity Framework, but there is probably nothing that has not already been aired. So instead, I will complain about the lack of tuning options in SQL Server's Memory Allocated To Plan Caching.

As of SQL Server versions 2005 SP2 and 2008, the formula is 75% from the first 0-4GB, 10% from 4-64GB and 5% of the remaining over 64GB, all referring to visible target memory. There was a different formula in version 2005 RTM and SP1 that allowed even more memory for the plan cache. Does it seem that both of these, even the reduced newer version, allows for an excessively larger plan cache? How in the world was this arrived at?

My suspicion is that the first 2005 formula derived from testing with the Pace database. Way back in the early days, before AdventureWorks, before NorthWind, someone created a database for testing. Instead of scaling up the data in this database to test ever larger environments, it was decided to scale up the number of databases.

What this meant was that the desired plan cache size scaled linearly with data size, which is probably not true in more common situations. At some point, there was the case of a very large environment using ad-hoc queries that resulted in an excessively large amount of memory allocated to the plan cache. And so the current formula was conceded.

It could be argued that a responsible database architect would have relied heavily on stored procedures and parameterized SQL achieving a sufficient degree of plan reuse so that the plan cache never becomes excessively large, perhaps using DBCC FREESYSTEMCACHE ('SQL Plans') on occasion, or even using a script to flush out single use Adhoc and prepared plans. But if only responsible people were involved, an RDBMS would probably have infinitesimal market share.

In any case, SQL Server is used in an unimaginably broad range of scenarios for which no single formula on plan cache is workable. Any given formula will be significantly problematic for a some number of scenarios.

It is time to admit that there is justification for Plan Cache memory to be a tunable setting. We could keep the current formula for the first 64GB (0.75x4 + 0.1x60 = 9GB) and make only the percentage of visible target memory over 64GB tunable. Additional fine grain tuning options are also necessary.

There should be the ability to limit the percentage of plan cache allowed for adhoc and prepared statements. This could be a single limit on both, or individual controls on each. The purpose of this is to limit the damage that one crappy element of an application can cause on the whole system.

An additional related tuning option concerns memory allocations outside of the new (version 2012) "any-size" page allocations. CLR , thread stack and other functions (network packet size larger than 8060) make direct allocations from Windows. In 32-bit versions of Server, there the was the memory_to_reserve, or more correctly, virtual address space (VAS) to reserve and the corresponding -g startup option, also known as MemToLeave.

Edit 2017-08-28
The default network packet size was 8192 bytes up to some point, either .NET Framework up to version 3.0 or 4.0
. In more recent versions, this was changed to 8000 to avoid bad things from happening.

We might think that this is no longer an issue in 64-bit versions of SQL Server. However, the essentially unlimited 64-bit VAS (actually only 44 or 48-bit VAS is implemented depending on the version) is just that, address space. Would it be a good idea to be able to reserve physical memory for CLR and others that use VAS outside of the page allocator ?

The specific situation that I encountered was one in which some elements where built on Entity Framework. One of the EF queries was a de-normalized row set. The parent table had 13 child-descendant combinations, so the query implemented a UNION ALL for each component. The combined number of columns was 449. In the EF style, each section used the full table name as an alias for each column which then re-aliased, and each query had sub-queries with the extremely verbose text.

This resulted in a query of 348KB size. Even though this was a parameterized query, there were variations such that SQL Server had very many similar queries in the plan cache. Other elements of the environment include substantial large string manipulations. (In a different, I will suggest implementing the stringbuilder as a variable option in SQL Server.)

In this environment, there were a continuing stream of VAS allocations. System memory was 512GB. After a few days operation, VAS reserved was 1TB, and 3TB after a three weeks. VAS Committed was as expected never more than 500GB and there was always about 8GB of free physical memory. What appeared to have happened was that SQL Server forced out most of the plan cache on VAS pressure?

Almost all of the compiled plans for stored procedures were forced out and to a lessor degree the plans for parameterized SQL. It did not seem as though any buffer cache for data pages were forced out. Data read IO activity was generally zero.

As the storage system was on All-Flash, the preferred strategy in this situation would have been to reserve about 3-4GB of the plan cache for stored procedure plans, maybe less than 2GB for parameterized SQL. The SQL Server engine should force out data pages to support direct allocations to Windows. To be more sophisticated, data pages can be forced out until there is an observable increase in data reads, or even until data read IO latency reached some set point, as appropriate for either AFA/SSD or HDD storage as the case may be.

In other RDBMS's, there are a great deal of tuning options not exposed in SQL Server. In the hands of a very knowledgeable and talented database expert, these tools can be put to good use. In the hands of someone who read a "best practice" blog, without bothering to understand the rational behind each item, tuning can result in serious degradation. So there needs to some care in which tuning options to expose and how best to provide safe guards.

Settable Tuning Options Need

1. percentage of memory above 64GB available to the plan cache
2a. percentage of plan cache that can used by ad-hoc SQL
2b. same as above for parameterized SQL, or a single combined setting
3. settable physical memory to reserve for allocations outside of page allocator
4. Priority option on what should be paged out on memory pressue: Plan cache, data, or temp

Addendum
I probably said this else where. The use of NAND flash for storage has become pervasive, particularly in new purchases even though HDD based systems may retain a large share of the installed based. We should factor in that MLC NAND has limited write life to rethink database caching strategy. Consider the HASH MATCH operator which has set point to spill to tempdb and tempdb objects in general. Instead of writing short-lived pages to tempdb, it might be better to dump clean data pages, while keeping more of tempdb in memory, up to a point of course.

Addendum 2017-08-31
Having a sufficient OS page file seems to alleviate the problem of SQL Server forcing out the plan cache too dangerously low level, and adding injury to insult in blocking compile while doing, when data pages could have been forced without any adverse effect.
I am seeing Performance Counter -> Memory -> pages/sec of 6,000-10,000, but no disk activity. Does paging occur outside of physical disk perf counters?
dm_exec_connections has a column for network packet size, I have confirmed that large packets are reported correctly.

References:
Plan Caching in SQL Server 2008 by Greg Low,
Plan cache and optimizing for adhoc workloads by Kimberly Tripp,
Memory configuration and sizing considerations in SQL Server 2012 and later versions on support.microsoft.com,
Plan Cache Internals on msdn.microsoft.com
Also
SqlConnection.PacketSize Property
According to above link, the default packet size was lowered from 8192 to 8000 in version .NET 3.5, but if anyone knows for sure that it was not 4.5, please advise.

Edit: 2017-12-02:
see
Social msdn

There are 2 nodes in the availability group, identical HW, identical config (512GB, 524288MB mem), SQL Server max mem (494GB = 506000MB). LPIM in effect.

The cache flush problem is more pronounced on one than the other. System 1 shows 8573MB available memory, but the system_low_memory_signal_state is True. System 2 typically shows 9,818MB free, sys low mem ss False. Given that there is almost no disk IO for data reads, I recommended reducing SQL Server Max 490,000MB and the plan cache immediately recovered (from a situation where the plan cache tries to grow to several GB, then gets flushed to below 1GB intermittently).

So apparently several GB of OS free memory is not a reliable indicator, use dm_os_sys_memory value of system_low_memory_signal_state instead. Roughly, SQL Server Max memory at 96% of system mem is too high, but 93.5% seems to be ok. It is suspected that something is running on the OS periodically consuming large memory (possibly even a file transfer that get cached by the OS, more so on one than the other. I think this might be it because system_cache is just a hair higher than avail phy mem.)

In this particular case, flushing the plan cache in the sys mem low state was the wrong thing to do because there is very little IO for data pages, and storage is on all-flash. SQL Server + Windows should have a way of knowing this, hence feel free to ditch data buffer pages, not the plan cache, unless it is the single use parameterized SQL (from Entity Frameworks). Mostly certainly the OS should know not to use too much free memory for file caching, and never trigger the sys low mem state.

this is my best guess.
SQL Server backups locally. This does not use the OS/system cache because SQL Server backups use unbuffered IO.

there is a robocopy job that later copies the backups to a network location. This copy would go to the system cache because the default is not unbuffered IO, and the Windows Server 2016 OS is not smart enough to know that for a multi-GB file transfer, using buffered IO is stupid, more so when SQL Server is running and using 494GB on a system with 512GB memory. So Windows fills the remaining memory with system cache for this file copy that will not be touched afterwards (the only "benefit of caching"), then triggers the system low memory signal state.

Then to compound this, SQL Server sees the low memory signal, and decides to flush the plan cache, not just the very many single use parameterized plan (generated unnecessarily from Entity Frameworks), but also the critical frequently used procedure plans. This causes a significant jump in CPU. In some cases, there also appears to a lock on the plan cache (could this be taken for a large dump of the plan cache) which then causes any new calls that require a compile to be blocked?

What is sad here is that even if SQL Server decided to cooperate with the OS in releasing memory, it could have dumped data buffers instead. Much of the data pages were infrequently used. And the storage is on flash. It could 100,000's of IOPS if needed. But no, there may have been some dump of data buffers, PLE drops from 90,000 to 8,500 sec? Even then, IOPS was noise compare to what the storage could have handled. Hey dude, what's up?

The plan is to try the Robocopy with the -j flag, which is supposed to be unbuffered IO, hopefully this will stop using the system cache. Also, I see that there is a call: %windir%\system32\rundll32.exe advapi32.dll, ProcessIdleTasks that is suppose to flush the system cache. Will try both to see if the OS free memory is truly free and not system cache (of unnecessary stuff).

Published Saturday, August 12, 2017 8:22 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

George Walkey said:

Adaptive Joins using cost model for join selection with thresholding here

https://channel9.msdn.com/Shows/Data-Exposed/SQL-Server-2017-Adaptive-QP

September 8, 2017 5:46 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Privacy Statement