We have a 16 processor SQL Server 2005 cluster. When looking at CPU usage data we see that most of the time only 4 of the 16 processors are ever utilized. However, in periods of high load, occasionally a 5th and 6th processor will be used, although never anywhere near the utilization of the other 4. I'm concerned that in periods of tremendously high load that not all of the other processors will be utilized and we will have performance degradation.
Is what we're seeing standard SQL Server 2005 cluster behavior? I assumed that all 16 processors would be utilized at all times, though this does not appear to be the case. Is this something we can tune? Or is this expected behavior? Will SQL server be able to utilize all 16 processors if it comes to that?
I'll consider you did due diligence and validated that the CPU consumption belongs to the sqlservr.exe process, so we're not chasing a red herring here. If not, please make sure the CPU is consumed by sqlservr.exe by checking the Process\% Processor performance counters.
You need to understand the SQL Server CPU scheduling model, as described in Thread and Task Architecture. SQL Server spreads requests (sys.dm_exec_requests) across schedulers (sys.dm_os_schedulers) by assigning each requests to task (sys.dm_os_tasks) that is run by a worker (sys.dm_os_workers). A worker is backed by an OS thread or fiber (sys.dm_os_threads). Most requests (a batch sent to SQL Server) spawn only one task, some requests though may spawn multiple tasks (parallel queries being the most notorious).
The normal behavior of SQL Server 2005 scheduling should be to distribute the tasks evenly, across all schedulers. Each scheduler corresponds to one CPU core. The result should be an even load on all CPU cores. But I've seen the problem you describe a few times in the labs, when the physical workload would distribute unevenly across only few CPUs. You have to understand that SQL Server does not control the thread affinity of its workers, but instead relies on the OS affinity algorithm for thread locality. What that means is that even if SQL Server spreads the requests across the 16 schedulers, the OS might decide to run the threads on only 4 cores. In correlation with this issue there are two problems that may cause or aggravate this behavior:
Hyperthreading. If you enabled hyperthreading, turn it off. SQL Server and hyperthreading should never mix.
Bad drivers. Make sure you have the proper system device drivers installed (for things like main board and such).
Also make sure your SQL 2005 is at least at SP2 level, prefferably at latest SP and all CU applied. Same goes for Windows (do you run Windows 2003 or Windows 2008?).
In theory the behavior could also be explained by a very peculiar workload, ie. SQL sees only few very long and CPU demanding requests that have no parallle option. But that would be an extremly skewed load and I never seen something like that in real life.
Even accounting for IO bottleneck I would check is whether you have processor affinities set up, what your maxdop setting is, whether it is SMP or NUMA which should also affect what maxdop you may wish to set.
when you say you have a 16 processor cluster, you mean 2 SQL servers in a cluster with 16 processors each, or 2 x 8 way SQL servers?
Are you sure that you're not bottlenecking elsewhere? On IO perhaps?
Hard to be sure without hard data, but I suspect the problem is that you're more IO-bound or memory-bound than CPU-bound right now, and 4 processors is enough to keep up with your real bottleneck.
My reasoning is that if there were some configuration problem that was keeping you limited to 4 cpus, you wouldn't see it spill over to the 5th and 6th processors at all.
Related
I'm using Rebus to communicate with rabbitmq and I've configured to use 4 workers with 4 max parallelism. Also I notice that the prefetch count is 1 (probably to have an even balancing between consumers).
My issue is happening when I have spike with let's say 1000 messages for instance, i notice high CPU usage on a kubernetes environment with 2 containers where the CPU is limited to let's say 0.5 CPU.
I've profiled the application with dotTrace and it's showing that the cause is at cancenationToken.WaitHandle.wait method inside the DefaultBackOffStrategy class.
My question is if the initial setup of workers/parallelism is making this happen or i need to tweek something in Rebus. I've also tried to change the prefetch count for each consumer but on the rabbitmq management UI this doesn't change the default value (which is one).
Also with the CPU profiler from visual studio and looking at the .net counters I notice some lock contention counters (can this be related)
Why is the CPU usage so high is my question at this point and a way to solve this properly.
Thanks for any help given.
The problem is described and answered here: https://groups.google.com/forum/#!topic/redis-db/egyA1xvhGfo
Unfortunately I do not fully understand the answer.
My concerns are if redis takes up 100% CPU every 5 minutes and if my server only has a single CPU (i.e. staging) would that mean it will freeze my httpd process every 5 minutes?
Would this not be of a concern if my server has multiple CPUs?
Depending on the type of persistence you select, this will happen. The reason is because the standard persist method ( fork and copy-on-write aka cow ) happens after x number of object changes ( or however you have it configured ) and will eat up a fair amount of I/O persisting the database to disk. You'll want at least a spare core on your server for the persistence to happen but it's not so much actual CPU that's being utilized as it is the wait for the I/O. Faster I/O will mitigate the impact of the db saves.
I have installed MongoDB 2.4.4 on Amazon EC2 with ubuntu 64 bit OS and 1.6 GB RAM.
On this server, only MongoDB running nothing else.
But sometime CPU usage reach to 99% and load average: 500.01, 400.73,
620.77
I have also installed MMS on server to monitor what's going on server.
Here is MMS detail
As per MMS details, indexing working perfectly for each queries.
Suspect details as below
1) HIGH non-mapped virtual memory
2) HIGH page faults
Can anyone help me to understand what exactly causing high CPU usage ?
EDIT:
After comments of #Dylan Tong, i have reduced active connetions but
still there is high non-mapped virtual memory
Here's a summary of a few things to look into:
1. Observed a large number of connections and cursors (13k):
- fix: make sure your connection pool is appropriate. For reporting, and your current request rate, you only need a few connections at most. Also, I'm guessing you have a m1small instance, which means you only have 1 core.
2. Review queries and indexes:
- run your queries with explain(), to observe how the queries are executed. The right model normally results in queries only pulling very few documents and utilization of an index.
3. Memory (compact and readahead setting):
- make the best use of memory. 1.6GB is low. Check how much free memory you have, and compare it to what is reported as resident. A couple of common causes of low resident memory is due to fragmentation. If there are alot of documents moving, changing size and such, you should run the compact command to defragment your data files. Also, a bad readahead can lead to poor use of memory as well. Check your readahead setting (http://manpages.ubuntu.com/manpages/lucid/man2/readahead.2.html). Try a few values starting with low values (http://docs.mongodb.org/manual/administration/production-notes/). The production notes recommend 32 (for standard 512byte blocks). Sometimes higher values are optimal if your documents are larger. The hope is that resident memory should be close to your available memory and your page faults should start to lower.
If you're using resources to the fullest after this, and you're still capped out on CPU then it means you need to up your resources.
Can the performance (response time) of a query executed in a DBMS like SQL Server be influenced by whatever it's happening on the machine on which the server runs? To be more specific, is the response time expected to increase when running a couple of Windows processes that continuously check and clean the machine, and process data received from the network?
Thanks.
The four key resources for any program are available memory, processors, disk space, and disk usage.
Let's investigate each of these in turn. Available memory is well-managed in SQL Server (see here). The default behavior is to start with a bunch of memory and then increase it as necessary. If your query load is not changing, then SQL Server should hit a maximum amount of memory and stop growing. It sounds like your query load is consistent over time, so memory would not be a big issue. Also, many configurations of SQL server fix the memory size to avoid interference with other processors.
Processing power. This can be a big one. SQL Server requires processing power. The processors may be used by other Windows processes. This would slow down queries, particularly those that are processing (as opposed to I/O) constrained. However, this might be mitigated on a multi-processor machine. A given SQL Server instance might be assigned a certain number of processors. The rest could be used for Windows.
Disk space. This has little impact. In general, either disk that is needed is available or it is not (and the query needing it fails). One exception is temporary disk space, the availability of which can influence query execution plans. Often, temporary space is put on its own drive to avoid needless conflict with other processes.
I/O bandwidth. SQL Server needs to communicate through the file system to disk. This can be a real performance drag, and it can occur at multiple different levels. The operating system itself could be saturated with I/O calls, slowing down the database. The network between the CPUs and the disks could be saturated, slowing down reading and writing speeds. The disk system itself can be slow because of multiple concurrent actions -- even from different servers. And this can get all the more complicated in a virtual environment.
The answer is "yes". Windows processes can affect the performance of SQL queries. My best guess is that the affect would be either in terms of eating up processors or using up disk bandwidth.
Yes, the database server is sharing resources with anything else that runs on the machine, so any resource intensive process could affect the database performance noticeably.
One important resource is the memory. The SQL Server will by default use up all free memory if it has any use for it. If you are running any other processes on the server, you should limit the memory use of SQL Server so that it allocates a bit less, to allow room for the other proccesses, to reduce memory swapping.
TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.