Optimum thread count NServiceBus - nservicebus

We're trying to figure out the optimum number of threads to use for our NServiceBus service. We're running it on a machine with 2 quad cores. We've been having problems with the queue backing up. We started with 100 threads then bumped it to 200 and things got worse. We backed it down to 75, then 50 and it seemed even better. Is there some optimal number based on how many CPU's we have or some rule of thumb that we should use to determine the number of threads to run?

Every thread you have running has an overhead attached to it. If you have 2 quad cores then you will be able to have exactly 8 threads running at any one time. Each thread will be consuming a core.
If you have more than 8 threads then there is a chance you will start to do LESS useful work, not more. This is because every time windows decides to give one of the threads not currently consuming a core a turn at doing something it needs to store the state of one of the running threads and then restore the old state of the thread that is about to run - then let the thread go at it. If you have a huge number of threads you're going to spend a large amount of time just switching between the threads and doing nothing useful.
If you have a bunch of threads that are blocked waiting for IO (for instance a message to finish writing to disk so it can be got at) then you might be able to run more threads than you have cores and still get something useful done as a number of those threads will be sitting waiting for something else to complete. It's a complex subject and there is no real answer to 'how many threads should I use'. A good rule of thumb is have a thread for every core and then try and play with it a bit if you want to achieve more throughput. Testing it under real conditions is the only real way to find the sweet spot. You might find that you only need one thread to process the messages and half the time that thread is blocked waiting for a message to come in....
Obviously, even what I've described is oversimplified. Windows needs access to the cores to do OSy things so even if you have 8 cores all of your 8 threads wont always be running because the windows threads are having a turn... then you have IO threads etc....

Related

Can a BLOCKED Thread cause high CPU Consumption

We saw a high CPU consumption issue in our production environment recently, and saw something strange while debugging the same. When I did a "top -H" to see the CPU stats per thread ID, I found a thread X consuming high CPU. When I took the thread dumps, I saw that this thread X was in BLOCKED state. What does this mean, can a thread which is in BLOCKED state consume high CPU ? I think this might be trivial question but I am a novice in debugging Performance issues and JVM, and not sure what I might be missing here.
Entering and exiting a BLOCKED state can be expensive. If you are BLOCKED for even a little while this is not a problem, but if you are blocking briefly in a busy loop, your thread can appear blocked but in reality burning CPU.
I would look for multiple threads repeatedly competing on a shared resources which are entering BLOCKED very briefly.
#Peter has already mentioned good point about busy loop (which could be JVM internal adaptive optimization of spin locks in case of synchronization or busy loop created by application itself on some condition) which can burn CPU. There is another indirect way in which the CPU can go very high because of thread blocking. Typically in a web server if lots of threads are in blocked state ( not because of synchronization lock related blocking but say waiting for IO from a back-end datastore) then it may put lots of pressure on JVM garbage collection. These worker threads are supposed to finish their work quickly so that all the objects created by them on heap is quickly de-referenced and garbage collected. If lots of threads are in this state then the garbage collection threads have to work overtime and they may end up taking lots of CPU.

Is it useful to create many threads for a specific process to increase the chance of this process executed?

This is an interview question I encountered today. I have some knowledge about OS but not really proficient at it. I think maybe there are limited threads for each process can create?
Any ideas will help.
This question can be viewed [at least] in two ways:
Can your process get more CPU time by creating many threads that need to be scheduled?
or
Can your process get more CPU time by creating threads to allow processing to continue when another thread(s) is blocked?
The answer to #1 is largely system dependent. However, any rationally-designed system is going to protect again rogue processes trying this. Generally, the answer here is NO. In fact, some older systems only schedule processes; not threads. In those cases, the answer is always NO.
The answer to #2 is generally YES. One of the reasons to use threads is to allow a process to continue processing while it has to wait on some external event.
The number of threads that can run in parallel depends on the number of CPUs on your machine
It also depends on the characteristic of the processes you're running, if they're consuming CPU - it won't be efficient to run more threads than the number of CPUs on your machine, on the other hand, if they do a lot of I/O, or any other kind of tasks that blocks a lot - it would make sense to increase the number of threads.
As for the question "how many" - you'll have to tune your app, make measurements and decide based on actual data.
Short answer: Depends on the OS.
I'd say it depends on how the OS scheduler is implemented.
From personal experience with my hobby OS, it can certainly happen.
In my case, the scheduler is implemented with a round robin algorithm, per thread, independent on what process they belong to.
So, if process A has 1 thread, and process B has 2 threads, and they are all busy, Process B would be getting 2/3 of the CPU time.
There are certainly a variety of approaches. Check Scheduling_(computing)
Throw in priority levels per process and per thread, and it really depends on the OS.

How can I speed up a Mac app processing 5000 independent tasks?

I have a long running (5-10 hours) Mac app that processes 5000 items. Each item is processed by performing a number of transforms (using Saxon), running a bunch of scripts (in Python and Racket), collecting data, and serializing it as a set of XML files, a SQLite database, and a CoreData database. Each item is completely independent from every other item.
In summary, it does a lot, takes a long time, and appears to be highly parallelizable.
After loading up all the items that need processing it, the app uses GCD to parallelize the work, using dispatch_apply:
dispatch_apply(numberOfItems, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^(size_t i) {
#autoreleasepool {
...
}
});
I'm running the app on a Mac Pro with 12 cores (24 virtual). So I would expect to have 24 items being processed at all times. However, I found through logging that the number of items being processed varies between 8 and 24. This is literally adding hours to the run time (assuming it could work on 24 items at a time).
On the one hand, perhaps GCD is really, really smart and it is already giving me the maximum throughput. But I'm worried that, because much of the work happens in scripts that are spawned by this app, maybe GCD is reasoning from incomplete information and isn't making the best decisions.
Any ideas how to improve performance? After correctness, the number one desired attribute is shortening how long it takes this app to run. I don't care about power consumption, hogging the Mac Pro, or anything else.
UPDATE: In fact, this looks alarming in the docs: "The actual number of tasks executed by a concurrent queue at any given moment is variable and can change dynamically as conditions in your application change. Many factors affect the number of tasks executed by the concurrent queues, including the number of available cores, the amount of work being done by other processes, and the number and priority of tasks in other serial dispatch queues." (emphasis added) It looks like having other processes doing work will adversely affect scheduling in the app.
It'd be nice to be able to just say "run these blocks concurrently, one per core, don't try to do anything smarter".
If you are bound and determined, you can explicitly spawn 24 threads using the NSThread API, and have each of those threads pull from a synchronized queue of work items. I would bet money that performance would get noticeably worse.
GCD works at its most efficient when the work items submitted to it never block. That said, the workload you're describing is rather complex and rife with opportunities for your threads to block. For starters, you're spawning a bunch of other processes. Right here, this means that you're already relying on the OS to divvy up time/resources between your master task and these slave tasks. Other than setting the OS priority of each subprocess, the OS scheduler has no way to know which processes are more important than others, and by default, your subprocesses are going to have the same priority as their parent. That said, it doesn't sound like you have anything to gain by tweaking process priorities. I'm assuming you're blocking the master task thread that's waiting for the slave tasks to complete. That is effectively parking that thread -- it can do no useful work. But like I said, I don't think there's much to be gained by tweaking the OS priorities of your slave tasks, because this really sounds like it's an I/O bound workflow...
You go on to describe three I/O-heavy operations ("serializing it as a set of XML files, a SQLite database, and a CoreData database.") So now you have all these different threads and processes vying for what is presumably a shared bulk storage device. (i.e. unless you're writing to 24 different databases, on 24 separate hard drives, one for each core, your process is ultimately going to be serialized at the disk accesses.) Even if you had 24 different hard drives, writing to a hard drive (even an SSD) is comparatively slow. Your threads are going to be taken off of the CPU they were running on (so that another thread that's waiting can run) for virtually any blocking disk write.
If you wanted to maximize the performance you're getting out of GCD, you would probably want to rewrite all the stuff you're doing in subtasks in C/C++/Objective-C, bringing them in-process, and then conducting all the associated I/O using dispatch_io primitives. For API where you don't control the low-level reads and writes, you would want to carefully manage and tune your workload to optimize it for the hardware you have. For instance, if you have a bunch of stuff to write to a single, shared SQLite database, there's no point in ever having more than one thread trying to write to that database at once. You'd be better off making one thread (or a serial GCD queue) to write to SQLite and submitting tasks to that after pre-processing is done.
I could go on for quite a while here, but the bottom line is that you've got a complex, seemingly I/O bound workflow here. At the highest-level, CPU utilization or "number of running threads" is going to be a particularly poor measure of performance for such a task. By using sub-processes (i.e. scripts), you're putting a lot of control into the hands of the OS, which knows effectively nothing about your workload a priori, and therefore can do nothing except use its general scheduler to divvy up resources. GCD's opaque thread pool management is really the least of your problems.
On a practical level, if you want to speed things up, go buy multiple, faster (i.e. SSD) hard drives, and rework your task/workflow to utilize them separately and in parallel. I suspect that would yield the biggest bang for your buck (for some equivalence relation of time == money == hardware.)

Is it safe to access the hard drive via many different GCD queues?

Is it safe? For instance, if I create a bunch of different GCD queues that each compress (tar cvzf) some files, am I doing something wrong? Will the hard drive be destroyed?
Or does the system properly take care of such things?
Dietrich's answer is correct save for one detail (that is completely non-obvious).
If you were to spin off, say, 100 asynchronous tar executions via GCD, you'd quickly find that you have 100 threads running in your application (which would also be dead slow due to gross abuse of the I/O subsystem).
In a fully asynchronous concurrent system with queues, there is no way to know if a particular unit of work is blocked because it is waiting for a system resource or waiting for some other enqueued unit of work. Therefore, anytime anything blocks, you pretty much have to spin up another thread and consume another unit of work or risk locking up the application.
In such a case, the "obvious" solution is to wait a bit when a unit of work blocks before spinning up another thread to de-queue and process another unit of work with the hope that the first unit of work "unblocks" and continues processing.
Doing so, though, would mean that any asynchronous concurrent system with interaction between units of work -- a common case -- would be so slow as to be useless.
Far more effective is to limit the # of units of work that are enqueued in the global asynchronous queues at any one time. A GCD semaphore makes this quite easy; you have a single serial queue into which all units of work are enqueued. Every time you dequeue a unit of work, you increment the semaphore. Every time a unit of work is completed, you decrement the semaphore. As long as the semaphore is below some maximum value (say, 4), then you enqueue a new unit of work.
If you take something that is normally IO limited, such as tar, and run a bunch of copies in GCD,
It will run more slowly because you are throwing more CPU at an IO-bound task, meaning the IO will be more scattered and there will be more of it at the same time,
No more than N tasks will run at a time, which is the point of GCD, so "a billion queue entries" and "ten queue entries" give you the same thing if you have less than 10 threads,
Your hard drive will be fine.
Even though this question was asked back in May, it's still worth noting that GCD has now provided I/O primitives with the release of 10.7 (OS X Lion). See the man pages for dispatch_read and dispatch_io_create for examples on how to do efficient I/O with the new APIs. They are smart enough to properly schedule I/O against a single disk (or multiple disks) with knowledge of how much concurrency is, or is not, possible in the actual I/O requests.

High CPU utilization - VB.NET

We are facing an issue with VB.NET listeners that utilizes high CPU (50% to 70%) in the server machine where it is running. Listeners are using a threading concept and also we used FileSystemWatcher class to keep monitoring the file renaming pointing to one common location. Both are console applications and scheduled jobs running all the day.
How can I control the CPU utilization with this FileSystemWatcher class?
This could all depend on the code you are running.
For instance if you have a timer with an interval of 10ms but only do work every two minutes and on each timer interval you do a lot of checking this will take a lot of CPU to do nothing.
If you are using multiple threads and one is looping waiting for the second to release a lock (Monitor.TryEnter()) then again this may be taking up extra CPU. You can avoid this by putting the waiting thread into Monitor.Wait() and then when the busy thread is finished do Monitor.Pulse().
Apart for the very general advice above, if you post the key parts of your code or profile results then we may be able to help more.
If you are looking for a profiler we use RedGates ANTS Profiler (costs but with a free trial) and it give good results, I haven't used any other to compare (and I am in no way affiliated with RedGate) so others may be better.