Understanding BLOCKED_TIME in PerfView - asp.net-core

We are suspecting that we're experciencing thread pool starvation on a server that is running a couple of ASP.NET Core APIs and a couple of .NET Core consoles.
I ran perfview one one of our servers were we are suspecting problems with thread pool starvation. However I'm having a bit of trouble analyzing the results.
I ran PerfView /threadTime collect for about 60 seconds. And this is the result I got (I chose one to look at one of our ASP.NET Core APIs):
Looking at "By Name" we can see that there is a lot of time spent in BLOCKED_TIME. If I double click then I'm taken to the following view where I can expand one of the nodes to get the following view (the overwritten part is the name of our API process):
What does that tell me? Shouldn't I be able to see what exactly is blocking? And does it look like the problem is that a lot of threads is blocking each one for a small amount of time?
Are there any other conclusions we can draw from this?

BLOCKED_TIME generally means a period when the thread wasn't doing anything at all. This could be periods of I/O, where network or other types of latency are involved or time spent waiting on locks such as in situations with semaphores. In short, this doesn't necessarily tell you anything, as there's perfectly standard and reasonable reasons for the thread to be idled. However, a goodish amount of time spent blocked can be an indication of an underlying problem. Perhaps you have too much network latency. Perhaps you're trying to do too much file system work on a slow drive. In short, it may or may not indicate a problem, and even if it does indicate a problem, it doesn't really tell you what the problem is.
In general, if you're experiencing thread starvation, the first thing you should look at is thread pool utilization. Are you using async everywhere you can? Are you doing things that are big no-nos in web apps such as using Task.Run, Task.StartNew or worse, Thread.Start? All those created threads are coming out of the same thread pool, and thus proportionally reducing your server throughput.
There's an all too common pattern of attempting to schedule long-running jobs by shuffling them to new threads. That's death to a web application. All threads in the pool are there to service requests, not long-running jobs, and as such, requests should be handled quickly and efficiently so that the thread can be returned to the pool in short order to field other requests. If you need to background work, you need to truly background it, by offloading to another process or even a different machine entirely.
Short of all that, maybe you're just getting more load than the server can handle in general. That's always a possibility. Perhaps you need to vertically scale your system resources (and the thread pool with it). Perhaps you need to horizontally scale by replicating this server with a load balancer in front. Given that you're running multiple different things on the same server, an easy way to horizontally scale is to simply divvy out these things to their own machines. That alone would probably help tremendously. However, scaling, either vertically or horizontally, should be your last resort. Make sure you're using resources efficiently first, before throwing more resources at your inefficient things.

Related

operating system - context switches

I have been confused about the issue of context switches between processes, given round robin scheduler of certain time slice (which is what unix/windows both use in a basic sense).
So, suppose we have 200 processes running on a single core machine. If the scheduler is using even 1ms time slice, each process would get its share every 200ms, which is probably not the case (imagine a Java high-frequency app, I would not assume it gets scheduled every 200ms to serve requests). Having said that, what am I missing in the picture?
Furthermore, java and other languages allows to put the running thread to sleep for e.g. 100ms. Am I correct in saying that this does not cause context switch, and if so, how is this achieved?
So, suppose we have 200 processes running on a single core machine. If
the scheduler is using even 1ms time slice, each process would get its
share every 200ms, which is probably not the case (imagine a Java
high-frequency app, I would not assume it gets scheduled every 200ms
to serve requests). Having said that, what am I missing in the
picture?
No, you aren't missing anything. It's the same case in the case of non-pre-emptive systems. Those having pre-emptive rights(meaning high priority as compared to other processes) can easily swap the less useful process, up to an extent that a high-priority process would run 10 times(say/assume --- actual results are totally depending on the situation and implementation) than the lowest priority process till the former doesn't produce the condition of starvation of the least priority process.
Talking about the processes of similar priority, it totally depends on the Round-Robin Algorithm which you've mentioned, though which process would be picked first is again based on the implementation. And, Windows and Unix have same process scheduling algorithms. Windows and Unix does utilise Round-Robin, but, Linux task scheduler is called Completely Fair Scheduler (CFS).
Furthermore, java and other languages allows to put the running thread
to sleep for e.g. 100ms. Am I correct in saying that this does not
cause context switch, and if so, how is this achieved?
Programming languages and libraries implement "sleep" functionality with the aid of the kernel. Without kernel-level support, they'd have to busy-wait, spinning in a tight loop, until the requested sleep duration elapsed. This would wastefully consume the processor.
Talking about the threads which are caused to sleep(Thread.sleep(long millis)) generally the following is done in most of the systems :
Suspend execution of the process and mark it as not runnable.
Set a timer for the given wait time. Systems provide hardware timers that let the kernel register to receive an interrupt at a given point in the future.
When the timer hits, mark the process as runnable.
I hope you might be aware of threading models like one to one, many to one, and many to many. So, I am not getting into much detail, jut a reference for yourself.
It might appear to you as if it increases the overhead/complexity. But, that's how threads(user-threads created in JVM) are operated upon. And, then the selection is based upon those memory models which I mentioned above. Check this Quora question and answers to that one, and please go through the best answer given by Robert-Love.
For further reading, I'd suggest you to read from Scheduling Algorithms explanation on OSDev.org and Operating System Concepts book by Galvin, Gagne, Silberschatz.

How can I speed up a Mac app processing 5000 independent tasks?

I have a long running (5-10 hours) Mac app that processes 5000 items. Each item is processed by performing a number of transforms (using Saxon), running a bunch of scripts (in Python and Racket), collecting data, and serializing it as a set of XML files, a SQLite database, and a CoreData database. Each item is completely independent from every other item.
In summary, it does a lot, takes a long time, and appears to be highly parallelizable.
After loading up all the items that need processing it, the app uses GCD to parallelize the work, using dispatch_apply:
dispatch_apply(numberOfItems, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^(size_t i) {
#autoreleasepool {
...
}
});
I'm running the app on a Mac Pro with 12 cores (24 virtual). So I would expect to have 24 items being processed at all times. However, I found through logging that the number of items being processed varies between 8 and 24. This is literally adding hours to the run time (assuming it could work on 24 items at a time).
On the one hand, perhaps GCD is really, really smart and it is already giving me the maximum throughput. But I'm worried that, because much of the work happens in scripts that are spawned by this app, maybe GCD is reasoning from incomplete information and isn't making the best decisions.
Any ideas how to improve performance? After correctness, the number one desired attribute is shortening how long it takes this app to run. I don't care about power consumption, hogging the Mac Pro, or anything else.
UPDATE: In fact, this looks alarming in the docs: "The actual number of tasks executed by a concurrent queue at any given moment is variable and can change dynamically as conditions in your application change. Many factors affect the number of tasks executed by the concurrent queues, including the number of available cores, the amount of work being done by other processes, and the number and priority of tasks in other serial dispatch queues." (emphasis added) It looks like having other processes doing work will adversely affect scheduling in the app.
It'd be nice to be able to just say "run these blocks concurrently, one per core, don't try to do anything smarter".
If you are bound and determined, you can explicitly spawn 24 threads using the NSThread API, and have each of those threads pull from a synchronized queue of work items. I would bet money that performance would get noticeably worse.
GCD works at its most efficient when the work items submitted to it never block. That said, the workload you're describing is rather complex and rife with opportunities for your threads to block. For starters, you're spawning a bunch of other processes. Right here, this means that you're already relying on the OS to divvy up time/resources between your master task and these slave tasks. Other than setting the OS priority of each subprocess, the OS scheduler has no way to know which processes are more important than others, and by default, your subprocesses are going to have the same priority as their parent. That said, it doesn't sound like you have anything to gain by tweaking process priorities. I'm assuming you're blocking the master task thread that's waiting for the slave tasks to complete. That is effectively parking that thread -- it can do no useful work. But like I said, I don't think there's much to be gained by tweaking the OS priorities of your slave tasks, because this really sounds like it's an I/O bound workflow...
You go on to describe three I/O-heavy operations ("serializing it as a set of XML files, a SQLite database, and a CoreData database.") So now you have all these different threads and processes vying for what is presumably a shared bulk storage device. (i.e. unless you're writing to 24 different databases, on 24 separate hard drives, one for each core, your process is ultimately going to be serialized at the disk accesses.) Even if you had 24 different hard drives, writing to a hard drive (even an SSD) is comparatively slow. Your threads are going to be taken off of the CPU they were running on (so that another thread that's waiting can run) for virtually any blocking disk write.
If you wanted to maximize the performance you're getting out of GCD, you would probably want to rewrite all the stuff you're doing in subtasks in C/C++/Objective-C, bringing them in-process, and then conducting all the associated I/O using dispatch_io primitives. For API where you don't control the low-level reads and writes, you would want to carefully manage and tune your workload to optimize it for the hardware you have. For instance, if you have a bunch of stuff to write to a single, shared SQLite database, there's no point in ever having more than one thread trying to write to that database at once. You'd be better off making one thread (or a serial GCD queue) to write to SQLite and submitting tasks to that after pre-processing is done.
I could go on for quite a while here, but the bottom line is that you've got a complex, seemingly I/O bound workflow here. At the highest-level, CPU utilization or "number of running threads" is going to be a particularly poor measure of performance for such a task. By using sub-processes (i.e. scripts), you're putting a lot of control into the hands of the OS, which knows effectively nothing about your workload a priori, and therefore can do nothing except use its general scheduler to divvy up resources. GCD's opaque thread pool management is really the least of your problems.
On a practical level, if you want to speed things up, go buy multiple, faster (i.e. SSD) hard drives, and rework your task/workflow to utilize them separately and in parallel. I suspect that would yield the biggest bang for your buck (for some equivalence relation of time == money == hardware.)

Grand Central Dispatch vs NSThreads?

I searched a variety of sources but don't really understand the difference between using NSThreads and GCD. I'm completely new to the OS X platform so I might be completely misinterpreting this.
From what I read online, GCD seems to do the exact same thing as basic threads (POSIX, NSThreads etc.) while adding much more technical jargon ("blocks"). It seems to just overcomplicate the basic thread creation system (create thread, run function).
What exactly is GCD and why would it ever be preferred over traditional threading? When should traditional threads be used rather than GCD? And finally is there a reason for GCD's strange syntax? ("blocks" instead of simply calling functions).
I am on Mac OS X 10.6.8 Snow Leopard and I am not programming for iOS - I am programming for Macs. I am using Xcode 3.6.8 in Cocoa, creating a GUI application.
Advantages of Dispatch
The advantages of dispatch are mostly outlined here:
Migrating Away from Threads
The idea is that you eliminate work on your part, since the paradigm fits MOST code more easily.
It reduces the memory penalty your application pays for storing thread stacks in the application’s memory space.
It eliminates the code needed to create and configure your threads.
It eliminates the code needed to manage and schedule work on threads.
It simplifies the code you have to write.
Empirically, using GCD-type locking instead of #synchronized is about 80% faster or more, though micro-benchmarks may be deceiving. Read more here, though I think the advice to go async with writes does not apply in many cases, and it's slower (but it's asynchronous).
Advantages of Threads
Why would you continue to use Threads? From the same document:
It is important to remember that queues are not a panacea for
replacing threads. The asynchronous programming model offered by
queues is appropriate in situations where latency is not an issue.
Even though queues offer ways to configure the execution priority of
tasks in the queue, higher execution priorities do not guarantee the
execution of tasks at specific times. Therefore, threads are still a
more appropriate choice in cases where you need minimal latency, such
as in audio and video playback.
Another place where I haven't personally found an ideal solution using queues is daemon processes that need to be constantly rescheduled. Not that you cannot reschedule them, but looping within a NSThread method is simpler (I think). Edit: Now I'm convinced that even in this context, GCD-style locking would be faster, and you could also do a loop within a GCD-dispatched operation.
Blocks in Objective-C?
Blocks are really horrible in Objective-C due to the awful syntax (though Xcode can sometimes help with autocompletion, at least). If you look at blocks in Ruby (or any other language, pretty much) you'll see how simple and elegant they are for dispatching operations. I'd say that you'll get used to the Objective-C syntax, but I really think that you'll get used to copying from your examples a lot :)
You might find my examples from here to be helpful, or just distracting. Not sure.
While the answers so far are about the context of threads vs GCD inside the domain of a single application and the differences it has for programming, the reason you should always prefer GCD is because of multitasking environments (since you are on MacOSX and not iOS). Threads are ok if your application is running alone on your machine. Say, you have a video edition program and want to apply some effect to the video. The render is going to take 10 minutes on a machine with eight cores. Fine.
Now, while the video app is churning in the background, you open an image edition program and play with some high resolution image, decide to apply some special image filter and your image application being clever detects you have eight cores and starts eight threads to process the image. Nice isn't it? Except that's terrible for performance. The image edition app doesn't know anything about the video app (and vice versa) and therefore both will request their respectively optimum number of threads. And there will be pain and blood while the cores try to switch from one thread to another, because to avoid starvation the CPU will eventually let all threads run, even though in this situation it would be more optimal to run only 4 threads for the video app and 4 threads for the image app.
For a more detailed reference, take a look at http://deusty.blogspot.com/2010/11/introducing-gcd-based-cocoahttpserver.html where you can see a benchmark of an HTTP server using GCD versus thread, and see how it scales. Once you understand the problem threads have for multicore machines in multi-app environments, you will always want to use GCD, simply because threads are not always optimal, while GCD potentially can be since the OS can scale thread usage per app depending on load.
Please, remember we won't have more GHz in our machines any time soon. From now on we will only have more cores, so it's your duty to use the best tool for this environment, and that is GCD.
Blocks allow for passing a block of code to execute. Once you get past the "strange syntax", they are quite powerful.
GCD also uses queues which if used properly can help with lock free concurrency if the code executing in the separate queues are isolated. It's a simpler way to offer background and concurrency while minimizing the chance for deadlocks (if used right).
The "strange syntax" is because they chose the caret (^) because it was one of the few symbols that wasn't overloaded as an operator in C++
See:
https://developer.apple.com/library/ios/#documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Although you might think rewriting your code for dispatch queues would
be difficult, it is often easier to write code for dispatch queues
than it is to write code for threads. The key to writing your code is
to design tasks that are self-contained and able to run
asynchronously. (This is actually true for both threads and dispatch
queues.)
...
Although you would be right to point out that two tasks running in a
serial queue do not run concurrently, you have to remember that if two
threads take a lock at the same time, any concurrency offered by the
threads is lost or significantly reduced. More importantly, the
threaded model requires the creation of two threads, which take up
both kernel and user-space memory. Dispatch queues do not pay the same
memory penalty for their threads, and the threads they do use are kept
busy and not blocked.
GCD (Grand Central Dispatch): GCD provides and manages FIFO queues to which your application can submit tasks in the form of block objects. Work submitted to dispatch queues are executed on a pool of threads fully managed by the system. No guarantee is made as to the thread on which a task executes. Why GCD over threads :
How much work your CPU cores are doing
How many CPU cores you have.
How much threads should be spawned.
If GCD needs it can go down into the kernel and communicate about resources, thus better scheduling.
Less load on kernel and better sync with OS
GCD uses existing threads from thread pool instead of creating and then destroying.
Best advantage of the system’s hardware resources, while allowing the operating system to balance the load of all the programs currently running along with considerations like heating and battery life.
I have shared my experience with threads, operating system and GCD AT http://iosdose.com

Rails controllers, safe to fork a block and then return?

Short and simple question:
fork { something_that_takes_a_few_seconds_and_doesnt_concern_the_user }
respond_to ...
Is there any reason not to do this sort of thing in a rails app? In PHP we're currently relying on external queueing systems like beanstalk or Amazon's SQS coupled with an asynchronous task worker that's pulling things off the queue to run in the background. A simple fork would fit better in many cases, depending on the complexity of the task.
Yes, if you get too many requests or this process takes longer than expected you can easily blast your CPU and memory and the Rails app is going to start raise out of memory errors.
There are many dead simple background workers solutions for Rails apps, like Resque, and you can benefit greatly in using such a solution instead of forking, as it would always happen out of your webapp environment and even if something nasty happens it would be confined to the worker machine (and not your application server).
Also, I've seen a lot of people saying that once you move to a virtualized word, memory fragmentation (due to spawning and killing many processes) might take their toll and make you go slower.

Is there an ideal number of network operations for iPhone OS?

I'm using NSOperation and NSOperationQueue to handle all of my networking threads so my interface can remain responsive while handling data transfer over the internet. Currently, I've got my operation queue set to a maximum concurrent operation count of 5, and it seems to work well.
I'm wondering, though, if there is a more ideal number of concurrent network operations that would best maximize the available resources without choking the hardware. Are there any recommendations, or steps I might take to measure and find out for myself?
Given the iPhone (currently) runs a single core, I would guess 5 is around the right number.
But the only way to be sure would be to instrument it and find out what the usage looks like (CPU, Memory and Network). Network usage you could get based on the data transferred - but its hard to know what a reaspnable usage would be. I'm not sure if it is possible to get CPU/Memory statistics from the iPhone.
If you are doing large transfers, then more connections probably wont help much. If you are doing lots of small transfers, then more connections will help work around the back and forth of setting up and tearing down the connection.