Why I/O-bound processes are faster? - process

Typically the CPU runs for a while without stopping, then a system call is made to read from a file or write to a file. When the system call completes, the CPU computes again until it needs more data or has to write more data, and so on.
Some processes spend most of their time computing, while others spend most of their time waiting for I/O. The former are called compute-bound; the latter are called I/O-bound. Compute-bound processes typically have long CPU bursts and thus infrequent I/O waits, whereas I/O-bound processes have short CPU bursts and thus frequent I/O waits.
As CPU gets faster, processes tend to
get more I/O-bound.
Why and how?
Edited:
It's not a homework question. I was studying the book (Modern Operating Systems by Tanenbaum) and found this matter there. I didn't get the concept that's why I am asking here. Don't tag this question as a homework please.

With a faster CPU, the amount of time spent using the CPU will decrease (given the same code), but the amount of time spent doing I/O will stay the same (given the same I/O performance), so the percentage of time spent on I/O will increase, and I/O will become the bottleneck.
That does not mean that "I/O bound processes are faster".

As CPU gets faster, processes tend to get more I/O-bound.
What it's trying to say is:
As CPU gets faster, processes tend to not increase in speed in proportion to CPU speed because they get more I/O-bound.
Which means that I/O bound processes are slower than non-I/O bound processes, not faster.
Why is this the case? Well, when only CPU speed increases all the rest of your system haven't increased in speed. Your hard disk is still the same speed, your network card is still the same speed, even your RAM is still the same speed*. So as the CPU increases in speed, the limiting factor to your program becomes less and less the CPU speed but more about how slow your I/O is. In other words, programs naturally shift to being more and more I/O bound. In other words: ..as CPU gets faster, processes tend to get more I/O-bound.
*note: Historically everything else also improved in speed along with the CPU, just not as much. For example CPUs went from 4MHz to 2GHz, a 500x speed increase whereas hard disk speed went from around 1MB/s to 70MB/s, a lame 70x increase.

Related

Does low GPU utilization indicate bad fit for GPU acceleration?

I'm running some GPU-accelerated PyTorch code and training it against a custom dataset, but while monitoring the state of my workstation during the process, I see GPU usage along the following lines:
I have never written my own GPU primitives, but I have a long history of doing low-level optimizations for CPU-intensive workloads and my experience there makes me concerned that while pytorch/torchvision are offloading the work to the GPU, it may not be an ideal workload for GPU acceleration.
When optimizing CPU-bound code, the goal is to try and get the CPU to perform as much (meaningful) work as possible in a unit of time: a supposedly CPU-bound task that shows only 20% CPU utilization (of a single core or of all cores, depending on whether the task is parallelizable or not) is a task that is not being performed efficiently because the CPU is sitting idle when ideally it would be working towards your goal. Low CPU usage means that something other than number crunching is taking up your wall clock time, whether it's inefficient locking, heavy context switching, pipeline flushes, locking IO in the main loop, etc. which prevents the workload from properly saturating the CPU.
When looking at the GPU utilization in the chart above, and again speaking as a complete novice when it comes to GPU utilization, it strikes me that the GPU usage is extremely low and appears to be limited by the rate at which data is being copied into the GPU memory. Is this assumption correct? I would expect to see a spike in copy (to GPU) followed by an extended period of calculations/transforms, followed by a brief copy (back from the GPU), repeated ad infinitum.
I notice that despite the low (albeit constant) copy utilization, the GPU memory is constantly peaking at the 8GB limit. Can I assume the workload is being limited by the low GPU memory available (i.e. not maxing out the copy bandwidth because there's only so much that can be copied)?
Does that mean this is a workload better suited for the CPU (in this particular case with this RTX 2080 and in general with any card)?

Scheduling on multiple cores with each list in each processor vs one list that all processes share

I have a question about how scheduling is done. I know that when a system has multiple CPUs scheduling is usually done on a per processor bases. Each processor runs its own scheduler accessing a ready list of only those processes that are running on it.
So what would be the pros and cons when compared to an approach where there is a single ready list that all processors share?
Like what issues are there when assigning processes to processors and what issues might be caused if a process always lives on one processor? In terms of the mutex locking of data structures and time spent waiting on for the locks are there any issues to that?
Generally there is one, giant problem when it comes to multi-core CPU systems - cache coherency.
What does cache coherency mean?
Access to main memory is hard. Depending on the memory frequency, it can take between a few thousand to a few million cycles to access some data in RAM - that's a whole lot of time the CPU is doing no useful work. It'd be significantly better if we minimized this time as much as possible, but the hardware required to do this is expensive, and typically must be in very close proximity to the CPU itself (we're talking within a few millimeters of the core).
This is where the cache comes in. The cache keeps a small subset of main memory in close proximity to the core, allowing accesses to this memory to be several orders of magnitude faster than main memory. For reading this is a simple process - if the memory is in the cache, read from cache, otherwise read from main memory.
Writing is a bit more tricky. Writing to the cache is fast, but now main memory still holds the original value. We can update that memory, but that takes a while, sometimes even longer than reading depending on the memory type and board layout. How do we minimize this as well?
The most common way to do so is with a write-back cache, which, when written to, will flush the data contained in the cache back to main memory at some later point when the CPU is idle or otherwise not doing something. Depending on the CPU architecture, this could be done during idle conditions, or interleaved with CPU instructions, or on a timer (this is up to the designer/fabricator of the CPU).
Why is this a problem?
In a single core system, there is only one path for reads and writes to take - they must go through the cache on their way to main memory, meaning the programs running on the CPU only see what they expect - if they read a value, modified it, then read it back, it would be changed.
In a multi-core system, however, there are multiple paths for data to take when going back to main memory, depending on the CPU that issued the read or write. this presents a problem with write-back caching, since that "later time" introduces a gap in which one CPU might read memory that hasn't yet been updated.
Imagine a dual core system. A job starts on CPU 0 and reads a memory block. Since the memory block isn't in CPU 0's cache, it's read from main memory. Later, the job writes to that memory. Since the cache is write-back, that write will be made to CPU 0's cache and flushed back to main memory later. If CPU 1 then attempts to read that same memory, CPU 1 will attempt to read from main memory again, since it isn't in the cache of CPU 1. But the modification from CPU 0 hasn't left CPU 0's cache yet, so the data you get back is not valid - your modification hasn't gone through yet. Your program could now break in subtle, unpredictable, and potentially devastating ways.
Because of this, cache synchronization is done to alleviate this. Application IDs, address monitoring, and other hardware mechanisms exist to synchronize the caches between multiple CPUs. All of these methods have one common problem - they all force the CPU to take time doing bookkeeping rather than actual, useful computations.
The best method of avoiding this is actually keeping processes on one processor as much as possible. If the process doesn't migrate between CPUs, you don't need to keep the caches synchronized, as the other CPUs won't be accessing that memory at the same time (unless the memory is shared between multiple processes, but we'll not go into that here).
Now we come to the issue of how to design our scheduler, and the three main problems there - avoiding process migration, maximizing CPU utilization, and scalability.
Single Queue Multiprocessor scheduling (SQMS)
Single Queue Multiprocessor schedulers are what you suggested - one queue containing available processes, and each core accesses the queue to get the next job to run. This is fairly simple to implement, but has a couple of major drawbacks - it can cause a whole lot of process migration, and does not scale well to larger systems with more cores.
Imagine a system with four cores and five jobs, each of which takes about the same amount of time to run, and each of which is rescheduled when completed. On the first run through, CPU 0 takes job A, CPU 1 takes B, CPU 2 takes C, and CPU 3 takes D, while E is left on the queue. Let's then say CPU 0 finishes job A, puts it on the back of the shared queue, and looks for another job to do. E is currently at the front of the queue, to CPU 0 takes E, and goes on. Now, CPU 1 finishes job B, puts B on the back of the queue, and looks for the next job. It now sees A, and starts running A. But since A was on CPU 0 before, CPU 1 now needs to sync its cache with CPU 0, resulting in lost time for both CPU 0 and CPU 1. In addition, if two CPUs both finish their operations at the same time, they both need to write to the shared list, which has to be done sequentially or the list will get corrupted (just like in multi-threading). This requires that one of the two CPUs wait for the other to finish their writes, and sync their cache back to main memory, since the list is in shared memory! This problem gets worse and worse the more CPUs you add, resulting in major problems with large servers (where there can be 16 or even 32 CPU cores), and being completely unusable on supercomputers (some of which have upwards of 1000 cores).
Multi-queue Multiprocessor Scheduling (MQMS)
Multi-queue multiprocessor schedulers have a single queue per CPU core, ensuring that all local core scheduling can be done without having to take a shared lock or synchronize the cache. This allows for systems with hundreds of cores to operate without interfering with one another at every scheduling interval, which can happen hundreds of times a second.
The main issue with MQMS comes from CPU Utilization, where one or more CPU cores is doing the majority of the work, and scheduling fairness, where one of the processes on the computer is being scheduled more often than any other process with the same priority.
CPU Utilization is the biggest issue - no CPU should ever be idle if a job is scheduled. However, if all CPUs are busy, so we schedule a job to a random CPU, and a different CPU ends up becoming idle, it should "steal" the scheduled job from the original CPU to ensure every CPU is doing real work. Doing so, however, requires that we lock both CPU cores and potentially sync the cache, which may degrade any speedup we could get by stealing the scheduled job.
In conclusion
Both methods exist in the wild - Linux actually has three different mainstream scheduler algorithms, one of which is an SQMS. The choice of scheduler really depends on the way the scheduler is implemented, the hardware you plan to run it on, and the types of jobs you intend to run. If you know you only have two or four cores to run jobs, SQMS is likely perfectly adequate. If you're running a supercomputer where overhead is a major concern, then an MQMS might be the way to go. For a desktop user - just trust the distro, whether that's a Linux OS, Mac, or Windows. Generally, the programmers for the operating system you've got have done their homework on exactly what scheduler will be the best option for the typical use case of their system.
This whitepaper describes the differences between the two types of scheduling algorithms in place.

maximum safe CPU utilization for embedded system

What is the safe maximum CPU utilization time for embedded system for critical applications. We are measuring performance with top. Is 50 - 75% is safe?
Real-Time embedded systems are designed to meet Real-Time constraints, for example:
Voltage acquisition and processing every 500 us (lets say sensor monitoring).
Audio buffer processing every 5.8 ms (4ms processing).
Serial Command Acknowledgement within 3ms.
Thanks to Real-Time Operating System (RTOS), which are "preemptive" (the scheduler can suspend a task to execute one with a higher priority), you can meet those constraints even at 100% CPU usage: The CPU will execute the high priority task then resume to whatever it was doing.
But this does not mean you will meet the constraint no matter what, few tips:
High priority Tasks execution must be as short as possible (by calculating Execution Time/Occurence you can have an estimation of CPU usage).
If estimated CPU usage is too high, look for code optimization, hardware equivalent (hardware CRC, DMA, ...), or second micro-processor.
Stress test your device and measure if your real-time constraints are met.
For the previous example:
Audio processing should be the lowest priority
Serial Acknowledgement/Voltage Acquisition the highest
Stress test can be done by issuing Serial Commands and checking for missed audio buffers, missed analog voltage events and so on. You can also vary CPU clock frequency: your device might meet constraints at much lower clock frequencies, reducing power consumption.
To answer you question, 50-75 and even 100% CPU usage is safe as long as you meet real-time constraints, but bear in mind that if you want to add functionality later on, you will not have much room for that at 98% CPU usage.
In rate monotonic scheduling mathematical analysis determines that real-time tasks (that is tasks with specific real-time deadlines) are schedulable when the utilisation is below about 70% (and priorities are appropriately assigned). If you have accurate statistics for all tasks and they are deterministic this can be as high as 85% and still guarantee schedulability.
Note however that the utilisation applies only to tasks with hard-real-time deadlines. Background tasks may utilise the remaining CPU time all the time without missing deadlines.
Assuming by "CPU utilization" you are referring to time spent executing code other than an idle loop, in a system with a preemptive priority based scheduler, then ... it depends.
First of all there is a measurement problem; if your utilisation sampling period is sufficiently short, it will often switch between 100% and zero, whereas if it were very long you'd get a good average, but would not know if utilisation was high for long-enough to starve a lower priority task than that running to the extent that it might miss its deadlines. It is kind of the wrong question, because any practical utilisation sampling rate will typically be much longer then the shortest deadline, so is at best a qualitative rather then quantitative measure. It does not tell you much that is useful in critical cases.
Secondly there is teh issue of what you are measuring. While it is common for an RTOS to have a means to measure CPU utilisation, it is measuring utilisation by all tasks including those that have no deadlines.
If the utilisation becomes 100% in the lowest priority task for example, there is no harm to schedulabilty - the low priority task is in this sense no different from the idle loop. It may have consequences for power consumption in systems that normally enter a low-power mode in the idle loop.
It a higher priority task takes 100% CPU utilisation such that a deadline for a lower priority task is missed, then your system will fail (in the sense that deadlines will be missed - the consequences are application specific).
Simply measuring CPU utilisation is insufficient, but if your CPU is at 100% utilisation and that utilisation is not simply some background task in the lowest priority thread, it is probably not schedulable- there will be stuff not getting done. The consequences are of course application dependent.
Whilst having a low priority background thread consume 100% CPU may do no harm, it does render the ability to measure CPU utilisation entirely useless. If the task is preemted and for some reason the higher priority task take 100%, you may have no way of detecting the issue and the background task (and any tasks lower then the preempting task) will not get done. It is better therefore to ensure that you have some idle time so that you can detect abnormal scheduling behaviour (if you have no other means to do so).
One common solution to the non-yielding background task problem is to perform such tasks in the idle loop. Many RTOS allow you to insert idle-loop hooks to do this. The background task is then preemptable but not included in the utilisation measurement. In that case of course you cannot then have a low-priority task that does not yield, because your idle-loop does stuff.
When assigning priorities, the task with the shortest execution time should have the highest priority. Moreover the execution time should be more deterministic in higher priority tasks - that is if it takes 100us to run it should within some reasonable bounds always take that long. If some processing might be variable such that it takes say 100us most of the time and must occasionally do something that requires say 100ms; then the 100ms process should be passed off to some lower priority task (or the priority of the task temporarily lowered, but that pattern may be hard to manage and predict, and may cause subsequent deadlines or events to be missed).
So if you are a bit vague about your task periods and deadlines, a good rule of thumb is to keep it below 70%, but not to include non-real-time background tasks in the measurement.

Why cpu bound is better with blocking I/O and I/O bound is better with non blocking I/O

I have been told that for I/O bound applications, non blocking I/O would be better. For CPU bound applications, blocking I/O is much better. I could not find the reason for such a statement. Tried google, but few articles just touches the topic with not much details. Can someone provide the deep depth reason for it?
With this, I want to clear myself with what are the short coming of non blocking I/O as well.
After going through another thread here,a reason I could relate was out was if the I/O process is heavy enough then only we can see significant performance improvements using non blocking I/O. It also states that if the number of I/O operations is large(a typical web application scenario) where there are many requests looking out for I/O requests, then also we see significant improvements using non blocking I/O.
Thus my questions boil down to the following list:
In case of a CPU intensive applications, is it better to start a
threadpool(or executionContext of scala) and divide the work between
the threads of the threadpool.(I guess it has definitely an
advantage over spawning your own threads and dividing the work
manually. Also using asyn concepts of future, even CPU intensive
work can be returned using callbacks hence avoiding the issues
related to blocking of multi threading?). Also if there is a I/O
which is fast enough, then do the I/O using blocking principles on
the threads of thread pool itself. Am I right?
What are actually short comings or overheads of using a non blocking
I/O technically? Why we don't see much performance gains of using
non blocking I/O if the I/O is fast enough or if there are very less
I/O operations required? Eventually it is the OS which is handling
I/O's. Irrespective of whether the number of I/O's are large or
small, let OS handle that pain. What makes the difference here.
From a programmer's perspective blocking I/O is easier to use than nonblocking I/O. You just call the read/write function and when it returns you are done. With nonblocking I/O you need to check if you can read/write, then read/write and then check the return values. If not everything was read or written you need mechanisms to read again or to write again now or later when write can be done.
Regarding performance: nonblocking I/O in one thread is not faster than blocking I/O in one thread. The speed of the I/O operation is determined by the device (for example the hard disc) that is read from or written to. The speed is not determined by someone waiting for (blocking on) or not waiting for (nonblocking on) it. Also if you call a blocking I/O function then the OS can do the blocking quite effectively. If you need to do the blocking/waiting in the application you might do that nearly as good as the OS, but you might also do it worse.
So why do programmers make their life harder and implement nonblocking I/O? Because, and that is the key point, their program has more to do than only that single I/O operation. When using blocking I/O you need to wait until the blocking I/O is done. When using nonblocking I/O you can do some calculations until the blocking I/O is done. Of course during nonblocking I/O you can also trigger other I/O (blocking or nonblocking).
Another approach to nonblocking I/O is to throw in more threads with blocking I/O, but as said in the SO post that you linked threads come with a cost. That cost is higher is than the cost for (OS supported) nonblocking I/O.
If you have an application with massive I/O but only low CPU usage like a web server with lots of clients in parallel, then use a few threads with nonblocking I/O. With blocking I/O you'll end up with a lot of threads -> high costs, so use only a few threads -> requires nonblocking I/O.
If you have an application that is CPU intensive like a program that reads a file, does intensive calculations on the complete data and writes the result to file, then 99% of the time will be spent in the CPU intensive part. So create a few threads (for example one per processor) and do as much calculation in parallel. Regarding the I/O you'll probably stick to one main thread with blocking I/O because it is easier to implement and because the main thread itself has nothing to do in parallel (given that the calculations are done in the other threads).
If you have an application that is CPU intensive and I/O intensive then you'ld also use a few threads and nonblocking I/O. You could think of a web server with lots of clients and web page requests where you are doing intensive calculations in a cgi script. While waiting for I/O on on connection the program could calculate the result for another connection. Or think of a program that reads a large file and could do intensive calculations on chunks of the file (like calculating an average value or adding 1 to all values). In that case you could use nonblocking reads and while waiting for the next read to finish you could already calculate on the data that is available. If the result file is only a small condensed value (like an average) you might use blocking write for the result. If the result file is as large as the input file and is like "all values +1", then you could write back the results nonblocking and while the write is being done you are free to do calculations on the next block.

In general, how expensive is calling an external program?

I know external programs can be called, but I don't know how expensive it is compared to, say, calling a subroutine. By the cost of calling, I mean the overhead of starting the program, rather than the cost of executing the program's code itself. I know the cost probably varies greatly depending on the language and operating system used and other factors, but I would appreciate some ballpark estimates.
I am asking to see the plausibility of emulating code self-modification on languages that don't allow code self-modification by making processes modify other processes
Like I said in my comment above, perhaps it would be best if you simply tried it and did some benchmarking. I'd expect this to depend primarily on the OS you're using.
That being said, starting a new process generally is many orders of magnitude slower than calling a subroutine (I'm tempted to say something like "at least a million times slower", but I couldn't back up such a claim with any measurements).
Possible reasons why starting a process is much slower:
Disk I/O (the OS has to load the process image file into memory) — this is going to be a big factor because I/O is many orders of magnitude slower than a simple CPU jump/call instruction.
To give you a rough idea of the orders of magnitude involved, let me quote this 2011 blog article (which is about memory access vs HDD access, not CPU jump instruction vs HDD access):
"Disk latency is around 13ms, but it depends on the quality and rotational speed of the hard drive. RAM latency is around 83 nanoseconds. How big is the difference? If RAM was an F-18 Hornet with a max speed of 1,190 mph (more than 1.5x the speed of sound), disk access speed is a banana slug with a top speed of 0.007 mph."
You do the math.
allocations of memory & other kernel data structures
laying out the process image in memory & performing relocations
creation of a new OS thread
context switches
etc.
Apparently, all of the above points mean that your OS is likely to perform lots of internal subroutine calls to start a new process, so doing just one subroutine call yourself instead of having the OS do hundreds of these is bound to be comparatively super-cheap.