"Processes may be described as physically concurrent and logically
concurrent processes, the distinction between them is analogous to
that between real and virtual processors"
what does that mean?
What is the difference between physically concurrent and logically concurrent processes?
What is the difference between physically concurrent and logically
concurrent processes?
Suppose if you've a single core processor, and suppose if you have multithreading in your code, it'll show as if it is running in parallel in multiple different processors; but, in reality it utilises the single processor, where quantum of time is allotted to each threads in round-robin manner. In this case, the processes(OR threads) seem to be running in parallel concurrently, but, in reality, there is context switch many-a-times between the processes(threads) to simulate as if they are running simultaneously.
Whereas, had you been having multiple cores in your processor(or multiple processors), your multithreaded code would have executed in parallel on different cores(or processors, if there) concurrently! In this case, the processes are running in parallel concurrently.
I hope it clears your doubt! Feel free to ask in case of further queries.
Related
I am using dynamic scheduling for the loop iteration. But when the works in each iteration are too small, some threads don't work or when there is a huge amount of threads. Eg. There are 100 iterations and there are 90 threads, I want every thread to do at least one iteration and the rest 10 iterations can be distributed to the threads who have done their job. How can I do that?
You cannot force the OpenMP runtime to do this. However, you can give hints to the OpenMP runtime so that it will likely do that when (it decide that) it is possible at the cost of a higher overhead.
On way is to specify the granularity of the dynamically scheduled loop.
Here is an example:
#pragma omp parallel for schedule(dynamic,1)
for(int i=0 ; i<100 ; ++i)
compute(i);
With such a code, the runtime is free to share the work evenly between threads (using a work-sharing scheduler) or let threads steal the work of a master thread that drive the parallel computation (using a work-stealing scheduler). In the second approach, although the granularity is 1 loop iteration, some threads could steal more work than they actually need (eg. to generally improve performance). If the loop iterations are fast enough, the work will probably not be balanced between threads.
Creating 90 threads is costly and sending work to 90 threads is also far from being free as it is mostly bounded by the relatively high latency of atomic operations, their salability as well as the latency of awaking threads.
Moreover, while such operation appear to be synchronous from the user point of view, it is not the case in practice (especially with 90 threads and on multi-socket NUMA-based architectures).
As a results, some threads may finish to compute one iteration of the loop while others may not be aware of the parallel computation or not even created yet.
The overhead to make threads aware of the computation to be done generally grow as the number of threads used is increased.
In some case, this overhead can be higher than the actual computation and it can be more efficient to use less threads.
OpenMP runtime developers should sometimes tread work balancing with smaller communication overheads. Thus those decisions can perform badly in your case but could improve the salability of other kind of applications. This is especially true on work-stealing scheduler (eg. the Clang/ICC OpenMP runtime). Note that improving the scalability of OpenMP runtimes is an ongoing research field.
I advise you to try multiple OpenMP runtimes (including research ones that may or may not be good to use in production code).
You can also play with the OMP_WAIT_POLICY variable to reduce the overhead of awaking threads.
You can also try to use OpenMP tasks to force a bit more the runtime to not merge iterations.
I also advise you to profile your code to see what is going on and find potential software/hardware bottlenecks.
Update
If you use more OpenMP threads than there is hardware threads on your machine, the processor cannot execute them simultaneously (it can only execute one OpenMP thread on each hardware thread). Consequently, the operating systems on your machine schedules the OpenMP threads on the hardware threads so that they seem to be executed simultaneously from the user point of view. However, they are not running simultaneously, but executed in an interleaved way during a very small quantum of time (eg. 100 ms).
For example, if you have a processor with 8 hardware threads and you use 8 OpenMP threads, you can roughly assume that they will run simultaneously.
But if you use 16 OpenMP threads, your operating system can choose to schedule them using the following way:
the first 8 threads are executed for 100 ms;
the last 8 threads are executed for 100 ms;
the first 8 threads are executed again for 100 ms;
the last 8 threads are executed again for 100 ms;
etc.
If your computation last for less than 100 ms, the OpenMP dynamic/guided schedulers will move the work of the 8 last threads to the 8 first threads so that the overall execution time will be faster. Consequently, the 8 first threads can execute all the work and the 8 last threads will not have anything to once executed. This is the cause of the work imbalance between threads.
Thus, if you want to measure the performance of an OpenMP program, you shall NOT use more OpenMP threads than hardware threads (unless you exactly know what you are doing and you are fully aware of such effects).
According to my understanding one operation is operated by one CPU. And, multiprocessing systems have multiple CPUs. That is, multiprocessing systems can work on multiple tasks at the same time.
But, in multiprocessing systems, only one process is in working state at a point of time.
And processes are alternately performed by process scheduling.
So multiprocessing systems can work on multiple processes at the same time.
Why multiprocessing systems that have multiple CPUs use process scheduling and perform only one process at once by using one CPU?
Why don't multiple CPUs perform multiple processes at the same time?
Why multiprocessing systems that have multiple CPUs use process scheduling and perform only one process at once by using one CPU?
Most systems these days use THREAD scheduling; not process scheduling. There are still some Eunuchs variants that still schedule processes but most have switched to threads.
Why don't multiple CPUs perform multiple processes at the same time?
They do. They also execute multiple threads in the same process at the same time.
This is an interview question I encountered today. I have some knowledge about OS but not really proficient at it. I think maybe there are limited threads for each process can create?
Any ideas will help.
This question can be viewed [at least] in two ways:
Can your process get more CPU time by creating many threads that need to be scheduled?
or
Can your process get more CPU time by creating threads to allow processing to continue when another thread(s) is blocked?
The answer to #1 is largely system dependent. However, any rationally-designed system is going to protect again rogue processes trying this. Generally, the answer here is NO. In fact, some older systems only schedule processes; not threads. In those cases, the answer is always NO.
The answer to #2 is generally YES. One of the reasons to use threads is to allow a process to continue processing while it has to wait on some external event.
The number of threads that can run in parallel depends on the number of CPUs on your machine
It also depends on the characteristic of the processes you're running, if they're consuming CPU - it won't be efficient to run more threads than the number of CPUs on your machine, on the other hand, if they do a lot of I/O, or any other kind of tasks that blocks a lot - it would make sense to increase the number of threads.
As for the question "how many" - you'll have to tune your app, make measurements and decide based on actual data.
Short answer: Depends on the OS.
I'd say it depends on how the OS scheduler is implemented.
From personal experience with my hobby OS, it can certainly happen.
In my case, the scheduler is implemented with a round robin algorithm, per thread, independent on what process they belong to.
So, if process A has 1 thread, and process B has 2 threads, and they are all busy, Process B would be getting 2/3 of the CPU time.
There are certainly a variety of approaches. Check Scheduling_(computing)
Throw in priority levels per process and per thread, and it really depends on the OS.
I have a long running (5-10 hours) Mac app that processes 5000 items. Each item is processed by performing a number of transforms (using Saxon), running a bunch of scripts (in Python and Racket), collecting data, and serializing it as a set of XML files, a SQLite database, and a CoreData database. Each item is completely independent from every other item.
In summary, it does a lot, takes a long time, and appears to be highly parallelizable.
After loading up all the items that need processing it, the app uses GCD to parallelize the work, using dispatch_apply:
dispatch_apply(numberOfItems, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^(size_t i) {
#autoreleasepool {
...
}
});
I'm running the app on a Mac Pro with 12 cores (24 virtual). So I would expect to have 24 items being processed at all times. However, I found through logging that the number of items being processed varies between 8 and 24. This is literally adding hours to the run time (assuming it could work on 24 items at a time).
On the one hand, perhaps GCD is really, really smart and it is already giving me the maximum throughput. But I'm worried that, because much of the work happens in scripts that are spawned by this app, maybe GCD is reasoning from incomplete information and isn't making the best decisions.
Any ideas how to improve performance? After correctness, the number one desired attribute is shortening how long it takes this app to run. I don't care about power consumption, hogging the Mac Pro, or anything else.
UPDATE: In fact, this looks alarming in the docs: "The actual number of tasks executed by a concurrent queue at any given moment is variable and can change dynamically as conditions in your application change. Many factors affect the number of tasks executed by the concurrent queues, including the number of available cores, the amount of work being done by other processes, and the number and priority of tasks in other serial dispatch queues." (emphasis added) It looks like having other processes doing work will adversely affect scheduling in the app.
It'd be nice to be able to just say "run these blocks concurrently, one per core, don't try to do anything smarter".
If you are bound and determined, you can explicitly spawn 24 threads using the NSThread API, and have each of those threads pull from a synchronized queue of work items. I would bet money that performance would get noticeably worse.
GCD works at its most efficient when the work items submitted to it never block. That said, the workload you're describing is rather complex and rife with opportunities for your threads to block. For starters, you're spawning a bunch of other processes. Right here, this means that you're already relying on the OS to divvy up time/resources between your master task and these slave tasks. Other than setting the OS priority of each subprocess, the OS scheduler has no way to know which processes are more important than others, and by default, your subprocesses are going to have the same priority as their parent. That said, it doesn't sound like you have anything to gain by tweaking process priorities. I'm assuming you're blocking the master task thread that's waiting for the slave tasks to complete. That is effectively parking that thread -- it can do no useful work. But like I said, I don't think there's much to be gained by tweaking the OS priorities of your slave tasks, because this really sounds like it's an I/O bound workflow...
You go on to describe three I/O-heavy operations ("serializing it as a set of XML files, a SQLite database, and a CoreData database.") So now you have all these different threads and processes vying for what is presumably a shared bulk storage device. (i.e. unless you're writing to 24 different databases, on 24 separate hard drives, one for each core, your process is ultimately going to be serialized at the disk accesses.) Even if you had 24 different hard drives, writing to a hard drive (even an SSD) is comparatively slow. Your threads are going to be taken off of the CPU they were running on (so that another thread that's waiting can run) for virtually any blocking disk write.
If you wanted to maximize the performance you're getting out of GCD, you would probably want to rewrite all the stuff you're doing in subtasks in C/C++/Objective-C, bringing them in-process, and then conducting all the associated I/O using dispatch_io primitives. For API where you don't control the low-level reads and writes, you would want to carefully manage and tune your workload to optimize it for the hardware you have. For instance, if you have a bunch of stuff to write to a single, shared SQLite database, there's no point in ever having more than one thread trying to write to that database at once. You'd be better off making one thread (or a serial GCD queue) to write to SQLite and submitting tasks to that after pre-processing is done.
I could go on for quite a while here, but the bottom line is that you've got a complex, seemingly I/O bound workflow here. At the highest-level, CPU utilization or "number of running threads" is going to be a particularly poor measure of performance for such a task. By using sub-processes (i.e. scripts), you're putting a lot of control into the hands of the OS, which knows effectively nothing about your workload a priori, and therefore can do nothing except use its general scheduler to divvy up resources. GCD's opaque thread pool management is really the least of your problems.
On a practical level, if you want to speed things up, go buy multiple, faster (i.e. SSD) hard drives, and rework your task/workflow to utilize them separately and in parallel. I suspect that would yield the biggest bang for your buck (for some equivalence relation of time == money == hardware.)
What is the difference between a thread/process/task?
Process:
A process is an instance of a computer program that is being executed.
It contains the program code and its current activity.
Depending on the operating system (OS), a process may be made up of multiple threads of execution that execute instructions concurrently.
Process-based multitasking enables you to run the Java compiler at the same time that you are using a text editor.
In employing multiple processes with a single CPU,context switching between various memory context is used.
Each process has a complete set of its own variables.
Thread:
A thread is a basic unit of CPU utilization, consisting of a program counter, a stack, and a set of registers.
A thread of execution results from a fork of a computer program into two or more concurrently running tasks.
The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources.
Example of threads in same process is automatic spell check and automatic saving of a file while writing.
Threads are basically processes that run in the same memory context.
Threads may share the same data while execution.
Thread Diagram i.e. single thread vs multiple threads
Task:
A task is a set of program instructions that are loaded in memory.
Short answer:
A thread is a scheduling concept, it's what the CPU actually 'runs' (you don't run a process). A process needs at least one thread that the CPU/OS executes.
A process is data organizational concept. Resources (e.g. memory for holding state, allowed address space, etc) are allocated for a process.
To explain on simpler terms
Process: process is the set of instruction as code which operates on related data and process has its own various state, sleeping, running, stopped etc. when program gets loaded into memory it becomes process. Each process has atleast one thread when CPU is allocated called sigled threaded program.
Thread: thread is a portion of the process. more than one thread can exist as part of process. Thread has its own program area and memory area. Multiple threads inside one process can not access each other data. Process has to handle sycnhronization of threads to achieve the desirable behaviour.
Task: Task is not widely concept used worldwide. when program instruction is loaded into memory people do call as process or task. Task and Process are synonyms nowadays.
A process invokes or initiates a program. It is an instance of a program that can be multiple and running the same application. A thread is the smallest unit of execution that lies within the process. A process can have multiple threads running. An execution of thread results in a task. Hence, in a multithreading environment, multithreading takes place.
A program in execution is known as process. A program can have any number of processes. Every process has its own address space.
Threads uses address spaces of the process. The difference between a thread and a process is, when the CPU switches from one process to another the current information needs to be saved in Process Descriptor and load the information of a new process. Switching from one thread to another is simple.
A task is simply a set of instructions loaded into the memory. Threads can themselves split themselves into two or more simultaneously running tasks.
for more Understanding refer the link: http://www.careerride.com/os-thread-process-and-task.aspx
Wikipedia sums it up quite nicely:
Threads compared with processes
Threads differ from traditional multitasking operating system processes in that:
processes are typically independent, while threads exist as
subsets of a process
processes carry considerable state information, whereas multiple
threads within a process share state
as well as memory and other resources
processes have separate address spaces, whereas threads share their
address space
processes interact only through system-provided inter-process
communication mechanisms.
Context switching between threads in the same process is
typically faster than context
switching between processes.
Systems like Windows NT and OS/2 are said to have "cheap" threads and "expensive" processes; in other operating systems there is not so great a difference except the cost of address space switch which implies a TLB flush.
Task and process are used synonymously.
from wiki clear explanation
1:1 (Kernel-level threading)
Threads created by the user are in 1-1 correspondence with schedulable entities in the kernel.[3] This is the simplest possible threading implementation. Win32 used this approach from the start. On Linux, the usual C library implements this approach (via the NPTL or older LinuxThreads). The same approach is used by Solaris, NetBSD and FreeBSD.
N:1 (User-level threading)
An N:1 model implies that all application-level threads map to a single kernel-level scheduled entity;[3] the kernel has no knowledge of the application threads. With this approach, context switching can be done very quickly and, in addition, it can be implemented even on simple kernels which do not support threading. One of the major drawbacks however is that it cannot benefit from the hardware acceleration on multi-threaded processors or multi-processor computers: there is never more than one thread being scheduled at the same time.[3] For example: If one of the threads needs to execute an I/O request, the whole process is blocked and the threading advantage cannot be utilized. The GNU Portable Threads uses User-level threading, as does State Threads.
M:N (Hybrid threading)
M:N maps some M number of application threads onto some N number of kernel entities,[3] or "virtual processors." This is a compromise between kernel-level ("1:1") and user-level ("N:1") threading. In general, "M:N" threading systems are more complex to implement than either kernel or user threads, because changes to both kernel and user-space code are required. In the M:N implementation, the threading library is responsible for scheduling user threads on the available schedulable entities; this makes context switching of threads very fast, as it avoids system calls. However, this increases complexity and the likelihood of priority inversion, as well as suboptimal scheduling without extensive (and expensive) coordination between the userland scheduler and the kernel scheduler.