Which takes longer time? Switching between the user & kernel modes or switching between two processes? - process

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?

Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Related

Scheduling on multiple cores with each list in each processor vs one list that all processes share

I have a question about how scheduling is done. I know that when a system has multiple CPUs scheduling is usually done on a per processor bases. Each processor runs its own scheduler accessing a ready list of only those processes that are running on it.
So what would be the pros and cons when compared to an approach where there is a single ready list that all processors share?
Like what issues are there when assigning processes to processors and what issues might be caused if a process always lives on one processor? In terms of the mutex locking of data structures and time spent waiting on for the locks are there any issues to that?
Generally there is one, giant problem when it comes to multi-core CPU systems - cache coherency.
What does cache coherency mean?
Access to main memory is hard. Depending on the memory frequency, it can take between a few thousand to a few million cycles to access some data in RAM - that's a whole lot of time the CPU is doing no useful work. It'd be significantly better if we minimized this time as much as possible, but the hardware required to do this is expensive, and typically must be in very close proximity to the CPU itself (we're talking within a few millimeters of the core).
This is where the cache comes in. The cache keeps a small subset of main memory in close proximity to the core, allowing accesses to this memory to be several orders of magnitude faster than main memory. For reading this is a simple process - if the memory is in the cache, read from cache, otherwise read from main memory.
Writing is a bit more tricky. Writing to the cache is fast, but now main memory still holds the original value. We can update that memory, but that takes a while, sometimes even longer than reading depending on the memory type and board layout. How do we minimize this as well?
The most common way to do so is with a write-back cache, which, when written to, will flush the data contained in the cache back to main memory at some later point when the CPU is idle or otherwise not doing something. Depending on the CPU architecture, this could be done during idle conditions, or interleaved with CPU instructions, or on a timer (this is up to the designer/fabricator of the CPU).
Why is this a problem?
In a single core system, there is only one path for reads and writes to take - they must go through the cache on their way to main memory, meaning the programs running on the CPU only see what they expect - if they read a value, modified it, then read it back, it would be changed.
In a multi-core system, however, there are multiple paths for data to take when going back to main memory, depending on the CPU that issued the read or write. this presents a problem with write-back caching, since that "later time" introduces a gap in which one CPU might read memory that hasn't yet been updated.
Imagine a dual core system. A job starts on CPU 0 and reads a memory block. Since the memory block isn't in CPU 0's cache, it's read from main memory. Later, the job writes to that memory. Since the cache is write-back, that write will be made to CPU 0's cache and flushed back to main memory later. If CPU 1 then attempts to read that same memory, CPU 1 will attempt to read from main memory again, since it isn't in the cache of CPU 1. But the modification from CPU 0 hasn't left CPU 0's cache yet, so the data you get back is not valid - your modification hasn't gone through yet. Your program could now break in subtle, unpredictable, and potentially devastating ways.
Because of this, cache synchronization is done to alleviate this. Application IDs, address monitoring, and other hardware mechanisms exist to synchronize the caches between multiple CPUs. All of these methods have one common problem - they all force the CPU to take time doing bookkeeping rather than actual, useful computations.
The best method of avoiding this is actually keeping processes on one processor as much as possible. If the process doesn't migrate between CPUs, you don't need to keep the caches synchronized, as the other CPUs won't be accessing that memory at the same time (unless the memory is shared between multiple processes, but we'll not go into that here).
Now we come to the issue of how to design our scheduler, and the three main problems there - avoiding process migration, maximizing CPU utilization, and scalability.
Single Queue Multiprocessor scheduling (SQMS)
Single Queue Multiprocessor schedulers are what you suggested - one queue containing available processes, and each core accesses the queue to get the next job to run. This is fairly simple to implement, but has a couple of major drawbacks - it can cause a whole lot of process migration, and does not scale well to larger systems with more cores.
Imagine a system with four cores and five jobs, each of which takes about the same amount of time to run, and each of which is rescheduled when completed. On the first run through, CPU 0 takes job A, CPU 1 takes B, CPU 2 takes C, and CPU 3 takes D, while E is left on the queue. Let's then say CPU 0 finishes job A, puts it on the back of the shared queue, and looks for another job to do. E is currently at the front of the queue, to CPU 0 takes E, and goes on. Now, CPU 1 finishes job B, puts B on the back of the queue, and looks for the next job. It now sees A, and starts running A. But since A was on CPU 0 before, CPU 1 now needs to sync its cache with CPU 0, resulting in lost time for both CPU 0 and CPU 1. In addition, if two CPUs both finish their operations at the same time, they both need to write to the shared list, which has to be done sequentially or the list will get corrupted (just like in multi-threading). This requires that one of the two CPUs wait for the other to finish their writes, and sync their cache back to main memory, since the list is in shared memory! This problem gets worse and worse the more CPUs you add, resulting in major problems with large servers (where there can be 16 or even 32 CPU cores), and being completely unusable on supercomputers (some of which have upwards of 1000 cores).
Multi-queue Multiprocessor Scheduling (MQMS)
Multi-queue multiprocessor schedulers have a single queue per CPU core, ensuring that all local core scheduling can be done without having to take a shared lock or synchronize the cache. This allows for systems with hundreds of cores to operate without interfering with one another at every scheduling interval, which can happen hundreds of times a second.
The main issue with MQMS comes from CPU Utilization, where one or more CPU cores is doing the majority of the work, and scheduling fairness, where one of the processes on the computer is being scheduled more often than any other process with the same priority.
CPU Utilization is the biggest issue - no CPU should ever be idle if a job is scheduled. However, if all CPUs are busy, so we schedule a job to a random CPU, and a different CPU ends up becoming idle, it should "steal" the scheduled job from the original CPU to ensure every CPU is doing real work. Doing so, however, requires that we lock both CPU cores and potentially sync the cache, which may degrade any speedup we could get by stealing the scheduled job.
In conclusion
Both methods exist in the wild - Linux actually has three different mainstream scheduler algorithms, one of which is an SQMS. The choice of scheduler really depends on the way the scheduler is implemented, the hardware you plan to run it on, and the types of jobs you intend to run. If you know you only have two or four cores to run jobs, SQMS is likely perfectly adequate. If you're running a supercomputer where overhead is a major concern, then an MQMS might be the way to go. For a desktop user - just trust the distro, whether that's a Linux OS, Mac, or Windows. Generally, the programmers for the operating system you've got have done their homework on exactly what scheduler will be the best option for the typical use case of their system.
This whitepaper describes the differences between the two types of scheduling algorithms in place.

How do I mitigate CUDA's very long initialization delay?

Initializing CUDA in a newly-created process can take quite some time as long as a half-second or more on many server-grade machines of today. As #RobertCrovella explains, CUDA initialization usually includes establishment of a Unified Memory model, which involves harmonizing of device and host memory maps. This can take quite a long time for machines with a lot of memory; and there might be other factors contributing to this long delay.
This effect becomes quite annoying when you want to run a sequence of CUDA-utilizing processes, which do not use complicated virtual memory mappings: They each have to wait their their long wait - despite the fact that "essentially", they could just re-use whether initializations CUDA made the last time (perhaps with a bit of cleanup code).
Now, obviously, if you somehow rewrote the code for all those processes to execute within a single process - that would save you those long initialization costs. But isn't there a simpler approach? What about:
Passing the same state information / CUDA context between processes?
Telling CUDA to ignore most host memory altogether?
Making the Unified Memory harmonization more lazy than it is now, so that it only happens to the extent that it's actually necessary?
Starting CUDA with Unified Memory disabled?
Keeping some daemon process on the side and latching on to it's already-initialized CUDA state?
What you are asking about already exists. It is called MPS (MULTI-PROCESS SERVICE), and it basically keeps a single GPU context alive at all times with a daemon process that emulates the driver API. The initial target application is MPI, but it does bascially what you envisage.
Read more here:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
http://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf

How does a CPU idle (or run below 100%)?

I first learned about how computers work in terms of a primitive single stored program machine.
Now I'm learning about multitasking operating systems, scheduling, context switching, etc. I think I have a fairly good grasp of it all, except for one thing. I have always thought of a CPU as something which is just charging forward non-stop. It always knows where to go next (program counter), and it goes to that instruction, etc, ad infinitum.
Clearly this is not the case since my desktop computer CPU is not always running at 100%. So how does the CPU shut itself off or throttle itself down, and what role does the OS play in this? I'm guessing there's an input on the CPU somewhere which allows it to power down... and the OS can set this if it has nothing to schedule, but the next logical question is how does it start back up again? I'm guessing either one of two things:
It never shuts down completely, just runs at a very low frequency waiting for the scheduler to get busy again
It shuts down completely but is woken up by interrupts
I searched all over for info on this and came up fairly empty-handed. Any insight would be much appreciated.
The answer is that is depends on the hardware, the operating system and the way that the operating system has been configured.
And it could involve either or both of the strategies you proposed.
Another possibility for machines based on the x86 architecture, is that x86 has an HLT instruction that causes the core to stop until it receives an external interrupt. So the "Idle" task could simply execute HLT in a tight loop.
Just go to task manager, performance tab, and watch the cpu usage while you're doing absolutely nothing on your computer. it never stops fluctuating. Having an operating system like windows running, the cpu is going to ALWAYS be functioning, it never completely shuts down.
Having your monitor display an image requires your cpu to process a function allowing it to display anything. etc.
Everything runs through the CPU, just like your brain, it controls everything. nothing would function without it.
Some CPUs do have a 'wait for interrupt' instruction which allows the CPU to stop executing instructions when there is nothing to do, and will not re-awake until there is an interrupt event. This is particularly useful in microcontrollers, where they can sit for long periods of time waiting for something to happen.
Intel = HLT (Halt)
ARM = WFI (Wait for interrupt)
Sometimes a 'busy wait' is also used, where the CPU sits in a little 'idle' loop, checking for things to do. In this case, the CPU is still running instructions, but the operating system is in an idle state. It's not as efficient as using a HLT.
Modern CPUs can also adjust their power usage, and are capable of reducing clock rates, or shutting down parts of the CPU that aren't being used. In this way, power usage during an active idle state can be less than during active processing, even though the core CPU is still running and executing instructions.
If speaking about x86 architecture when an operating system has nothing to do it can use HLT instruction.
HLT instruction stops the CPU till next interrupt.
See http://en.m.wikipedia.org/wiki/HLT for details.
Other architectures have similar instruction to give CPU a rest.

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

How To Simulate Lower CPU Processor Machines For Browser Testing

We have some users which are using lower-CPU powered machines and they're encountering slow response times using our web application. Is there any way for me to do testing so that I can simulate lower CPU rates?
For example, I have 2.3 Ghz computing power, can I lower it to 1.6 Ghz or lower so that I may be able to test it?
BTW, our customers are using Windows. I have to simulate low computing power on Internet Explorer as browser.
Most new CPUs multiplier can easily be lowered (Intel: Speedstep, AMD: PowerNow!). This is used to save power. With RMclock you can manually adjust your multiplier and thus lower your frequency and make your pc slower. I use this tool myself so I can tell you that it works.
http://cpu.rightmark.org/products/rmclock.shtml
The virtual machine Bochs(pronounced boxes) allows you to set a instructions per second directive. It's probably the slowest emulator out there as it is though...
Create some virtual machines.
You can use VirtualPC or VirtualBox both are free.
I would recommend to start something on the background which eats up all your processor cycles.
A program which finds primenumbers or something similar.
Another slight option in addition to those above is to boot windows in a lower resource config. Go to the start menu,, select run and type MSCONFIG. You can go to the boot tab, click on advanced options and limit the memory and number of of processsors. It's not as robust as the above, but it does give you another option.
Lowering the CPU clock doesn't always give expected results.
Newer CPUs feature architecture improvements which make them more efficient on an equvialent clock basis than older chips. Incidentally, because of this virtual machines are a bad way of testing performance for "older" tech as well.
Your best bet is to simply buy a couple of older machines. Using similar RAM (types and amounts), processor, motherboard chipsets, hard drives, and video cards. All of which feed into the total performance of the machine itself.
I bring the other components up because changing just one of them can have an impact on even browser performance. A prime example is memory. If your clients are constrained to something like 512MB of RAM, the machines could be performing a lot of hard drive access for VM swaps, even for just running the browser. In this situation downgrading the clock speed on your processor while still retaining your 2GB (assuming) of RAM would still not perform anywhere near the same even if everything else was equal.
Isak Savo'sanswer works, but can be a bit finicky, as the modern tpl is going to try and limit cpu load as much as possible. When I tested it out, It was hard (though possible with some testing) to consistently get the types of cpu usages I wanted.
Then I remembered, http://www.cpukiller.com/, which does this already. Highly recommended. As an aside, I found this util from playing old 90s games on modern machines, back when frame rate was pegged to cpu clock time, making playing them on modern computers way too fast. Great utility.
Another big difference between high-performance and low-performance CPUs is the number of cores available. This can realistically differ by a factor of 4, way more than the difference in clock frequency you're likely to encounter.
You can solve this by setting the thread affinity. Even IE6 will use 13 threads just to show google.com. That means it will benefit from a multi-core CPU. But if you set the thread affinity to one core only, all 13 IE threads will have to share that one core.
I understand that this question is pretty old, but here are some receipts I personally use (not only for Web development):
BES. I'm getting some weird results while using it.
Go to Control Panel\All Control Panel Items\Power Options\Edit Plan Settings\Change Advanced Power Settings, then go to the "Processor" section and set it's maximum state to 5% (or something else). It works only if your processor supports dynamic multiplier change and ACPI driver is installed correctly.
Run Task Manager and set processor affinity to a single core (or whatever number of cores you want) for your browser's (or any other's) process. Not a best practice for browsers, because JavaScript implementations are usually single-threaded, but, as far as I see, modern browsers actually DO use multiple cores.
There are a few different methods to accomplish this.
If you're using VirtualBox, go into the Settings for the VM you want to slow the CPU speed for. Go to System > Processor, then set the Execution Cap. The percentage controls how slow it will go: lower values are slower relative to the regular speed. In practice, I've noticed the results to be choppy, although it does technically work.
It is also possible to set the CPU speed for the whole system. In the Windows 10 Settings app, go to System > Power & Sleep. Then click Additional Power Settings on the right hand side. Go to Change Plan Settings for the currently selected plan, then click Change Advanced Power Plan Settings. Scroll down to Processor Power Management and set the Maximum Processor State. Again, this is a percentage. Although this does work, I find that in practice, it doesn't have a big impact even when the percentage is set very low.
If you're dealing with a videogame that uses DirectX or OpenGL and doesn't have a framerate cap, another common method is to force Vsync on in your graphics driver settings. This will usually slow the rendering to about 60 FPS which may be enough to play at a reasonable rate. However, it will only work for applications using 3D hardware rendering specifically.
Finally: if you'd rather not use a VM, and don't want to change a system global setting, but would rather simulate an old CPU for one specific process only, then I have my own program to do that called Old CPU Simulator.
The main brain of the operation is a command line tool written in C++, but there is also a GUI wrapper written in C#. The GUI requires .NET Framework 4.0. The default settings should be fine in most cases - just select the CPU you'd like to simulate under Target Rate, then hit New and browse for the program you'd like to run.
https://github.com/tomysshadow/OldCPUSimulator (click the Releases tab on the right for binaries.)
The concept is to suspend and resume the process at a precise rate, and because it happens so quickly the process will appear to just be running slowly. For example, by suspending a process for 3 milliseconds, then resuming it for 1 millisecond, it will appear to be running at 25% speed. By controlling the ratio of time suspended vs. time resumed, it is possible to simulate different speeds. This is completely API agnostic (it doesn't hook DirectX, OpenGL, etc. it'll work with a command line program if you want.)
Old CPU Simulator does not ask for a percentage, but rather, the clock speed to simulate (which it calls the Target Rate.) It then automatically determines, based on your CPU's real clock speed, the percentage to use. Although clock speed is not the only factor that has improved computer performance over time (there are also SSDs, faster GPUs, more RAM, multithreaded performance, etc.) it's a good enough approximation to get fairly consistent results across machines given the same Target Rate. It also supports other options that may help with consistency, such as setting the process affinity to one.
It implements three different methods of suspending and resuming a process and will use the best available: NtSuspendProcess, NtQuerySystemInformation, or Toolhelp Snapshots. It also uses timeBeginPeriod and timeEndPeriod to achieve high precision timing without busy looping. Note that this is not an emulator; the binary still runs natively. If you like, you can view the source to see how it's implemented - it's not a large project. On my machine, Old CPU Simulator uses less than 1% CPU and less than 1 MB of memory, so the program itself is quite efficient (unlike running intensive programs to intentionally slow the CPU.)