I need to solve the following problem:
a. Show how to implement acquire() and release() lock operations using TestandSet instruction.
b. Identify a performance problem, that could occur in your solution when it runs on a multiprocessor, but does not occur on a uniprocessor. Describe a concrete scenario where the performance problem arises.
c. Describe an alternative lock implementation that reduces the performance problem in b, and explain how it helps in the concrete scenario you presented in b.
I have my acquire() and release() setup like these:
acquire() {
while(TestandSet(true)){
//wait for lock to be released
{
}
release() {
TestandSet(false);
}
However, I could not identify any performance issue regarding multiple processors or a single processor. What is the performance issue? Or, is my implementation of acquire() and release() correct?
Found on the testAndSet wiki:
The four major evaluation metrics for locks in general are uncontended lock-acquisition latency, bus traffic, fairness, and storage.
Test-and-set scores low on two of them, namely, high bus traffic and unfairness.
When processor P1 has obtained a lock and processor P2 is also waiting for the lock, P2 will keep incurring bus transactions in attempts to acquire the lock. When a processor has obtained a lock, all other processors which also wish to obtain the same lock keep trying to obtain the lock by initiating bus transactions repeatedly until they get hold of the lock. This increases the bus traffic requirement of test-and-set significantly. This slows down all other traffic from cache and coherence misses. It slows down the overall section, since the traffic is saturated by failed lock acquisition attempts. Test-and-test-and-set is an improvement over TSL since it does not initiate lock acquisition requests continuously.
When we consider fairness, we consider if a processor gets a fair chance of acquiring the lock when it is set free. In an extreme situation the processor might starve i.e. it might not be able to acquire the lock for an extended period of time even though it has become free during that time.
Storage overhead for TSL is next to nothing since only one lock is required. Uncontended latency is also low since only one atomic instruction and branch are needed.
Related
I am on my fourth year of Software Engineering and we are covering the topic of Deadlocks.
The generalization goes that a Deadlock occurs when two processes A and B, use two resources X and Y and wait for the release of the other process resource before releasing theirs.
My question would be, given that the CPU is a resource in itself, is there a scenario where there could be a deadlock involving CPU as a resource?
My first thought on this problem is that you would require a system where a process cannot be released from the CPU by timed interrupts (it could just be a FCFS algorithm). You would also require no waiting queues for resources, because getting into a queue would release the resource. But then I also ask, can there be Deadlocks when there are queues?
CPU scheduler can be implemented in any way, you can build one which used FCFS algorithm and allowed processes to decide when they should relinquish control of CPU. but these kind of implementations are neither going to be practical nor reliable since CPU is the single most important resource an operating system has and allowing a process to take control of it in such a way that it may never be preempted will effectively make process the owner of the system which contradicts the basic idea that operating system should always be in control of the system.
As far as contemporary operating systems (Linux, Windows etc) are concerned, this will never happen because they don't allow such situations.
Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.
After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)
So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.
GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.
That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.
And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:
I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them
Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.
Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.
With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.
With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.
It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.
But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.
The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.
If you have a semaphore that is being used to restrict access to a shared resource or limit the number of concurrent actions, what is the locking algorithm to be able to change the maximum value of that semaphore once it's in use?
Example 1
In NSOperationQueue, there is a property named maxConcurrentOperationCount. This value can be changed after the queue has been created. The documentation notes that changing this value doesn't affect any operations already running, but it does affect pending jobs, which presumably are waiting on a lock or semaphore to execute.
Since that semaphore is potentially being held by pending operations, you can just replace it with one with a new count. So another lock must be needed in the change somewhere, but where?
Example 2:
In most of Apple's Metal sample code, they use a semaphore with an initial count of 3 to manage in-flight buffers. I'd like to experiment changing that number while my application is running, just to see how big of a difference it makes. I could tear down the entire class that uses that semaphore and then rebuild the Metal pipeline, but that's a bit heavy handed. Like above, I'm curious how I can structure a sequence of locks or semaphores to allow me to swap out that semaphore for a different one while everything is running.
My experience is with Grand Central Dispatch, but I'm equally interested in a C++ implementation that might use those locking or atomic constructs.
I should add that I'm aware I can technically just make unbalanced calls to signal and wait but that doesn't seem right to me. Namely, whatever code that is making these changes needs to be able to block itself if wait takes awhile to reduce the count...
Is anyone familiar with the ticket lock algorithm which replaces the basic spinlock algorithm in the linux kernel? I am hoping to find an expert on this. I've read from a few online sources that the ticket lock algorithm is supposed to be faster, since the naive algorithm overwhelms the CPU bus with all threads trying to get the lock at the same time. Can anyone confirm/deny this for me?
I did some experiments of my own. The ticket lock is indeed fair, but its performance is just about on par with the pthread spinlock algorithm. In fact, it is just a touch slower.
The way I see it, an unfair algorithm should be a bit faster since the thread that hogs the lock early on finishes more quickly, giving the scheduler less work to do.
I'd like to get some more perspective on this. If it isnt faster, why is ticket lock implemented in the kernel and why is it not used in user space? thanks!
Is anyone familiar with the ticket lock algorithm which replaces the basic spinlock algorithm in the linux kernel? I am hoping to find an expert on this. I've read from a few online sources that the ticket lock algorithm is supposed to be faster, since the naive algorithm overwhelms the CPU bus with all threads trying to get the lock at the same time. Can anyone confirm/deny this for me?
I did some experiments of my own. The ticket lock is indeed fair, but its performance is just about on par with the pthread spinlock algorithm. In fact, it is just a touch slower.
I think the introducing of ticket lock is mainly because of fairness reason. The speed and scalability of ticket lock and spinlock is almost the same comparing to scalable lock like MCS. Both of them introduce lot of cache line invalidate and memory read which overwhelms the CPU bus.
The way I see it, an unfair algorithm should be a bit faster since the thread that hogs the lock early on finishes more quickly, giving the scheduler less work to do.
There's no scheduler involved. Ticket lock and spinlock are busy-waiting lock, which are not blocked when waiting, but keep check the lock value. The program moves on once the lock is free. The control flow never goes the scheduler and comes back. The reason we use spinlock instead of block-wakeup lock is block-wakeup involve context switch, which is expensive, instead we just waiting and burning cpu time turns out cheaper. So busy-waiting locks can only be used in "short" critical sections.
I'd like to get some more perspective on this. If it isnt faster, why is ticket lock implemented in the kernel and why is it not used in user space? thanks!
It's in the kernel because kernel code has critical section too, so you need kernel space lock to protect kernel data. But of course, you can implement a use space ticket lock, and use it in your application.
I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?