As described here vkQueueWaitIdle is equivalent of vkFence.
So in which situation to use either of them.
As you say, vkQueueWaitIdle() is just a special case of Fence use.
So you would use it when you would have to write 10 lines of equivalent Fence code instead — especially if you do not care to remember all the previous queue submissions. It is somewhat a debug feature (most frequently you would use it temporarily to test your synchronization). And it may be useful during cleanup (e.g. application termination, or rebuilding the swapchain).
In all other cases you should prefer VkFences, which are more general:
You can take advantage of advanced vkWaitForFences() usage. I.e. wait-one vs wait-all and timeout.
You supply it to some command that is supposed to signal it (can't do that with vkQueueWaitIdle()). You can do something like:
vkQueueSubmit( q, 1, si1, fence1 );
vkQueueSubmit( q, 1, si2, fence2 );
vkWaitFences( fence1 ); // won't block on the 2nd submit unlike vkQueueWaitIdle(q)
which can even be potentially faster than:
vkQueueSubmit( q, 1, si1, 0 );
vkQueueWaitIdle(q);
vkQueueSubmit( q, 1, si2, 0 );
You can just query the state of the Fence without waiting with vkGetFenceStatus(). E.g. having some background job and just periodically asking if it's done already while you do other jobs.
VkFence may be faster even in identical situations. vkQueueWaitIdle() might be implemented as
vkQueueSubmit( q, 0, nullptr, fence );
vkWaitFences( fence, infiniteWait );
where you would potentially pay extra for the vkQueueSubmit.
In what situations is VkFence better than vkQueueWaitIdle for vkQueueSubmit?
When you aren't shutting down the Vulkan context, i.e. in virtually all situations. vkQueueWaitIdle is a sledgehammer approach to synchronization, roughly analogous to glFinish(). A Vulkan queue is something you want to keep populated, because when it's empty that's a kind of inefficiency. Using vkQueueWaitIdle creates a kind of synchronization point between the client code and parts of the Vulkan driver, which can potentially lead to stalls and bubbles in the GPU pipeline.
A fence is much more fine-grained. Instead of asking the queue to be empty of all work, you're just asking when it finished the specific set of work queued prior to or with the fence. Even though it still creates a synchronization point by having to sync the client CPU thread with the driver CPU thread, this still leaves the driver free to continue working on the remaining items in the queue.
Semaphores are even better than fences, because they're telling the driver that one piece of work is dependent on another piece of work and letting the driver work out the synchronization entirely internally, but they're not viable for all situations, since sometimes the client needs to know when some piece of work is done.
Quite frankly you should always prefer waiting on a fence because it is much more flexible.
With a fence you can wait on completion of work without having to wait on work submitted after the work you are waiting on. A fence also allows other threads to push command buffers to the queue without interfering with the wait.
Besides that the WaitQueueIdle may be implemented differently (and less efficiently) compared to waiting on the fence.
Related
If I check the status of a fence with vkGetFenceStatus() it takes about 0.002 milliseconds. This may not seem like a long time, but that amount of time in a rendering or game engine is a very long time, especially when waiting on fences while doing other scheduled jobs will soon add up to time quickly approaching a millisecond. If the fence statuses are kept host-side why does it take so long to check these and reset them? Do other people get similar timings when calling this function?
Ideally, the time it takes to check for a fence being set shouldn't matter. While taking up 0.02% of a frame at 120FPS isn't ideal, at the end of the day, it should not be all that important. The ideal scenario works like this:
Basically, you should build your fence logic around the idea that you're only going to check the fence if it's almost certainly already set.
If you submit frame 1, you should not check the fence when you're starting to build frame 2. You should only check it when you're starting to build frame 3 (or 4, depending on how much delay you're willing to tolerate).
And most importantly, if it isn't set, that should represent a case where either the CPU isn't doing enough work or the GPU has been given too much work. If the CPU is outrunning the GPU, it's fine for the CPU to wait. That is, the CPU performance no longer matters, since you're GPU-bound.
So the time it takes to check the fence is more or less irrelevant.
If you're in a scenario where you're task dispatching and you want to run the graphics task ASAP, but you have other tasks available if the graphics task isn't ready yet, that's where this may become a problem. But even so, it would only be a problem for that small space of time between the first check to see if the graphics task is ready and the point where you've run out of other tasks to start and the CPU needs to start waiting on the GPU to be ready.
In that scenario, I would suggest testing the fence only twice per frame. Test it at the first opportunity; if its not set, do all of the other tasks you can. After those tasks are dispatched/done... just wait on the GPU with vkWaitForFences. Either the fence is set and the function will return immediately, or you're waiting for the GPU to be ready for more data.
There are other scenarios where this could be a problem. If the GPU lacks dedicated transfer queues, you may be testing the same fence for different purposes. But even in those cases, I would suggest only testing the fence once per frame. If the data upload isn't done, you either have to do a hard sync if that data is essential right now, or you delay using it until the next frame.
If this remains a concern, and your Vulkan implementation allows timeline semaphores, consider using them to keep track of queue progress. vkGetSemaphoreCounterValue may be faster than vkGetFenceStatus, since it's just reading a number.
I am using sleep() in two ways in my current embedded (real time) software design:
To throttle a processing loop, but this is discussed here, and as pointed out thread priority will most likely answer very well for that.
Waiting for hardware to "settle". Lets say I am writing an interface to some hardware. Communications with the hardware is all good, but I want to change its mode and I know it only takes a small number of instruction cycles to do it. I am using a sleep(1); to pause briefly to allow for this. I could setup a loop that keeps pinging it until I receive a valid response, but this would arguably be harder to read (much more code) and, in fact, slower because of data transfer times. In fact I could probably do a usleep(100) or less in my case.
So my question is, is this a good practice? And if not, is there a better/efficient alternative?
Callback
The most ideal solution to this would be to have the hardware notify you when a particular operation is complete through some form of callback/signal.
When writing production code, I would almost always favor this solution above all others. Of course this is provided that the api you are using exposes such methods.
Poll
If there is no way for you to receive such events then the only other option would be for you to check if the operation has completed. The most naive solution would be one that constantly checks (spin-lock).
However, if you know roughly how long an operation should take you could always sleep for that duration, wake-up, check operation status then sleep again or continue.
If you are 100% sure about the timings and can guarantee that your thread is not woken up early then you can rely solely on sleep.
Poor Design?
I wouldn't necessarily say that using sleep for this task is poor design. Sometimes you have no other choice. I would say that to rely solely on sleep is poor design when you cannot guarantee timing because you can not be 100% sure that the operation you are waiting for has in fact completed.
In Linux I use sigsuspend it suspends the software until it receives a signal.
Example
My main thread needs a data, but this data isn't ready, so main thread is suspended.
Other thread reads the data and when it finishes, it fires a signal.
Now then main thread continues and it has the data ready.
If you use sleep the data can be ready or not.
I am reading process management,and I have a few doubts-
What is meant by an I/o request,for E.g.-A process is executing and
hence it is in running state,it is in waiting state if it is waiting
for the completion of an I/O request.I am not getting by what is meant by an I/O request,Can you
please give an example to elaborate.
Another doubt is -Lets say that a process is executing and suddenly
an interrupt occurs,then the process stops its execution and will be
put in the ready state,is it possible that some other process began
its execution while the interrupt is also being processed?
Regarding the first question:
A simple way to think about it...
Your computer has lots of components. CPU, Hard Drive, network card, sound card, gpu, etc. All those work in parallel and independent of each other. They are also generally slower than the CPU.
This means that whenever a process makes a call that down the line (on the OS side) ends up communicating with an external device, there is no point for the OS to be stuck waiting for the result since the time it takes for that operation to complete is probably an eternity (in the CPU view point of things).
So, the OS fires up whatever communication the process requested (call it IO request), flags the process as waiting for IO, and switches execution to another process so the CPU can do something useful instead of sitting around blocked waiting for the IO request to complete.
When the external device finishes whatever operation was requested, it generates an interrupt, so the OS is informed the work is done, and it can then flag the blocked process as ready again.
This is all a very simplified view of course, but that's the main idea. It allows the CPU to do useful work instead of waiting for IO requests to complete.
Regarding the second question:
It's tricky, even for single CPU machines, and depends on how the OS handles interrupts.
For code simplicity, a simple OS might for example, whenever an interrupt happens process the interrupt in one go, then resume whatever process it decides it's appropriate whenever the interrupt handling is done. So in this case, no other process would run until the interrupt handling is complete.
In practice, things get a bit more complicated for performance and latency reasons.
If you think about an interrupt lifetime as just another task for the CPU (From when the interrupt starts to the point the OS considers that handling complete), you can effectively code the interrupt handling to run in parallel with other things.
Just think of the interrupt as notification for the OS to start another task (that interrupt handling). It grabs whatever context it needs at the point the interrupt started, then keeps processing that task in parallel with other processes.
I/O request generally just means request to do either Input , Output or both. The exact meaning varies depending on your context like HTTP, Networks, Console Ops, or may be some process in the CPU.
A process is waiting for IO: Say for example you were writing a program in C to accept user's name on command line, and then would like to print 'Hello User' back. Your code will go into waiting state until user enters their name and hits Enter. This is a higher level example, but even on a very low level process executing in your computer's processor works on same basic principle
Can Processor work on other processes when current is interrupted and waiting on something? Yes! You better hope it does. Thats what scheduling algorithms and stacks are for. However the real answer depending on what Architecture you are on, does it support parallel or serial processing etc.
I have been confused about the issue of context switches between processes, given round robin scheduler of certain time slice (which is what unix/windows both use in a basic sense).
So, suppose we have 200 processes running on a single core machine. If the scheduler is using even 1ms time slice, each process would get its share every 200ms, which is probably not the case (imagine a Java high-frequency app, I would not assume it gets scheduled every 200ms to serve requests). Having said that, what am I missing in the picture?
Furthermore, java and other languages allows to put the running thread to sleep for e.g. 100ms. Am I correct in saying that this does not cause context switch, and if so, how is this achieved?
So, suppose we have 200 processes running on a single core machine. If
the scheduler is using even 1ms time slice, each process would get its
share every 200ms, which is probably not the case (imagine a Java
high-frequency app, I would not assume it gets scheduled every 200ms
to serve requests). Having said that, what am I missing in the
picture?
No, you aren't missing anything. It's the same case in the case of non-pre-emptive systems. Those having pre-emptive rights(meaning high priority as compared to other processes) can easily swap the less useful process, up to an extent that a high-priority process would run 10 times(say/assume --- actual results are totally depending on the situation and implementation) than the lowest priority process till the former doesn't produce the condition of starvation of the least priority process.
Talking about the processes of similar priority, it totally depends on the Round-Robin Algorithm which you've mentioned, though which process would be picked first is again based on the implementation. And, Windows and Unix have same process scheduling algorithms. Windows and Unix does utilise Round-Robin, but, Linux task scheduler is called Completely Fair Scheduler (CFS).
Furthermore, java and other languages allows to put the running thread
to sleep for e.g. 100ms. Am I correct in saying that this does not
cause context switch, and if so, how is this achieved?
Programming languages and libraries implement "sleep" functionality with the aid of the kernel. Without kernel-level support, they'd have to busy-wait, spinning in a tight loop, until the requested sleep duration elapsed. This would wastefully consume the processor.
Talking about the threads which are caused to sleep(Thread.sleep(long millis)) generally the following is done in most of the systems :
Suspend execution of the process and mark it as not runnable.
Set a timer for the given wait time. Systems provide hardware timers that let the kernel register to receive an interrupt at a given point in the future.
When the timer hits, mark the process as runnable.
I hope you might be aware of threading models like one to one, many to one, and many to many. So, I am not getting into much detail, jut a reference for yourself.
It might appear to you as if it increases the overhead/complexity. But, that's how threads(user-threads created in JVM) are operated upon. And, then the selection is based upon those memory models which I mentioned above. Check this Quora question and answers to that one, and please go through the best answer given by Robert-Love.
For further reading, I'd suggest you to read from Scheduling Algorithms explanation on OSDev.org and Operating System Concepts book by Galvin, Gagne, Silberschatz.
Is it safe? For instance, if I create a bunch of different GCD queues that each compress (tar cvzf) some files, am I doing something wrong? Will the hard drive be destroyed?
Or does the system properly take care of such things?
Dietrich's answer is correct save for one detail (that is completely non-obvious).
If you were to spin off, say, 100 asynchronous tar executions via GCD, you'd quickly find that you have 100 threads running in your application (which would also be dead slow due to gross abuse of the I/O subsystem).
In a fully asynchronous concurrent system with queues, there is no way to know if a particular unit of work is blocked because it is waiting for a system resource or waiting for some other enqueued unit of work. Therefore, anytime anything blocks, you pretty much have to spin up another thread and consume another unit of work or risk locking up the application.
In such a case, the "obvious" solution is to wait a bit when a unit of work blocks before spinning up another thread to de-queue and process another unit of work with the hope that the first unit of work "unblocks" and continues processing.
Doing so, though, would mean that any asynchronous concurrent system with interaction between units of work -- a common case -- would be so slow as to be useless.
Far more effective is to limit the # of units of work that are enqueued in the global asynchronous queues at any one time. A GCD semaphore makes this quite easy; you have a single serial queue into which all units of work are enqueued. Every time you dequeue a unit of work, you increment the semaphore. Every time a unit of work is completed, you decrement the semaphore. As long as the semaphore is below some maximum value (say, 4), then you enqueue a new unit of work.
If you take something that is normally IO limited, such as tar, and run a bunch of copies in GCD,
It will run more slowly because you are throwing more CPU at an IO-bound task, meaning the IO will be more scattered and there will be more of it at the same time,
No more than N tasks will run at a time, which is the point of GCD, so "a billion queue entries" and "ten queue entries" give you the same thing if you have less than 10 threads,
Your hard drive will be fine.
Even though this question was asked back in May, it's still worth noting that GCD has now provided I/O primitives with the release of 10.7 (OS X Lion). See the man pages for dispatch_read and dispatch_io_create for examples on how to do efficient I/O with the new APIs. They are smart enough to properly schedule I/O against a single disk (or multiple disks) with knowledge of how much concurrency is, or is not, possible in the actual I/O requests.