Measuring the time of the fragment shader with queries

Measuring the time of the fragment shader with queries - vulkan

I want to use the query system to retrieve the execution time of the fragment shader.
I am creating a query pool with two timestamp queries and I am using vkCmdWriteTimestamp.
device.cmd_draw_indexed(draw_command_buffer, 6, 1, 0, 0, 1);
device.cmd_write_timestamp(
draw_command_buffer,
vk::PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT,
query_pool,
0,
);
device.cmd_write_timestamp(
draw_command_buffer,
vk::PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
query_pool,
1,
);
device.cmd_end_render_pass(draw_command_buffer);
Which pipeline stages do I need to specify to only track the time for the fragment shader?
vkCmdWriteTimestamp latches the value of the timer when all previous commands have completed executing as far as the specified pipeline stage, and writes the timestamp value to memory. When the timestamp value is written, the availability status of the query is set to available.
Does this include the specified pipeline stage? For example if I specify VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT, will it latch the timer before or after the fragment shader has finished?
When do I need to call vkCmdWriteTimestamp? I have noticed that if both calls to vkCmdWriteTimestamp are directly on top of each other, the resulting delta will be close to 0. Initially I thought it shouldn't matter when I call them because I specify the pipeline stage.

You seem to misunderstand what these functions are doing. They don't detect the time it takes for those stages to execute. They detect how long it takes for the system to reach that stage at that point in the command stream.
That is, if you use "EARLY_FRAGMENT_TESTS_BIT", the question you're asking is "what is the time when all prior commands have finished executing up to the point when they're done with early fragment tests." The same goes for "FRAGMENT_SHADER_BIT". Which means the time delta between these will effectively be non-existent; it'll be the time it took for the last primitives few of the last rendering command to execute its fragment shader.
Or worse, the time between the execution of the two query commands. They're not free, after all.
Remember: rendering commands are pipelined. While some primitives are doing early fragment tests, others are running the fragment shader, and others still are running post-FS stuff. And so on.
Timestamps, as the standard says, are for "timing the execution of commands", not the execution of pipeline stages. There is no way to measure the execution time of a pipeline stage.

Related

Anylogic: how to compare parameter and variable value within statechart?

In my Anylogic model I have a hub which can store 5 containers. So it has a capacity parameter with value 5. I also have given it a variable with the numberOfContainers stored at the hub at that moment. When I run the model I see that the variable works (it changes over time to the number of containers that is stored at that moment).
Now I want another agent in my model to make a decision based on whether the capacity of the hub is reached at that moment (within its statechart). I tried to create a branche with the following condition:
main.hub.numberOfContainers > main.hub.capacity
but it doesn't work, the statechart acts like the capacity is never reached, even if the number of containers is much higher than the capacity. Does anybody know how to make this work?

Typically, condition branches are tricky because the condition may not be evaluated at the time you want it to be. Here is an example.
At time n there are 3 containers in the hub
At time n+1 there are 10 containers in the hub
At time n+2 there are 2 containers in the hub
The model may have missed evaluating the condition at time (n+1) which is why your transition would not be triggered.
To address this issue, I have 3 possible suggestions:
Do not use a condition transition. Instead, use a message. For example, if you are storing the containers in a queue, then, on the "On Enter" and "On Exit" fields of the queue, add the condition:
if(queue.size >= main.hub.numberOfContainers)
<send msg to the statechart>
Use a cyclic event to check if the condition is met every second or millisecond or whatever time period makes sense to you. When the condition is met, send a message to trigger the transition. But the problem with this method is that it might slow down your model with poor performance.
Use the onChange() function. This function is used to signal to your model that a change happened and the condition trigger needs to be evaluated. So, you need to make sure to place onChange() whenever a change that might cause the condition to become true happens. In the example provided under option 1 above, that would be in the fields of the queue "On Enter" and "On Exit".

Why won't a signal be updated instantly in process statement? VHDL

In VHDL, you need to use a variable in a process statement for it to be updated instantaneously. The use of a signal can be done, but it won't be updated immediately. Repeating my question from above: Why won't a signal be updated instantly in process statement?

The short answer is the VHDL execution model. The VHDL execution model does a simulation cycle in two separate steps: Update and then Execution. A limited perspective (there are many details that I have abstracted away) of the a simulation cycle is:
During the update phase all signals that will be updated during a simulation cycle are.
When signals are updated and change, processes that are sensitive to a signal changing are marked for execution.
Execute statements and processes that were marked for execution until they encounter either a wait statement or loop back to the process sensitivity list (and the process has one).
Your question is why do this? It guarantees that every compliant VHDL simulator executes the same code with exactly the same number of simulation cycles and produces the exact same result.
To understand why it would be problematic if signals updated instantaneously, consider the following code:
proc1 : process(Clk)
begin
if rising_edge(Clk) then
Reg1 <= A ;
end if ;
end process ;
proc2 : process(Clk)
begin
if rising_edge(Clk) then
Reg2 <= Reg1 ;
end if ;
end process ;
In the above code, if signals updated instantaneously like variables, and the processes ran in the order proc1 then proc2, then in the simulator we would see 1 flip-flop for A to be received by Reg2. OTOH, if the processes ran in the order proc2 then proc1, then in the simulator we would see 2 flip-flops for A to be received by Reg2.
This is also why shared variables of an ordinary type in VHDL were only introduced temporarily in 1993 and removed in 2000 when a more appropriate feature was able to be introduced (shared variables of a protected type).

Because a signal is designed to behave like a physically implemented value in the hardware, it only updates in response to determined stimuli and according to the progression of time.
In VHDL, this is reflected in that a signal assignment statement does not itself update the value of a signal. Rather, it schedules a transaction to occur on that signal that, when the appointed time comes, will trigger an event on the signal to change its value (if the assignment is to a changed value).
The default scheduling for a transaction is after a delta-delay in simulation, i.e. a simulation time instant just after all concurrently executing processes at that time complete. So if I'm operating in a clocked process and I update a signal value in a process triggered by rising_edge(clk), the new value won't be accessible within that current run of the process but will update just after the rising edge of the clock, when the process is complete.
This difference exists because VHDL is a hardware description language, not a programming language. Designs therefore must take into account the realities of hardware operation - the progression of time, the need for causal stimuli, &c. Thus in a good VHDL design, any value meant to persist through time will be defined as a signal so that the design takes into account that it ought to behave like a section of hardware. Within a process, a variable can provide an intermediate value for use in a combinatorial calculation - the synthesizer will determine whatever logic is necessary to accomplish that job, but the variable as a language element is a tool for calculations, not a way to define persistent values. Of course, variable abuse is possible and does exist... :^)

Can timestamp be used in synchronization of processes having Race Condition?

I am wondering whether timestamp can be used to solve process synchronization problem, when race condition occurs? Below is an algorithm for entry as well as exit sections for every process who wants to enter in critical section. Entry section uses FCFS (First Come First Serve) technique to give access to critical section.
interested[N] is shared array of N integers where N is number of processes.
// This section executed when process enters critical section.
entry_section (int processNumber) {
interested [processNumber] = getCurrentTimeStamp (); // Gets the current timestamp.
index = findOldestProcessNumber (interested); // Find the process number with least timestamp.
while (index != processNumber);
}
// This section executed when process leaves critical section.
exit_section (int processNumber) {
interested [processNumber] = NULL;
}
According to me, this algorithm satisfies all conditions for synchronization, i.e., Mutual Exclusion, Progress, Bounded waiting and Portability. So, Am I correct?
Thanks for giving your time.

Short and sweet, here are the two issues with this approach.
All your processes are busy-waiting. This means that even though the process cannot enter the critical section, it still cannot rest. Which means that the os scheduler needs to constantly keep scheduling all interested processes even though they're not producing a meaningful output. This hurts performance and power consumption.
This is the big one. There is no guarantee that two processes will not have the same timestamp. It may be unlikely, but likelihood is not what you're looking for when you want to guarantee mutual exclusion to prevent a race condition.

Your code is just a sketch, but most likely it will not work in all the cases.
If there are no locks and all the functions are using non-atomic operations, there are no guarantees that the code will execute correctly. It is essentially the same as the first example here, except that you are using an array and assuming you don't need the atomicity since each process will only access its own element.
Let me try to come up with a counterexample.
Few minor clarifications.
As far as I understand the omitted portion of each process is running in a loop
while(!processExitCondition)
{
// some non-critical code
...
// your critical section as in the question
entry_section (int processNumber) {
interested [processNumber] = getCurrentTimeStamp (); // Gets the current timestamp.
index = findOldestProcessNumber (interested); // Find the process number with least timestamp.
while (index != processNumber);
}
// This section executed when process leaves critical section.
exit_section (int processNumber) {
interested [processNumber] = NULL;
}
// more non-critical code
...
}
It seems to me that the scheduling portion should be busy-waiting, constantly getting the oldest process as such:
while (findOldestProcessNumber (interested) != processNumber);
as otherwise, all your threads can immediately hang in an infinite while loop, except for the first one which will execute once and hang right after that.
Now your scheduling function findOldestProcessNumber (interested); has some finite execution time and if my assumption about the presence of a process outer loop while(!processExitCondition) correct, this execution time might happen to be slower than the execution of code inside, before or after the critical section. As a result, the completed process can get back into the interested array before while findOldestProcessNumber (interested); iterates over it and if getCurrentTimeStamp (); is low fidelity (say seconds) you can get two processes entering critical section at once. Imagine adding a long sleep into findOldestProcessNumber (interested); and it will be easier to see how that might happen.
You can say it is an artificial example, but the point is that there are no guarantees of how the processes will interleave with each other, so your synchronization relies on the assumption that certain portions of the code execute certain time "large" or "small" enough. This is just an attempt to fake an atomic operation using those assumptions.
You can come up with the counter-ideas to make it work. Say you can implement getCurrentTimeStamp () to return a unique timestamp for each caller. Either a simple atomic counter with the hardware guarantee that only one process can increment it or by internally using an atomic lock (mutex), it's own critical section and busy waiting for this lock to provide each caller process a distinct system clock value if you wish to have it as a real-time. But with a separate findOldestProcessNumber (interested) call I find it hard to think of a way to make it guaranteed. I can't claim it is impossible, but the more complicated it gets the more likely you are just hiding absence of the Mutual Exclusion guarantee.
So the simplest solution with a lock (mutex) is the best. In your code snippet add a mutex around your critical section with your current entry and exit code used only for the scheduling on the first-come-first-serve basis and mutex with giving you mutual exclusion guarantee.
If you want a lockless solution you can use Boost lockfree queue or implement a lockless ring buffer to push the processes number the queue in entry_section and then wait for it to get its turn, although performance can be worse as contr-intuitive as it seems.

Batch run time against run time of code without batch

How accurate are the start-/stop-timestamps in batch-history?
I've noticed, that a batch runtime is declared with one minute in the history. The code executed by the batch includes a find-method and only if this returns false, further code is executed. The find-method itself runs nearly instantly.
I've added timestamps in code via info-logs and can see those in the history of the batch. one timestamp is at the very first line and another one at the very last line of code. the delta is 0.
So I'm asking, from what this time-delta (stop-start of history against timestamps in code) comes from?!
Is there any "overhead" or sth. which takes an amount of time everytime a batch is executed?

The timestamps in BatchJobHistory (Batch job history) are off by up to a minute.
The timestamps in BatchHistory (Show tasks) are pretty accurate (one second resolution).
The timestamps in BatchJobHistory represents when the batch was started and observed finished by the batch system. Due to implementation details this may differ by 60 seconds from the real execution times recorded in BatchHistory.
This is yet another reason why it is difficult to monitor the AX batch system.

using clock() (or similar) function to find time of execution when there are multiple processes

I want to use a function, like clock(), to find the execution time of a particular piece of code. But if there are multiple threads/processes, then would those getting scheduled in between affect the output of the clock() function?
Example code:
process 1
int main()
{
clock_t t1,t2;
t1=clock();
//Long code
t2=clock();
float diff ((float)t2-(float)t1);
cout<<diff<<endl;
system ("pause");
return 0;
}
My question is, if another process gets scheduled while the long code is running, does the clock function count the cycles used for the other program also? If yes, what is an alternative to get the exact time of execution of a piece of code

This is not possible. There are many other processes running alongside your code that will affect the execution speed (due to scheduling as you point out). In addition, you code itself may use the disk, network, etc so the CPU usage time for your process may not make sense. What you may want to do this run the timing analysis for the program a large number of times on the same system and same load and take an average from that.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas