On iOS Devices with PVR SGX 5 Series based on Deferred Rendering, when does the hardware execute the drawing command submitted previously? - opengl-es-2.0

We all know that iOS devices which have POWER-VR SGX 5 Series GPU built in(3GS or LATER) .
The POWER VR GPU perform opengl rendering based on TBDR technology, which means that the hardware won't execute drawing commands immediately but cache them until some point, and then execute them all.
This can perform early HSR because that all drawing data are submitted to opengl at the executing point, the hardware has global visibility info in every tile being processed, this is good we all know.
But I'm confused that WHEN does or WHAT makes the hardware to STOP caching and to execute commands submitted previously as the application calling glDrawArray or glDrawElement time and time again besides the Opengl objects(tex, shaders, buffer objects...) modifying operations and glFlush, glFinish function ?
P.S. I also know that the hardware will flush commands submitted previously when PB(Parameter Buffer which helps performing HSR) is full.

Related

How do I know when Vulkan isn't using memory anymore so I can overwrite it / reuse it?

When working with Vulkan it's common that when creating a buffer, such as a uniform buffer, that you create multiple (buffers 'versions'), because if you have double buffering for example you don't know if the graphics API is still drawing the last frame (using the memory you bound and instructed it to use the last loop). I've seen this happen with uniform buffers but not vertex or index buffers or image/texture buffers. Is this because uniform buffers are updated regularly and vertex buffers or images are not?
If you wanted to update an image or a vertex buffer how would you go about it given that you don't know whether the graphics API is still using it? Do you simply reallocate new memory for that image/buffer and start anew? Even if you just want to update a section of it? And if this is the case that you allocate a new buffer, when would you know to release the old buffer? Would say, for example 5 frames into the future be OK? Or 2 seconds? After all, it could still be being used. How is this done?
given that you don't know whether the graphics API is still using it?
But you do know.
Vulkan doesn't arbitrarily use resources. It uses them exactly and only how your code tells it to use the resource. You created and submitted the commands that use those resources, so if you need to know when a resource is in use, it is you who must keep track of it and manage this.
You have to use API synchronization functions to follow the GPU's execution of commands.
If an action command uses some set of resources, then those resources are in use while that command is being executed. You have tools like events which can be used to stop subsequent commands from executing until some prior commands have finished. And events can tell when a particular command has finished, so that you'll know when those resources are no longer in use.
Semaphores have similar powers, but at the level of a batch of work. If a semaphore is signaled, then all of the commands in the batch that signaled it have completed and are no longer using the resources they use. Fences can be used for extremely coarse synchronization, at the level of a submit command.
You multi-buffer uniform data because the nature of uniform data is such that it typically needs to change every frame. If you have vertex buffers or images to change every frame, then you'll need to do the same thing with those.
For infrequent changes, you may want to have extra memory available so that you can just create new images or buffers, then delete the old ones when the memory is no longer in use. Or you may have to stall the CPU until the GPU has finished using those resources.

Rendering Terrain Dynamically with Argument Buffers : Understanding why the particle buffer is not overwritten by the GPU inflight

I am looking through an Apple demo project that is associated with the 2017 WWDC video entitled "Introducing Metal 2" where the developers demonstrate the use of argument buffers. The project is linked here on the page titled "Rendering Terrain Dynamically with Argument Buffers" on the Apple developer website. Here, they synchronize resource writes by the CPU to prevent race conditions with a dispatch_semaphore_t, signaling it when the command buffer finishes executing on the GPU and waiting on it if the CPU is writing data several frames ahead of the GPU. This is consistent with what was shown in a previous 2014 WWDC "Working With Metal: Fundamentals".
I noticed that it seems the APPLParticleRenderer is sending data to be written by the GPU in a compute pass before it finishes reading from that same buffer from the fragment shader from a previous render pass. The resource storage mode of the buffer is MTLResourceStorageModePrivate. My question: does Metal automatically synchronize access to private id<MTLBuffer>s accessible only by the GPU? Do render, compute, and blit passes called from new id<MTLCommandEncoder> have access to the buffer only after other passes have written and read from it (exclusive access)? I have seen that there are guaranteed barriers within tile shaders, where tile memory is accessed exclusively by the kernel before subsequent fragment shaders access the memory.
Lastly, in the 2016 WWDC "What's New in Metal, Part 2", the first presenter, Charles Brissart, at 16:44 mentions that fragment and vertex functions reading and writing from the same buffer must be placed into two render command encoders, but for compute kernels one compute command encoder suffices. This is consistent with what is seen within the particle renderer.
See my comment on the original question for a brief version of this answer.
It turns out that Metal tracks dependencies between commands scheduled to the GPU by default for MTLResource types. The hazardTrackingMode property of a MTLResource is defaulted to MTLHazardTrackingModeTracked (MTLHazardTrackingMode.tracked in Swift) according to the Metal documentation. This means Metal tracks dependencies across commands that modify the resource, as is the case with the particle kernel, and delays execution until prior commands accessing the resource are complete.
Therefore, since the _particleDataPool buffer has a storage mode of MTLResourceStorageModePrivate (storageModePrivate in Swift), it can only be written to by the GPU; hence, no CPU/GPU synchronization is necessary with a semaphore for this buffer and thus no multi-buffer system is necessary for the resource.
Only when a resource can be written to by the CPU while the GPU is still reading from it do we want multiple buffers so the CPU is not idle.
Note that the default hazard tracking mode for a MTLHeap is MTLHazardTrackingModeUntracked (MTLHazardTrackingMode.untracked in Swift), in which case you are responsible for synchronizing resource writes by the GPU
EDIT
After reading into resource synchronization in Metal, there are some additional points I would like to make that I think further clarify what's going on. Note that the remaining portion is in Swift. To learn more in detail, I recommend reading the "Synchronization" section in the Metal documentation here.
MTLFence
Firstly, a MTLFence is used to synchronize accesses to untracked resources within the execution of a single command buffer. A fence gives you explicit control over when the GPU accesses resources and is necessary when you are working with an untracked resource. Otherwise, Metal will handle this synchronization for you
It is important to note that the automatic management I mention in the answer only occurs within a single command buffer between encoding passes. But this does not mean we need to synchronize across command buffers scheduled in the same command queue since a command buffer is not immediately scheduled for execution. In fact, according to the documentation on the addScheduledHandler(_:) method of the MTLCommandBuffer protocol found here
The device object schedules the command buffer after it identifies any dependencies with work tasks submitted by other command buffers or other APIs in the system.
at which point it would be safe to access these same buffers. Note that within a single render encoding pass, it is important to mention that if a vertex shader writes into a buffer the fragment shader in the same pass reads from, this is undefined. I mentioned this in the original question, the solution being to use two render pass encoders. I have yet to determine why this is not necessary for a compute encoder, but I imagine it has to do with how kernels are executed in comparison to vertex and fragment shaders
MTLEvent
In some cases, however, command buffers in different queues created by the same MTLDevice need access to the same resource or depend on one another in some way. In this case, synchronization is necessary because the separate queues schedule their own command buffers without knowledge of the other, meaning there is potential for the two command buffers to be executing concurrently.
To fix this problem, you use an MTLEvent instance created by the device using makeEvent() and encode event signals at specific points in each buffer.
MTLSharedEvent
In the event (no pun intended) that you have multiple processors (different CPU cores, CPU and GPU, or multi-GPU), resource synchronization is needed. Here, you create a MTLSharedEvent in place of a MTLEvent that can be used to synchronize across devices and processes. It is essentially the same API as that of the MTLEvent, but involves command queues on different devices.

How the GPU process non-graphic data in parallel?

As the introduction of programmable shaders in graphic pipeline enabled GPGPU concept which makes use of GPU as a general processing engine suited for parallel data.
However, as far as I know, because GPU is still used for graphic processing a lot compared to GPGPU, it makes use of lots of fixed graphic pipeline stages that cannot be programmed.
If my understanding is correct, when one data is processed by the GPU regardless of the type of data (graphic or general), it should be processed through the fixed graphic pipeline which includes programmable stages and non-programmable fixed stages.
Does that mean non-graphical processing should go through graphical processing stages even though it doesn't make use of it? Or can it bypass those fixed stages used for graphics? If one can explain how the GPU pipeline works for GPGPU I would appreciate it.
TL;DR:
GPGPU completely bypasses the rendering pipeline, but the pipeline is still used today.
GPUs consist of two main parts (in relation to your question). The first one is the processing part, which consists of the memory, registers, warp units, dispatchers and streaming processors. The other part is a set of controllers, that are responsible for geometry processing and the graphics pipeline. Those controllers just issue commands for the Streaming Processors on how to process the data for each of the steps of the rendering pipeline, either hardwired or based on user supplied shaders. NVidia calls them "PolyMorph Engine", AMD "Geometric Processor".
Historically, some of those controllers were hardwired to do things a single way, so you could only programm the vertexshader, fragmentshader and pixelshader. The tesselation controller e.g. was hardwired on the GPU and not user programmable. As demands grew, more and more of those controllers became user-programmable and today most of them are completely programmable (Wikipedia).
In the beginning days of GPGPU, the only way to do computing was to hack the available shaders by using a texture with your input data on a full-screen face to calculate the result and then read the rendered image back (See slide 26 on this introduction).
With CUDA, NVidia allowed users not only to program the shaders/polymorph Engine, but also directly interact with the Streaming processors and execute code on those (See slide 31 & 32).
This does not mean, that the graphics pipeline became obsolete, but now there is a way to completely bypass it and directly run code on the GPU processors. Nvidia has a nice explanation on how the pipeline works today, where you can also see both the PolyMorph Engine and the Streaming Processors here.
The Graphics pipeline still helps the dev by offloading repetitive and more complicated parts of the process, like managing the memory, managing warps, passing data and all that stuff. Theoretically you could probably write your own pipeline directly on the StreamingProcessors using CUDA and then render the result, but it would be tedious. Just how writing a GPGPU-Code using Shaders would be tedious.
Although old GPUs have pipelines hardcoded in the chip, modern GPU itself is just a large ASIC that can compute vectorized data at stupid fast speed. It is human who defines what it can do. So the render pipeline is defined in the graphics library like OpenGL, not in GPU. Thus, GPU does not care what it is computing, as long as it is vectorized data, it can do all the computation needed and give you a result.

Can GPU be used to run programs that run on CPU?

Can Gpu be used to run programs that run on Cpu like getting input from keyboard and mouse or playing music or reading the contents of a text file using Direct3D and OpenGL Api?
The GPU has no direct access on any memory that is mapped by the OS to be accessed within client code (i.e. code, which is executed in user-mode while the instructions are executed on the CPU).
In addition the GPU is not supposed to perform stuff like this, it aims to perform floating point arithmetic at a high speed. And finally you would never use Direct3D or OpenGL to perform anything that is not related to graphics, except you are only going to use the compute shader.
General purpose computations are performed with OpenCL or CUDA on the GPU, such as image manipulation or physics simulations.
You can, however, gather any data on the CPU, send it to the GPU for further processing and finally write it back again into memory accessible from the CPU.

Cocoa IOSurfaces and synchronization with background task pulling frames via Quicktime

I've a question regarding IOSurface on Cocoa.
After an extensive research required to switch my OPENGL realtime application to 64 bit, I've taken the only path to support Quicktime playback spawning a background thread that pulls the frames installing a frame-ready callback and then with QTVisualContextCopyImageForTime , and pass the IOSurfaceRef through RPC to the parent process.
Everything works fine but there's one main issue. In my 32 bit application I was able to serialize any call to the GL subsystem by rendering a frame, pull the QT frames for the next pass and then wait for the next V-sync. This produced a very smooth and stable result.
Using the IOSurface technique gives me no way to synchronize when my app draws a frame and when the background process pulls the IOSurface from the quicktime movie. The result is that, on a random basis, I experience performances SPYKES. Indeed using the OPENGL driver monitor raises the CPU WAIT cycles up to 10% in my 64 bit app, while I have 0% CPU Wait graph under 32 bit.
Anyone here used IOSurface in a real world application and faced issues like this one ? I've though about an interprocess mutex/lock , but considering I need to lock/unlock about 120 times x second, I was not able to find a valid solution, it doesn't seems that darwin has something like the NAMED SIGNALS available in Win32...
Any suggestion, or I should take a totally different approach to the problem ?
Thanks !