Transpose an image2d_t in OpenCL

Transpose an image2d_t in OpenCL - optimization

I work on an image processing code base that uses image2d_t objects everywhere. These have their shape (width and height) formally declared which enables programmers to use built-in boundary checking and so on.
To speed-up a separable 2D convolution, I would like to transpose the image temporarily, so the two 1D convolutions access memory along lines. But since all the image2d_t buffers have the same shape, I need to reshape 2 of them, while not reallocating them (if I need to realloc + transpose, then the speed-up adds up to almost nothing).
Is there a way to switch width and height properties in the image2d_t object ?

There is no point in transposing image2d_t objects.
image2d_t objects represent texture memory. Texture memory is a special kind of memory that is hardware-optimized for situations where threads of a warp / wavefront access elements in nearby 2D locations (x and y).
By 'nearby 2D locations' I mean not necessarily on the same horizontal line (x) and not necessarily in discrete pixel locations.
The GPU hardware has special support for 'texture sampling' - allowing you to 'sample' the texture in non discrete locations and obtain interpolated pixel values.
The exact manner in which texture memory is implemented is vendor dependent, but the general idea is to have 2D regional tiles reside in the same physical line in memory.
Examples where using texture memory makes sense:
Texture mapping in computer graphics. Adjacent pixels in an object sample their color from adjacent 2D locations in an input image.
Image transformation in image processing - scaling, rotating, distorting and undistorting an image. Situations where you 'sample' an input image in an arbitrarily calculated location and write the sample to a target buffer / image.
For most cases in image processing applications, texture memory makes no sense.
Many image processing algorithms access memory in a known pattern, which can be better optimized using linear memory (opencl buffers), and have less overhead.
As for your specific question:
Is there a way to switch width and height properties in the image2d_t
object?
No. image2d_t objects are 'immutable'. Their content can be changed however if you allocate them with appropriate flags and pass them to a kernel as __write_only.
I suggest you switch to using buffer objects. Transposing them is possible to do efficiently and there are some good examples online.

Related

How to join Images by depth values in Vulkan

I want to enable heterogenous multi-GPU support for my Vulkan Application. My goal is to enhance the amount of draw calls (and thereby visible objects) without increasing frame times. I decided it would be best to divide the geometry to multiple GPUs and let each one render its part to the GPU's local Framebuffer. Each local Framebuffer (including depth buffer) would be copied to the logical device containing the presentation engine at the end of the frame.
Is there any way to "join" those Framebuffers depending on the depth values? The pixel value of the framebuffer containing the least depth value should be copied to the final output Frambuffer. A version of vkCmdBlitImage containing a parameter of type VkCompareOp would be nice to have.

Differentiate models when using single draw call

I'm trying to draw geometry for multiple models using a single draw call. All the geometry, thusly, resizes within the same vertex/index buffers. The geometry for the different models share the same vertex format, but the vertex amounts for each model can be different.
In the vertex/fragment shaders, what's a technique that can be used to differentiate between the different models, to access their appropriate transforms/textures/etc ?

Are these static models? For traditional static batching:
You only need a single transform relative to the batch origin (position the individual models relative to the batch origin as part of the offline data packaging step).
You can batch your textures in to a single atlas (either a single 2D image with different coordinates for each object, or a texture array with a different layer for each object).
If you do it this way you don't need to different component models - they are effectively just "one large model". Which has nice performance properties ...
For more modern methods, you can try indirect draws with multiple "drawCount" values to index the settings you want. This allows variable buffer offsets and triangle counts to be used, but the rest of the state used needs to be the same.
As an alternative to texture arrays, with bindless texturing you can just programmatically select which texture to use in the shader at runtime. BUT you generally still want it to be at least warp-uniform to avoid a performance hit.

Registration of 3D surface mesh to 3D image volume

I have an accurate mesh surface model of an implant I'd like to optimally rigidly align to a computed tomography scan (scalar volume) that contains the exact same object. I've tried detecting edges in the image volume with canny filter and doing an iterative closest point alignment between the edges and the vertices of the mesh, but it's not working. I also tried voxelizing the mesh, and using image volume alignment methods (Mattes Mutual) which yields very inconsistent results.
Any other suggestions?
Thank you.

Generally, mesh and volume are two different data structures. You have to either convert mesh to volume or convert volume to mesh.
I would recommend doing a segmentation of volume data first, to segment out the issues you want to register. With canny filter might not be enough to segment the border clearly. I would like to recommend you with level-set method and active contour model. These two are frequently used in medical image processing. For these two topics, I would recommend professor Chunming Li's work.
And after you do the segmentation of volume data, you might be able to reconstruct mesh model of that volume with marching cubes. The vertexes of two mesh could be registered through a simple ICP algorithm.
However, this is just a workaround instead of real registration, it always takes too much time to do the segmentation.

Best practice for simple DirectX overlay rendering

I'm creating a DirectX 11 game that renders complex meshes in 3D space. I'm using vertex/index buffers/shaders and this all works fine. However I now want to perform some basic 'overlay' rendering - more specifically, I want to render wireframe boxes in 3D space to show the bounds of a particular area. There would only ever be one or two boxes in view at any one time, and their vertices would change position each frame.
I've therefore been searching for simpler DX11 rendering methods but most articles I find still prepare a vertex/index buffer for very simple rendering. I know that hardware is well optimised for processing vertex streams, but is the overhead of building and filling a vertex buffer every frame just to process 8 vertices really the most efficient method?
My question is therefore, what is the most efficient method for performing this very simple rendering in DX11? Is there any more primitive method ("DrawLine", "DrawLineList(D3DXVECTOR3[])", ...) that would be a better solution? It could be less efficient per-vertex than the standard method of passing vertex buffers because it's only ever going to be used for a handful of vertices per frame.
Thanks in advance
Rob

You should create a single vertex / index buffer for each primitive Shape (box, sphere, ...) and use transformation matrix to place it correctly in the world.

Optimization using VBO in OpenGL ES 2.0

My system is composed of several objects that represent quads. Each quad is represented by the same vertices and therefore, each object only stores matrices that represent the object's transformation through the world, and its own object space. During each render pass, after these matrices are updated with their frame transforms, they are multiplied with the current view and projection matrices to form the MVP matrix for that object. The objects vertices are then sent with the MVP matrix to the shader, where the vertices are multiplied by the MVP matrix. The inefficiency here is that each quad is drawn separately, meaning there is a separate call glDrawElements for each quad. At any given moment, there may be 50 or 60 quads in existence, some move out of scope and are destroyed or their animation may complete, so they're also destroyed, but more will randomly enter existence. Would there be a significant performance gain to storing all the necessary values in a VBO and just calling glDrawElements once during each pass?

Let's first reason about it with some simple mathematics:
At the moment you don't need to push any vertex data onto the GPU (each frame), but 12-16 floats matrix data per quad, and perform a matrix-matrix multiplication per quad on the CPU.
When putting all in one VBO, you have to transfer 4 vertices (~12 floats) per quad, but no matrix data (except for the global VP, of course) and you have to do 4 matrix-vector multiplies (~1 matrix-matrix multiply) on the CPU.
So the amount of work and data transferred doesn't really change much. But what changes is, that the transferred data is shifted from many many small uniform updates to a single large VBO update, which is very likely to be faster (both because a buffer update is likely to be faster from the hardware side than multiple uniform updates, but don't nail me on that, and second because of the much reduced driver overhead). And on top of that comes the even more reduced overhead by using a single large draw call instead of many smaller.
So yes, it will certainly be worth a try, though it has to be evaluated if it is really a "significant" improvement in your particular application.

Would there be a significant performance gain to storing all the
necessary values in a VBO and just calling glDrawElements once during
each pass?
Yes, it would be much faster. First reason, as you correctly identified, will be a single glDrawElements call. And second being the fact that VBO keeps the data in the GPU itself.
If quads move out of scope you can reuse their memory for the new quads. VBO's can be used to draw subregions of the buffer, so you can get big flexibility without memory allocations.
By using VBO's you are minimising interaction with the GPU and so getting the performance benefit.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas