Best way to load 24-bit BGR image from memory to ID2D1Bitmap1 - optimization

I have a BGR 24-bit image in memory as continuous buffer (represented by cv::Mat, in case it may be of any help). I would like to load it to ID2D1Bitmap1 bitmap for 2D rendering. I have the following working code (showing a pseudo-code here):
IWICImagingFactory::CreateBitmapFromMemory(GUID_WICPixelFormat24bppBGR);
IWICFormatConverter::Initialize(GUID_WICPixelFormat32bppRGB);
ID2D1DeviceContext::CreateBitmapFromWicBitmap;
This works fine, the main issue being the time it takes: 20-40 milliseconds, which is too long for my application. I am looking for ways to optimize the process.
I, probably, can save the creation time of the ID2D1Bitmap1 by doing this once, and then loading the converted image from memory using CopyFromMemory, but still the conversion itself takes a large amount of time. One way could be loading the raw BGR buffer to GPU memory, and converting it to native RGBA format on the GPU itself, but I have no idea how start with that.

Your second idea is exactly the direction you should go. Create the ID2D1Bitmap(s) once, convert the buffer (more on that below), and use CopyFromMemory. I do something very similar in an app which provides a preview of a connected webcam (which may have one of various formats). Some cameras will deliver the images in MJPEG, YUY2, etc.
That app uses MediaFoundation, and an IMFTransform to convert the buffer. The IMFTransform in this case is an instance of CLSID_CColorConvertDMO (which uses SIMD registers/instructions when possible). However, prior to completing that, I tested with my own conversion code (which was CPU bound, and performed so-so), and another solution with HLSL and DirectCompute (performed well, but handled only one format). In the end I chose CLSID_CColorConvertDMO to handle the various types of input, but if you only have one known type, you may choose to use the HLSL solution (although it will cause you to have to write the conversion code, and setup the 'views' of the data).
However, if you choose the MediaFoundation route, you can use an IMFTransform without all of the rest of the graph (source, sink, etc). After creating the CLSID_CColorConvertDMO instance, and setting the input and output types (format, frame size, etc), create an IMFSample (using MFCreateSample), and an IMFMediaBuffer (using MFCreateMemoryBuffer) to the sample (using IMFSample::AddBuffer), then all that is necessary is to call ProcessInput and ProcessOutput to convert the buffer (create all items upfront). This may sound like a lot, but if done correctly, your cpu utilization will not have a noticeable impact, and you will achieve the performance you are looking for, even for capture cards which often deliver large frames at 60+ FPS (having used DataPath and Blackmagic capture cards in the past).
Good luck. I am certain you will crush it.

Related

FabricJS v3.4.0: Filters & maxTextureSize - performance/size limitations

Intro:
I've been messing with fabricJS image filtering features in an attempt to start using them in my webapp, but i've run into the following.
It seems fabricJS by default only sets the image size cap (textureSize) on filters to be 2048, meaning the largest image is 2048x2048 pixels.
I've attempted to raise the default by calling fabric.isWebGLSupported() and then setting fabric.textureSize = fabric.maxTextureSize, but that still caps it at 4096x4096 pixels, even though my maxTextureSize on my device is in the 16000~ range.
I realize that devices usually report the full value without accounting for current memory actually available, but that still seems like a hard limitation.
So I guess the main issues I'm looking at here to start effectively using this feature:
1- Render blocking applyFilters() method:
The current filter application function seems to be render blocking in the browser, is there a way call it without blocking the rendering, so I can show an indeterministic loading spinner or something?
is it as simple as making the apply filter method async and calling it from somewhere else in the app? (I'm using vue for context, with webpack/babel which polyfills async/await etc.)
2- Size limits:
Is there a way to bypass the size limit on images? I'm looking to filter images up to 4800x7200 pixels
I can think of one way atleast to do this, which is to "break up" the image into smaller images, apply the filters, and then stitch it back together. But I worry it might be a performance hit, as there will be a lot of canvas exports & canvas initializations in this process.
I'm surprised fabricjs doesn't do this "chunking" by default as its quite a comprehensive library, and I think they've already gone to the point where they use webGL shaders (which is a black box to me) for filtering under the hood for performance, is there a better way to do this?
My other solution would be to send the image to a service (one i handroll, or a pre-existing paid one) that applies the filters somewhere in the cloud and returns it to the user, but thats not a solution i prefer to resort to just yet.
For context, i'm mostly using fabric.Canvas and fabric.StaticCanvas to initialize canvases in my app.
Any insights/help with this would be great.
i wrote the filtering backend for fabricJS, with Mr. Scott Seaward (credits to him too), and i can give you some answers.
Hard block to 2048
A lot of macbook with intel integrated only videocard report a max texture size of 4096, but then they crash the webgl instance at anything higher of 2280. This was happening widely in 2017 when the webgl filtering was written. 4096 would have left uncovered by default a LOT of notebooks. Do not forget mobile phones too.
You know your userbase, you can up the limit to what your video card allows and what canvas allows in your browser. The final image, for how big the texture can be, must be copied in a canvas and displayed. ( canvas has a different max size depending on browser and device )
Render blocking applyFilters() method
Webgl is sync for what i understood.
Creating a parallel executing in a thread for filtering operations that are in the order of 20-30 ms ( sometimes just a couple of ms in chrome ) seems excessive.
Also consider that i tried it but when more than 4 webgl context were open in firefox, some would have been dropped. So i decided for one at time.
The non webgl filtering take longer of course, that could be done probably in a separate thread, but fabricJS is a generic library that does both vectors and filterings and serialization, it has already lot of things on the plate, filtering performances are not that bad. But i'm open to argue around it.
Chunking
Shutterstock editor uses fabricJS and is the main reason why a webgl backend was written. The editor has also chunking and can filter with tiles of 2048 pixels bigger images. We did not release that as opensource and i do not plan of asking. That kind of tiling limit the kind of filters you can write because the code has knowledge of a limited portion of the image at time, even just blurring becomes complicated.
Here there is a description of the process of tiling, is written for casual reader and not only software engineers, is just a blog post.
https://tech.shutterstock.com/2019/04/30/canvas-webgl-filtering-concepts
Generic render blocking consideration
So fabricJS has some pre-written filters made with shaders.
The timing i note here are from my memory and not reverified
The time that pass away filtering an image is:
Uploading the image in the GPU ( i do not know how many ms )
Compiling the shader ( up to 40 ms, depends )
Running the shader ( like 2 ms )
Downloading the result on the GPU ( like 0ms or 13 depends on what method is using )
Now the first time you run a filter on a single image:
The image gets uploaded
Filter compiled
Shader Run
Result downloaded
The second time you do this:
Shader Run
Result downloaded
When a new filter is added or filter is changed:
New filter compiled
Shader or both shader run
Result downloaded
Most common errors in application building with filtering that i have noticed are:
You forget to remove old filters, leaving them active with a value near 0 that does not produce visual changes, but adds up time
You connect the filter to a slider change event, without throttling, and that depending on the browser/device brings up to 120 filtering operation per second.
Look at the official simple demo:
http://fabricjs.com/image-filters
Use the sliders to filter, apply even more filters, everything seems pretty smooth to me.

How to write to the image directly by CPU when load it in Vulkan?

In Direct3D12, you can use "ID3D12Resource::WriteToSubresource" to enable zero-copy optimizations for UMA adapters.
What is the equivalent of "ID3D12Resource::WriteToSubresource" in Vulkan?
What WriteToSubresource seems to do (in Vulkan-equivalent terms) is write pixel data from CPU memory to an image whose storage is in CPU-writable memory (hence the requirement that it first be mapped), to do so immediately without the need for a command buffer, and to be able to do so regardless of linear/tiling.
Vulkan doesn't have a way to do that. You can write directly to the backing storage for linear images (in the generic layout), but not for tiled ones. You have to use a proper transfer command for that, even on UMA architectures. Which means building a command buffer and submitting to a transfer-capable queue, since Vulkan doesn't have any immediate copy commands like that.
A Vulkan way to do this would essentially be a function that writes data to a mapped pointer to device memory storage as appropriate for a tiled VkImage in the pre-initialized layout that you intend to store in a particular region of memory. That way, you could then bind the image to that location of memory, and you'd be able to transition the layout to whatever you want.
But that would require adding such a function and allowing the pre-initialized layout to be used for tiled images (so long as the data is written by this function).
So, from ID3D12Resource::WriteToSubresource docunentation I read it performs one copy, with marketeze sprinkled on top.
Vulkan is an explicit API, which does perfectly allow you to do an one-copy on UMA (or on anything else). It even allows you to do real zero-copy, if you stick with linear tiling.
UMA may look like this: https://vulkan.gpuinfo.org/displayreport.php?id=4919#memorytypes
I.e. has only one heap, and the memory type is both DEVICE_LOCAL and HOST_VISIBLE.
So, if you create a linearly tiled image\buffer in Vulkan, vkMapMemory its memory, and then produce your data into that mapped pointer directly, there you have a (real) zero-copy.
Since this is not always practical (i.e. you cannot always choose how things are allocated, e.g. if it is data returned from library function), there is an extension VK_EXT_external_memory_host (assuming your ICD supports it of course), which allows you to import your host data directly, without having to first make a Vulkan memory map.
Now, there are optimally tiled images. Optimal tiling is opaque in Vulkan (so far), and implementation-dependent, so you do not even know the addressing scheme without some reverse engineering. You, generally speaking, want to use optimally tiled images, because supposedly accessing them has better performance characteristics (at least in common situations).
This is where the single copy comes in. You would take your linearly tiled image (or buffer), and vkCmdCopy* it into your optimally tiled image. That copy is performed by the Device\GPU with all its bells and whistles, potentially faster than CPU, i.e. what I suspect they would call "near zero-copy".

How to return acquired SwapChain image back to the SwapChain?

I can currently acquire swap chain image, draw to it and then present it. After vkQueuePresentKHR the image is returned back to the swap chain. Is there other way to return the image back. I do not want to display the rendered data to screen.
You can probably do what you want here by simply not presenting the images to the device. But the number of images you can get depends on the VkSurfaceCapabilities of your device.
The maximum number of images that the application can simultaneously acquire from this swapchain is derived by subtracting VkSurfaceCapabilitiesKHR::minImageCount from the number of images in the swapchain and adding 1.
On my device, I can have an 8-image swapchain and the minImageCount is 2, letting me acquire 7 images at once.
If you really want for whatever reason to scrap the frame just do not Present the Image and reuse it next iteration (do not Acquire new Image; use the one you already have).
If there's a possibility you are never going to use some Swapchain Image, you still do not need to worry about it. Acquired Images will be reclaimed (unpresented) when a Swapchain is destroyed.
Seeing your usage comment now, I must add you still need to synchronize. And it is not guaranteed to be round-robin. And that it sounds very misguided. Creating Swapchain seems like equal programming work to creating and binding memory to the Image. Considering the result is not "how it is meant to be used"...
From a practical point, you will probably not have good choice of Swapchain Image formats, types and usage flags and they can be limited by size and numbers you can use. It will probably not work well across platforms. It may come with performance hit too.
TL;DR Swapchains are only for interaction with the windowing system (or lack thereof) of the OS. For other uses there are appropriate non-Swapchain commands and objects.
Admittedly Vulkan is sometimes less than terse to write in(a product of it being C-based, reasonably low-level and abstracting a wide range of GPU-like HW), but your proposed technique is not a viable way around it. You need to get used to it and where apropriate make your own abstractions (or use a library doing that).

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

Working around WebGL readPixels being slow

I'm trying to use WebGL to speed up computations in a simulation of a small quantum circuit, like what the Quantum Computing Playground does. The problem I'm running into is that readPixels takes ~10ms, but I want to call it several times per frame while animating in order to get information out of gpu-land and into javascript-land.
As an example, here's my exact use case. The following circuit animation was created by computing things about the state between each column of gates, in order to show the inline-with-the-wire probability-of-being-on graphing:
The way I'm computing those things now, I'd need to call readPixels eight times for the above circuit (once after each column of gates). This is waaaaay too slow at the moment, easily taking 50ms when I profile it (bleh).
What are some tricks for speeding up readPixels in this kind of use case?
Are there configuration options that significantly affect the speed of readPixels? (e.g. the pixel format, the size, not having a depth buffer)
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Should I try to aggregate all the textures I'm reading into a single megatexture and sort things out after a single big read?
Should I be using a different method to get the information back out of the textures?
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Yes, yes, yes. readPixels is fundamentally a blocking, pipeline-stalling operation, and it is always going to kill your performance wherever it happens, because it's sending a request for data to the GPU and then waiting for it to respond, which normal draw calls don't have to do.
Do readPixels as few times as you can (use a single combined buffer to read from). Do it as late as you can. Everything else hardly matters.
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
This will get you immensely better performance.
If your graphics are all like you show above, you shouldn't need to do any “layout” at all (which is good, because it'd be very awkward to implement) — everything but the text is some kind of color or boundary animation which could easily be done in a shader, and all the layout can be just a static vertex buffer (each vertex has attributes which point at which simulation-state-texel it should be depending on).
The text will be more tedious merely because you need to load all the digits into a texture to use as a spritesheet and do the lookups into that, but that's a standard technique. (Oh, and divide/modulo to get the digits.)
I don't know enough about your use case but just guessing, Why do you need to readPixels at all?
First, you don't need to draw text or your the static parts of your diagram in WebGL. Put another canvas or svg or img over the WebGL canvas, set the css so they overlap. Let the browser composite them. Then you don't have to do it.
Second, let's assume you have a texture that has your computed results in it. Can't you just then make some geometry that matches the places in your diagram that needs to have colors and use texture coords to look up the results from the correct places in the results texture? Then you don't need to call readPixels at all. That shader can use a ramp texture lookup or any other technique to convert the results to other colors to shade the animated parts of your diagram.
If you want to draw numbers based on the result you can use a technique like this so you'd make a shader at references the result shader to look at a result value and then indexes glyphs from another texture based on that.
Am I making any sense?