FabricJS v3.4.0: Filters & maxTextureSize - performance/size limitations - vue.js

Intro:
I've been messing with fabricJS image filtering features in an attempt to start using them in my webapp, but i've run into the following.
It seems fabricJS by default only sets the image size cap (textureSize) on filters to be 2048, meaning the largest image is 2048x2048 pixels.
I've attempted to raise the default by calling fabric.isWebGLSupported() and then setting fabric.textureSize = fabric.maxTextureSize, but that still caps it at 4096x4096 pixels, even though my maxTextureSize on my device is in the 16000~ range.
I realize that devices usually report the full value without accounting for current memory actually available, but that still seems like a hard limitation.
So I guess the main issues I'm looking at here to start effectively using this feature:
1- Render blocking applyFilters() method:
The current filter application function seems to be render blocking in the browser, is there a way call it without blocking the rendering, so I can show an indeterministic loading spinner or something?
is it as simple as making the apply filter method async and calling it from somewhere else in the app? (I'm using vue for context, with webpack/babel which polyfills async/await etc.)
2- Size limits:
Is there a way to bypass the size limit on images? I'm looking to filter images up to 4800x7200 pixels
I can think of one way atleast to do this, which is to "break up" the image into smaller images, apply the filters, and then stitch it back together. But I worry it might be a performance hit, as there will be a lot of canvas exports & canvas initializations in this process.
I'm surprised fabricjs doesn't do this "chunking" by default as its quite a comprehensive library, and I think they've already gone to the point where they use webGL shaders (which is a black box to me) for filtering under the hood for performance, is there a better way to do this?
My other solution would be to send the image to a service (one i handroll, or a pre-existing paid one) that applies the filters somewhere in the cloud and returns it to the user, but thats not a solution i prefer to resort to just yet.
For context, i'm mostly using fabric.Canvas and fabric.StaticCanvas to initialize canvases in my app.
Any insights/help with this would be great.

i wrote the filtering backend for fabricJS, with Mr. Scott Seaward (credits to him too), and i can give you some answers.
Hard block to 2048
A lot of macbook with intel integrated only videocard report a max texture size of 4096, but then they crash the webgl instance at anything higher of 2280. This was happening widely in 2017 when the webgl filtering was written. 4096 would have left uncovered by default a LOT of notebooks. Do not forget mobile phones too.
You know your userbase, you can up the limit to what your video card allows and what canvas allows in your browser. The final image, for how big the texture can be, must be copied in a canvas and displayed. ( canvas has a different max size depending on browser and device )
Render blocking applyFilters() method
Webgl is sync for what i understood.
Creating a parallel executing in a thread for filtering operations that are in the order of 20-30 ms ( sometimes just a couple of ms in chrome ) seems excessive.
Also consider that i tried it but when more than 4 webgl context were open in firefox, some would have been dropped. So i decided for one at time.
The non webgl filtering take longer of course, that could be done probably in a separate thread, but fabricJS is a generic library that does both vectors and filterings and serialization, it has already lot of things on the plate, filtering performances are not that bad. But i'm open to argue around it.
Chunking
Shutterstock editor uses fabricJS and is the main reason why a webgl backend was written. The editor has also chunking and can filter with tiles of 2048 pixels bigger images. We did not release that as opensource and i do not plan of asking. That kind of tiling limit the kind of filters you can write because the code has knowledge of a limited portion of the image at time, even just blurring becomes complicated.
Here there is a description of the process of tiling, is written for casual reader and not only software engineers, is just a blog post.
https://tech.shutterstock.com/2019/04/30/canvas-webgl-filtering-concepts
Generic render blocking consideration
So fabricJS has some pre-written filters made with shaders.
The timing i note here are from my memory and not reverified
The time that pass away filtering an image is:
Uploading the image in the GPU ( i do not know how many ms )
Compiling the shader ( up to 40 ms, depends )
Running the shader ( like 2 ms )
Downloading the result on the GPU ( like 0ms or 13 depends on what method is using )
Now the first time you run a filter on a single image:
The image gets uploaded
Filter compiled
Shader Run
Result downloaded
The second time you do this:
Shader Run
Result downloaded
When a new filter is added or filter is changed:
New filter compiled
Shader or both shader run
Result downloaded
Most common errors in application building with filtering that i have noticed are:
You forget to remove old filters, leaving them active with a value near 0 that does not produce visual changes, but adds up time
You connect the filter to a slider change event, without throttling, and that depending on the browser/device brings up to 120 filtering operation per second.
Look at the official simple demo:
http://fabricjs.com/image-filters
Use the sliders to filter, apply even more filters, everything seems pretty smooth to me.

Related

Can we control progressive rendering in the Viewer based on the distance to the camera?

We have to work with very large models and we're hoping to use the first person camera to walk through them, and eventually do this in VR. The progressive rendering does wonders for improving perceived responsiveness, but it can be disorienting to have so many items around you disappear as you move.
Is there any way to turn progressive rendering off but only for objects closest to the camera? Maybe a number of objects up to a maximum number, or objects within a radius from the camera. Everything further away can load in later and flicker during motion, but it would be nice to keep the nearby objects rendered, especially structural objects like stairs. I've often walked towards stairs just to have them disappear in front of me, forcing me to fly to a platform with E and Q instead of walking.
So far I've only found a way to toggle progressive rendering for the entire model on or off with viewer.setProgressiveRendering(bool) but I haven't found a way to customize the rendering behavior.
Per our Engineering's recommendation can you try set rendering targets with
viewer.impl.setFPSTargets(1, 5, 15) //min, target, max
In fact we've been having similar reports from other developers requesting similar capability to fine tune rendering with large models so our Engineering is considering their options to extend on existing functions and even build them into extensions.

Custom rendering with GPU, Direct3D or OpenGL

I have a Windows application that currently renders graphics largely using MFC that I'd like to change to get better use out of the GPU. Most of the graphics are straightforward and could easily be built up into a scene graph, but some of the graphics could prove very difficult. Specifically, in addition to the normal mesh type objects, I'm also dealing with point clouds which are liable to contain billions of Cartesian stored in a very compact manner that use quite a lot of custom culling techniques to be displayed in real time (Example). What I'm looking for is a mechanism that does the bulk of the scene rendering to a buffer and then gives me access to that buffer, a z buffer, and camera parameters such that I can modify them before putting them out to the display. I'm wondering whether this is possible with Direct3D, OpenGL or possibly use a higher level framework like OpenSceneGraph, and what would be the best starting point? Given the software is Windows based, I'd probably prefer to use Direct3D as this is likely to lead to fewest driver issues which I'm eager to avoid. OpenSceneGraph seems to provide custom culling via octrees, which are close but not identical to what I'm using.
Edit: To clarify a bit more, currently I have the following;
A display list / scene in memory which will typically contain up to a few million triangles, lines, and pieces of text, which I cull in software and output to a bitmap using low performing drawing primitives
A point cloud in memory which may contain billions of points in a highly compressed format (~4.5 bytes per 3d point) which I cull and output to the same bitmap
Cursor information that gets added to the bitmap prior to output
A camera, z-buffer and attribute buffers for navigation and picking purposes
The slow bit is the highlighted part of section 1 which I'd like to replace with GPU rendering of some kind. The solution I envisage is to build a scene for the GPU, render it to a bitmap (with matching z-buffer) based on my current camera parameters and then add my point cloud prior to output.
Alternatively, I could move to a scene based framework that managed the cameras and navigation for me and provide points in view as spheres or splats based on volume and level of detail during the rendering loop. In this scenario I'd also need to be able add cursor information to the view.
In either scenario, the hosting application will be MFC C++ based on VS2017 which would require too much work to change for the purposes of this exercise.
It's hard to say exactly based on your description of a complex problem.
OSG can probably do what you're looking for.
Depending on your timeframe, I'd consider eschewing both OpenGL (OSG) and DirectX in favor of the newer Vulkan 3D API. It's a successor to both D3D and OGL, and is designed by the GPU manufacturers themselves to provide optimal performance exceeding both of its predecessors.
The OSG project is currently developing a Vulkan scenegraph known as VSG, which already demonstrates superior performance to OSG and will have more generalized culling ability.
I've worked a bunch with point clouds and am pretty experienced with them, but I'm not exactly clear on what you're proposing to do.
If you want to actually have a verbal discussion about the matter, I'm pretty easy to find (my company is AlphaPixel -- AlphaPixel.com) and you could call us. I'm in the European time zone right now, it's not clear from your question where you are but you sound US-based.

Rendering multiple models bug on DirectX 12

I am trying to render multiple models on DirectX 12 using only one graphic context, but the result is very weird and I have not much idea what is the reason. Rendering result of the sponza model from outside, the one on right is the correct result and the one on left has problem.
Rendering result of the left sponza (the one has problem) from inside.
Even the loaded two meshes are the same, each model has its own vertex buffer, index buffer and SRVs. In the process of creating graphics context, there is only one graphics context and set with each model's index and vertex buffer, and then I call the drawIndexed() function to render it. After the graphics context is created, we execute the graphics context once per frame. However, if we create an individual graphics context to each model and execute all graphics contexts per frame, the rendering works fine but the frame rate drops a lot.
It will be very helpful for you to provide any hints about what is the reason for the weird result, or providing a solution is even better. Thank you very much in advanced.
First, i would recommend you to stay away from dx12 and stick to dx11, unless you are a dx11 expert already and that you are the top 1% application case, like triple A games or very specific high demand on control over the gpu memory.
Without much details on your problems here, i can only give you a few basic advices :
Run with the debug layer and look at the console log with D3D12GetDebugInterface ( you will need to install the optional feature named graphic tools )
Use frame capture tools, like VSGD in visual studio or nsight from nVidia and inspect your frame step by step
Use Dx11, really

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

Best way to load 24-bit BGR image from memory to ID2D1Bitmap1

I have a BGR 24-bit image in memory as continuous buffer (represented by cv::Mat, in case it may be of any help). I would like to load it to ID2D1Bitmap1 bitmap for 2D rendering. I have the following working code (showing a pseudo-code here):
IWICImagingFactory::CreateBitmapFromMemory(GUID_WICPixelFormat24bppBGR);
IWICFormatConverter::Initialize(GUID_WICPixelFormat32bppRGB);
ID2D1DeviceContext::CreateBitmapFromWicBitmap;
This works fine, the main issue being the time it takes: 20-40 milliseconds, which is too long for my application. I am looking for ways to optimize the process.
I, probably, can save the creation time of the ID2D1Bitmap1 by doing this once, and then loading the converted image from memory using CopyFromMemory, but still the conversion itself takes a large amount of time. One way could be loading the raw BGR buffer to GPU memory, and converting it to native RGBA format on the GPU itself, but I have no idea how start with that.
Your second idea is exactly the direction you should go. Create the ID2D1Bitmap(s) once, convert the buffer (more on that below), and use CopyFromMemory. I do something very similar in an app which provides a preview of a connected webcam (which may have one of various formats). Some cameras will deliver the images in MJPEG, YUY2, etc.
That app uses MediaFoundation, and an IMFTransform to convert the buffer. The IMFTransform in this case is an instance of CLSID_CColorConvertDMO (which uses SIMD registers/instructions when possible). However, prior to completing that, I tested with my own conversion code (which was CPU bound, and performed so-so), and another solution with HLSL and DirectCompute (performed well, but handled only one format). In the end I chose CLSID_CColorConvertDMO to handle the various types of input, but if you only have one known type, you may choose to use the HLSL solution (although it will cause you to have to write the conversion code, and setup the 'views' of the data).
However, if you choose the MediaFoundation route, you can use an IMFTransform without all of the rest of the graph (source, sink, etc). After creating the CLSID_CColorConvertDMO instance, and setting the input and output types (format, frame size, etc), create an IMFSample (using MFCreateSample), and an IMFMediaBuffer (using MFCreateMemoryBuffer) to the sample (using IMFSample::AddBuffer), then all that is necessary is to call ProcessInput and ProcessOutput to convert the buffer (create all items upfront). This may sound like a lot, but if done correctly, your cpu utilization will not have a noticeable impact, and you will achieve the performance you are looking for, even for capture cards which often deliver large frames at 60+ FPS (having used DataPath and Blackmagic capture cards in the past).
Good luck. I am certain you will crush it.