How to improve MTKView rendering when using MPSImageScale and MTLBlitCommandEncoder - objective-c

TL;DR: From within my MTKView's delegate drawInMTKView: method, part of my rendering pass involves adding an MPSImageBilinearScale performance shader and zero or more MTLBlitCommandEncoder requests for generateMipmapsForTexture. Is that a smart thing to do from within drawInMTKView:, which happens on the main thread? Do either of them block the main thread while running or are they only being encoded and then executed later and entirely on the GPU?
Longer Version:
I'm playing around with Metal within the context of an imaging application. I use Core Image to load an image and apply filters. The output image is displayed as a 2D plane in a metal view with a single texture. This works, but to improve performance I wanted to experiment with Core Image's ability to render out smaller tiles at a time. Each tile is rendered into its own IOSurface.
On each render pass, I check if there are any tiles that have been recently rendered. For each rendered tile (which is now an IOSurface), I create a Metal texture from a CVMetalTextureCache that is backed by the surface.
I think use a scaling MPS to copy from the tile-texture into the "master" texture. If a tile was copied over, then I issue a blit command to generate the mipmaps on the master texture.
What I'm seeing is that if my master texture is quite large, then generate the mipmaps can take "a bit of time". The same is true if I have a lot of tiles. It appears this is blocking the main thread because my FPS drops significantly. (The MTKView is running at the standard 60fps.)
If I play around with tile sizes, then I can improve performance in some areas but decrease it in others. For example, increasing the tile size that Core Image renders it creates less tiles, and thus less calls to generate mipmaps and blits, but at the cost of Core Image taking longer to render a region.
If I decrease the size of my "master" texture, then mipmap generation goes faster since only the dirty textures are updates, but there appears to be a lower bounds on how small I should make the master texture because if I make it too small, then I need to pass in a large number of textures to the fragment shader. (And it looks like that limit might be 128?)
What's not entirely clear to me is how much of this I can move off the main thread while still using MTKView. If part of the rendering pass is going to block the main thread, then I'd prefer to move it to a background through so that UI elements (like sliders and checkboxes) remain fully responsive.
Or maybe this isn't the right strategy in the first place? Is there a better way to display really large images in Metal other than tiling? (i.e.: Images larger than Metal's texture size limit of 16384?)

Related

Number or hardware draw calls in Core Animation

I’m building a Metal app which renders a few objects to a view, similar to how Core Animation renders its CALayers. Soon I realized that rendering each layer with a separate draw call can be expensive and inefficient if there are many objects, compared to drawing all objects from a single vertex buffer with a single draw call. However then the single buffer size may become large and if only a few of the objects change frequently, the entire buffer must be sent to the GPU which may also be inefficient.
So I’m wondering how Core Animation works in this regard. Does it render each layer with a separate draw call or unifies all layers into a single draw call, or something in between?
I tried to profile a Core Animation app with Instruments but the Metal draw calls don’t seem to be recorded, even though Core Animation is said to use Metal under the hood. Any insights?

Rendering small text with Vulkan?

A font rendering library (like say freetype) provides a function that will take an outline font file (like a .ttf) and a character code and produce a bitmap of the corresponding glyph in host memory.
For small text (like say up to 30x30 pixel glyphs) what's the most efficient way to render those glyphs to a Vulkan framebuffer?
Some options I've though about might be:
Render the glyphs with the font rendering library every time on demand, blit them with host code to a single host-side image holding a whole "text box", transfer the host-side image of the text box to a device local image, and then render a quad (like a normal image) using fragment shader / image sampler from the text box to be drawn.
At program startup cycle through all the glyphs host side, render them to glyph bitmaps. Do the same as 1 but blit from the cached glyph bitmaps (takes about 1 MB host memory).
Cache the glyph bitmaps individually into device local images. Rather than bitting host-side, render a quad for each glyph device-side and set the image sampler to the corresponding glyph each time. (Not sure how the draw calls would work? One draw call per glyph with a different combined image sampler every time?)
Cache all the glyph bitmaps into one large device-side image (layed out in a big grid say). Use a single device-side combined image sampler, and push params to describe the subregion that contains the glyph image. One draw call per glyph, updating push params each time.
Like 4 but use a single instanced draw call, and rather than push params use instance-varying input attributes.
Something else?
I mean like, how do common game engines like Unreal or Unity or Godot etc solve this problem? Is there a typical approach or best practice?
First, some considerations:
Rasterizing a glyph at around 30px with freetype might take on the order of 10μs. This is a very small one-time cost, but rendering e.g. 100 glyphs every frame would seriously eat into your frame budget (if we assume the math is as simple as 100 * 10μs == 1ms).
State changes (like descriptor updates) are relatively expensive. Changing the bound descriptor for each character you render has non-negligible cost. This could be limited by batching character draws (draw all the As, then the Bs, etc), but using push constants is typically the fastest.
Instanced drawing with small meshes (such as quads or single triangles) can be very slow on some GPUs, as they will not schedule multiple instances on a single wavefront/warp. If you're rendering a quad with 6 vertices, and a single execution unit can process 64 vertices, you may end up wasting 58/64 = 90.6% of available vertex shading capacity.
This suggests 4 is your best option (although 5 is likely comparable); you can further optimize that approach by caching the results of the draw calls. Imagine you have some menu text:
The first frame it is needed, render all the text to an intermediate image.
Each frame it is needed, make a single draw call textured with the intermediate image. (You could also blit the text if you don't need transparency.)

Culling off-screen objects in OpenGL ES 2 2D

I'm playing about with OpenGL ES 2.0. If I'm working with a simple 2D projection, if I have a large 2D grid of vertices which are pretty much static (think map tiles), of which only a small proportion are visible at any one time, would it be better to...
Work out in the CPU which vertices are visible, and and create a VBO to draw just those triangles that make up the visible tiles in each frame?
or
Keep a static VBO with the entire tiled grid, and then just rely on the graphics card (RPi, in my case) to clip out the off-screen triangles?
Or perhaps some combination of the two (like sets of overlapping pre-computed grids)? How big does the grid have to be before the latter option becomes unworkable?
Edit
I decided to make several calls to glDrawElements(), drawing sub-ranges of the index buffer that I knew would overlap the viewport. At the scale I'm working at it doesn't seem to make any difference to the speed over drawing the entire element array, even on a Pi Zero.
However, this approach would require more computation to determine which ranges of elements needed to be rendered if there was any rotation of the grid involved - effectively rasterising my own quad. I'm interested to hear if this is a reasonable approach.
There are some other options like a more exotic structure for breaking up the plane into sub areas, I guess. Still not sure if any of this is really necessary, though.
Thanks!
Please note: I don't want to discuss drawing tiles in the fragment shader, I'm more interested in the correct way to work with the vertex shader than actually solving the described problem.
If that's a regular grid, I'd split it in large chunks, so the screen width (larger side) would fit 2-3 such chunks. They don't need to overlap if it's regular grid.
Checking one chunk's visibility is trivial and cheap, as well as finding/selecting those few that must be drawn. And the wasted/clipped area is small enough to not worry about it. You don't have to go crazy and trim every single vertex that's outside of the screen.
Each chunk would have own VBO, and it would be weakly cached when it goes fully outside of screen, so you don't have to rebuild/reload resources needed to draw that chunk if you quickly return to this part of the map.
Splitting in chunks minimizes the memory requirements and speeds up the level loading. You spend time only loading the part of the screen that user will see immediately. This also allows quite huge maps, since you can prefetch the areas that you're going towards to.

About animating frame by frame with sprite files

I used to animate my CCSprites by iterating through 30 image files (rather big ones) and on each file I changed the CCSprite's texture to that image file.
Someone told me that was not efficient and I should use spritesheets instead. But, can I ask why is this not efficient exactly?
There are two parts to this question:
Memory.
OpenGL ES requires textures to have width and height's to the power of 2 eg 64x128, 256x1024, 512x512 etc. If the images don't comply, Cocos2D will automatically resize your image to fit the dimensions by adding in extra transparent space. With successive images being loaded in, you are constantly wasting more and more space. By using a sprite sheet, you already have all the images tightly packed in to reduce wastage.
Speed. Related to above, it takes time to load an image and resize it. By only calling the 'load' once, you speed the entire process up.

SpriteSheet, AtlasSprite, Sprite and optimization

I'm developing an iPhone Cocos2D game and reading about optimization. some say use spritesheet whenever possible. others say use atlassprite whenever possible and others say sprite is fine.
I don't get the "whenever possible", when each one can and can't be used?
Also what is the best case for each type?
My game will typically use 100 sprites in a grid, with about 5 types of sprites and some other single sprites. What is the best setup for that? guidelines for deciding for general cases will help too.
Here's what you need to know about spritesheets vs. sprites, generally.
A spritesheet is just a bunch of images put together onto one big image, and then there will be a separate file for image location data (i.e. image 1 starts at coordinate 0,0 with a size of 100,100, image 2 starts at coordinate 100,0, etc).
The advantage here is that loading textures (sprites) is a pretty I/O and memory-alloc intensive operation. If you're trying to do this continually in your game, you may get lags.
The second advantage is memory optimization. If you're using transparent PNGs for your images, there may be a lot of blank pixels -- and you can remove those and "pack" your texture sizes way down than if you used individual images. Good for both space & memory concerns. (TexturePacker is the tool I use for the latter).
So, generally, I'd say it's always a good idea to use a sprite sheet, unless you have non-transparent sprites.