When to use multiple MTLRenderCommandEncoders to perform my Metal rendering? - objective-c

I'm learning Metal, and there's a conceptual question that I'm trying to wrap my head around: at what level, exactly, should my code handle successive drawing operations that require different pipeline states? As I understand it (from answers like this: https://stackoverflow.com/a/43827775/2752221), I can use a single MTLRenderCommandEncoder and change its pipeline state, the vertex buffer it's using, etc., between calls to drawPrimitives:, and the encoder state that was current at the time of each call to drawPrimitives: will be preserved. So that's great. But it also seems like the design of Metal is such that one can make multiple MTLRenderCommandEncoder instances, and use them to sequentially throw batches of commands into a MTLCommandBuffer. Given that the former works – using one MTLRenderCommandEncoder and changing its state – why would one do the latter? Under what circumstances is it correct to do the former, and under what circumstances is it necessary to do the latter? What is an example of a situation where the latter would be necessary/appropriate?
If it matters, I'm working on a macOS app, using Objective-C. Thanks.

Ignoring multithreaded encoding cases, which are somewhat advanced, the main reason you'd want to create multiple render command encoders during a frame is because you need to change which textures you're rendering to.
You'll notice that you need to provide a render pass descriptor when creating a render command encoder. For this reason, we often say that the sequence of commands belonging to a particular encoder constitute a render pass. The attachments of that descriptor refer to the textures that will be written to by the commands encoded by the encoder.
Many different techniques, including shadow mapping and postprocessing effects like bloom require multiple passes to produce. Since you can't change attachments in the midst of a pass, creating a new encoder is the only way to encode multiple passes in a frame.
Relatedly, you should ordinarily use one command buffer per frame. You can, however, sometimes reduce frame time by splitting your passes across multiple command buffers, but this is highly dependent on the shape of your workload and should only be done in tandem with profiling, as it's not always an optimization.

In addition to Warren's answer, another way to look at the question is by examining the API. A number of Metal objects are created from descriptors. The properties of the descriptor at the time an object is created from it govern that object for its lifetime. Those are aspects of the object that can't be changed after creation.
By contrast, the object will have various setter methods to modify other properties over its lifetime.
For a render command encoder, the properties that are fixed for its lifetime are those specified by the MTLRenderPassDescriptor used to create it. If you want to render with different values for any of those properties, the only way to do so is to create a new encoder from a different descriptor. On the other hand, if you can do everything you need/want to do by using the encoder's setter methods, then you don't need a new encoder.

Related

Concurrent use of VkSamplers?

So a VkSampler is created with a VkSamplerCreateInfo that just has a bunch of configuration settings, that as far as I can see would just define a pure function of some input image.
They are described as:
VkSampler objects represent the state of an image sampler which is used by the implementation to
read image data and apply filtering and other transformations for the shader.
One use (possibly only use) of VkSampler is to write them to descriptors (such as VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER) for use in descriptor sets that are bound to pipelines/shaders.
My question is: can you write the same VkSampler to multiple different descriptors? from the same or multiple different descriptor pools? even if one of the current descriptors is in use in some currently executing render pass?
Can you use the same VkSampler concurrently from multiple different render passes / subpasses / pipelines?
Put another way, are VkSamplers stateless? or do they represent some stateful memory on the device and so you shouldn't use the same one concurrently?
VkSampler objects definitely have data associated with them, so it would be wrong to call them "stateless". What they are is immutable. Like VkRenderPass, VkPipeline, and similar objects, once they are created, their contents cannot be changed.
Synchronization between accesses is (generally) needed only for cases when one of the accesses is a modification operation. Since VkSamplers are immutable, there are no modification operations. So synchronization is not needed for cases where you're accessing a VkSampler from different threads, commands, or whathaveyou.
The only exception is the obvious one: vkDestroySampler, which requires that submitted commands that use the sampler have completed before calling the function.

How to Instance an object in Godot?

So I basically have some fair knowledge of Opengl 4.0. In OpenGL you can render the same object at many places. This is a technique called Instancing. This saves up some CPU calls or something.
I wanted to do this in Godot. So I looked up in the docs and it basically just tells me to duplicate an object. But I think this does not save the CPU calls to the GPU, like how Instancing does (please let me know if I'm wrong about this).
Plus I cannot have all the nodes beforehand. Because the number of times I need to render the object(at different places) is determined during runtime and can change.
Is there a solution to this?
Any help would be appreciated.
Thank you
Instancing can be thought of as making copies of an object from a blueprint. The reason it saves memory and draw calls is that essentially, only the "blueprint" must be kept in memory. The recommended way that Godot addresses this (as per the documentation) is through (packed) scenes.
To do this, create the object as it's own scene - remember that you can right click on the root node of a scene (even an empty one) and change the type to whatever you want. Once you have the object set up the way you like, save it as it's own scene (ex: myInstance.tscn).
From there, you can call upon the instance from your main scene (or whatever scene you need it in). To do this you need to do a couple of things:
First, create a variable for your instance in the script you want to call it from by declaring something like onready var instancedObject = preload("res://myInstance.tscn"). (Using whatever path you used for the scene).
From there, you call the variable from whatever function you need by writing something like: var myObject = instancedObject.instance()
You then must add the instance to the current scene with add_child(myObject)
After this, you can (optionally) specify things like transforms and rotations to determine where the instance gets put (Ex: myObject.transform.origin = Vector3(0,10,0) - For 3D, or myObject.position = Vector2(10,0) for 2D)
Alternatively, you can initialize and instance the object at the same time by writing onready var instancedObject = preload(res://myInstance.tscn).instance(), and then adding it in functions by using add_child(instancedObject), however although it requires fewer steps, there are limitations to doing it this way, and I personally have had much more success using the first approach.
If, however, you are looking to instance multiple thousands of objects (or more) in the same scene, I recommend using Calinou's answer and using a MultiMeshInstance. However, one of the limitations of the MultiMeshInstance is that it uses an all or nothing approach to drawing, meaning all instances will either be all drawn at once, or not drawn at all. There is no in-between. This could be good or bad depending on what you need it for.
Hope this helps.
Instancing in Godot is handled using the MultiMeshInstance node. It's the instanced counterpart to MeshInstance. See Optimization using MultiMeshes in the documentation for more information.
Keep in mind MultiMeshes aren't suited if you need to move the objects in different directions every frame (although you can can achieve this by using INSTANCE_ID in a shader shared among all instances). MultiMeshInstance lets you change how many instances are visible by setting its visible_instance_count property.

How should I organise a pile of singly used functions?

I am writing a C++ OpenCV-based computer vision program. The basic idea of the program could be described as follows:
Read an image from a camera.
Do some magic to the image.
Display the transformed image.
The implementation of the core logic of the program (step 2) falls into a sequential calling of OpenCV functions for image processing. It is something about 50 function calls. Some temporary image objects are created to store intermediate results, but, apart from that, no additional entities are created. The functions from step 2 are used only once.
I am confused about organising this type of code (which feels more like a script). I used to create several classes for each logical step of the image processing. Say, here I could create 3 classes like ImagePreprocessor, ImageProcessor, and ImagePostprocessor, and split the abovementioned 50 OpenCV calls and temorary images correspondingly between them. But it doesn't feel like a resonable OOP design. The classes would be nothing more than a way to store the function calls.
The main() function would still just create a single object of each class and call thier methods consequently:
image_preprocessor.do_magic(img);
image_processor.do_magic(img);
image_postprocessor.do_magic(img);
Which is, to my impression, essentially the same thing as callling 50 OpenCV functions one by one.
I start to question whether this type of code requiers an OOP design at all. After all, I can simply provide a function do_magic(), or three functions preprocess(), process(), and postprocess(). But, this approach doesn't feel like a good practice as well: it is still just a pile of function calls, separated into a different function.
I wonder, are there some common practices to organise this script-like kind of code? And what would be the way if this code is a part of a large OOP system?
Usually, in Image Processing, you have a pipeline of various Image Processing Modules. Same is applicable on Video Processing, where each Image is processed according to its timestamp order in the video.
Constraints to consider before designing such pipeline:
Order of Execution of these modules is not always same. Thus, the pipeline should be easily configurable.
All modules of the pipeline should be executable in parallel with each other.
Each module of the pipeline may also have a multithreaded operation. (Out of scope of this answer, but is a good idea when a single module becomes the bottleneck for the pipeline).
Each module should easily adhere to the design and have the flexibility of internal implementation changes without affecting other modules.
The benefit of preprocessing of a frame by one module should be available to later modules.
Proposed Design.
Video Pipeline
A video pipeline is a collection of modules. For now, assume module is a class whose process method is called with some data. How each module can be executed will depend on how such modules are stored in VideoPipeline! To further explain, see below two categories:-
Here, let’s say we have modules A, B, and C which always execute in same order. We will discuss the solution with a video of Frame 1, 2 and 3.
a. Linked List: In a single-threaded application, frame 1 is first executed by A, then B and then C. The process is repeated for next frame and so on. So linked list seems like an excellent choice for the single threaded application.
For a multi-threaded application, speed is what matters. So, of course, you would want all your modules running 128-core machine. This is where Pipeline class comes into play. If each Pipeline object runs in a separate thread, the whole application which may have 10 or 20 modules starts running multithreaded. Note that the single-thread/multithread approach can be made configurable
b. Directed Acyclic Graph: Above-linked list implementation can be further improved when you have high processing power and want to reduce the lag between input and response time of pipeline. Such a case is when module C does not depend on B, but on A. In such case, any frame can be parallelly processed by module B and module C using a DAG based implementation. However, I wouldn’t recommend this as the benefits are not so great compared to the increased complexity, as further management of output from module B and C needs to be done by say module D where D depends on B or C or both. The number of scenarios increases.
Thus, for simplicity sake, let’s use LinkedList based design.
Pipeline
Create a linked list of PipelineElement.
Make process method of pipeline call process method of the first element.
PipelineElement
First, the PipelineElement processes the information by calling its ImageProcessor(read below). The PipelineElement will pass a Packet(of all data, read below) to ImageProcessor and receives the updated packet.
If next element is not null, call next PipelineElement process and pass updated packet.
If next element of a PipelineElement is null, stop. This element is special as it has an Observer object. Other PipelineElement will be set to null for Observer field.
FrameReader(VIdeoReader/ImageReader)
For video/image reader, create an abstract class. Whether you process video or image or multiple, processing is done one frame at a time, so create an abstract class(interface) ImageProcessor.
A FrameReader object stores reference to the pipeline.
For each frame, it pushes the information in by calling process method of Pipeline.
ImageProcessor
There is no Pre and Post ImageProcessor. For example, retinex processing is used as Post Processing but some application can use it as PreProcessing. Retinex processing class will implement ImageProcessor. Each element will hold Its ImageProcessor and Next PipeLineElement object.
Observer
A special class which extends PipelineElement and provides a meaningful output using GUI or disk.
Multithreading
1. Make each method run in its thread.
2. Each thread will poll messages from a BlockingQueue( of small size like 2-3 Frames) to act as a buffer between two PipelineElements. Note: The queue helps in averaging the speed of each module. Thus, small jitters(a module taking too long time for a frame) does not affect video output rate and provides smooth playback.
Packet
A packet will store all the information such as input or Configuration class object. This way you can store intermediate calculations as well as observe a real-time effect of changing configuration of an algorithm using a Configuration Manager.
To conclude, each element can now process in parallel. The first element will process nth frame, the second element will process n-1th frame, and soon, but with this, a lot more issues such as pipeline bottlenecks and additional delays due to less core power available to each element will pop up.
This structure lends itself to the pipes and filters architecture (see Pattern-Oriented Software Architecture Volume 1: A System of Patterns by Frank Buschmann):
The Pipes and Filters architectural pattern provides a structure for
systems that process a stream of data. Each processing step is
encapsulated in a filter component. Data is passed through pipes
between adjacent filters. Recombining filters allows you to build
families of related systems.
See also this short description (with images) from the Enterprise Integration Patterns book.

Is this understanding of VkDescriptorPoolCreateInfo.pPoolSizes correct?

In Vulkan, I understand that a descriptor pool is used to allocate descriptor sets of some layout for use in a shader, but in the VkDescriptorPoolCreateInfo passed to vkCreateDescriptorPool, there is a field pPoolSizes that takes a bunch of objects containing a descriptor type and a number.
The documentation seems somewhat vague, but is this saying that a given descriptor pool can only have a certain, predetermined amount of each type of descriptor allocated from it in descriptor sets? If so, how do I determine how many I will need beforehand? What happens if it runs out?
Your understanding of descriptor pools is correct.
If so, how do I determine how many I will need beforehand?
That's up to you and your application's needs.
If your application needs to be completely flexible and freeform, then you will need to create descriptor pools dynamically as needed. If your application has greater foreknowledge of what the scene will look like, then your application will need fewer of such gymnastics.
Many serious Vulkan applications try to avoid having the number of descriptor sets be based on the number of objects in the scene. Push constants and/or dynamic UBO/SSBO descriptors allow different per-object state to be used without changing the descriptor itself. Textures for lots of objects can be bundled together into array textures, or depending on the hardware, arrays of textures.
In a perfect world, all meshes of a type (say, skinned meshes) could be rendered with the exact same descriptor set, using some per-object state to fetch the right matrix/texture data for that object.
But that's how they render. Such applications have firm control over the kinds of objects they render, what per-object data looks like, and so forth. Other applications may have different needs.
Vulkan is a tool; how you use it is entirely up to you.
What happens if it runs out?
Then you cannot allocate more descriptors from that pool. If you need to allocate another descriptor set, you will need to create another pool.
My approach was to have a class that initially allocates N of the descriptor, and if it runs out, it'll create another pool with N*2 entries. It'll keep doubling in size. It uses a simple linked lists and when it comes to allocating, it just tries the first one, and then moves onto the next if it's full.
That's all pretty inefficient, so I also had my code fire an assert if it ever had to create a second pool, that way I can make sure I choose a value of N that's big enough so that the retail version should never have to do it (but if it does somehow manage to due to some unforeseen set of circumstances, it'll still render correctly).
At the time, I remember cursing the spec and wishing descriptor pools would auto grow like command pools do. Still I imagine there's a good reason that they are like they are.

LabVIEW: Programmatically setting FPGA I/O variables (templates?)

Question
Is there a way to programmatically set what FPGA variables I am reading from or writing to so that I can generalize my main simulation loop for every object that I want to run? The simulation loops for each object are identical except for which FPGA variables they read and write. Details follow.
Background
I have a code that uses LabVIEW OOP to define a bunch of things that I want to simulate. Each thing then has an update method that runs inside of a Timed Loop on an RT controller, takes a cluster of inputs, and returns a cluster of outputs. Some of these inputs come from an FPGA, and some of the outputs are passed back to the FPGA for some processing before being sent out to hardware.
My problem is that I have a separate simulation VI for every thing in my code, since different values are read from and returned to the FPGA for each thing. This is a pain for maintainability and seems to cry out for a better method. The problem is illustrated below. The important parts are the FPGA input and output nodes (change for every thing), and the input and output clusters for the update method (always the same).
Is there some way to define a generic main simulation VI and then programmatically (maybe with properties stored in my things) tell it which specific inputs and outputs to use from the FPGA?
If so then I think the obvious next step would be to make the main simulation loop a public method for my objects and just call that method for each object that I need to simulate.
Thanks!
The short answer is no. Unfortunately once you get down to the hardware level with LabVIEW FPGA things begin to get very static and rely on hard-coded IO access. This is typically handled exactly how you have presented your current approach. However, you may be able encapsulate the IO access with a bit of trickery here.
Consider this, define the IO nodes on your diagram as interfaces and abstract them away with a function (or VI or method, whichever term you prefer). You can implement this with either a dynamic VI call or an object oriented approach.
You know the data types defined by your interface are well known because you are pushing and pulling them from clusters that do not change.
By abstracting away the hardware IO with a method call you can then maintain a library of function calls that represent unique hardware access for every "thing" in your system. This will encapsulate changes to the hardware IO access within a piece of code dedicated to that job.
Using dynamic VI calls is ugly but you can use the properties of your "things" to dictate the path to the exact function you need to call for that thing's IO.
An object oriented approach might have you create a small class hierarchy with a root object that represents generic IO access (probably doing nothing) with children overriding a core method call for reading or writing. This call would take your FPGA reference in and spit out the variables every hardware call will return (or vice versa for a read). Under the hood it is taking care of deciding exactly which IO on the FPGA to access. Example below:
Keep in mind that this is nowhere near functional, I just wanted you to see what the diagram might look like. The approach will help you further generalize your main loop and allow you to embed it within a public call as you had suggested.
This looks like an [object mapping] problem which LabVIEW doesn't have great support for, but it can be done.
My code maps one cluster to another assuming the control types are the same using a 2 column array as a "lookup."