Scene graphs: recursion or iteration? - optimization

A colleague and I are talking about how scenes are typically rendered in complex games. He believes the world is rendered recursively in the truest object-oriented fashion, with the world's many actors each overriding a virtual function like Actor.Draw() (e.g. Koopa.Draw(), Goomba.Draw()).
By contrast, I imagined that complex games today would iterate over their scene graph, avoiding virtual function overhead and allowing more flexibility for specialized iterators (e.g. near-to-far vs. far-to-near, skipping certain objects in the tree, etc.) And my experience with OpenGL and DirectX suggested to me that they tend to draw objects in batch, and passing a batch context through a recursive call (i.e. the batch into which a class's Draw() function would draw) seemed like additional parameter-passing overhead that could be avoided with iteration.
Is one method favored over another nowadays? If so, why?

Updating parts of a scene graph might be done recursively. But drawing it is usually done via some spatial partitioning data structure and geometry/render-state batcher in order to prevent overdraw (by drawing front to back), and minimize state switches of the rendering pipeline (batching up data, draw calls that use similar resources and render states). Usually this drawing part is in a way, iterative.
You have to take into account how complex the scenes are, their composition, and how many objects are being drawn. For projects with simpler scenes or where you have certain information beforehand, sequencing your rendering (near-)optimally will not be worth the actual cost of calculating the drawing sequence or batching.
In the end the "favored" method would be project-specific.

Related

In a 2d application where you're drawing a lot of individual sprites, will the rasterization stage inevitably become a bottleneck? [duplicate]

I'm in the processing of learning Vulkan, and I have just integrated ImGui into my code using the Vulkan-GLFW example in the original ImGui repo, and it works fine.
Now I want to render both the GUI and my 3D model on the screen at the same time, and since the GUI and the model definitely needs different shaders, I need to use multiple pipelines and submit multiples commands. The GUI is partly transparent, so I would like it to be rendered after the model. The Vulkan specs states that the execution order of commands are not likely to be the order that I record the commands, thus I need synchronization of some kind. In this Reddit post several methods of exactly achieving my goals was proposed, and I once believed that I must use multiple subpasses (together with subpass dependency) or barriers or other synchronization methods like that to solve this problem.
Then I had a look at SaschaWillems' Vulkan examples, in the ImGui example though, I see no synchronization between the two draw calls, it just record the command to draw the model first, and then the command to draw the GUI.
I am confused. Is synchronization really needed in this case, or did I misunderstand something about command re-ordering or blending? Thanks.
Think about what you're doing for a second. Why do you think there needs to be synchronization between the two sets of commands? Because the second set of commands needs to blend with the data in the first set, right? And therefore, it needs to do a read/modify/write (RMW), which must be able to read data written by the previous set of commands. The data being read has to have been written, and that typically requires synchronization.
But think a bit more about what that means. Blending has to read from the framebuffer to do its job. But... so does the depth test, right? It has to read the existing sample's depth value, compare it with the incoming fragment, and then discard the fragment or not based on the depth test. So basically every draw call that uses a depth test contains a framebuffer read/modify/wright.
And yet... your depth tests work. Not only do they work between draw calls without explicit synchronization, they also work within a draw call. If two triangles in a draw call overlap, you don't have any problem with seeing the bottom one through the top one, right? You don't have to do inter-triangle synchronization to make sure that the previous triangles' writes are finished before the reads.
So somehow, the depth test's RMW works without any explicit synchronization. So... why do you think that this is untrue of the blend stage's RMW?
The Vulkan specification states that commands, and stages within commands, will execute in a largely unordered way, with several exceptions. The most obvious being the presence of explicit execution barriers/dependencies. But it also says that the fixed-function per-sample testing and blending stages will always execute (as if) in submission order (within a subpass). Not only that, it requires that the triangles generated within a command also execute these stages (as if) in a specific, well-defined order.
That's why your depth test doesn't need synchronization; Vulkan requires that this is handled. This is also why your blending will not need synchronization (within a subpass).
So you have plenty of options (in order from fastest to slowest):
Render your UI in the same subpass as the non-UI. Just change pipelines as appropriate.
Render your UI in a subpass with an explicit dependency on the framebuffer images of the non-UI subpass. While this is technically slower, it probably won't be slower by much if at all. Also, this is useful for deferred rendering, since your UI needs to happen after your lighting pass, which will undoubtedly be its own subpass.
Render your UI in a different render pass. This would only really be needed for cases where you need to do some full-screen work (SSAO) that would force your non-UI render pass to terminate anyway.

In Unity Combine Meshes Vs Instance Objects the Difference

I am in a serious need of optimization of my some Unity projects and i have so many objects which are from 3DsMax, so i am wondering if Combining the meshes would have any effect on the memory/performance or i should leave the objects Instance to each other as it would save me some space.
This arise the question that what is the difference between Combined mesh objects or Instance Objects as it will save a lot of memory and hassle if one realy knows the difference and what is better
Looking forward for some Brief information about the two
Thanks
Combining is useful if you have a lot of unique assets that only appear once or twice in a scene, e.g unique buildings in a 3D FPS, but not cloned houses in a SimCity style game. If you have a model that appears many times in a scene it's more performant to have Unity (automatically) batch them, this is Unity's default behaviour. e.g lets say your scene is in an art gallery; if the gallery contains a dozen distinct sculptures then combine them. If it contains a dozen of the same sculpture don't bother, Unity will batch them for you.
However, you should be wary of using different materials, each material adds to the draw count. So, if you had 10 of the same model but using 5 different materials it's going to be expensive. The way round this is to use a texture atlas with a single material, with different UV mapping for each models. This means you have a lot of different models, but save on render time due to the single material.
Also, be aware that transparent shaders much more expensive than opaque, if you have three semi transparent objects in front of each other that's at least 4 render passes.
As you probably know this is a complex subject with a lot of variables (many more than I can describe here) and is best judged by using the profiler.
Here are some general rules of thumb I've learned while creating a game for mobile which naturally is performance critical:
Use as few a materials as possible
Use as fewer textures as possible, share textures between materials
Recycle models as often as possible. Often a model oriented at a different angle or in a different material can look like a whole new model to the player, particularly if their attention is elsewhere in the game
Use LODS extensively
Ensure your models are clean, remove all unnecessary vertices before importing
After importing think if there's anything about the model that could be represented with less vertices
Good use of normal mapping can pay off, depending on the platform. If you can trade in 1000 verts for a 256 px normal map and 50 verts then do it, otherwise dont bother normal mapping just to save a few verts
I created a tutorial that explains draw calls, static batching, lightmapping etc.
https://www.youtube.com/watch?v=x0t2xylbTo8&t=253s

Cocos2d moving nodes is choppy

In my upcoming iPhone game different scene elements are split up into their own CCNode.
My Obstacle node contains many nodes, each representing an obstacle. Inside every obstacle node are the images that make up the obstacle (1 - 4 images), and there are only ~10 obstacles at a time. Every update my game calls the update function in the Obstacle node, which moves every obstacle to the left. But this slows down my game quite a bit.
At the same time, I have a particle node that just contains images and moves them all every frame exactly the same way the Obstacle node does, but it has no noticeable effect on performance. But it has hundreds of images at a time.
My question is why do the obstacles slow it down so much but the particles don't? I have even tried replacing the images used in the obstacles with the ones in the particles and it makes no (noticeable) difference. Would it be that there is another level of child nodes?
You will dramatically increase the app's performance, run speed, frame rate and more if you put all your images in a texture atlas and rendering them once as a batch using the CCSpriteBatchNode class. If you are moving lots of objects around on the screen a lot, this makes the hardware work a lot less.
Using this class is easy. Create the class with a texture atlas that contains all your images, and then add this class as a child to your layer, just as you would a sprite.
However, when you create sprites, add them as children to this batch node, not as children to the layer.
It's very easy and will probably help you quite a lot here.
From what I recall of the Cocos2d documentation, particles are intended to be VERY lightweight so you can have many, many of them on screen at once. Nodes are heavier, require more processing between frames as they interact with the physics system and requiring node-specific rendering. The last time I looked at the render loop code, it was basically O(n) based on the number of CCnodes you had in a scene. Using NSTimers versus Cocos' built in run loop also makes quite a bit of difference in performance.
Could you provide an example of something that slows down a lot? Exactly how do you update these Obstacles?
The cocos2d documentation has some best practices that all, in one way or another, touch on performance. There's a LOT you can do to optimize your frames per second.
In general, when your code is slow, it helps to use Instruments.app to figure out where your code is spending so much time. Since you're using a framework this will be less helpful because you'll end up finding out what functions your code spends a lot of time in, and then figure out how to reduce that via the framework's best practices or other optimizations. There are a few good blog posts on improving performance, I found this one very helpful.

Object Oriented implementation of graph data structures

I have been reading quite a bit graph data structures lately, as I have intentions of writing my own UML tool. As far as I can see, what I want can be modeled as a simple graph consisting of vertices and edges. Vertices will have a few values, and will so best be represented as objects. Edges does not, as far as I can see, need to be neither directed or weighted, but I do not want to choose an implementation that makes it impossible to include such properties later on.
Being educated in pure object oriented programming, the first things that comes to my mind is representing vertices and edges by classes, like for example:
Class: Vertice
- Array arrayOfEdges;
- String name;
Class: Edge
- Vertice from;
- Vertice to;
This gives me the possibility to later introduce weights, direction, and so on. Now, when I read up on implementing graphs, it seems that this is a very uncommon solution. Earlier questions here on Stack Overflow suggests adjacency lists and adjacency matrices, but being completely new to graphs, I have a hard time understanding why that is better than my approach.
The most important aspects of my application is having the ability to easily calculate which vertice is clicked and moved, and the ability to add and remove vertices and edges between the vertices. Will this be easier to accomplish in one implementation over another?
My language of choice is Objective-C, but I do not believe that this should be of any significance.
Here are the two basic graph types along with their typical implementations:
Dense Graphs:
Adjacency Matrix
Incidence Matrix
Sparse Graphs:
Adjacency List
Incidence List
In the graph framework (closed source, unfortunately) that I've ben writing (>12k loc graph implementations + >5k loc unit tests and still counting) I've been able to implement (Directed/Undirected/Mixed) Hypergraphs, (Directed/Undirected/Mixed) Multigraphs, (Directed/Undirected/Mixed) Ordered Graphs, (Directed/Undirected/Mixed) KPartite Graphs, as well as all kinds of Trees, such as Generic Trees, (A,B)-Trees, KAry-Trees, Full-KAry-Trees, (Trees to come: VP-Trees, KD-Trees, BKTrees, B-Trees, R-Trees, Octrees, …).
And all without a single vertex or edge class. Purely generics. And with little to no redundant implementations**
Oh, and as if this wasn't enough they all exist as mutable, immutable, observable (NSNotification), thread-unsafe and thread-safe versions.
How? Through excessive use of Decorators.
Basically all graphs are mutable, thread-unsafe and not observable. So I use Decorators to add all kinds of flavors to them (resulting in no more than 35 classes, vs. 500+ if implemented without decorators, right now).
While I cannot give any actual code, my graphs are basically implemented via Incidence Lists by use of mainly NSMutableDictionaries and NSMutableSets (and NSMutableArrays for my ordered Trees).
My Undirected Sparse Graph has nothing but these ivars, e.g.:
NSMutableDictionary *vertices;
NSMutableDictionary *edges;
The ivar vertices maps vertices to adjacency maps of vertices to incident edges ({"vertex": {"vertex": "edge"}})
And the ivar edges maps edges to incident vertex pairs ({"edge": {"vertex", "vertex"}}), with Pair being a pair data object holding an edge's head vertex and tail vertex.
Mixed Sparse Graphs would have a slightly different mapping of adjascency/incidence lists and so would Directed Sparse Graphs, but you should get the idea.
A limitation of this implementation is, that both, every vertex and every edge needs to have an object associated with it. And to make things a bit more interesting(sic!) each vertex object needs to be unique, and so does each edge object. This is as dictionaries don't allow duplicate keys. Also, objects need to implement NSCopying. NSValueTransformers or value-encapsulation are a way to sidestep these limitation though (same goes for the memory overhead from dictionary key copying).
While the implementation has its downsides, there's a big benefit: immensive versatility!
There's hardly any type graph that I could think of that's impossible to archieve with what I already have. Instead of building each type of graph with custom built parts you basically go to your box of lego bricks and assemble the graphs just the way you need them.
Some more insight:
Every major graph type has its own Protocol, here are a few:
HypergraphProtocol
MultigraphProtocol [tagging protocol] (allows parallel edges)
GraphProtocol (allows directed & undirected edges)
UndirectedGraphProtocol [tagging protocol] (allows only undirected edges)
DirectedGraphProtocol [tagging protocol] (allows only directed edges)
ForestProtocol (allows sets of disjunct trees)
TreeProtocol (allows trees)
ABTreeProtocol (allows trees of a-b children per vertex)
FullKAryTreeProtocol [tagging protocol] (allows trees of either 0 or k children per vertex)
The protocol nesting implies inharitance (of both protocols, as well as implementations).
If there's anything else you'd like to get some mor insight, feel free to leave a comment.
Ps: To give credit where credit is due: Architecture was highly influenced by the
JUNG Java graph framework (55k+ loc).
Pps: Before choosing this type of implementation I had written a small brother of it with just undirected graphs, that I wanted to expand to also support directed graphs. My implementation was pretty similar to the one you are providing in your question. This is what gave my first (rather naïve) project an abrupt end, back then: Subclassing a set of inter-dependent classes in Objective-C and ensuring type-safety Adding a simple directedness to my graph cause my entire code to break apart. (I didn't even use the solution that I posted back then, as it would have just postponed the pain) Now with the generic implementation I have more than 20 graph flavors implemented, with no hacks at all. It's worth it.
If all you want is drawing a graph and being able to move its nodes on the screen, though, you'd be fine with just implementing a generic graph class that can then later on be extended to specific directedness, if needed.
An adjacency matrix will have a bit more difficulty than your object model in adding and removing vertices (but not edges), since this involves adding and removing rows and columns from a matrix. There are tricks you could use to do this, like keeping empty rows and columns, but it will still be a bit complicated.
When moving a vertex around the screen, the edges will also be moved. This also gives your object model a slight advantage, since it will have a list of connected edges and will not have to search through the matrix.
Both models have an inherent directedness to the edges, so if you want to have undirected edges, then you will have to do additional work either way.
I would say that overall there is not a whole lot of difference. If I were implementing this, I would probably do something similar to what you are doing.
If you're using Objective-C I assume you have access to Core Data which would be probably be a great place to start - I understand you're creating your own graph, the strength of Core Data being that it can do a lot of the checking you're talking about for free if you set up your schema properly

Per frame optimization for large datasets

Summary
New to iPhone programming, I'm having trouble picking the right optimization strategy to filter a set of view components in a scrollview with huge content. In what area would my app gain the most performance?
Introduction
My current iPad app-in-progress let's users explore fairly large binary tree structures. The trees contain between 30 to 900 nodes, and when drawing inside a scrollview (with limited zoom) it looks like this.
The nodes' contents are stored in a SQLite backed Core Data model. It's a binary tree and if a node has children, there are always exactly two. The x and y positions are part of the model, as are the dimensions of the node connections, shown as dotted lines.
Optimization
Only about 50 nodes fit the screen at any given time. With the largest trees containing up to 900 nodes, it's not possible to put everything in a scrollview controlled and zooming UIView, that's a recipe for crashes. So I have to do per frame filtering of the nodes.
And that's where my troubles start. I don't have the experience to make a well founded decision between the possible filtering options, and in addition I probably don't know about that really fast special magic buried deep in Objective-C or Cocoa Touch. Because the backing store is close to 200 MB in size (some 90.000 nodes in hundreds of trees), it's very time consuming to test every single tree on the iPad device. Which is why I'd like to ask you guys for advice.
For all my attempts I'm putting a filter method in the scrollViewDidScroll: and scrollViewDidZoom:. I'm also blocking the main thread with the filter, because I can't show the content without the nodes anyway. But maybe someone has an idea in that area?
Because all the positioning is present in the Core Data model, I might use NSFetchRequest to do the filtering. Is that really fast though? I have the idea it's not a very optimized method.
From what I've tried, the faulted managed objects seem to fit in memory at once, but it might be tricky for the larger trees once their contents start firing faults. Is it a good idea to loop over the NSSet of nodes and see what items should be on screen?
Are there other tricks to gain performance? Would you see ways where I could use multi threading to get the display set faster, even though the model's context was created on the main thread?
Thanks for your advice,
EP.
Ironically your binary tree could be divided using Binary Space Partitioning done in 2D so rendering will be very fast performant and a number of check close to minimum necessary.