Planning a 2D tile engine - Performance concerns

Planning a 2D tile engine - Performance concerns - optimization

As the title says, I'm fleshing out a design for a 2D platformer engine. It's still in the design stage, but I'm worried that I'll be running into issues with the renderer, and I want to avoid them if they will be a concern.
I'm using SDL for my base library, and the game will be set up to use a single large array of Uint16 to hold the tiles. These index into a second array of "tile definitions" that are used by all parts of the engine, from collision handling to the graphics routine, which is my biggest concern.
The graphics engine is designed to run at a 640x480 resolution, with 32x32 tiles. There are 21x16 tiles drawn per layer per frame (to handle the extra tile that shows up when scrolling), and there are up to four layers that can be drawn. Layers are simply separate tile arrays, but the tile definition array is common to all four layers.
What I'm worried about is that I want to be able to take advantage of transparencies and animated tiles with this engine, and as I'm not too familiar with designs I'm worried that my current solution is going to be too inefficient to work well.
My target FPS is a flat 60 frames per second, and with all four layers being drawn, I'm looking at 21x16x4x60 = 80,640 separate 32x32px tiles needing to be drawn every second, plus however many odd-sized blits are needed for sprites, and this seems just a little excessive. So, is there a better way to approach rendering the tilemap setup I have? I'm looking towards possibilities of using hardware acceleration to draw the tilemaps, if it will help to improve performance much. I also want to hopefully be able to run this game well on slightly older computers as well.
If I'm looking for too much, then I don't think that reducing the engine's capabilities is out of the question.

I think the thing that will be an issue is the sheer amount of draw calls, rather than the total "fill rate" of all the pixels you are drawing. Remember - that is over 80000 calls per second that you must make. I think your biggest improvement will be to batch these together somehow.
One strategy to reduce the fill-rate of the tiles and layers would be to composite static areas together. For example, if you know an area doesn't need updating, it can be cached. A lot depends of if the layers are scrolled independently (parallax style).
Also, Have a look on Google for "dirty rectangles" and see if any schemes may fit your needs.
Personally, I would just try it and see. This probably won't affect your overall game design, and if you have good separation between logic and presentation, you can optimise the tile drawing til the cows come home.

Make sure to use alpha transparency only on tiles that actually use alpha, and skip drawing blank tiles. Make sure the tile surface color depth matches the screen color depth when possible (not really an option for tiles with an alpha channel), and store tiles in video memory, so sdl will use hardware acceleration when it can. Color key transparency will be faster than having a full alpha channel, for simple tiles where partial transparency or blending antialiased edges with the background aren't necessary.
On a 500mhz system you'll get about 6.8 cpu cycles per pixel per layer, or 27 per screen pixel, which (I believe) isn't going to be enough if you have full alpha channels on every tile of every layer, but should be fine if you take shortcuts like those mentioned where possible.

I agree with Kombuwa. If this is just a simple tile-based 2D game, you really ought to lower the standards a bit as this is not Crysis. 30FPS is very smooth (research Command & Conquer 3 which is limited to 30FPS). Even still, I had written a remote desktop viewer that ran at 14FPS (1900 x 1200) using GDI+ and it was still pretty smooth. I think that for your 2D game you'll probably be okay, especially using SDL.

Can you just buffer each complete layer into its view plus an additional tile size for all four ends(if you have vertical scrolling), use the buffer again to create a new buffer minus the first column and drawing on a new end column?
This would reduce a lot of needless redrawing.
Additionally, if you want a 60fps, you can look up ways to create frame skip methods for slower systems, skipping every other or every third draw phase.

I think you will be pleasantly surprised by how many of these tiles you can draw a second. Modern graphics hardware can fill a 1600x1200 framebuffer numerous times per frame at 60 fps, so your 640x480 framebuffer will be no problem. Try it and see what you get.
You should definitely take advantage of hardware acceleration. This will give you 1000x performance for very little effort on your part.
If you do find you need to optimise, then the simplest way is to only redraw the areas of the screen that have changed since the last frame. Sounds like you would need to know about any animating tiles, and any tiles that have changed state each frame. Depending on the game, this can be anywhere from no benefit at all, to a massive saving - it really depends on how much of the screen changes each frame.

You might consider merging neighbouring tiles with the same texture into a larger polygon with texture tiling (sort of a build process).

What about decreasing the frame rate to 30fps. I think it will be good enough for a 2D game.

Related

SKPhysicsBody optimization

I have a 2D sidescrolling game. Right now, in order to jump, the player must be touching the ground. Therefor, I have a boolean, isOnGround, that is set to YES when the player collides with a tile object, and no when the player jumps. This generates a LOT of calls to didBeginContact method, slowing down the game.
Firstly, how can I optimise this by using one big physics body for the tiles on the floor (for example clustering multiple adjacent tiles into one single physics body)?
Secondly, is this even efficient? Is there a better way to detect if the play is on the ground? My current method opens up a lot of bugs, for example wall jumping. If a player collides with a wall, isOnGround becomes YES and allows the player to jump.

Having didBeginContact called numerous times should in no way slow down your game. If you are having performance issues, I suspect the problem is probably elsewhere. Are you testing on device or simulator?
If you are using the Tiled app to create your game map, you can use the Objects Layer to create a individual objects in your map which your code can translate into physics bodies later on.
Using physics and collisions is probably the easiest way for you to determine your player's state in relation to ground contact. To solve your wall issue, you simply make a wall contact a different category than your ground. This will prevent the isOnGround to be set to YES.

You could use the physics engine to detect when jumping is enabled, (and this is what I used to do in my game). However I too have noticed significant overhead using the physics engine to detect when a unit was on a surface and that is because contact detection in sprite kit for whatever reason is expensive, even when collisions are already enabled. Even the documentation notes:
For best performance, only set bits in the contacts mask for
interactions you are interested in.
So I found a better solution for my game (which has 25+ simultaneous units that all need surface detection). Instead of going through the physics engine, I just did my own surface calculation and cache the result each game update. Something like this:
final class func getSurfaceID(nodePosition: CGPoint) -> SurfaceID {
//Loop through surface rects and see if position is inside.
}
What I ended up doing was handling my own surface detection by checking if the bottom point of my unit was inside any of the surface frames. And if your frames are axis-aligned (your rectangles are not rotated) you can perform even faster checks to see if the point is inside the frame.
This is more work in terms of level design because you will need to build an array of surface frames either dynamically from your tiles or manually place surface frames in your world (this is what I did).
Making this change reduced the cpu time spent on surface detection from over 20% to 0.1%. It also allows me to check if any arbitrary point lies on a surface rather than needing to create a physics body (which is unnecessary overhead). However this solution obviously won't work for you if you need to use contact detection.
Now regarding your point about creating one large physics body from smaller ones. You could group adjacent floor tiles using a container node and recreate a physics body that fits the nodes that are grouped. Depending on how your nodes are grouped and how you recycle tiles this can get complicated. A better solution would be to create large physics bodies that just overlap your tiles. This would reduce the number of total physics bodies, as well as the number of detections. And if used in combination with the surface frames solution you could really reduce your overhead.
I'm not sure how your game is designed and what its requirements are. I'm just giving you some possible solutions I looked at when developing surface detection in my game. If you haven't already you should definitely profile your game in instruments to see if contact detection is indeed the source of your overhead. If you game doesn't have a lot of contacts I doubt that this is where the overhead is coming from.

Working around WebGL readPixels being slow

I'm trying to use WebGL to speed up computations in a simulation of a small quantum circuit, like what the Quantum Computing Playground does. The problem I'm running into is that readPixels takes ~10ms, but I want to call it several times per frame while animating in order to get information out of gpu-land and into javascript-land.
As an example, here's my exact use case. The following circuit animation was created by computing things about the state between each column of gates, in order to show the inline-with-the-wire probability-of-being-on graphing:
The way I'm computing those things now, I'd need to call readPixels eight times for the above circuit (once after each column of gates). This is waaaaay too slow at the moment, easily taking 50ms when I profile it (bleh).
What are some tricks for speeding up readPixels in this kind of use case?
Are there configuration options that significantly affect the speed of readPixels? (e.g. the pixel format, the size, not having a depth buffer)
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Should I try to aggregate all the textures I'm reading into a single megatexture and sort things out after a single big read?
Should I be using a different method to get the information back out of the textures?
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?

Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Yes, yes, yes. readPixels is fundamentally a blocking, pipeline-stalling operation, and it is always going to kill your performance wherever it happens, because it's sending a request for data to the GPU and then waiting for it to respond, which normal draw calls don't have to do.
Do readPixels as few times as you can (use a single combined buffer to read from). Do it as late as you can. Everything else hardly matters.
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
This will get you immensely better performance.
If your graphics are all like you show above, you shouldn't need to do any “layout” at all (which is good, because it'd be very awkward to implement) — everything but the text is some kind of color or boundary animation which could easily be done in a shader, and all the layout can be just a static vertex buffer (each vertex has attributes which point at which simulation-state-texel it should be depending on).
The text will be more tedious merely because you need to load all the digits into a texture to use as a spritesheet and do the lookups into that, but that's a standard technique. (Oh, and divide/modulo to get the digits.)

I don't know enough about your use case but just guessing, Why do you need to readPixels at all?
First, you don't need to draw text or your the static parts of your diagram in WebGL. Put another canvas or svg or img over the WebGL canvas, set the css so they overlap. Let the browser composite them. Then you don't have to do it.
Second, let's assume you have a texture that has your computed results in it. Can't you just then make some geometry that matches the places in your diagram that needs to have colors and use texture coords to look up the results from the correct places in the results texture? Then you don't need to call readPixels at all. That shader can use a ramp texture lookup or any other technique to convert the results to other colors to shade the animated parts of your diagram.
If you want to draw numbers based on the result you can use a technique like this so you'd make a shader at references the result shader to look at a result value and then indexes glyphs from another texture based on that.
Am I making any sense?

Non power of two textures and memory consumption optimization

I read somewhere that XNA framework upscales a texture to nearest power of two size and then sends that to VRAM, which, provided it's how it really works, might be not efficient when loading many small (in my case 150×150) textures, which essentially waste memory with unused texture data resulting from upscaling.
So is there some automatic optimization, or should I make my own implementation of it, like loading all textures, figuring out where the "upscaled" space is big enough to hold some other texture and place it there, remembering sprite positions, thus using one texture instead of two (or more)?
It isn't always handy to do this manually for each texture (placing many small sprites in a single texture), because it's hard to work with later (essentially it becomes less human-oriented), and not always a sprite will be needed in some level of a game, so it would be better if sprites were in a different composition, so it should be done automatically.

There are tools available to create what are known as "sprite sheets" or "texture atlases". This XNA sample does this for you as part of a content pipeline extension.
Note that the padding of textures only happens on devices that do not support non-power-of-two textures. Windows Phone, for example. Modern GPUs won't waste the RAM. However this is still a useful optimisation to allow you to merge batches of sprites (see this answer for details).

At what phase in rendering does clipping occur?

I've got some OpenGL drawing code that I'm trying to optimize. It's currently testing all drawing objects for visibility client-side before deciding whether or not to send rendering data to OpenGL. (This is easier than it sounds. It's drawing a 2D scene so clipping is trivial: just test against the current coordinates of the viewport rectangle.)
It occurs to me that the entire model could be greatly simplified by passing the entire scene to OpenGL and letting the GPU take care of the clipping. But sometimes the total can be very, very complex, involving up to 100,000 total sprites, most of which never get rendered because they're off-camera, and I'd prefer to not end up killing the framerate in the name of simplicity.
I'm using OpenGL 2.0, and I've got a pretty simple vertex shader and a much more complicated fragment shader. Is there any guarantee that says that if the vertex shader runs and determines coordinates that are completely off-camera for all vertices of a polygon, that a clipping test will be applied somewhere between there and the fragment shader and prevent the fragment shader from ever running for that polygon? And if so, is this automatic or is there something I need to do to enable it? I've looked around online for information on this but I haven't found anything conclusive...

Clipping happens after the vertex transform stage before and after the NDC space; clip planes are applied in clip space, viewport clipping is done in NDC space. That is one step before rasterizing. Clipping means, that a face only partially visible is "cut" by inserting new vertices at the visibility border, or fragments outside the viewport discarded. What you mean is usually called culling. Faces completely outside the viewport are culled, at the same stage like clipping.
From a performance point of view, the best code is code never executed, and the best data is data never accessed. So in your case sending off a single drawing call that makes the GPU process a large batch of vertices clearly takes load off the CPU, but it consumes GPU processing power. Culling those vertices before sending the drawing command consumes CPU power, but takes load off the GPU. The goal is to find the right balance. If the number of vertices is low, a simple brute force approach (just render the whole thing) may easily outperform ever other scheme.
However using a simple, yet effective data management scheme can greatly improve performance on both ends. For example a spatial subdivision structure like a Kd tree is easily built (you don't have to balance it). Sorting the vertices into the Kd tree you can omit (cull) large portions of the tree if one branch near to the root is completely outside the viewport. Preparing drawing a frame you iterate through the visible parts of the tree, building the list of vertices to draw, then you pass this list to the rendering command. Kd trees can be traversed on average in O(n log n) time.

It's important to understand the difference between clipping and culling. You appear to be talking about the latter.
Clipping means taking a triangle and literally cutting it into pieces to fit into the viewport. The OpenGL specification defines this process to happen post-vertex shader, for any triangle that is only partially in view.
Culling means throwing something away entirely. If a triangle is not entirely in view, it can therefore be culled. OpenGL does not say that culling has to happen. Remember: the OpenGL specification defines behavior, not performance.
That being said, hardware makers are not stupid. Obvious efforts like not rasterizing triangles that are outside of the viewport are easily implemented and improve performance. Pretty much any hardware that exists will do this.
Similarly, clipping is typically implemented (where possible) with rasterizer tricks, rather than by creating new triangles. Fragments that would be outside of the viewport simply aren't generated by the rasterizer. This is also legal according to OpenGL, because the spec defines apparent behavior. It doesn't really care if you actually cut the triangle into pieces as long as it looks indistinguishable form if you did.
Your question is essentially one of, "How much work should I do to not render off-screen objects?" That really depends on what your scene is and how you're rendering it. You say you're rendering 100,000 sprites. Are you making 100,000 draw calls, or are these sprites part of larger structures that you render with larger granularity? Do you stream the vertex data to the GPU every frame, or is the vertex data static?

Clipping and culling happen before fragment processing. http://www.opengl.org/wiki/Rendering_Pipeline_Overview
However, you will still be passing 100000 * 4 vertices (assuming you're rendering the sprites with quads and not point sprites) to the card if you don't do culling yourself. Depending on the card's memory performance this can be an issue.

Voxel Engine and Optimization

Recently I've started developing voxel engine. What I need is only colorful voxels without texture, but at very large amount (much smaller than minecraft) - and the question is how to draw the scene very fast? I'm using c#/xna but this is in my opinion not very important in this case, let's talk about general cases. Look at these two games:
http://www.youtube.com/watch?v=EKdRri5jSMs
http://www.youtube.com/watch?v=in0bavLJ8KQ
Especially I think video number 2 represents great optimization methods (my gfx card starts choking just at 192 x 192 x 64) How they achieve this?
What i would to have in the engine:
colorful voxels without texture, but shaded
many, many voxels, say minimum 512 x 512 x 128 to achieve something like video #2
shadows (smooth shadows will be great but this is not necessary)
optional: dynamic lighting (for example from fireballs flying, which light up near voxel structures)
framerate minimum 40 FPS
camera have 3 ways of freedom (move in x-axis, move in y-axis, move in z-axis), no camera rotation is needed
finally optional feature may be Depth of Field (it will be sweet ^^ )
What optimization I have already know:
remove unseen voxels that resides inside voxel structure (covered
from six directions by other voxels)
remove unseen faces of voxels - because camera have no rotation and always look aslant forward like in TPP games, so if we divide screen
by vertical cut, left voxels and right voxels will show only 3 faces
keep voxels in Dictionary instead of 3-dimensional array - jumping through array of size 512 x 512 x 128 takes miliseconds which is
unacceptable - but dictionary int:color where int describes packed
3D position is much much faster
use instancing where applciable
occluding? (how to do this?)
space dividing / octtree (is it good idea?)
I'll be very thankful if someone give me a tip how to improve existing optimizations listed above or can share ideas of new improvements. Thanks

1) Voxatron uses a software renderer rather than the GPU. You can read some details about it if you read the comments in this blog post:
http://www.lexaloffle.com/bbs/?tid=201
I haven't looked in detail myself so can't tell you much more than that.
2) I've never played 3D Dot Game Heroes but I don't have any reason to believe it uses voxels at all. I mean, I don't see any cubes being added or deleted. Most likely it is just a static polygon mesh with a nice texture applied.
As for implementing it yourself, do not try to draw the world by rendering cubes as this is very slow. Instead you should process the volume and generate meshes lying on the intersection of solid voxels and empty ones. Break the volume into suitable sized regions (e.g. 32x32x32) and generate a mesh for each.
I have written a book article about this which you might find useful. It's actually about smooth voxel terain but a lot of the priciples stll apply.
You can read it on Google books here: http://books.google.com/books?id=WNfD2u8nIlIC&lpg=PR1&dq=game%20engine%20gems&pg=PA39#v=onepage&q&f=false
And you can find the associated source code here: http://www.thermite3d.org

Since you are using XNA, you can just use instancing to get the desired effect: http://www.float4x4.net/index.php/2010/06/hardware-instancing-in-xna/
http://roecode.wordpress.com/2008/03/17/xna-framework-gameengine-development-part-19-hardware-instancing-pc-only/
The underlying concept is instancing: this feature lets you specify some amount of repeating data and some amount of varying data in a single DrawIndexedPrimitive call. In your case, the instance stream would be a single solid box, and the other stream would be the transform and color information.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas