What causes fluctuating execution times when presenting the Renderbuffer? (OpenGL) - objective-c

This is what happens:
The drawGL function is called at the exact end of the frame thanks to a usleep, as suggested. This already maintains a steady framerate.
The actual presentation of the renderbuffer takes place with drawGL(). Measuring the time it takes to do this, gives me fluctuating execution times, resulting in a stutter in my animation. This timer uses mach_absolute_time so it's extremely accurate.
At the end of my frame, I measure timeDifference. Yes, it's on average 1 millisecond, but it deviates a lot, ranging from 0.8 milliseconds to 1.2 with peaks of up to more than 2 milliseconds.
Example:
// Every something of a second I call tick
-(void)tick
{
drawGL();
}
- (void)drawGL
{
// startTime using mach_absolute_time;
glBindRenderbufferOES(GL_RENDERBUFFER_OES, viewRenderbuffer);
[context presentRenderbuffer:GL_RENDERBUFFER_OES];
// endTime using mach_absolute_time;
// timeDifference = endTime - startTime;
}
My understanding is that once the framebuffer has been created, presenting the renderbuffer should always take the same effort, regardless of the complexity of the frame? Is this true? And if not, how can I prevent this?
By the way, this is an example for an iPhone app. So we're talking OpenGL ES here, though I don't think it's a platform specific problem. If it is, than what is going on? And shouldn't this be not happening? And again, if so, how can I prevent this from happening?

The deviations you encounter maybe be caused by a lot of factors, including OS scheduler that kicks in and gives cpu to another process or similar issues. In fact normal human won't tell a difference between 1 and 2 ms render times. Motion pictures run at 25 fps, which means each frame is shown for roughly 40ms and it looks fluid for human eye.
As for animation stuttering you should examine how you maintain constant animation speed. Most common approach I've seen looks roughly like this:
while(loop)
{
lastFrameTime; // time it took for last frame to render
timeSinceLastUpdate+= lastFrameTime;
if(timeSinceLastUpdate > (1 second / DESIRED_UPDATES_PER_SECOND))
{
updateAnimation(timeSinceLastUpdate);
timeSinceLastUpdate = 0;
}
// do the drawing
presentScene();
}
Or you could just pass lastFrameTime to updateAnimation every frame and interpolate between animation states. The result will be even more fluid.
If you're already using something like the above, maybe you should look for culprits in other parts of your render loop. In Direct3D the costly things were calls for drawing primitives and changing render states, so you might want to check around OpenGL analogues of those.

My favorite OpenGL expression of all times: "implementation specific". I think it applies here very well.

A quick search for mach_absolute_time results in this article: Link
Looks like precision of that timer on an iPhone is only 166.67 ns (and maybe worse).
While that may explain the large difference, it doesn't explain that there is a difference at all.
The three main reasons are probably:
Different execution paths during renderbuffer presentation. A lot can happen in 1ms and just because you call the same functions with the same parameters doesn't mean the exact same instructions are executed. This is especially true if other hardware is involved.
Interrupts/other processes, there is always something else going on that distracts the CPU. As far as I know the iPhone OS is not a real-time OS and so there's no guarantee that any operation will complete within a certain time limit (and even a real-time OS will have time variations).
If there are any other OpenGL calls still being processed by the GPU that might delay presentRenderbuffer. That's the easiest to test, just call glFinish() before getting the start time.

It is best not to rely on a high constant frame rate for a number of reasons, the most important being that the OS may do something in the background that slows things down. Better to sample a timer and work out how much time has passed each frame, this should ensure smooth animation.

Is it possible that the timer is not accurate to the sub ms level even though it is returning decimals 0.8->2.0?

Related

Processing get higher max frameRate

So I just recently kind of had a breakthrough in Neural Nets and made a couple of games with NN AI's. For training, I use frameRate(100000) to jack the frame rate up. However, checking with println(frameRate) I see that the average frame rate is about 270. Removing all displays (drawing shapes pretty much) increases it to about 300.
I'd like to make it faster, I noticed the documentation states the frameRate() only goes as high as your processor can handle, but checking with task manager I see the program is only using about 20% of my CPU and only 90MB. I've increased the maximum available memory to 4096MB in preferences, but that didn't seem to make a difference.
So I guess my question is, how do I allow processing to use more of my CPU for faster frameRate [or is there a better option other than simply "optimizing my code", because its already fairly optimized IMO (not saying it couldn't be better though)].
Keep in mind that even with a very high framerate, the mechanisms that call draw() have some overhead that you don't need if you aren't drawing anything. Your computer might limit the frame rate depending on your graphics settings. Also note that the println() statement itself is very slow, so you should not use it for continuously printing out your frame rate.
If you aren't drawing anything (or if you're only drawing a single frame), you can probably just use a basic loop instead of the draw() function.
Instead, try something like this:
boolean running = true;
while(running){
// do your processing
if(done){
running = false;
}
}

SceneKit crashes without verbose crash info

Update: For anyone who stumbles upon this, it seems like SceneKit has a threshold for the maximum number of objects it can render. Using [SCNNode flattenedClone] is a great way to help increase the amount of objects it can handle. As #Hal suggested, I'll file a bug report with Apple describing the performance issues discussed below.
I'm somewhat new to iOS and I'm currently working on my first OS X project for a class. I'm essentially creating random geometric graphs (random points in space connected to one another if the distance between them is ≤ a given radius) and I'm using SceneKit to display the results. I already know I'm pushing SceneKit to its limits, but if the number of objects I'm trying to graph is too large, the whole thing just crashes and I don't know how to interpret the results.
My SceneKit scene consists of the default camera, 2 lighting nodes, approximately 5,000 SCNSpheres each within an SCNNode (the nodes on the graph), and then about 50,0000 connections which are of type SCNPrimitiveSCNGeometryPrimitiveTypeLine which are also within SCNNodes. All of these nodes are then added to one large node which is added to my scene.
The code works for smaller numbers of spheres and connections.
When I run my app with these specifications, everything seems to work fine, then 5-10 seconds after executing the following lines:
dispatch_async(dispatch_get_main_queue(), ^{
[self.graphSceneView.scene.rootNode addChildNode:graphNodes];
});
the app crashes with this resulting screen: .
Given that I'm sort of new to Xcode and used to more verbose output upon crashing, I'm a bit over my head. What can I do to get more information on this crash?
That's terse output for sure. You can attack it by simplifying until you don't see the crash anymore.
First, do you ever see anything on screen?
Second, your call to
dispatch_async(dispatch_get_main_queue(), ^{
[self.graphSceneView.scene.rootNode addChildNode:graphNodes];
});
still runs on the main queue, so I would expect it to make no difference in perceived speed or responsiveness. So take addChildNode: out of the GCD block and invoke it directly. Does that make a difference? At the least, you'll see your crash immediately, and might get a better stack trace.
Third, creating your own geometry from primitives like SCNPrimitiveSCNGeometryPrimitiveTypeLine is trickier than using the SCNGeometry subclasses. Memory mismanagement in that step could trigger mysterious crashes. What happens if you remove those connection lines? What happens if you replace them with long, skinny SCNBox instances? You might end up using SCNBox by choice because it's easier to style in SceneKit than a primitive line.
Fourth, take a look at #rickster's answer to this question about optimization: SceneKit on OS X with thousands of objects. It sounds like your project would benefit from node flattening (flattenedClone), and possibly the use of SCNLevelOfDetail. But these suggestions fall into the category of premature optimization, the root of all evil.
It would be interesting to hear what results from toggling between the Metal and OpenGL renderers. That's a setting on the SCNView in IB ("preferred renderer" I think), and an optional entry in Info.plist.

Working around WebGL readPixels being slow

I'm trying to use WebGL to speed up computations in a simulation of a small quantum circuit, like what the Quantum Computing Playground does. The problem I'm running into is that readPixels takes ~10ms, but I want to call it several times per frame while animating in order to get information out of gpu-land and into javascript-land.
As an example, here's my exact use case. The following circuit animation was created by computing things about the state between each column of gates, in order to show the inline-with-the-wire probability-of-being-on graphing:
The way I'm computing those things now, I'd need to call readPixels eight times for the above circuit (once after each column of gates). This is waaaaay too slow at the moment, easily taking 50ms when I profile it (bleh).
What are some tricks for speeding up readPixels in this kind of use case?
Are there configuration options that significantly affect the speed of readPixels? (e.g. the pixel format, the size, not having a depth buffer)
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Should I try to aggregate all the textures I'm reading into a single megatexture and sort things out after a single big read?
Should I be using a different method to get the information back out of the textures?
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Yes, yes, yes. readPixels is fundamentally a blocking, pipeline-stalling operation, and it is always going to kill your performance wherever it happens, because it's sending a request for data to the GPU and then waiting for it to respond, which normal draw calls don't have to do.
Do readPixels as few times as you can (use a single combined buffer to read from). Do it as late as you can. Everything else hardly matters.
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
This will get you immensely better performance.
If your graphics are all like you show above, you shouldn't need to do any “layout” at all (which is good, because it'd be very awkward to implement) — everything but the text is some kind of color or boundary animation which could easily be done in a shader, and all the layout can be just a static vertex buffer (each vertex has attributes which point at which simulation-state-texel it should be depending on).
The text will be more tedious merely because you need to load all the digits into a texture to use as a spritesheet and do the lookups into that, but that's a standard technique. (Oh, and divide/modulo to get the digits.)
I don't know enough about your use case but just guessing, Why do you need to readPixels at all?
First, you don't need to draw text or your the static parts of your diagram in WebGL. Put another canvas or svg or img over the WebGL canvas, set the css so they overlap. Let the browser composite them. Then you don't have to do it.
Second, let's assume you have a texture that has your computed results in it. Can't you just then make some geometry that matches the places in your diagram that needs to have colors and use texture coords to look up the results from the correct places in the results texture? Then you don't need to call readPixels at all. That shader can use a ramp texture lookup or any other technique to convert the results to other colors to shade the animated parts of your diagram.
If you want to draw numbers based on the result you can use a technique like this so you'd make a shader at references the result shader to look at a result value and then indexes glyphs from another texture based on that.
Am I making any sense?

How does Flash Player execute Timer?

I know about runtime code execution fundamentals in Flash Player and Air Debugger. But I want to know how Timer code execution is done.
Would it be better to use Timer rather than enterFrame event for similar actions? Which one is better to maximize optimization?
It depends on what you want to use it for. Most will vehemently say to use Event.ENTER_FRAME. In most instances, this is what you want. It will only be called once every frame begins construction. If your app is running at 24fps (the default), that code will run once every 41.7ms, assuming no dropped frames. For almost all GUI related cases, you do not want to run that code more often than this because it is entirely unnecessary (you can run it more often, sure, but it won't do any good since the screen is only updated that often).
There are times when you need code to run more often, however, mostly in non-GUI related cases. This can range from a systems check that needs to happen in the background to something that needs to be absolutely precise, such as an object that needs to be displayed/updated on an exact interval (Timer is accurate to the ms, I believe, whereas ENTER_FRAME is only accurate to the 1000/framerate ms).
In the end, it doesn't make much sense to use Timer for anything less than ENTER_FRAME would be called. Any more than that and you risk dropping frames. ENTER_FRAME is ideal for nearly everything graphics related, other than making something appear/update at a precise time. And even then, you should use ENTER_FRAME as well since it would only be rendered in the next frame anyway.
You need to evaluate each situation on a case-by-case basis and determine which is best for that particular situation because there is no best answer for all cases.
EDIT
I threw together a quick test app to see when Timer fires. Framerate is 24fps (42ms) and I set the timer to run every 10ms. Here is a selection of times it ran at.
2121
2137
2154
2171
2188
2203
2221
2237
As you can see, it is running every 15-18ms instead of the 10ms I wanted it to. I also tested 20ms, 100ms, 200ms, and 1000ms. In the end, each one fired within about 10ms of when it should have. So this is not nearly as precise as I had originally thought it was.

OpenGL quad rendering optimization

I'm drawing quads in openGL. My question is, is there any additional performance gain from this:
// Method #1
glBegin(GL_QUADS);
// Define vertices for 10 quads
glEnd();
... over doing this for each of the 10 quads:
// Method #2
glBegin(GL_QUADS);
// Define vertices for first quad
glEnd();
glBegin(GL_QUADS);
// Define vertices for second quad
glEnd();
//etc...
All of the quads use the same texture in this case.
Yes, the first is faster, because each call to glBegin or glEnd changes the OpenGL state.
Even better, however, than one call to glBegin and glEnd (if you have a significant number of vertices), is to pass all of your vertices with glVertexPointer (and friends), and then make one call to glDrawArrays or glDrawElements. This will send all your vertices to the GPU in one fell swoop, instead of incrementally by calling glVertex3f repeatedly.
From a function call overhead perspective the second approach is more expensive. If instead of ten quads we used ten thousand. Then glBegin/glEnd would be called ten thousand times per frame instead of once.
More importantly glBegin/glEnd have been deprecated as of OpenGL 3.0, and are not supported by OpenGL ES.
Instead vertices are uploaded as vertex arrays using calls such as glDrawArrays. Tutorial and much more in depth information can be found on the NeHe site.
I decided to go ahead and benchmark it using a loop of 10,000 quads.
The results:
Method 1: 0.0128 seconds
Method 2: 0.0132 seconds
Method #1 does have some improvement, but the improvement is very marginal (3%). It's probably nothing more than the overhead of simply calling more functions. So it's likely that OpenGL itself doesn't get any additional optimization from Method #1.
This is on Windows XP service pack 3 using OpenGL 2.0 and visual studio 2005.
I believe the answer is yes, but you should try it out yourself. Write something to draws 100k quads and see if one is much faster. Then report your results here :)
schnaader: What is meant in the document you read is that you should not have non-gl related code between glBegin and glEnd. They do not mean that you should call it multiple times over calling it in short bits.
I suppose that you get the highest performance gain by reusing the vertices. To achieve that, you would require to maintain some structure for primitives yourself.
You would get better performance for sure in just how much code gets called by the CPU.
Whether or not your drawing performance would be better on the GPU, that would completely depend on the implementation of the driver for your 3d graphics card. You could get potentially wildly different results with a different manufacturer's driver and even with a different version of the driver for the same card.