Sorting float array in fragment shader without dynamic branching in OpenGL ES (iPad) - opengl-es-2.0

We have a somewhat complicated fragment shader, which we cannot split into smaller shaders.
One of the thing the shader does is calculate the distance of the fragment/pixel to 7 other points, and find the two nearest ones.
On the iPad, we are having huge performance problems with this part of the code, and Instruments tells us the bottleneck is the shader code which leads to dynamic branches in the GPU.
We sacrificed readability and tried several code changes to avoid these dynamic branches:
Changed the loop so that it's not really a loop, but a repeating series of instructions.
Simplified the code so that it contains only very simple if-like instructions.
This is the 'simplest' version of the code we came up with:
bool isLowest;
float currentDistance, lowestDistance, newLowest;
lowestDistance = distances[1];
int indexOfLowest = 1, newIndexOfLowest;
// 'loop' for index 2
currentDistance = distances[2];
isLowest = currentDistance < lowestDistance;
newLowest = isLowest ? currentDistance : lowestDistance;
lowestDistance = newLowest;
newIndexOfLowest = isLowest ? 2 : indexOfLowest; // BAD LINE
indexOfLowest = newIndexOfLowest;
// 'loop' for index 3
currentDistance = distances[3];
isLowest = currentDistance < lowestDistance;
newLowest = isLowest ? currentDistance : lowestDistance;
lowestDistance = newLowest;
newIndexOfLowest = isLowest ? 3 : indexOfLowest; // BAD LINE
indexOfLowest = newIndexOfLowest;
// etc. until index 8
As you can see, the code finds the index of the lowest distance. In our opinion, this code can be executed without dynamic branches. The last four lines of one 'loop' are just simple calculations, they are not real branches.
The problem is the second-last row in a 'loop, where indexOfLowest gets a new value:
newIndexOfLowest = isLowest ? 2 : indexOfLowest;
If we comment out that line, everything works fine on 60 FPS and Instruments doesn't report dynamic branches. But with this line, it's not more than 8 FPS.
How can we rewrite this code such that it does not cause dynamic branches in the GPU?
EDIT: Just updated the code to an even simpler version, which has the same problems. The old code had a for loop with fixed limits. But the problem is still there.
EDIT 2: We just analyzed the code using PowerVR's PVRShaderEditor (previously PVRUniSCoEditor), where it shows the expected GPU cycles per shader source code line. The line in question (BAD LINE in the code above) is shown with only 1-2 GPU cycles. The fact that this line causes dynamic branches which lead to huge performance problems is not mentioned. Any other ideas how we could debug this code, so we can understand why it's causing these branches?

We did it. Casting the bool to float (1.0 for true, 0.0 for false) was helpful here:
bool isLowest;
float currentDistance, lowestDistance, newLowest;
lowestDistance = distances[1];
int indexOfLowest = 1, newIndexOfLowest;
// 'loop' for index 2
currentDistance = distances[2];
isLowest = currentDistance < lowestDistance;
indexOfLowest = float(isLowest) * float(2 - indexOfLowest);
lowestDistance = isLowest ? currentDistance : lowestDistance;
// 'loop' for index 3
currentDistance = distances[3];
isLowest = currentDistance < lowestDistance;
indexOfLowest = float(isLowest) * float(3 - indexOfLowest);
lowestDistance = isLowest ? currentDistance : lowestDistance;
// etc. until index 8
This calculates the index of the lowest distance, without using dynamic branches.

Related

process a sound with a UGen inside of a loop in supercollider

began using supercollider, read through the tutorials and practiced a bit but i can't seem to make this work. how can i process an input N times in a row ?
basically, i want build a synth to crush a sound source by running it through a distortion N times. i've tried many things, but what i want to do is basically:
// distortion function
dfunc = {
arg inp; // input
AnalogVintageDistortion.ar(
inp, 1, 2.5, 20.dbamp, 10.dbamp, 200, 10
)};
// taking an input and running a 1st distortion
in = In.ar(input, 1);
dout = dfunc.value(in);
// rerunning the distorsion 10 times
10.do({
dout = dfunc.value(dout);
});
basically, it would be a quick equivalent of doing:
dout = AnalogVintageDistorsion.ar(
AnalogVintageDistorsion.ar(
AnalogVintageDistorsion.ar(
dout
)
)
);
i'm coming from python, where this result would be achieved by:
for i in range(10):
dout = dfunc(dout)
thanks in advance !!

Reaching unexpected objc breakpoint with method __Block_byref_object_copy_

I'm using lldb to debug objc based service. several breakpoints (which is set
have been placed in the code, and I see that one of them is reached unexpectedly according to the stack trace.
The method encapsulating this breakpoint shouldn't have called but I still see it in stack trace (file1.mm:97) although it seems like the code isn't being execute there.
I suspect that objc internal method __Block_byref_object_copy_ is responsible for copying the block of code which involves both caller and callee methods (MyClass from the upper frame in the stack and the method in file1.mm:97).
While copying the debugger probably thinks that it reach this line for execution and stop there, where in fact it's only for copying the code block which involves those 2 methods.
Perhaps anybody can support this claim or provide additional explanation of why am I getting this breakpoint where it shouldn't occur ?
* frame #0: 0x0000000107e03ce0 MyLib`::__Block_byref_object_copy_((null)=0x00007fda19a86b30, (null)=0x00007ffeea7f3bd0) at file1.mm:97:27
frame #1: 0x00007fff7de6bb78 libsystem_blocks.dylib`_Block_object_assign + 325
frame #2: 0x0000000107dd960a MyLib`::__copy_helper_block_ea8_32r((null)=0x00007fda19a86540, (null)=0x00007ffeea7f3ba8) at file2.mm:47:55
frame #3: 0x00007fff7de6b9f3 libsystem_blocks.dylib`_Block_copy + 104
frame #4: 0x00007fff7c64e1e8 libobjc.A.dylib`objc_setProperty_atomic_copy + 53
frame #5: 0x00007fff5411d16b Foundation`-[NSXPCConnection _sendInvocation:orArguments:count:methodSignature:selector:withProxy:] + 1885
frame #6: 0x00007fff54168508 Foundation`-[NSXPCConnection _sendSelector:withProxy:arg1:arg2:] + 125
frame #7: 0x00007fff54168485 Foundation`_NSXPCDistantObjectSimpleMessageSend2 + 46
frame #8: 0x0000000107e0520e MyLib`::-[MyClass func:withVar0:Var1:Var2:withError:](self=0x00007fda17c2cb50, _cmd="funcWithVar0:Var1:Var2:Var3:withError:", var0="aaa", var1=0x0000000000000000, var2="bbb", var3=0x00007fda17d41dd0, err=0x00007ffeea7f4258) at MyClass.mm:196:5
UPDATE:
thanks to the comments below, it happen that if I set breakpoint according to file and line, it gives me 3 locations (!?)
breakpoint set --file myfile.mm --line 97
now when I list my breakpoints, it give me 2 breakpoints that aren't related to the actual method which wraps the file, besides the expected breakpoint.
3.2: where = my class`::__Block_byref_object_copy_() + 16 at myfile:97:27, address = 0x0000000107e03ce0, unresolved, hit count = 0
3.3: where = myclass `::__Block_byref_object_dispose_() + 16 at myfile:97:27, address = 0x0000000107e03d40, unresolved, hit count = 0
Not really an answer, more an illustration ...
unsigned fn ( void )
{
return 3; // Set bp here and see what happens
}
int main(int argc, const char * argv[])
{
#autoreleasepool
{
// insert code here...
NSLog(#"Hello, World!");
unsigned int x = 0;
x = 1;
x = 2;
x = 3;
if ( x >= 0 )
{
switch ( 5 )
{
case 1 : x = 4; break;
case 2 : x = 4; break; // Set bp here - it will trigger!!!!
case 3 : x = 4; break;
case 4 : x = 4; break;
case 5 : x = 4; break; // Set bp here and see what happens
case 6 : x = 4; break;
case 7 : x = 4; break;
default : x = 4; break;
}
}
__block unsigned y;
void ( ^ b1 )( void ) = ^ {
y = fn();
};
void ( ^ b2 )( void ) = ^ {
b1();
};
if ( YES )
{
b2();
}
NSLog ( #"x = %u y = %u", x, y );
}
return 0;
}
Without going into too much detail, note that most of the code above will be optimised away. The compiler will optimise for loops and switches aggressively and will optimise superfluous assignments and checks (e.g. x >= 0 for unsigned x) away. Above even the blocks gets inlined and you end up with what appears to be very strange behaviour. I think the blocks here are relevant to your specific problem?
If you optimise then the first breakpoint indicated in the code does not get triggered as the blocks all get inlined (I think as I did not look at the disassembly). Likewise the third one does get triggered, but because of optimisation in the strangest of places. Really, a large part of the code gets reduced into a single assignment and the compiler does not really know where to stick it so when that bp is triggered it looks as if it is in the strangest and most disconnected place possible.
Finally, even the second (!!!) one will trigger. That code should never execute but since it all collapse due to optimisation you can even get it to trigger.
So I would not be too perplexed about what triggers your bp ...
I mean, above I just proved that 2 == 5 if I take the bp seriously. And that is not even variable but constant 2 == 5!!! Quite amusing ...

game maker random cave generation

I want to make a cave explorer game in game maker 8.0.
I've made a block object and an generator But I'm stuck. Here is my code for the generator
var r;
r = random_range(0, 1);
repeat(room_width/16) {
repeat(room_height/16) {
if (r == 1) {
instance_create(x, y, obj_block)
}
y += 16;
}
x += 16;
}
now i always get a blank frame
You need to use irandom(1) so you get an integer. You also should put it inside the loop so it generates a new value each time.
In the second statement, you are generating a random real value and storing it in r. What you actually require is choosing one of the two values. I recommend that you use the function choose(...) for this. Here goes the corrected statement:
r = choose(0,1); //Choose either 0 or 1 and store it in r
Also, move the above statement to the inner loop. (Because you want to decide whether you want to place a block at the said (x,y) location at every spot, right?)
Also, I recommend that you substitute sprite_width and sprite_height instead of using the value 16 directly, so that any changes you make to the sprite will adjust the resulting layout of the blocks accordingly.
Here is the code with corrections:
var r;
repeat(room_width/sprite_width) {
repeat(room_height/sprite_height) {
r = choose(0, 1);
if (r == 1)
instance_create(x, y, obj_block);
y += sprite_height;
}
x += sprite_width;
}
That should work. I hope that helps!
Looks like you are only creating a instance if r==1. Shouldn't you create a instance every time?
Variable assignment r = random_range(0, 1); is outside the loop. Therefore performed only once before starting the loop.
random_range(0, 1) returns a random real number between 0 and 1 (not integer!). But you have if (r == 1) - the probability of getting 1 is a very small.
as example:
repeat(room_width/16) {
repeat(room_height/16) {
if (irandom(1)) {
instance_create(x, y, obj_block)
}
y += 16;
}
x += 16;
}
Here's a possible, maybe even better solution:
length = room_width/16;
height = room_height/16;
for(xx = 0; xx < length; xx+=1)
{
for(yy = 0; yy < height; yy+=1)
{
if choose(0, 1) = 1 {
instance_create(xx*16, yy*16, obj_block); }
}
}
if you want random caves, you should probably delete random sections of those blocks,
not just single ones.
For bonus points, you could use a seed value for the random cave generation. You can also have a pathway random generation that will have a guaranteed path to the finish with random openings and fake paths that generate randomly from that path. Then you can fill in the extra spaces with other random pieces.
But in regards to your code, you must redefine the random number each time you are placing a block, which is why all of them are the same. It should be called inside of the loops, and should be an integer instead of a decimal value.
Problem is on the first line, you need to put r = something in the for cycle

Looping with iterator vs temp object gives different result graphically (Libgdx/Java)

I've got a particle "engine" whom I've implementing a Pool system to and I've tested two different ways of rendering every Particle in a list. Please note that the Pooling really doesn't have anything with the problem to do. I just followed a tutorial and tried to use the second method when I noticed that they behaved differently.
The first way:
for (int i = 0; i < particleList.size(); i++) {
Iterator<Particle> it = particleList.iterator();
while (it.hasNext()) {
Particle p = it.next();
if (p.isDead()){
it.remove();
}
p.render(batch, delta);
}
}
Which works just fine. My particles are sharp and they move with the correct speed.
The second way:
Particle p;
for (int i = 0; i < particleList.size(); i++) {
p = particleList.get(i);
p.render(batch, delta);
if (p.isDead()) {
particleList.remove(i);
bulletPool.free(p);
}
}
Which makes all my particles blurry and moving really slow!
The render method for my particles look like this:
public void render(SpriteBatch batch, float delta) {
sprite.setX(sprite.getX() + (dx * speed) * delta * Assets.FPS);
sprite.setY(sprite.getY() + (dy * speed) * delta * Assets.FPS);
ttl--;
sprite.setScale(sprite.getScaleX() - 0.002f);
if (ttl <= 0 || sprite.getScaleX() <= 0)
isDead = true;
sprite.draw(batch);
}
Why do the different rendering methods provide different results?
Thanks in advance
You are mutating (removing elements from) a list while iterating over it. This is a classic way to make a mess.
The Iterator must have code to handle the delete case correctly. But your index-based for loop does not. Specifically when you call particleList.remove(i) the i is now "out of sync" with the content of the list. Consider what happens when you remove the element at index 3: 'i' will increment to 4, but the old element 4 got shuffled down into index 3, so it will get skipped.
I assume you're avoiding the Iterator to avoid memory allocations. So, one way to side-step this issue is to reverse the loop (go from particleList.size() down to 0). Alternatively, you can only increment i for non-dead particles.

openCV cvContourArea

I'm trying to use cvFindContours, which definitely seems like the way to go. I'm having a problem with getting the largest one. There is a function call cvContourArea, which suppose to get the area of a contour in a sequence. I'm having trouble with it.
int conNum = cvFindContours(outerbox, storage, &contours, sizeof(CvContour),CV_RETR_LIST,CV_CHAIN_APPROX_NONE,cvPoint(0, 0));
CvSeq* current_contour = contours;
double largestArea = 0;
CvSeq* largest_contour = NULL;
while (current_contour != NULL){
double area = fabs(cvContourArea(&storage,CV_WHOLE_SEQ, false));
if(area > largestArea){
largestArea = area;
largest_contour = current_contour;
}
current_contour = current_contour->h_next;
}
I tried replacing storage (in the cvContourArea) with contours, but same error keeps coming up no matter what:
OpenCV Error: Bad argument (Input array is not a valid matrix) in cvPointSeqFromMat, file /Volumes/ramdisk/opencv/OpenCV-2.2.0/modules/imgproc/src/utils.cpp, line 53
I googled and could hardly find example of cvContourArea that takes 3 arguments.. as if it's changed recently.. I want to loop thru the found contours and find the biggest one and after that draw it using the cvDrawContours method.. Thanks!
Try to change &storage to current_contour in the following statement.
Change
double area = fabs(cvContourArea(&storage,CV_WHOLE_SEQ, false));
to
double area = fabs(cvContourArea(current_contour,CV_WHOLE_SEQ, 0));