Variable time step bug with Box2D - optimization

Can anybody spot what is wrong with the code below. It is supposed to average the frame interval (dt) for the previous TIME_STEPS number of frames.
I'm using Box2d and cocos2d, although I don't think the cocos2d bit is very relevent.
-(void) update: (ccTime) dt
{
float32 timeStep;
const int32 velocityIterations = 8;
const int32 positionIterations = 3;
// Average the previous TIME_STEPS time steps
for (int i = 0; i < TIME_STEPS; i++)
{
timeStep += previous_time_steps[i];
}
timeStep = timeStep/TIME_STEPS;
// step the world
[GB2Engine sharedInstance].world->Step(timeStep, velocityIterations, positionIterations);
for (int i = 0; i < TIME_STEPS - 1; i++)
{
previous_time_steps[i] = previous_time_steps[i+1];
}
previous_time_steps[TIME_STEPS - 1] = dt;
}
The previous_time_steps array is initially filled with whatever the animation interval is set too.
This doesn't do what I would expect it too. On devices with a low frame rate it speeds up the simulation and on devices with a high frame rate it slows it down. I'm sure it's something stupid I'm over looking.
I know box2D likes to work with fixed times steps but I really don't have a choice. My game runs at a very variable frame rate on the various devices so a fixed time stop just won't work. The game runs at an average of 40 fps, but on some of the crappier devices like the first gen iPad it runs at barely 30 frames per second. The third gen ipad runs it at 50/60 frames per second.
I'm open to suggestion on other ways of dealing with this problem too. Any advice would be appreciated.
Something else unusual I should note that somebody might have some insight into is the fact that running any debug optimisations on the build has a huge effect on the above. The frame rate isn't changed much when debug optimisations are set to -Os vs -O0. But when the debut optimisations are set to -Os the physics simulation runs much faster than -O0 when the above code is active. If I just use dt as the interval instead of the above code then the debug optimisations make no difference.
I'm totally confused by that.

On devices with a low frame rate it speeds up the simulation and on
devices with a high frame rate it slows it down.
That's what using a variable time step is all about. If you only get 10 fps the physics engine will iterate the world faster because the delta time is larger.
PS: If you do any kind of performance tests like these, run them with the release build. That also ensures that (most) logging is disabled and code optimizations are on. It's possible that you simply experience much greater impact on performance from debugging code on older devices.
Also, what value is TIME_STEPS? It shouldn't be more than 10, maybe 20 at most. The alternative to averaging is to use delta time directly, but if delta time is greater than a certain threshold (30 fps) switch to using a fixed delta time (cap it). Because variable time step below 30 fps can get really ugly, it's probably better in such cases to allow the physics engine to slow down with the framerate or else the game will become harder if not unplayable at lower fps.

Related

Speeding up CUDA atomics calculation for many bins/few bins

I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's sake, assume that we have 256 pixels falling into let us say, two buckets.
The easiest way to do it is to do it appears to be
Load the variables into shared memory
Do vectorized loads for unsigned char, etc. if needed.
Do an atomic add in shared memory
Do a coalesced write to global.
Something like this:
__global__ void shmem_atomics_reducer(int *data, int *count){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
block_reduced[threadIdx.x] = 0;
__syncthreads();
atomicAdd(&block_reduced[data[tid]],1);
__syncthreads();
for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
atomicAdd(&count[i],block_reduced[i]);
}
The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. Contention, and shared memory bank conflicts are given as reasons. I don't know if there is any way to remove either of these for this calculation in any significant way.
I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction.
__global__ void ballot_popc_reducer(int *data, int *count ){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
uint warp_id = threadIdx.x >> 5;
//need lane_ids since we are going warp level
uint lane_id = threadIdx.x%32;
//for ballot
uint warp_set_bits=0;
//to store warp level sum
__shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
//shared data
__shared__ uint s_data[NUM_THREADS_PER_BLOCK];
//load shared data - could store to registers
s_data[threadIdx.x] = data[tid];
__syncthreads();
//suspicious loop - I think we need more parallelism
for(int i=0; i<NUM_BINS; i++){
warp_set_bits = __ballot(s_data[threadIdx.x]==i);
if(lane_id==0){
warp_reduced_count[warp_id] = __popc(warp_set_bits);
}
__syncthreads();
//do warp level reduce
//could use shfl, but it does not change the overall picture
if(warp_id==0){
int t = threadIdx.x;
for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
__syncthreads();
}
}
__syncthreads();
if(threadIdx.x==0){
atomicAdd(&count[i],warp_reduced_count[0]);
}
}
}
This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. When we run with 32 bins, performance drops to about 5 GB/s. The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop.
I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. Plus, the programming is clunky. Likewise, parallelizing in the y direction for bins gives similarly uninspiring results.
The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead.
The most promising approach seems to be - from Nicholas Wilt's article
on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell).
Perhaps someone could shed some insight into the problem? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. Otherwise, I suppose we just use the atomics version.
Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. See here: Parzen
I faced similar chalanges to work with clustering, but in the botton end, the best solution was to use the scan pattern to group the processing. So, I don't think that it would work for you. Since you asked for some experience in this are, I'll share mine with you.
The issues
In your 1st code, I guess that the deal with the low performance with the number of bins reduction is linked to warp stall, since you do perform very little processing for every evaluated data. When the number of bins is increased, the relation between processing and global memory load (data info) for that kernel is also increased. You can check that very easily with the "Issue Efficiency" Experiments at the Performance Analysis from Nsight. Probably you are getting a low rate of cycles with at least one elegible warp (Warp Issue Efficiency).
Since I was not able to improve the number of elegible warps to somewhere close to 95%, I gave up this approach, since for some cases it gets worse (the memory dependency stall 90% of my processing cycles.
The shuffle and vote reduction is very usefull if the number of bins is not to large. If it is to large, a small amount of threads should be active for every bin filter. So you may end up with a lot of code divergence, and that is very undesirable for parallel processing. You may try to group the divergence in order to remove branching and have a good control flow, so the whole warp/block presents a similar processing, but a lot chance across blocks.
A feasible solution
I don't know where, but there are very good solutions for your problem around that I saw. Did you tried this one?
Also you can use a vectorized load and try something like that, but I'm not sure how much would it improve your performance:
__global__ hist(int4 *data, int *count, int N, int rem, unsigned int init) {
__shared__ unsigned int sBins[N_OF_BINS]; // you may want to declare this one dinamically
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x < N_OF_BINS) sBins[threadIdx.x] = 0;
for (int i = 0; i < N; i+= warpSize) {
atomicAdd(&sBins[data[i + init].w], 1);
atomicAdd(&sBins[data[i + init].x], 1);
atomicAdd(&sBins[data[i + init].y], 1);
atomicAdd(&sBins[data[i + init].z], 1);
}
//process remaining elements if the data is not multiple of 4
// using recast and a additional control
for (int i = 0; i < rem; i++) {
atomicAdd(&sBins[reinterpret_cast<int*>(data)[N * 4 + init + i]], 1);
}
//update your histogram data here
}

SuperCollider: automatic phase and frequency alignment of oscillators

Anyone has an idea for automatic phase and frequency alignment?
To explain: assume, you have an Impulse
in = Impulse.ar(Rand(2, 5), Rand(0, 1));
now I'd like to manipulate the frequency of another Impulse such that it adapts its phase and frequency to match the input.
Any suggestions, even for a google search are highly appreciated.
[question asked on behalf of a colleague]
I don't agree that this, as frame is a tough problem. Impulses are simple to track - that's why, for example, old rotary dial phones used pulse trains.
Here's some code that generates an impulse at a random frequency, then resynthesises another impulse at the same frequency. It also outputs a pitch estimate.
(
var left, right, master, slave, periodestimatebus, secretfrequency;
s = Server.default;
left = Bus.new(\audio, 0,1);
right = Bus.new(\audio, 1,1);
periodestimatebus = Bus.control(s,1);
//choose our secret frequency here for later comparison:
secretfrequency = rrand(2.0,5.0);
//generate impulse with secret frequency at some arbitrary phase
master = {Impulse.ar(secretfrequency, Rand(0, 1));}.play(s, left);
slave = {
var masterin, clockcount, clockoffset, syncedclock, periodestimate, tracking;
masterin = In.ar(left);
//This 1 Hz LFSaw is the "clock" against which we measure stuff
clockcount = LFSaw.ar(1, 0, 0.5, 0.5);
clockoffset = Latch.ar(clockcount, Delay1.ar(masterin));
syncedclock = (clockcount - clockoffset).frac;
//syncedclock is a version of the clock hard-reset (one sample after) every impulse trigger
periodestimate = Latch.ar(syncedclock, masterin);
//sanity-check our f impulse
Out.kr(periodestimatebus, periodestimate);
//there is no phase estimate per se - what would we measure it against? -
//but we can resynthesise a new impulse up to a 1 sample delay from the matched clock.
tracking = (Slope.ar(syncedclock)>0);
}.play(master, right, 0, addAction: \addAfter);
//Let's see how we performed
{
periodestimatebus.get({|periodestimate|
["actual/estimated frequency", secretfrequency, periodestimate.reciprocal].postln;
});
}.defer(1);
)
Notes to this code:
The periodestimate is generated by tricksy use of Delay1 to make sure that it samples the value of the clock just before it is reset. As such it is off by one sample.
The current implementation will produce a good period estimate with varying frequencies, down to 1Hz at least. Any lower and you'd need to change the clockcount clock to have a different frequency and tweak the arithmetic.
Many improvements are possible. For example, if you wish to track varying frequencies you might want to tweak it a little bit so that the resynthesized signal does not click too often as it underestimates the signal.
This is a tough problem, as you are dealing with a noisy source with a low frequency. If this were a sine wave I'd recommend a FFT, but FFTs don't do very well with noisy sources and low frequencies. It's still worth a shot. FFTs can match phase too. I believe you can use pitch.ar to help find the frequency.
The Chrip-Z algorithm is something you could use instead of the FFT -http://www.embedded.com/design/configurable-systems/4006427/A-DSP-algorithm-for-frequency-analysis
http://en.wikipedia.org/wiki/Bluestein%27s_FFT_algorithm
Another thing you could try is to use a neural net to try and guess it's way to the right information. You could use active training to help it achieve this goal. There is a very general discussion of this on SO:
Pitch detection using neural networks
One method some folks are coming around to is simulating the neurons of Cochlea to detect pitch.
You may want to read up on Phase-locked loops: http://en.wikipedia.org/wiki/Phase-locked_loop

What is a better way in iOS to detect sound level? [duplicate]

I am trying to build an IOS application that counts claps. I have been watching the WWDC videos on CoreAudio, and the topic seems so vast that I'm not quite sure where to look.
I have found similar problems here in stackoverflow. Here is one in C# for detecting a door slam:
Given an audio stream, find when a door slams (sound pressure level calculation?)
It seems that I need to do this:
Divide the samples up into sections
Calculate the energy of each section
Take the ratio of the energies between the previous window and the current window
If the ratio exceeds some threshold, determine that there was a sudden loud noise.
I am not sure how to accomplish this in Objective-C.
I have been able to figure out how to sample the audio with SCListener
Here is my attempt:
- (void)levelTimerCallback:(NSTimer *)timer {
[recorder updateMeters];
const double ALPHA = 0.05;
double peakPowerForChannel = pow(10, (0.05 * [recorder peakPowerForChannel:0]));
lowPassResults = ALPHA * peakPowerForChannel + (1.0 - ALPHA) * lowPassResults;
if ([recorder peakPowerForChannel:0] == 0)
totalClapsLabel.text = [NSString stringWithFormat:#"%d", total++];
SCListener *listener = [SCListener sharedListener];
if (![listener isListening])
return;
AudioQueueLevelMeterState *levels = [listener levels];
Float32 peak = levels[0].mPeakPower;
Float32 average = levels[0].mAveragePower;
lowPassResultsLabel.text = [NSString stringWithFormat:#"%f", lowPassResults];
peakInputLabel.text = [NSString stringWithFormat:#"%f", peak];
averageInputLabel.text = [NSString stringWithFormat:#"%f", average];
}
Though I see the suggested algorithm, I am unclear as to how to implement it in Objective-C.
You didn't mention what sort of detection fidelity you are looking for? Just checking for some kind of sound "pressure" change may be entirely adequate for your needs, honestly.
Keep in mind however that bumps to the phone might end up being a very low frequency and fairly high-powered impulse such that it will trigger you detector even though it was not an actual clap. Ditto for very high frequency sound sources that are also not likely to be a clap.
Is this ok for your needs?
If not and you are hoping for something higher fidelity, I think you'd be better of doing a spectral analysis (FFT) of the input signal and then looking in a much narrower frequency band for a sharp signal spike, similar to the part you already have.
I haven't looked closely at this source, but here's some possible open source FFT code you could hopefully use as-is for your iphone app:
Edit:
https://github.com/alexbw/iPhoneFFT
The nice part about graphing the spectral result is that it should make it quite easy to tune which frequency range you actually care about. In my own tests with some laptop software I have, my claps have a very strong spike around 1kHz - 2kHz.
Possibly overkill for you needs, but if you need something higher fidelity, then I suspect you will not be satisfied with simply tracking a signal spike without knowing what frequency range led to the signal spike in the first place.
Cheers
I used FFT for my App https://itunes.apple.com/us/app/clapmera/id519363613?mt=8 . Clap in the frequency domain looks like a (not perfect) constant.
Regards

analog milliseconds for clock in iphone

I am actually trying to make an analog stopwatch app for iOS.Does anybody know what will be the right approach to have an analog clock with milliseconds hand. My problem is that the core graphics of iOS SDK does not support that high a refresh rate to refresh the movement of the milliseconds hand. Can anybody help with OpenGL-ES since I have a very little experience with OpenGL, so just need some tips for a head start.
Assuming you know you won't get the same result of your TAG Heuer watch (because of the refresh rates), you should interpolate the time to your needs.
To make things easier, I'll try to demonstrate a pointer that makes one lap each second.
Step 1: Get the elapsed time (assuming each unit is 1/100 second). Example value: 234 (wich is 2.34 seconds, in our scale).
Step 2: Reduce it to the elapsed time within your timeframe. (if you're measuring 1/100 second, you already used 200 for 2 full laps, you only need the ramaining of that). In our case: 34. How to obtain? In C: 234 % 100 = 34.
Step 3: Rotate your coordinates accordingly: in pure OpenGL: glRotatef(((float)34/100)*360, 0, 1, 0); (This is rotating around the Y axis. The OpenGL uses degrees, so, a full circle = 360).
Step 4: Draw your pointer
Step 5: Start over (since you're retrieving the time again in step 1, you'll redraw your pointer on the new location).
Remember that this is just the "drawing" phase and Step 5 is just a consequence of your running loop and is illustrated just for clarification.
Hope it helps get you started. If you need more specifics, just comment on the answer and I'll try to help you out!

clock() accuracy

I have seen many posts about using the clock() function to determine the amount of elapsed time in a program with code looking something like:
start_time = clock();
//code to be timed
.
.
.
end_time = clock();
elapsed_time = (end_time - start_time) / CLOCKS_PER_SEC;
The value of CLOCKS_PER_SEC is almost surely not the actual number of clock ticks per second so I am a bit wary of the result. Without worrying about threading and I/O, is the output of the clock() function being scaled in some way so that this divison produces the correct wall clock time?
The answer to your question is yes.
clock() in this case refers to a wallclock rather than a CPU clock so it could be misleading at first glance. For all the machines and compilers I've seen, it returns the time in milliseconds since I've never seen a case where CLOCKS_PER_SEC isn't 1000. So the precision of clock() is limited to milliseconds and the accuracy is usually slightly less.
If you're interested in the actual cycles, this can be hard to obtain.
The rdtsc instruction will let you access the number "pseudo"-cycles from when the CPU was booted. On older systems (like Intel Core 2), this number is usually the same as the actual CPU frequency. But on newer systems, it isn't.
To get a more accurate timer than clock(), you will need to use the hardware performance counters - which is specific to the OS. These are internally implemented using the 'rdtsc' instruction from the last paragraph.