Timing different sections in CUDA kernel

Timing different sections in CUDA kernel - optimization

I have a CUDA kernel that calls out to a series of device functions.
What is the best way to get the execution time for each of the device functions?
What is the best way to get the execution time for a section of code in one of the device functions?

In my own code, I use the clock() function to get precise timings. For convenience, I have the macros
enum {
tid_this = 0,
tid_that,
tid_count
};
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
#define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
#define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
#define TIMER_TIC
#define TIMER_TOC(tid)
#endif
These can then be used to instrument the device code as follows:
__global__ mykernel ( ... ) {
/* Start the timer. */
TIMER_TIC
/* Do stuff. */
...
/* Stop the timer and store the results to the "timer_this" counter. */
TIMER_TOC( tid_this );
}
You can then read the cuda_timers in the host code.
A few notes:
The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.
P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.

Related

Measuring Program Execution Time with Cycle Counters

I have confusion in this particular line-->
result = (double) hi * (1 << 30) * 4 + lo;
of the following code:
void access_counter(unsigned *hi, unsigned *lo)
// Set *hi and *lo to the high and low order bits of the cycle
// counter.
{
asm("rdtscp; movl %%edx,%0; movl %%eax,%1" // Read cycle counter
: "=r" (*hi), "=r" (*lo) // and move results to
: /* No input */ // the two outputs
: "%edx", "%eax");
}
double get_counter()
// Return the number of cycles since the last call to start_counter.
{
unsigned ncyc_hi, ncyc_lo;
unsigned hi, lo, borrow;
double result;
/* Get cycle counter */
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
if (result < 0) {
fprintf(stderr, "Error: counter returns neg value: %.0f\n", result);
}
return result;
}
The thing I cannot understand is that why is hi being multiplied with 2^30 and then 4? and then low added to it? Someone please explain what is happening in this line of code. I do know that what hi and low contain.

The short answer:
That line turns a 64bit integer that is stored as 2 32bit values into a floating point number.
Why doesn't the code just use a 64bit integer? Well, gcc has supported 64bit numbers for a long time, but presumably this code predates that. In that case, the only way to support numbers that big is to put them into a floating point number.
The long answer:
First, you need to understand how rdtscp works. When this assembler instruction is invoked, it does 2 things:
1) Sets ecx to IA32_TSC_AUX MSR. In my experience, this generally just means ecx gets set to zero.
2) Sets edx:eax to the current value of the processor’s time-stamp counter. This means that the lower 64bits of the counter go into eax, and the upper 32bits are in edx.
With that in mind, let's look at the code. When called from get_counter, access_counter is going to put edx in 'ncyc_hi' and eax in 'ncyc_lo.' Then get_counter is going to do:
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
What does this do?
Since the time is stored in 2 different 32bit numbers, if we want to find out how much time has elapsed, we need to do a bit of work to find the difference between the old time and the new. When it is done, the result is stored (again, using 2 32bit numbers) in hi / lo.
Which finally brings us to your question.
result = (double) hi * (1 << 30) * 4 + lo;
If we could use 64bit integers, converting 2 32bit values to a single 64bit value would look like this:
unsigned long long result = hi; // put hi into the 64bit number.
result <<= 32; // shift the 32 bits to the upper part of the number
results |= low; // add in the lower 32bits.
If you aren't used to bit shifting, maybe looking at it like this will help. If lo = 1 and high = 2, then expressed as hex numbers:
result = hi; 0x0000000000000002
result <<= 32; 0x0000000200000000
result |= low; 0x0000000200000001
But if we assume the compiler doesn't support 64bit integers, that won't work. While floating point numbers can hold values that big, they don't support shifting. So we need to figure out a way to shift 'hi' left by 32bits, without using left shift.
Ok then, shifting left by 1 is really the same as multiplying by 2. Shifting left by 2 is the same as multiplying by 4. Shifting left by [omitted...] Shifting left by 32 is the same as multiplying by 4,294,967,296.
By an amazing coincidence, 4,294,967,296 == (1 << 30) * 4.
So why write it in that complicated fashion? Well, 4,294,967,296 is a pretty big number. In fact, it's too big to fit in an 32bit integer. Which means if we put it in our source code, a compiler that doesn't support 64bit integers may have trouble figuring out how to process it. Written like this, the compiler can generate whatever floating point instructions it might need to work on that really big number.
Why the current code is wrong:
It looks like variations of this code have been wandering around the internet for a long time. Originally (I assume) access_counter was written using rdtsc instead of rdtscp. I'm not going to try to describe the difference between the two (google them), other than to point out that rdtsc does not set ecx, and rdtscp does. Whoever changed rdtsc to rdtscp apparently didn't know that, and failed to adjust the inline assembler stuff to reflect it. While your code might work fine despite this, it might do something weird instead. To fix it, you could do:
asm("rdtscp; movl %%edx,%0; movl %%eax,%1" // Read cycle counter
: "=r" (*hi), "=r" (*lo) // and move results to
: /* No input */ // the two outputs
: "%edx", "%eax", "%ecx");
While this will work, it isn't optimal. Registers are a valuable and scarce resource on i386. This tiny fragment uses 5 of them. With a slight modification:
asm("rdtscp" // Read cycle counter
: "=d" (*hi), "=a" (*lo)
: /* No input */
: "%ecx");
Now we have 2 fewer assembly statements, and we only use 3 registers.
But even that isn't the best we can do. In the (presumably long) time since this code was written, gcc has added both support for 64bit integers and a function to read the tsc, so you don't need to use asm at all:
unsigned int a;
unsigned long long result;
result = __builtin_ia32_rdtscp(&a);
'a' is the (useless?) value that was being returned in ecx. The function call requires it, but we can just ignore the returned value.
So, instead of doing something like this (which I assume your existing code does):
unsigned cyc_hi, cyc_lo;
access_counter(&cyc_hi, &cyc_lo);
// do something
double elapsed_time = get_counter(); // Find the difference between cyc_hi, cyc_lo and the current time
We can do:
unsigned int a;
unsigned long long before, after;
before = __builtin_ia32_rdtscp(&a);
// do something
after = __builtin_ia32_rdtscp(&a);
unsigned long long elapsed_time = after - before;
This is shorter, doesn't use hard-to-understand assembler, is easier to read, maintain and produces the best possible code.
But it does require a relatively recent version of gcc.

Canonical way(s) of determining system time in a microcontroller

Every so often I start a bare metal microcontroller project and end up implementing a system time measurement using a random timer unit.
I am working with ARM Cortex-M devices for a (albeit short) while now and typically used the SysTick ("System Tick") interrupt to create a 1ms resolution timer. It recently stumbled over a post that suggested chaining two Programmable Interrupt Timers (on a Kinetis KL25Z device) in order to create an interrupt-less 32bit millisecond timer, however sacrificing two PIT interrupts which may come in handy later on.
So I was wondering if there are some (sort of) canonical ways to determine the system time on a microcontroller - preferrably for Kinetis KL2xZ devices as I currently work with these, but not necessarily so.

The canonical method as you put it is exactly as you have done - using systick. That is the single timer device defined by the Cortex-M architecture; any other timer hardware is external to the core and vendor specific.
Some parts (STM32F2 for example) include 32 bit timer/counter hardware, so you would not need to chain two.
The best approach is to abstract timer services by defining a generic timer API that you implement for all parts you need so that the application layer is identical for all parts. For example in this case you might simply implement the standard library clock() function and define CLOCKS_PER_SEC.
If you are using two free-running cascaded timers, you must ensure high/low word consistency when combining the two counter values:
#include <time.h>
clock_t clock( void )
{
uint16_t low_word = 0 ;
uint16_t hi_word = 0 ;
do
{
hi_word = readTimerH() ;
lo_word = readTimerL() ;
} while( hi_word != readTimerH() ) ;
return (clock_t)(hi_word << 16 | lo_word) ;
}

I just looked into KL25 Sub-Family Reference Manual.
In Chapter 34 Real Time Clock (RTC) section 34.3.2 Time counter (may differ with document version).
I found that there are Two registers for Timer counter in RTC
32-bit seconds counter
16-bit prescaler register that increments once every 32.768 kHz clock cycle
Reference Manual says
Always write to the prescaler register before writing to the seconds register,
because the seconds register increments on the falling edge of bit 14 of the prescaler
register.
Which means to calculate system time, read rtc_sec_counter and add 14 bits of prescalar_reg
you can even create a macro to give you system time in uSec and mSec from combination of rtc_sec_counter and prescalar_reg or Sec(obviously from rtc_sec_counter)
For 16 bit prescalar REG System clock is 32.768 Khz, with this we can create macros to get time in uSec and mSec
#define PRESCALAR_TICK 32768
#define KHZ 1000
#define MHZ 1000000
/// Here first we extract 14bit value of prescalar_reg and than multiply it with MHZ to get better precision
/// but this value will not go more than 14 Bit
#define GET_SYS_US ((((prescalar_reg & 0x03FFF)*MHZ)/PRESCALAR_TICK))
#define GET_SYS_MS (GET_SYS_US)/KHZ)
if you need time in milliseconds up to 32 bit use below macro
#define GET_SYS_US_32bit ((rtc_sec_counter * 0x3FFF) + GET_SYS_US)
#define GET_SYS_MS_32bit ((rtc_sec_counter * 0x3FFF) + GET_SYS_MS)
But to use these information you must initialise RTC of you micro (Obviously)

Dynamic pool of workers with MPI for large array C++

My question is how to create a dynamic pool of workers with MPI.
There is a large (NNN = 10^6-7 elements) 1D array/vector. I should perform some calculations on each cell. This problem is extremely embarrassingly parallel.
The idea is (it works fine): each MPI process (when run in parallel) reads common .dat file, puts values in local (to each rank) large vector of size NNN and performs computation on appropriate part of large array, the lenght of this "part" is NNN/nprocs, where "nprocs" is the number of processes of MPI.
The problem: some "parts" of this array (NNN/nprocs) are finished very quick and thus some of CPUs are unused (they wait for the others to finish the run).
The question1: How to make dynamic schedule. CPU's, that finished their tasks, can pick new task and continue working.
The question2: Is there MPI built-in procedure, that schedules automatically "workers" and tasks?
Here is my code (static schedule)
{
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Offset offset;
MPI_File file;
MPI_Status status;
int Pstart = (NNN / nprocs) * rank + ((NNN % nprocs) < rank ? (NNN % nprocs) : rank);
int Pend = Pstart + (NNN / nprocs) + ((NNN % nprocs) > rank);
offset = sizeof(double)*Pstart;
MPI_File_open(MPI_COMM_WORLD, "shared.dat", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
double * local_array;
local_array = new double [NNN/nprocs];
for (int i=0;i<NNN/nprocs;i++)
{
/* next line calculates integral on each cell element of part NNN/nprocs of large array NNN */
adapt_integrate(1, Integrand, par, 2, a, b, MaxEval, tol, tol, &val, &err);
// putting result of integration to local array NNN/nprocs
local_array[i] = val;
}
// here, all local arrays are written to one shared file "shared.dat"
MPI_File_seek(file, offset, MPI_SEEK_SET);
MPI_File_write(file, local_array, NNN/nprocs, MPI_DOUBLE, &status);
MPI_File_close(&file);
}

This question is about a similar problem, but just to recap: have a designated master process that issues chunks of work to the others. All the workers need to do is blocking receive a work item, perform their calculations, then blocking send the result to the master and repeat. The master can manage work items either by posting a nonblocking receive for each worker and polling if any of them completed, or by posting a blocking receive with MPI_ANY_SOURCE as source.

Monotonic clock on OSX

CLOCK_MONOTONIC does not seem available, so clock_gettime is out.
I've read in some places that mach_absolute_time() might be the right way to go, but after reading that it was a 'cpu dependent value', it instantly made me wonder if it is using rtdsc underneath. Thus, the value could drift over time even if it is monotonic. Also, issues with thread affinity could result in meaningfully different results from calling the function (making it not monotonic across all cores).
Of course, that is just speculation. Does anyone know how mach_absolute_time actually works? I'm actually looking for a replacement to clock_gettime(CLOCK_MONOTONIC... or something like it for OSX. No matter what the clock source is, I expect at least millisecond precision and millisecond accuracy.
I'd just like to understand what clocks are available, which clocks are monotonic, if certain clocks drift, have thread affinity issues, aren't supported on all Mac hardware, or take a 'super high' number of cpu cycles to execute.
Here are the links I was able to find about this topic (some are already dead links and not findable on archive.org):
https://developer.apple.com/library/mac/#qa/qa1398/_index.html
http://www.wand.net.nz/~smr26/wordpress/2009/01/19/monotonic-time-in-mac-os-x/
http://www.meandmark.com/timing.pdf
Thanks!
Brett

The Mach kernel provides access to system clocks, out of which at least one (SYSTEM_CLOCK) is advertised by the documentation as being monotonically incrementing.
#include <mach/clock.h>
#include <mach/mach.h>
clock_serv_t cclock;
mach_timespec_t mts;
host_get_clock_service(mach_host_self(), SYSTEM_CLOCK, &cclock);
clock_get_time(cclock, &mts);
mach_port_deallocate(mach_task_self(), cclock);
mach_timespec_t has nanosecond precision. I'm not sure about the accuracy, though.
Mac OS X supports three clocks:
SYSTEM_CLOCK returns the time since boot time;
CALENDAR_CLOCK returns the UTC time since 1970-01-01;
REALTIME_CLOCK is deprecated and is the same as SYSTEM_CLOCK in its current implementation.
The documentation for clock_get_time says the clocks are monotonically incrementing unless someone calls clock_set_time. Calls to clock_set_time are discouraged as it could break the monotonic property of the clocks, and in fact, the current implementation returns KERN_FAILURE without doing anything.

After looking up a few different answers for this I ended up defining a header which emulates clock_gettime on mach:
#include <sys/types.h>
#include <sys/_types/_timespec.h>
#include <mach/mach.h>
#include <mach/clock.h>
#ifndef mach_time_h
#define mach_time_h
/* The opengroup spec isn't clear on the mapping from REALTIME to CALENDAR
being appropriate or not.
http://pubs.opengroup.org/onlinepubs/009695299/basedefs/time.h.html */
// XXX only supports a single timer
#define TIMER_ABSTIME -1
#define CLOCK_REALTIME CALENDAR_CLOCK
#define CLOCK_MONOTONIC SYSTEM_CLOCK
typedef int clockid_t;
/* the mach kernel uses struct mach_timespec, so struct timespec
is loaded from <sys/_types/_timespec.h> for compatability */
// struct timespec { time_t tv_sec; long tv_nsec; };
int clock_gettime(clockid_t clk_id, struct timespec *tp);
#endif
and in mach_gettime.c
#include "mach_gettime.h"
#include <mach/mach_time.h>
#define MT_NANO (+1.0E-9)
#define MT_GIGA UINT64_C(1000000000)
// TODO create a list of timers,
static double mt_timebase = 0.0;
static uint64_t mt_timestart = 0;
// TODO be more careful in a multithreaded environement
int clock_gettime(clockid_t clk_id, struct timespec *tp)
{
kern_return_t retval = KERN_SUCCESS;
if( clk_id == TIMER_ABSTIME)
{
if (!mt_timestart) { // only one timer, initilized on the first call to the TIMER
mach_timebase_info_data_t tb = { 0 };
mach_timebase_info(&tb);
mt_timebase = tb.numer;
mt_timebase /= tb.denom;
mt_timestart = mach_absolute_time();
}
double diff = (mach_absolute_time() - mt_timestart) * mt_timebase;
tp->tv_sec = diff * MT_NANO;
tp->tv_nsec = diff - (tp->tv_sec * MT_GIGA);
}
else // other clk_ids are mapped to the coresponding mach clock_service
{
clock_serv_t cclock;
mach_timespec_t mts;
host_get_clock_service(mach_host_self(), clk_id, &cclock);
retval = clock_get_time(cclock, &mts);
mach_port_deallocate(mach_task_self(), cclock);
tp->tv_sec = mts.tv_sec;
tp->tv_nsec = mts.tv_nsec;
}
return retval;
}

Just use Mach Time.
It is public API, it works on macOS, iOS, and tvOS and it works from within the sandbox.
Mach Time returns an abstract time unit that I usually call "clock ticks". The length of a clock tick is system specific and depends on the CPU. On current Intel systems a clock tick is in fact exactly one nanosecond but you cannot rely on that (may be different for ARM and it certainly was different for PowerPC CPUs). The system can also tell you the conversion factor to convert clock ticks to nanoseconds and nanoseconds to clock ticks (this factor is static, it won't ever change at runtime). When your system boots, the clock starts at 0 and then monotonically increases with every clock tick thereafter, so you can also use Mach Time to get the uptime of your system (and, of course, uptime is monotonic!).
Here's some code:
#include <stdio.h>
#include <inttypes.h>
#include <mach/mach_time.h>
int main ( ) {
uint64_t clockTicksSinceSystemBoot = mach_absolute_time();
printf("Clock ticks since system boot: %"PRIu64"\n",
clockTicksSinceSystemBoot
);
static mach_timebase_info_data_t timebase;
mach_timebase_info(&timebase);
// Cast to double is required to make this a floating point devision,
// otherwise it would be an interger division and only the result would
// be converted to floating point!
double clockTicksToNanosecons = (double)timebase.numer / timebase.denom;
uint64_t systemUptimeNanoseconds = (uint64_t)(
clockTicksToNanosecons * clockTicksSinceSystemBoot
);
uint64_t systemUptimeSeconds = systemUptimeNanoseconds / (1000 * 1000 * 1000);
printf("System uptime: %"PRIu64" seconds\n", systemUptimeSeconds);
}
You can also put a thread to sleep until a certain Mach Time has been reached. Here's some code for that:
// Sleep for 750 ns
uint64_t machTimeNow = mach_absolute_time();
uint64_t clockTicksToSleep = (uint64_t)(750 / clockTicksToNanosecons);
uint64_t machTimeIn750ns = machTimeNow + clockTicksToSleep;
mach_wait_until(machTimeIn750ns);
As Mach Time has no relation to any wallclock time, you can play around with your system date and time setting as you like, that won't have any effect on Mach Time.
There's one special consideration, though, that may make Mach Time unsuitable for certain use cases: The CPU clock is not running while your system is asleep! So if you make a thread wait for 5 minutes and after 1 minute the system goes to sleep and stays asleep for 30 minutes, the thread is still waiting another 4 minutes after the system has woken up as the 30 minutes sleep time don't count! The CPU clock was resting as well during that time. Yet in other cases this is exactly what you want to happen.
Mach Time is also a very precise way to measure time spent. Here's some code showing that task:
// Measure time
uint64_t machTimeBegin = mach_absolute_time();
sleep(1);
uint64_t machTimeEnd = mach_absolute_time();
uint64_t machTimePassed = machTimeEnd - machTimeBegin;
uint64_t timePassedNS = (uint64_t)(
machTimePassed * clockTicksToNanosecons
);
printf("Thread slept for: %"PRIu64" ns\n", timePassedNS);
You will see that the thread doesn't sleep for exactly one second, that's because it takes some time to put a thread to sleep, to wake it back up again and even when awake, it won't get CPU time immediately if all cores are already busy running a thread at that moment.
Update (2018-09-26)
Since macOS 10.12 (Sierra) there also exists mach_continuous_time. The only difference between mach_continuous_time and mach_absolute_time is that continues time also advances when the system is asleep. So in case this was a problem so far and a reason for not using Mach Time, 10.12 and up offer a solution to this problem. The usage is exactly the same as described above.
Also starting with macOS 10.9 (Mavericks), there is a mach_approximate_time and in 10.12 there's also a mach_continuous_approximate_time. These two are identical to mach_absolute_time and mach_continuous_time with the only difference, that they are faster yet less accurate. The standard functions require a call into the kernel as the kernel takes care of Mach Time. Such a call is somewhat expensive, especially on systems that already have a Meltdown fix. The approximate versions won't have to always call into the kernel. They use a clock in user space that is only synchronized with the kernel clock from time to time to prevent that it is running too far out of sync, yet a small deviation is always possible and thus it is only the "approximate" Mach Time.

Acoustic Echo Cancellation (AEC) in embedded software

I am doing a VoIP project on embedded device. I have built a sample using a 32bits MCU with a low grade audio codec. Now I found that there is echo issue on my device, that is I can hear what I said from the speaker. I have do some research and found that most appliaction use a DSP codec with acoustic echo cancellation feature. However, is it possible that I do the acoustic echo cancellation in the software, using my 32bits MCU?
Can you adive the algorithm, or even source code:P, for doing acoustic echo cancellation? I know sophisticated method is not possible on a MCU, whereas a simple algorithm is also welcomed.
Thank you
[Follow up] : I have tried some AEC code but they can not work well in my MCU, probably it is the limit of the MCU power. I found that my device become non-real-time when implemented these codes (but a VoIP need a real-time respond). At last I implemented a analog hardware solution by adding an AEC chips, because I do not want to write the code again in another DSP chip.

I had a heck of a time with echo cancellation. I wrote a softphone, and the user can switch their audio input and output devices around to suit their fancy. I tried the Speex echo cancellation library, and several other open source libs I found online. None worked well for me. I tried different speaker/mike configuration and the echo was always there in some form or fashion.
I believe it would be very hard to create AEC code that would work for all possible speaker configurations / room sizes / background noises..etc. Finally I sat down and wrote my own echo cancellation module for my softphone with this algorithm.
It's somewhat crude, but it has worked well and is reliable.
variable1:
Keep a record of what the average amplitude is of when the person to whom you're talking is speaking. (Don't factor quiet-time)
variable2:
Keep a record of what the average amplitude is on the input (mike), but only when there is voice- again- don't factor quiet time.
As soon as there's audio to play- cut the mike. And assuming the person listening is not talking, turn the mike on 150-300ms after the last audible audio frame comes in to be played.
If the audio from the microphones (that you're dropping during playback) is greater than oh- say (variable2 * 1.5), start sending the audio input frames for a specified duration, resetting that duration every time the input amplitude reaches (variable2 * 1.5).
That way the person talking will know they are being interrupted, and stop to see what the person is saying. If the person talking doesn't have too noisy of a background, they will probably hear most if not all of the interruption.
Like I said, not the most graceful, but it doesn't use a lot of resources (CPU, memory) and it actually works pretty darn well. I am very pleased with how mine sounds.
To implement it, I just made a few functions.
On a received audio frame, I call a function I called:
void audioin( AEC *ec, short *frame ) {
unsigned int tas=0; /* Total sum of all audio in frame (absolute value) */
int i=0;
for (;i<160;i++)
tas+=ABS(frame[i]);
tas/=160; /* 320 byte frames muLaw */
if (tas>300) { /* I assume this is audiable */
lockecho(ec);
ec->lastaudibleframe=GetTickCount64();
unlockecho(ec);
}
return;
}
and before sending a frame, I do:
#define ECHO_THRESHOLD 300 /* Time to keep suppression alive after last audible frame */
#define ONE_MINUTE 3000 /* 3000 20ms samples */
#define AVG_PERIOD 250 /* 250 20ms samples */
#define ABS(x) (x>0?x:-x)
char removeecho( AEC *ec, short *aecinput ) {
int tas=0; /* Average absolute amplitude in this signal */
int i=0;
unsigned long long *tot=0;
unsigned int *ctr=0;
unsigned short *avg=0;
char suppressframe=0;
lockecho(ec);
if (ec->lastaudibleframe+ECHO_THRESHOLD > GetTickCount64() ) {
/* If we're still within the threshold for echo (speaker state is ON) */
tot=&ec->t_aiws;
ctr=&ec->c_aiws;
avg=&ec->aiws;
} else {
/* If we're outside the threshold for echo (speaker state is OFF) */
tot=&ec->t_aiwos;
ctr=&ec->c_aiwos;
avg=&ec->aiwos;
}
for (;i<160;i++) {
tas+=ABS(aecinput[i]);
}
tas/=160;
if (tas>200) {
(*tot)+=tas;
(*avg)=(unsigned short)((*tot)/( (*ctr)?(*ctr):1));
(*ctr)++;
if ((*ctr)>AVG_PERIOD) {
(*tot)=(*avg);
(*ctr)=0;
}
}
if ( (avg==&ec->aiws) ) {
tas-=ec->aiwos;
if (tas<0) {
tas=0;
}
if ( ((unsigned short) tas > (ec->aiws*1.5)) && ((unsigned short)tas>=ec->aiwos) && (ec->aiwos!=0) ) {
suppressframe=0;
} else {
suppressframe=1;
}
}
if (suppressframe) { /* Silence frame */
memset(aecinput, 0, 320);
}
unlockecho(ec);
return suppressframe;
}
Which will silence the frame if it needs to. I keep all my variables, like the timers, and amplitude averages in the AEC struct, which I return from a call to
AEC *initecho( void ) {
AEC *ec=0;
ec=(AEC *)malloc(sizeof(AEC));
memset(ec, 0, sizeof(AEC));
ec->aiws=200; /* Just a default guess as to what the average amplitude would be */
return ec;
}
typedef struct aec {
unsigned long long lastaudibleframe; /* time stamp of last audible frame */
unsigned short aiws; /* Average mike input when speaker is playing */
unsigned short aiwos; /*Average mike input when speaker ISNT playing */
unsigned long long t_aiws, t_aiwos; /* Internal running total (sum of PCM) */
unsigned int c_aiws, c_aiwos; /* Internal counters for number of frames for averaging */
unsigned long lockthreadid; /* Thread ID with lock */
int stlc; /* Same thread lock-count */
} AEC;
You can adapt as you need to and play with the idea, but like I said. It actually sounds pretty dang good. The only problem I have is if they have a lot of background noise. But for me, if they pick up their USB handset or are using a headset, they can turn echo cancellation off, and not worry about it...but though PC speakers with a mike...I'm pretty happy with it.
I hope it helps, or gives you something to build on...

If you are doing a commercial project that this should be easy. You can integrate a commercial audio cancellation software in your VoIP application.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Timing different sections in CUDA kernel - optimization

I have a CUDA kernel that calls out to a series of device functions. What is the best way to get the execution time for each of the device functions? What is the best way to get the execution time for a section of code in one of the device functions?

Related

Measuring Program Execution Time with Cycle Counters

Canonical way(s) of determining system time in a microcontroller

Dynamic pool of workers with MPI for large array C++

Monotonic clock on OSX

Acoustic Echo Cancellation (AEC) in embedded software

Categories

Resources