How Fast is Convolution Using FFT - time-complexity

I read that in order to compute the convolution of two signals x,y (1D for example), the naïve method takes O(NM).
However FFT is used to compute FFT^-1(FFT(x)FFT(y)), which takes O(N log(N)), in the case where N>M.
I wonder why is this complexity considered better than the former one, as M isn't necessarily bigger than log(N). Moreover, M is very often the length of a filter, which doesn't scale with the signal to be filtered, and will actually provide us with a complexity more similar to O(N) than to O(N^2).

Fast convolution in the frequency domain is typically more efficient than direct convolution when the size of the filter exceeds a particular threshold. So for relatively small filters direct convolution is more efficient, whereas for longer filters there comes a point at which FFT-based convolution is more efficient. The actual value of m for this "tipping point" depends on a lot of factors, but it's typically somewhere between 10 and 100.

Related

Asynchrony loss function over an array of 1D signals

So I have an array of N 1D-signals (e.g. time series) with same number of samples per signal (all in equal resolution) and I want to define a differentiable loss function to penalize asynchrony among them and therefore be zero if all N 1D signals will be equal to each other. I've been searching the literature to find something but haven't had luck yet.
Few remarks:
1 - since N (number of signals) could be quite large I can not afford to calculate Mean squared loss between every single pair which could grow combinatorialy large. also I'm not quite sure whether it would be optimal in any mathematical sense for the goal to achieve.
There are two naive loss functions that I could think of :
a) Total variation loss for each time sample across all signals (to force to reach ideally zero variation). the problem is here the weight needs to be very large to yield zero varion. masking any other loss term that is going to be added and also there is no inherent order among the N signals, which doesnt make it suitable to TV loss to begin with.
b) minimizing the sum of variance at each time point among all signals. however, choice of the reference of variance (aka mean) could be crucial I believe as just using the sample mean might not really yield the desired result, not quite sure.

Time complexity with respect to input

This is a constant doubt I'm having. For example, I have a 2-d array of size n^2 (n being the number of rows and columns). Suppose I want to print all the elements of the 2-d array. When I calculate the time complexity of the algorithm with respect to n it's O(n^2 ). But if I calculated the time with respect to the input size (n^2 ) it's linear. Are both these calculations correct? If so, why do people only use O(n^2 ) everywhere regarding 2-d arrays?
That is not how time complexity works. You cannot do "simple math" like that.
A two-dimensional square array of extent x has n = x*x elements. Printing these n elements takes n operations (or n/m if you print m items at a time), which is O(N). The necessary work increases linearly with the number of elements (which is, incidentially, quadratic in respect of the array extent -- but if you arranged the same number of items in a 4-dimensional array, would it be any different? Obviously, no. That doesn't magically make it O(N^4)).
What you use time complexity for is not stuff like that anyway. What you want time complexity to tell you is an approximate idea of how some particular algorithm may change its behavior if you grow the number of inputs beyond some limit.
So, what you want to know is, if you do XYZ on one million items or on two million items, will it take approximately twice as long, or will it take approximately sixteen times as long, for example.
Time complexity analysis is irrespective of "small details" such as how much time an actual operations takes. Which tends to make the whole thing more and more academic and practically useless in modern architectures because constant factors (such as memory latency or bus latency, cache misses, faults, access times, etc.) play an ever-increasing role as they stay mostly the same over decades while the actual cost-per-step (instruction throughput, ALU power, whatever) goes down steadily with every new computer generation.
In practice, it happens quite often that the dumb, linear, brute force approach is faster than a "better" approach with better time complexity simply because the constant factor dominates everything.

Proof that k-means always converges?

I understand k-means algorithms steps.
However I'm not sure if the algorithm will always converge? Or can the observations always switch from one centroid to another?
The algorithm always converges (by-definition) but not necessarily to global optimum.
The algorithm may switch from centroid to centroid but this is a parameter of the algorithm (precision, or delta). This is sometimes refered as "cycling". The algorithm after a while cycles through centroids. There are two solutions (which both can be used at the same time). Precision parameter, maximum number of iterations parameter.
Precision parameter, if centroids amount of change is less than a threshold delta, stop the algorithm.
Max Num Iterations, if algorithm reaches that number of iterations stop the algorithm.
Note that the above schemes do not spoil the convergence characteristics of the algorithm. it still will converge but not necessarily to global optimum (this is irrelevant of the scheme used, as in many optimisation algorithms).
You may be interested in the related question on stats.SE Cycling in k-means algorithm and a referenced proof of convergence

How to speed up the rjags model training in Bayesian ranking?

All,
I am doing Bayesian modeling using rjags. However, when the number of observation is larger than 1000. The graph size is too big.
More specifically, I am doing a Bayesian ranking problem. Traditionally, one observation means one X[i, 1:N]-Y[i] pair, where X[i, 1:N] means the i-th item is represented by a N-size predictor vector, and Y[i] is a response. The objective is to minimize the point-wise error of predicted values,for example, least square error.
A ranking problem is different. Since we more care about the order, we use a pair-wise 1-0 indicator to represent the order between Y[i] and Y[j], for example, when Y[i]>Y[j], I(i,j)=1; otherwise I(i,j)=0. We treat this 1-0 indicator as an observation. Therefore, assuming we have K items: Y[1:K], the number of indicator is 0.5*K*(K-1). Hence when K is increased from 500 to 5000, the number of observations is very large, i.e. from 500^2 to 5000^2. The garph size of the rjags model is large too, for example graph size > 500,000. And the log-posterior will be very small.
And it takes a long time to complete the training. I think the consumed time is >40 hours. It is not practical for me to do further experiment. Therefore, do you have any idea to speed up the rjags. I heard that the RStan is faster than Rjags. Any one who has similar experience?

What are the downsides of convolution by FFT compared to realspace convolution?

So I am aware that a convolution by FFT has a lower computational complexity than a convolution in real space. But what are the downsides of an FFT convolution?
Does the kernel size always have to match the image size, or are there functions that take care of this, for example in pythons numpy and scipy packages? And what about anti-aliasing effects?
FFT convolutions are based on the convolution theorem, which states that given two functions f and g, if Fd() and Fi() denote the direct and inverse Fourier transform, and * and . convolution and multiplication, then:
f*g = Fi(Fd(d).Fd(g))
To apply this to a signal f and a kernel g, there are some things you need to take care of:
f and g have to be of the same size for the multiplication step to be possible, so you need to zero-pad the kernel (or input, if the kernel is longer than it).
When doing a DFT, which is what FFT does, the resulting frequency domain representation of the function is periodic. This means that, by default, your kernel wraps around the edge when doing the convolution. If you want this, then all is great. But if not, you have to add an extra zero-padding the size of the kernel to avoid it.
Most (all?) FFT packages only work well (performance-wise) with sizes that do not have any large prime factors. Rounding the signal and kernel size up to the next power of two is a common practice that may result in a (very) significant speed-up.
If your signal and kernel sizes are f_l and g_l, doing a straightforward convolution in time domain requires g_l * (f_l - g_l + 1) multiplications and (g_l - 1) * (f_l - g_l + 1) additions.
For the FFT approach, you have to do 3 FFTs of size at least f_l + g_l, as well as f_l + g_l multiplications.
For large sizes of both f and g, the FFT is clearly superior with its n*log(n) complexity. For small kernels, the direct approach may be faster.
scipy.signal has both convolve and fftconvolve methods for you to play around. And fftconvolve handles all the padding described above transparently for you.
While fast convolution has better "big O" complexity than direct form convolution; there are a few drawbacks or caveats. I did some thinking about this topic for an article I wrote a while back.
Better "big O" complexity is not always better. Direct form convolution can be faster than using FFTs for filters smaller than a certain size. The exact size depends on the platform and implementations used. The crossover point is usually in the 10-40 coefficient range.
Latency. Fast convolution is inherently a blockwise algorithm. Queueing up hundreds or thousands of samples at a time before transforming them may be unacceptable for some real-time applications.
Implementation complexity. Direct form is simpler in terms of the memory, code space and in the theoretical background of the writer/maintainer.
On a fixed point DSP platform (not a general purpose CPU): the limited word size considerations of fixed-point FFT make large fixed point FFTs nearly useless. At the other end of the size spectrum, these chips have specialized MAC intstructions that are well designed for performing direct form FIR computation, increasing the range over which te O(N^2) direct form is faster than O(NlogN). These factors tend to create a limited "sweet spot" where fixed point FFTs are useful for Fast Convolution.