Is there any obvious reason that Tensorflow uses COO format other than CSR for sparse matrix? - tensorflow

I'm trying to take performance advantages from built-in Sparse Matrix Multiplication API of Tensorflow.
And keveman recommended that tf.embedding_lookup_sparse is the right way.
But, it seems that the performance of embedding_lookup_sparse is somewhat disappointed in my experiments. Though it performs fairly small matrix multiplications, <1, 3196> and <3196, 1024>, sparse matmul with 0.1 sparsity fails to win the dense matrix multiplication.
If my implementation is correct, I think one of the reasons is that Tensorflow uses COO format which saves all index-nonzero pair. I'm not an expert on this domain but, isn't it widely known that CSR format is more performant on this kind of computation? Is there any obvious reason that Tensorflow internally uses COO format other than CSR for sparse matrix representation?

Just for the record, you say matrix multiplication, but one of your matrices is in fact a vector (1 x 3196). So this would make it a matrix-vector multiplication (different BLAS kernel). I will assume you mean matrix-vector multiplication for my answer.
Yes, CSR should theoretically be faster than COO for matrix-vector multiplication; this is because the storage size in CSR format is O(2nnz + n) vs O(3nnzs) and the sparse matrix vector multiplication is in many cases memory bound.
The exact performance difference compared to a dense matrix multiplication varies though based on the problem size, sparsity pattern, data type and implementation. It is difficult to say off the bat which should be faster, because the sparse storage format introduces indirection, which potentially leads to reduced locality and poor(er) utilisation of arithmetic units (e.g. no use of vectorisation).
Particularly when the matrix and vector size are so small that almost everything fits in cache, I would expect limited performance benefits. Sparse matrix structures are typically more useful for truly large matrices, ranging from 10sK x 10sK to 1B x 1B, which wouldn't even fit in main memory using a dense representation. For small problem sizes, in my experience, the storage advantage compared to dense formats is usually negated by the loss in locality and arithmetic efficiency. To some extent this is addressed by hybrid storage formats (such as Block CSR) which try to take the best of both worlds, and are very useful for some applications (doesn't look like tensorflow supports this).
In tensorflow, I would assume the COO format is used because it is more efficient for other operations, for example it supports O(1) updates, insertions and deletions from the data structure. It seems reasonable to trade ~50% performance in sparse matrix-vector multiply to improve performance on these operations.

Related

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

numpy difference between fft and rfft

I'm trying to understand the difference between numpy fft and rfft. I've read the doc, and it only says rfft is meant for real inputs.
I've tested their performance on a large real array and found out rfft is faster than fft by about a third. My question is: why is rfft fast? Thanks!
An RFFT has half the degrees of freedom on the input, and half the number of complex outputs, compared to an FFT. Thus the FFT computation tree can be pruned to remove those adds and multiplies not needed for the non-existent inputs and/or those unnecessary since there are a lesser number of independant output values that need to be computed.
This is because an FFT of a strictly real input (e.g. all the input value imaginary components zero) produces a complex conjugate mirrored result, where each half can be trivially derived from the other half.

Representing high precision floats in tensorflow

We intend to train an ANN with some data which do not fit into float64 or any other existing data types in TensorFlow. For example, an input neuron may receive:
1.13760089015656552359796023723738662029191734968655617689014826158791613333
I have two questions:
Is there a way to represent the data in TensorFlow or Keras?
If there is not a clear and convenient solution, do you think that casting the data into complex128 would help us? for example:
1.137600890156565523597960237237 + 73866202919173496865561768901 i
For our case, the more floating-point means more accuracy. So we need to keep them as much as possible.
We prefer to use a GPU so the solution should cover them.
Have you considered preprocessing the data? It's rather rare to see noticeably important divergence with so high precision. It might be also hard to calculate meaningful gradient, when you only operate on such flat space.

Most efficient way to add two CSR sparse matrices with the same sparsity pattern in python

I am using sparse matrices in python, namely
scipy.sparse.csr_matrix
I am in principle free to choose the exact sparse implementation, as long as the matrices support matrix-vector multiplication and addition/subtraction of matrices with the same sparsity pattern. Currently, at every time step, I construct a new sparse matrix from scratch and add it to the existing matrix. I believe that my code could be unnecessarily losing time due to
Construction time of sparse matrix
Addition of the sparse matrices, assuming that the underlying algorithm inside CSR matrix implementation has to find matching sparse entries before adding them up.
My guess would be that the sparse matrix is internally stored as a numpy array of values + a few index arrays denoting where those values are located. The question is if it is possible to directly add the underlying value arrays without touching the sparsity structure. Is something like this possible?
new_values = np.linspace(0, num_values)
csr_mat.val += new_values

What are the downsides of convolution by FFT compared to realspace convolution?

So I am aware that a convolution by FFT has a lower computational complexity than a convolution in real space. But what are the downsides of an FFT convolution?
Does the kernel size always have to match the image size, or are there functions that take care of this, for example in pythons numpy and scipy packages? And what about anti-aliasing effects?
FFT convolutions are based on the convolution theorem, which states that given two functions f and g, if Fd() and Fi() denote the direct and inverse Fourier transform, and * and . convolution and multiplication, then:
f*g = Fi(Fd(d).Fd(g))
To apply this to a signal f and a kernel g, there are some things you need to take care of:
f and g have to be of the same size for the multiplication step to be possible, so you need to zero-pad the kernel (or input, if the kernel is longer than it).
When doing a DFT, which is what FFT does, the resulting frequency domain representation of the function is periodic. This means that, by default, your kernel wraps around the edge when doing the convolution. If you want this, then all is great. But if not, you have to add an extra zero-padding the size of the kernel to avoid it.
Most (all?) FFT packages only work well (performance-wise) with sizes that do not have any large prime factors. Rounding the signal and kernel size up to the next power of two is a common practice that may result in a (very) significant speed-up.
If your signal and kernel sizes are f_l and g_l, doing a straightforward convolution in time domain requires g_l * (f_l - g_l + 1) multiplications and (g_l - 1) * (f_l - g_l + 1) additions.
For the FFT approach, you have to do 3 FFTs of size at least f_l + g_l, as well as f_l + g_l multiplications.
For large sizes of both f and g, the FFT is clearly superior with its n*log(n) complexity. For small kernels, the direct approach may be faster.
scipy.signal has both convolve and fftconvolve methods for you to play around. And fftconvolve handles all the padding described above transparently for you.
While fast convolution has better "big O" complexity than direct form convolution; there are a few drawbacks or caveats. I did some thinking about this topic for an article I wrote a while back.
Better "big O" complexity is not always better. Direct form convolution can be faster than using FFTs for filters smaller than a certain size. The exact size depends on the platform and implementations used. The crossover point is usually in the 10-40 coefficient range.
Latency. Fast convolution is inherently a blockwise algorithm. Queueing up hundreds or thousands of samples at a time before transforming them may be unacceptable for some real-time applications.
Implementation complexity. Direct form is simpler in terms of the memory, code space and in the theoretical background of the writer/maintainer.
On a fixed point DSP platform (not a general purpose CPU): the limited word size considerations of fixed-point FFT make large fixed point FFTs nearly useless. At the other end of the size spectrum, these chips have specialized MAC intstructions that are well designed for performing direct form FIR computation, increasing the range over which te O(N^2) direct form is faster than O(NlogN). These factors tend to create a limited "sweet spot" where fixed point FFTs are useful for Fast Convolution.