Multivariate Gaussian likelihood without matrix inversion - numpy

There are several tricks available for sampling from a multivariate Gaussian without matrix inversion--cholesky/LU decomposition among them. Are there any tricks for calculating the likelihood of a multivariate Gaussian without doing the full matrix inversion?
I'm working in python, using numpy arrays. scipy.stats.multivariate_normal is absurdly slow for the task, taking significantly longer than just doing the matrix inversion directly with numpy.linalg.inv.
So at this point I'm trying to understand what is best practice.

Related

Gaussian process regression for optimization, how hyperparameters affect the gradient of a Gaussian process

I am studying the predictive control of a Gaussian process model. When I use the hyperparameters optimized by the maximum edge likelihood function, I cannot get good results. When I manually change the hyperparameters, the MSE of the model becomes larger, but the control effect has a certain improvement, can someone explain it? thanks

What does it mean to say convolution implementation is based on GEMM (matrix multiply) or it is based on 1x1 kernels?

I have been trying to understand (but miserably failing) how convolutions on images (with height, width, channels) are implemented in software.
I've heard people say their convolution implementation is done using GEMM, or done using "Direct convolution" or done using 1x1 kernels.
I find it very confusing and can't wrap my head around so many different ways it's described everywhere - I thought I understood a typical convolution like pytorch conv2d as a mathematical operation on an image, but what do they mean when someone says they do conv2d using one of the following ways?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
GEMM
"direct convolution"
For doing Convolution using GEMM, what I understand based on this paper is that each of the input-image and filters are converted to 2d matrices using im2col and im2row ops and then these two are simply matrix-multiplied.
The 3d input image (height, width, input-channels) is converted to a 2d matrix, the 4-d kernel (output-channels, input-channels, kernel-height, kernel-width) is converted to a 2d matrix. Or does "GEMM-based implementation of convolution" mean something else? If that's what it means then how is it different than doing "convolution using 1x1 kernels"?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
You can have 3x3 convolution, so you have a square containing 9 elements sliding over the image (with some specified stride, dilation etc.). In this case you have 1x1 convolution so the kernel is a single element (with stride=1 as well and no dilation).
So instead of sliding window with summation you simply project linearly each pixel with this single valued kernel.
It is a cheap operation and is used as part of depthwise separable convolutions used in many modern architectures to increase/decrease number of channels.
GEMM
In the article you provided there is as the top:
[...] function called GEMM. It’s part of the BLAS (Basic Linear Algebra
Subprograms)
So BLAS is a specification which describes a set of low-level algebraic operations and how they should be performed on computer.
Now, you have a lot of implementations of BLAS tailored to specific architectures or having some traits usable in some context. For example there is cuBLAS which is written and optimized for GPU (and used heavily by deep learning "higher level" libraries like PyTorch) or Intel's MKL for Intel CPUs (you can read more about BLAS anywhere on the web)
Usually those are written with a low-level (Fortran, C, Assembly, C++) languages for maximum performance.
GEMM is GEneralized Matrix multiplication routine which is used to implement fully connected layers and convolutions and is provided by various BLAS implementations.
It has nothing to do with the deep learning convolution per-se, it is a fast matrix multiplication routine (considering things like cache hit)
Direct convolutions
It is an approach which is O(n^2) complexity so you simply multiply items with each other. There is more efficient approach using Fast Fourier Transformation which is O(n*log(n)). Some info presented in this answer and questions about this part would be better suited for math related stackexchanges.

GPU convolution performance in TensorFlow

How are the 2D / 3D convolution operations in TensorFlow implemented, as sparse matrix - dense vector multiplications? How is performance optimised for applying many filters to the same image? How is the performance optimised for applying the same filters for many images? Is there a paper that describes the performance for different ways of implementing the convolutions?
The reason I'm asking is that I have implemented and optimised many 2D and 3D (also 4D) convolvers with CUDA, and I'm thinking if I should compare their performance to TensorFlow or not.

Survival Analysis in TensorFlow

I have been using standard packages for survival analysis in R. I know how to do classification problems in TensorFlow such as logistic regression, but I am having difficulty mapping this to survival analysis problems. In a way, instead of one output vector you have two (time_to_event::continuous, censored::boolean). This has been done in Theano, here, but I am having difficulty translating this to TensorFlow.
You can use a logistic regression to do the survival analysis, however, another way you can use TensorFlow is to have the tf model predict the parameters of a survival distribution. So if you used the Weibull distribution you could, instead of regressing onto the time to event and a censoring probability, estimate the characteristic life (alpha parameter) and the shape (beta parameter). That is, the tf model estimates the parameters of the survival distribution directly.
The loss function can be the maximum likelihood which means you can incorporate observed and censored data.

Training complexity of Linear SVM

Which is the actual computational complexity of the learning phase of SVM (let's say, that implemented in LibSVM)?
Thank you
Training complexity of nonlinear SVM is generally between O(n^2) and O(n^3) with n the amount of training instances. The following papers are good references:
Support Vector Machine Solvers by Bottou and Lin
SVM-optimization and steepest-descent line search by List and Simon
PS: If you want to use linear kernel, do not use LIBSVM. LIBSVM is a general purpose (nonlinear) SVM solver. It is not an ideal implementation for linear SVM. Instead, you should consider things like LIBLINEAR (by the same authors as LIBSVM), Pegasos or SVM^perf. These have much better training complexity for linear SVM. Training speed can be orders of magnitude better than using LIBSVM.
This is going to be heavily dependent on svm type and kernel. There is a rather technical discussion http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf.
For a quick answer, http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf, says expect it to be n^2.