Is there any fast method to check whether a square matrix is symmetrical? - blas

I was wondering if is there any fast method to check whether a given square matrix is symmetrical.
I've checked some BLAS packages but don't seem to find anything. Using intel MKL the best method seems to be to call somatcopy, that performs an out-of-place matrix transposition of matrix A and stores the result on matrix B. It is then possible to check if matrix A is equal to B.
This method, however, requires an extra matrix to be stored in memory and I was looking for an approach with O(1) extra overhead.
Thanks in advance for any insight!

Related

Higher precision eigenvalues with numpy

I'm currently computing several eigenvalues of different matrices and trying to find their closed-form solutions. The matrix is hermitian/self-adjoint, and tri-diagonal. Additionally, every diagonal element is positive and every off-diagonal is negative.
Due to what I suspect is an issue with trying to algebraically solve the quintic, sympy cannot solve the eigenvalues of my 14x14 matrix.
Numpy has given me great results that I'm sometimes able to use via wolfram-alpha, but other times the precision is lacking to be able to determine which of several candidates the closed form solution could take. As a result, I'm wishing to increase the precision with which numpy.linalg.eigenvaluesh outputs eigenvalues. Any help would be greatly appreciated!
Eigenvalue problems of size>=5 have no general closed form solution (for the reason you mention), and so all general eigensolvers are iterative. As a result, there are a few sources of error.
First, there are the errors with the convergence of the algorithm itself. I.e. even if all your computations were exact, you would need to run a certain number of iterations to get a certain accuracy.
Second, finite precision limits the overall accuracy.
Numerical analysts study how accurate a solution you can get for a given algorithm and precision and there are results on this.
As for your specific problem, if you are not getting enough accuracy there are a few things you can try to do.
The first, is make sure you are using the best solvers for your method. I.e. since your matrix is symmetric and tridiagonal, make sure you are using solvers for this type (as suggested by norok2).
If that still doesn't give you enough accuracy, you can try to increase the precision.
However, the main issue with doing this in numpy is that the LAPACK functions under the hood are compiled for float64.
Thus, even if the numpy function allows inputs of higher precision (float128), it will round them before calling the LAPACK functions.
It might be possible to recompile those functions for higher precision, but that may not be worth the effort for your particular problem.
(As a side note, I'm not very familiar with scipy, so it may be the case that they have eigensolvers written in python which support all different types, but you need to be careful that they are actually doing every step in the higher precision and not silently rounding to float64 somewhere.)
For your problem, I would suggest using the package mpmath, which supports arbitrary precision linear algebra.
It is a bit slower since everything is done in software, but for 14x14 matrices it should still be pretty quick.

How to translate computation in index notation into sequence of SIMD ops in general case?

UPD: the question in it's original form is poorly formulated because I strongly confuse terminology (SIMD vs vectorized computations) and give too broad example that does not specify exactly what is the problem; I voted to close it with "unclear what you're asking", I'll link a better-formulated question above whenever it appears
In mathematics, one would usually describe n-dimensional tensor computation using index notation, that would look something like:
A[i,j,k] = B[k,j] + C[d[k],i,B[k,j]] + d[k]*f[j] // for 0<i<N, 0<j<M, 0<k<K
but if we want to use any SIMD library to efficiently parallelize that computation (and take advantage of linear-algebraic magic), we would have to express it using primitives from BLAS, numpy, tensorflow, OpenCL, ... that is often quite tricky.
Expressions in [Einstein notation][1] like A_ijk*B_kj are generally solved via [np.einsum][2] (using tensordot, sum and transpose, I guess?). Summation and other element-wise ops are also okay, "smart" indexing is quite tricky, though (especially, if an index appears more then single time in the expression).
I wonder if there any language-agnostic libraries that take an expression in certain form (lets say, form above) and translates it into some Intermediate Representation that can be efficiently executed using existing linear-algebra libraries?
There are libraries that attempt to parallelize loop computations (user API usually looks like #pragma in C++ or #numba.jit in python), but I'm asking about slightly different thing: translate abritary expression in form above into a finite sequence of SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
If there are no language-agnostic solutions yet, I am personally interested in numpy computations :)
Further questions about the code:
I see B[k,j] is used an an index and as a value. Is everything integer? If not, which parts are FP, and where does the conversion happen?
Why does i not appear in the right hand side? Is the same data repeated N times?
Oh yikes, so you have a gather operation, with indices coming from d[k] and B[k,j]. Only a few SIMD instruction sets support this (e.g. AVX2).
I mostly manually vectorize stuff in C, with Intel's x86 intrinsics, (or auto-vectorization and check the compiler's asm output to make sure it didn't suck), so IDK if there's any kind of platform-independent way to express that operation.
I wouldn't expect that many cross-platform SIMD languages would provide a gather or anything built on top of a gather. I haven't used numpy though.
I don't expect you'd find a BLAS, LAPACK, or other library function that includes a gather, unless you go looking for implementations of this exact problem.
With an efficient gather (e.g. Intel Skylake or Xeon Phi), it might vectorize ok if you use SIMD in the loop over j, so you load a whole vector at once from B[], and from f[], and use it with a vector holding d[k] broadcast to every position. You probably want to store a transposed result matrix, like A[i][k][j], so the final store doesn't have to be a scatter. You definitely need to avoid looping over k in the inner-most loop, since that makes loads from B[] non-contiguous, and you have d[k] instead of f[j] varying inside the inner loop.
I haven't done much with GPGPU, but they do SIMD differently. Instead of short vectors like CPUs use, they have effectively many scalar processors ganged together. OpenCL or CUDA or whatever other hot new GPGPU tech might handle your gathers much more efficiently.
SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
When I think of "SIMD commands", I think of x86 assembly instructions (or ARM NEON, or whatever), or at least C / C++ intrinsics that compile to single instructions. :P
A matrix-vector product is not a single "instruction". If you used that terminology, every function that processes a buffer would be "a SIMD instruction".
The last part of your question seems to be asking for a programming-language independent version of numpy, for gluing together high-performance library functions. Or were you thinking that there might be something that would inter-optimize such operations, so you could write something that would compile to a vectorized loop that did stuff like use each input more than once without having to reload it in separate library calls?
IDK if there's anything like that, other than normal C compiler auto-vectorization of loops over arrays.

CGAffineTransform: apply a Scale to a Translation, how?

The affine transforms Apple use have "scale" defined as "does not affect translation"
This seems to me completely wrong, and doesn't match what I'd expect from normal affine transforms (where a scale multiplied by a translation DOES affect the translation), and makes it extremely difficult to work with real-world problems, where "scaling" is expected to scale the entire co-ordinate system, not just the local co-ords of a single object at a time.
Is there a safe way within Apple's library to workaround this problem (i.e. make "scale" apply to the whole matrix, not just the non-translation parts)?
Or have I made a stupid mistake and completely misunderstood what's happening with the scaling, somehow?
I'm pretty sure that just means it doesn't affect the translation values in the matrix. CGAffineTransform isn't some special brand of math, it's just a regular transformation matrix. It works like any other transformation matrix you've ever used.
Ah. Embarassing. My mistake: arguments to concat were wrong way around! At least I can leave this here and hopefully help the next person to make such a dumb mistake.
I had a Concat call with the arguments the wrong way around; obviously, "translating" a "scale" works as expected - the scale doesn't affect the translate!
When I googled this issue, I hit a couple of pages that talked about CGAffineTransform doing scale and translate independently. Confirmation bias :( I read that and assumed it was true. Doh.
FYI: CGAffineTransformConcat( A, B ) ... does: Matrix A * Matrix B ... i.e. "A's effects first, then B's effects"
So, make sure your scaling matrix is the second argument (or the "later" argument if you have a chain of nested Concat calls).

What is the most efficient way to implement a convolution filter within a pixel shader?

Implementing convolution in a pixel shader is somewhat costly as to the very high number of texture fetches.
A direct way of implementing a convolution filter is to make N x N lookups per fragment using two for cycles per fragment. A simple calculation says that a 1024x1024 image blurred with a 4x4 Gaussian kernel would need 1024 x 1024 x 4 x 4 = 16M lookups.
What can one do about this?
Can one use some optimization that would need less lookups? I am not interested in kernel-specific optimizations like the ones for the Gaussian (or are they kernel specific?)
Can one at least make these lookups faster by somehow exploiting the locality of the pixels one would work with?
Thanks!
Gaussian kernels are separable, which means you can do a horizontal pass first, then a vertical pass (or the other way around). That turns O(N^2) into O(2N). That works for all separable filters, not just for blur (not all filters are separable, but many are, and some are "as good as").
Or,in the particular case of a blur filter (Gauss or not), which are all kind of "weighted sums", you can take advantage of texture interpolation, which may be faster for small kernel sizes (but definitively not for large kernel sizes).
EDIT: image for the "linear interpolation" method
EDIT (as requested by Jerry Coffin) to summarize the comments:
In the "texture filter" method, linear interpolation will produce a weighted sum of adjacent texels according to the inverse distance from the sample location to the texel center. This is done by the texturing hardware, for free. That way, 16 pixels can be summed in 4 fetches. Texture filtering can be exploited in addition to separating the kernel.
In the example image, on the top left, your sample (the circle) hits the center of a texel. What you get is the same as "nearest" filtering, you get that texel's value. On the top right, you are in the middle between two texels, what you get is the 50/50 average between them (pictured by the lighter shader of blue). On the bottom right, you sample in between 4 texels, but somewhat closer to the top left one. That gives you a weighted average of all 4, but with the weight biased towards the top left one (darkest shade of blue).
The following suggestions are courtesy of datenwolf (see below):
"Another methods I'd like suggest is operating in fourier space, where convolution turns into a simple product of fourier transformed signal and fourier transformed kernel. Although the fourier transform on the GPU itself is quite tedious to implement, at least using OpenGL shaders. But it's quite easy done in OpenCL. Actually I implement such things using OpenCL, now, a lot of image processing in my 3D engine happens in OpenCL.
OpenCL has been specifically designed for running on GPUs. A Fast Fourier Transform is actually the piece of example code on Wikipedia's OpenCL article: en.wikipedia.org/wiki/OpenCL and yes the performance gain is tremendous. A FFT executes with at most O(n log n), the reverse the same. The filter kernel fourier representation can be precomputed. The way is FFT -> multiply with kernel -> IFFT, which boils down to O(n + 2n log n) operations. Take note the the actual convolution is just O(n) there.
In the case of a separable, finite convolution like a gaussian blur the separation solution will outperform the fourier method. But in case of generalized, possible non-separable kernels the fourier methods is probably the fastest method available.
OpenCL integrates nicely with OpenGL, e.g. you can use OpenGL buffers (textures and vertex) for both input and ouput of OpenCL programs."
More than being separable, Gaussian filters are also computable in O(1) :
There are recursive computations like the Deriche one :
http://hal.inria.fr/docs/00/07/47/78/PDF/RR-1893.pdf
Rotoglup's answer to my question here my be worth reading; in particular, this blog post about Gaussian blur really helped me understand the concept of separable filters.
One more approach is approximating Gaussian curve with step-wise function: https://arxiv.org/pdf/1107.4958.pdf (I guess piece-wise linear functions can be also used of course).

Why is Verlet integration better than Euler integration?

Can someone explain to me why Verlet integration is better than Euler integration? And why RK4 is better than Verlet? I don't understand why it is a better method.
The Verlet method is is good at simulating systems with energy conservation, and the reason is that it is symplectic. In order to understand this statement you have to describe a time step in your simulation as a function, f, that maps the state space into itself. In other words each timestep can be written on the following form.
(x(t+dt), v(t+dt)) = f(x(t),v(t))
The time step function, f, of the Verlet method has the special property that it conserves state-space volume. We can write this in mathematical terms. If you have a set A of states in the state space, then you can define f(A) by
f(A) = {f(x)| for x in A}
Now let us assume that the sets A and f(A) are smooth and nice so we can define their volume. Then a symplectic map, f, will always fulfill that the volume of f(A) is the same as the volume of A. (and this will be fulfilled for all nice and smooth choices of A). This is fulfilled by the time step function of the Verlet method, and therefore the Verlet method is a symplectic method.
Now the final question is. Why is a symplectic method good for simulating systems with energy conservation, but I am afraid that you will have to read a book to understand this.
The Euler method is a first order integration scheme, i.e. the total error is proportional to the step size. However, it can be numerically unstable, in other words, the accumulated error can overwhelm the calculation giving you nonsense. Please note, this instability can occur regardless of how small you make the step size or whether the system is linear or not. I am not familiar with verlet integration, so I can not speak to its efficacy. But, the Runge-Kutta methods differ from the Euler method in more than just step size.
In essence, they are based on a better way of numerically approximating the derivative. The precise details escape me at the moment. In general, the fourth order Runge-Kutta method is considered the workhorse of the integration schemes, but it does have some disadvantages. It is slightly dissipative, i.e. a small first derivative dependent term is added to your calculation which resembles an added friction. Also, it has a fixed step size which can result can make it difficult to achieve the accuracy you desire. Alternatively, you can use an adaptive stepsize scheme, like the Runge-Kutta-Fehlberg method, which gives fifth order accuracy for an additional 6 function evaluations. This can greatly reduce the time necessary to perform your calculation while improving accuracy, as shown here.
If everything just coasts along in a linear way, it wouldn't matter what method you used, but when something interesting (i.e. non-linear) happens, you need to look more carefully, either by considering the non-linearity directly (verlet) or by taking smaller timesteps (rk4).