How to create NumPy random 2D array with full rank, or a particular rank? - numpy

In NumPy, I can use random package to create a 2D array, but cannot make sure it has full rank or a particular rank. How to got it? In the case of full rank, I can use linalg.matrix_rank to check, but I want to make sure this in more simple way. A solution for the case of square matrix is welcome.
Edit: When I ask this question, I think to builtin solution in Numpy. But if there's not that kind of solution in Numpy, Scipy (or other Python modules), the situation here may as in #hpaulj's comment "But at some level this sounds more like a theoretical math topic than a Numpy programming one." And as the math topic, the question should be: find algorithm to generate a random matrix (random in uniform distribution) that has a particular rank? So, closing this question or keeping it to wait nice algorithms in Python is the admin's choice.

Related

Is there any fast method to check whether a square matrix is symmetrical?

I was wondering if is there any fast method to check whether a given square matrix is symmetrical.
I've checked some BLAS packages but don't seem to find anything. Using intel MKL the best method seems to be to call somatcopy, that performs an out-of-place matrix transposition of matrix A and stores the result on matrix B. It is then possible to check if matrix A is equal to B.
This method, however, requires an extra matrix to be stored in memory and I was looking for an approach with O(1) extra overhead.
Thanks in advance for any insight!

how are histograms constructed in sklearn's HistGradientBoostingClassifier to decide on best split point

Both lightgbm and sklearn's HistGradientBoostingClassifier estimators use histograms to decide on best splits for continuous features.
Is it possible to explain intuitively (or with some example) the process of histogram creation and how does it help in deciding in faster split point at a node.
I have looked for answers extensively over the Internet but could not find any simple or intuitive way as to how histograms are constructed.
I am not sure but it could be related to how (unique) Regression trees are constructed in XGBoost. For a continuous feature, you construct an histogram, decide on the split (e.g. weight < 70kg), construct a Regression tree and compute the Similarity score as well as the Gain. However, when the range of the values in the continuous feature is quite large then it is quite computationally expensive to try all the possible split values. In that case, XGBoost basically makes the split by making use of the quantiles which involves dividing all the observations into equally sized sets.
I guess sklearn's HistGradientBoostingClassifier might involve the above tool optimization as well for coming up with the best split.

Higher precision eigenvalues with numpy

I'm currently computing several eigenvalues of different matrices and trying to find their closed-form solutions. The matrix is hermitian/self-adjoint, and tri-diagonal. Additionally, every diagonal element is positive and every off-diagonal is negative.
Due to what I suspect is an issue with trying to algebraically solve the quintic, sympy cannot solve the eigenvalues of my 14x14 matrix.
Numpy has given me great results that I'm sometimes able to use via wolfram-alpha, but other times the precision is lacking to be able to determine which of several candidates the closed form solution could take. As a result, I'm wishing to increase the precision with which numpy.linalg.eigenvaluesh outputs eigenvalues. Any help would be greatly appreciated!
Eigenvalue problems of size>=5 have no general closed form solution (for the reason you mention), and so all general eigensolvers are iterative. As a result, there are a few sources of error.
First, there are the errors with the convergence of the algorithm itself. I.e. even if all your computations were exact, you would need to run a certain number of iterations to get a certain accuracy.
Second, finite precision limits the overall accuracy.
Numerical analysts study how accurate a solution you can get for a given algorithm and precision and there are results on this.
As for your specific problem, if you are not getting enough accuracy there are a few things you can try to do.
The first, is make sure you are using the best solvers for your method. I.e. since your matrix is symmetric and tridiagonal, make sure you are using solvers for this type (as suggested by norok2).
If that still doesn't give you enough accuracy, you can try to increase the precision.
However, the main issue with doing this in numpy is that the LAPACK functions under the hood are compiled for float64.
Thus, even if the numpy function allows inputs of higher precision (float128), it will round them before calling the LAPACK functions.
It might be possible to recompile those functions for higher precision, but that may not be worth the effort for your particular problem.
(As a side note, I'm not very familiar with scipy, so it may be the case that they have eigensolvers written in python which support all different types, but you need to be careful that they are actually doing every step in the higher precision and not silently rounding to float64 somewhere.)
For your problem, I would suggest using the package mpmath, which supports arbitrary precision linear algebra.
It is a bit slower since everything is done in software, but for 14x14 matrices it should still be pretty quick.

How to speed up matrix functions such as expm function in scipy/numpy?

I'm using scipy and numpy to calculate exponentiation of a 6*6 matrix for many times.
Compared to Matlab, it's about 10 times slower.
The function I'm using is scipy.linalg.expm, I have also tried deprecated methods scipy.linalg.expm2 and scipy.linalg.expm3, and those are only two times faster than expm. My question is:
What's wrong with expm2 and expm3 as they are faster than expm?
I'm using wheel package from http://www.lfd.uci.edu/~gohlke/pythonlibs/, and I found https://software.intel.com/en-us/articles/building-numpyscipy-with-intel-mkl-and-intel-fortran-on-windows. Is the wheel package compiled with MKL. If not, I think I can optimize and numpy, scipy by compile it by myself with MKL?
Any other ways to optimize the performance?
Well I think I have found answer for question 1 and 2 by myself
1. It seems expm2 and expm3 returns array rather than matrix. But they are about 2 times faster than expm
Well, after a whole day trying to compile scipy by MKL, I succeed. It's really hard to build the scipy, especially when I'm using windows, x64 and python3. It turned out to be a waste of time. It's not even a bit faster than the whl package from http://www.lfd.uci.edu/~gohlke/pythonlibs/ .
Hoping someone give answer to question 3.
Your matrix is relatively small, so maybe the numerical part is not the bottleneck. You should use a profiler to make sure that the limitation is in the exponentiation.
You can also take a look at the source code of these implementations and write an equivalent function with less conditionals and checking.

If I use python pandas, is there any need for structured arrays?

Now that pandas provides a data frame structure, is there any need for structured/record arrays in numpy? There are some modifications I need to make to an existing code which requires this structured array type framework, but I am considering using pandas in its place from this point forward. Will I at any point find that I need some functionality of structured/record arrays that pandas does not provide?
pandas's DataFrame is a high level tool while structured arrays are a very low-level tool, enabling you to interpret a binary blob of data as a table-like structure. One thing that is hard to do in pandas is nested data types with the same semantics as structured arrays, though this can be imitated with hierarchical indexing (structured arrays can't do most things you can do with hierarchical indexing).
Structured arrays are also amenable to working with massive tabular data sets loaded via memory maps (np.memmap). This is a limitation that will be addressed in pandas eventually, though.
I'm currently in the middle of transition to Pandas DataFrames from the various Numpy arrays. This has been relatively painless since Pandas, AFAIK, if built largely on top of Numpy. What I mean by that is that .mean(), .sum() etc all work as you would hope. On top of that, the ability to add a hierarchical index and use the .ix[] (index) attribute and .xs() (cross-section) method to pull out arbitray pieces of the data has greatly improved the readability and performance of my code (mainly by reducing the number of round-trips to my database).
One thing I haven't fully investigated yet is Pandas compatibility with the more advanced functionality of Scipy and Matplotlib. However, in case of any issues, it's easy enough to pull out a single column that behaves enough like an array for those libraries to work, or even convert to an array on the fly. A DataFrame's plotting methods, for instance, rely on matplotlib and take care of any conversion for you.
Also, if you're like me and your main use of Scipy is the statistics module, pystatsmodels is quickly maturing and relies heavily on pandas.
That's my two cents' worth
I never took the time to dig into pandas, but I use structured array quite often in numpy. Here are a few considerations:
structured arrays are as convenient as recarrays with less overhead, if you don't mind losing the possibility to access fields by attributes. But then, have you ever tried to use min or max as field name in a recarray ?
NumPy has been developed over a far longer period than pandas, with a larger crew, and it becomes ubiquitous enough that a lot of third party packages rely on it. You can expect structured arrays to be more portable than pandas dataframes.
Are pandas dataframes easily pickable ? Can they be sent back and forth with PyTables, for example ?
Unless you're 100% percent that you'll never have to share your code with non-pandas users, you might want to keep some structured arrays around.