Working with columns of NumPy matrices - numpy

I've been unable to figure out how to access, add, multiply, replace, etc. single columns of a NumPy matrix. I can do this via looping over individual elements of the column, but I'd like to treat the column as a unit, something that I can do with rows.
When I've tried to search I'm usually directed to answers handling NumPy arrays, but this is not the same thing.

Can you provide code that's giving trouble? The operations on columns that you list are among the most basic operations that are supported and optimized in NumPy. Consider looking over the tutorial on NumPy for MATLAB users, which has many examples of accessing rows or columns, performing vectorized operations on them, and modifying them with copies or in-place.
NumPy for MATLAB Users
Just to clarify, suppose you have a 2-dimensional NumPy ndarray or matrix called a. Then a[:, 0] would access the first column just the same as a[0] or a[0, :] would access the first row. Any operations that work for rows should work for columns as well, with some caveats for broadcasting rules and certain mathematical operations that depend upon array alignment. You can also use the numpy.transpose(a) function (which is also exposed with a.T) to transpose a making the columns become rows.

Related

Typed lists vs ND-arrays in Numba

Could someone, please clarify that what is the benefit of using a Numba typed list over an ND array? Also, how do the two compares in terms of speed, and in what context would it be recommended to use the typed list?
Typed lists are useful when your need to append a sequence of elements but you do not know the total number of elements and you could not even find a reasonable bound. Such a data structure is significantly more expensive than a 1D array (both in memory space and computation time).
1D arrays cannot be resized efficiently: a new array needs to be created and a copy must be performed. However, the indexing of 1D arrays is very cheap. Numpy also provide many functions that can natively operate on them (lists are implicitly converted to arrays when passed to a Numpy function and this process is expensive). Note that is the number of items can be bounded to a reasonably size (ie. not much higher than the number of actual element), you can create a big array, then add the elements and finally work on a sub-view of the array.
ND arrays cannot be directly compared with lists. Note that lists of lists are similar to jagged array (they can contains lists of different sizes) while ND array are likes a (fixed-size) N x ... x M table. Lists of lists are very inefficient and often not needed.
As a result, use ND arrays when you can and you do not need to often resize them (or append/remove elements). Otherwise, use typed lists.

what is the difference between series/dataframe and ndarray?

Leaving that they are from two different binaries.
I know that series/dataframe can hold any data type, and ndarray is also heterogenous data.
And also all the slicing operations of numpy are applicable to series.
Is there any other difference between them?
After some research I found the answer to my question I asked above. For anyone who needs, here it is from pandas docs:
A key difference between Series and ndarray is that operations between
Series automatically align the data based on the label. Thus, you can
write computations without giving consideration to whether the Series
involved have the same labels.
An example:
s[1:] + s[:-1]
The result for above would produce NaN for both first and last index.
If a label is not found in one Series or the other, the result will be marked as missing NaN.

Numpy/Scipy pinv and pinv2 behave differently

I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)

Matrices with different row lengths in numpy

Is there a way of defining a matrix (say m) in numpy with rows of different lengths, but such that m stays 2-dimensional (i.e. m.ndim = 2)?
For example, if you define m = numpy.array([[1,2,3], [4,5]]), then m.ndim = 1. I understand why this happens, but I'm interested if there is any way to trick numpy into viewing m as 2D. One idea would be padding with a dummy value so that rows become equally sized, but I have lots of such matrices and it would take up too much space. The reason why I really need m to be 2D is that I am working with Theano, and the tensor which will be given the value of m expects a 2D value.
I'll give here very new information about Theano. We have a new TypedList() type, that allow to have python list with all elements with the same type: like 1d ndarray. All is done, except the documentation.
There is limited functionality you can do with them. But we did it to allow looping over the typed list with scan. It is not yet integrated with scan, but you can use it now like this:
import theano
import theano.typed_list
a = theano.typed_list.TypedListType(theano.tensor.fvector)()
s, _ = theano.scan(fn=lambda i, tl: tl[i].sum(),
non_sequences=[a],
sequences=[theano.tensor.arange(2, dtype='int64')])
f = theano.function([a], s)
f([[1, 2, 3], [4, 5]])
One limitation is that the output of scan must be an ndarray, not a typed list.
No, this is not possible. NumPy arrays need to be rectangular in every pair of dimensions. This is due to the way they map onto memory buffers, as a pointer, itemsize, stride triple.
As for this taking up space: np.array([[1,2,3], [4,5]]) actually takes up more space than a 2×3 array, because it's an array of two pointers to Python lists (and even if the elements were converted to arrays, the memory layout would still be inefficient).

Reshaped views in Parallel Colt

In numpy, there is a flatten operation which allows you to, for example, flatten a m x n matrix down to an array of mn elements, and a reshape operations which goes in the opposite direction. Much of the time this can be done with a view, without creating a copy of the original data.
Does such a capability exist in Parallel Colt, the Java matrix library? I have not been able to find one. There is a reshape method on one-dimensional matrices, but it appears to create copies.