numpy mask un-shaped N-dimensional array - numpy

Here is a Numpy array I would like to mask (note it is not a strict 2D array):
a = array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
This seems impossible however. I would like to understand why, and possibly how to treat this kind of example, where I get a mask from a values to apply it to another array with the same shape.
Thank you very much.

This is an object dtype array, containing 3 elements (which happen to be arrays themselves):
In [94]: a = np.array([np.array([0, 1, 2, 3, 4]), np.array([0, 1]), np.array([0,
...: 1, 2, 3, 4])], dtype=object)
In [95]: a
Out[95]: array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
In [96]: a.shape
Out[96]: (3,)
In [97]: a[1]
Out[97]: array([0, 1])
What do you mean by mask?
I can apply a boolean index to it:
In [99]: a[np.array([True,False,True])]
Out[99]: array([array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])], dtype=object)
a==np.array([0,1]) produces a warning and False; In general == (and other comparison test) does not work well with object dtype arrays.

Maybe what you need is to use Pandas DataFrames that can hold missing values. In your case you could do something like this:
>>> import pandas as pd
>>> df=pd.DataFrame([aa[0].tolist(), aa[1].tolist(), aa[2].tolist()])
>>> df.transpose()
>>> df
0 1 2 3 4
0 0 1 2.0 3.0 4.0
1 0 1 NaN NaN NaN
2 0 1 2.0 3.0 4.0
DataFrames are very powerful and they have more appropriate methods than Numpy arrays when you think of something that is more like a spreadsheet than like a matrix.

Related

Numpy reshape "2D many columns" to "3D of 2D single columns"

Some sklearn encoders don't accept many-columned 2D arrays.
Make example data
lzt_int = [1, 2, 3, 4, 5, 6]
d1_int = np.array(lzt_int)
d2_int_multi = d2_int.reshape(int(d2_int.shape[0]/3), 3)
A many columned 2D array
>>> d2_int_multi
array([[1, 2, 3],
[4, 5, 6]])
Want to efficiently turn into a 3D array of 2D single columns that looks like this.
array([
[[1],
[4]],
[[2],
[5]],
[[3],
[6]],
])
Transformation attempts
>>> d2_int_multi.reshape(3, 2, 1, order='C')
array([[[1],
[2]],
[[3],
[4]],
[[5],
[6]]])
>>> d2_int_multi.reshape(3, 2, 1, order='F')
array([[[1],
[5]],
[[4],
[3]],
[[2],
[6]]])
>>> d2_int_multi.reshape(3, 2, 1, order='A')
array([[[1],
[2]],
[[3],
[4]],
[[5],
[6]]])
For the sake of memory - I'd prefer not to access each column, make it a 2D array, before adding it to a 3D array.
You want to add an extra axis and then transpose the result. You can do those operations with this line:
d2_int_multi[None].T
Since this doesn't move any data but only creates a new view of the original array, its very efficient.

Multidimensional numpy.outer without flatten

x is N by M matrix.
y is 1 by L vector.
I want to return "outer product" between x and y, let's call it z.
z[n,m,l] = x[n,m] * y[l]
I could probably do this using einsum.
np.einsum("ij,k->ijk", x[:, :, k], y[:, k])
or reshape afterwards.
np.outer(x[:, :, k], y).reshape((x.shape[0],x.shape[1],y.shape[0]))
But I'm thinking of doing this in np.outer only or something seems simpler, memory efficient.
Is there a way?
It's one of those numpy "can't know unless you happen to know" bits: np.outer flattens multidimensional inputs while np.multiply.outer doesn't:
m,n,l = 3,4,5
x = np.arange(m*n).reshape(m,n)
y = np.arange(l)
np.multiply.outer(x,y).shape
# (3, 4, 5)
The code for outer is:
multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
As its docs says, it flattens (i.e. ravel). If the arrays are already 1d, that expression could be written as
a[:,None] * b[None,:]
a[:,None] * b # broadcasting auto adds the None to b
We could apply broadcasting rules to your (n,m)*(1,l):
In [2]: x = np.arange(12).reshape(3,4); y = np.array([[1,2]])
In [3]: x.shape, y.shape
Out[3]: ((3, 4), (1, 2))
You want a (n,m,l), which a (n,m,1) * (1,1,l) achieves. We need to add a trailing dimension to x. The extra leading 1 on y is automatic:
In [4]: z = x[...,None]*y
In [5]: z.shape
Out[5]: (3, 4, 2)
In [6]: z
Out[6]:
array([[[ 0, 0],
[ 1, 2],
[ 2, 4],
[ 3, 6]],
[[ 4, 8],
[ 5, 10],
[ 6, 12],
[ 7, 14]],
[[ 8, 16],
[ 9, 18],
[10, 20],
[11, 22]]])
Using einsum:
In [8]: np.einsum('nm,kl->nml', x, y).shape
Out[8]: (3, 4, 2)
The fact that you approved:
In [9]: np.multiply.outer(x,y).shape
Out[9]: (3, 4, 1, 2)
suggests y isn't really (1,l) but rather (l,)`. Adjust for either is easy.
I don't think there's much difference in memory efficiency among these. In this small example In[4] is fastest, but not by much.

numpy one dimensional array incase of list of different length

First of all i really sorry if i am posting a very silly question. I am very new to Numpy.
Question:
Scenario 1 :
import numpy as np
data=[1,2,3,4]
type(data)
array=np.array(data)
array
array.ndim
array.shape
OUTPUT:
array
Out[63]: array([1, 2, 3, 4])
array.ndim
Out[64]: 1
array.shape
Out[65]: (4,)
My question is what is the meaning of (4,). Does it mean it a single row having 4 element. Can we say it is row vector which has one row and 4 column.
If yes then it is creating confusion in the second scenario
Scenario 2 :
data1 =[[1,2,3,4],[5,6,7]]
array1 =np.array(data)
array1 =array1=np.array(data1)
array1
array1.ndim
array1.shape
OUTPUT:
array1
Out[67]: array([[1, 2, 3, 4], [5, 6, 7]], dtype=object)
array1.ndim
Out[68]: 1
array1.shape
Out[69]: (2,)
Here my question is the answer of array1.shape should be (7,) as the dimension is 1.
I want to know here , it is how many rows and how many column. Also why the output is (2,)
data is a list of numbers:
In [166]: data = [1,2,3,4]
arr is 1 dimensional array made from that list - with 4 elements
In [167]: arr = np.array(data)
In [168]: arr
Out[168]: array([1, 2, 3, 4])
In [169]: arr.ndim
Out[169]: 1
In [170]: arr.shape
Out[170]: (4,)
Note that the display of arr has the same [] as the original list.
When talking about 1d arrays, don't try to use row and column ideas. You did not start with a list of lists, and the array does not have rows. It is 1d.
(4,) is Python notation for a single element tuple.
data1 is a list of lists (of different length):
In [172]: data1 =[[1,2,3,4],[5,6,7]]
In [173]: data1
Out[173]: [[1, 2, 3, 4], [5, 6, 7]]
arr1 is a 1d array containing 2 lists. Note the object dtype:
In [174]: arr1 = np.array(data1)
In [175]: arr1
Out[175]: array([list([1, 2, 3, 4]), list([5, 6, 7])], dtype=object)
To get a 1d array of 7 numbers, we have to concatenate the sublists:
In [176]: np.hstack(data1)
Out[176]: array([1, 2, 3, 4, 5, 6, 7])
That operation is similar to a list join:
In [177]: data1[0] + data1[1]
Out[177]: [1, 2, 3, 4, 5, 6, 7]
If the sublists have equal length then we can make a 2d array - with 2 rows and 4 columns, shape (2,4):
In [178]: data2 =[[1,2,3,4],[5,6,7,8]]
In [179]: np.array(data2)
Out[179]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

Elegantly generate result array in numpy

I have my X and Y numpy arrays:
X = np.array([0,1,2,3])
Y = np.array([0,1,2,3])
And my function which maps x,y values to Z points:
def z(x,y):
return x+y
I wish to produce the obvious thing required for a 3D plot: the 2-dimensional numpy array for the corresponding Z-values. I believe it should look like:
Z = np.array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6]])
I can do this in several lines, but I'm looking for the briefest most elegant piece of code.
For a function that is array aware it is more economical to use an open grid:
>>> import numpy as np
>>>
>>> X = np.array([0,1,2,3])
>>> Y = np.array([0,1,2,3])
>>>
>>> def z(x,y):
... return x+y
...
>>> XX, YY = np.ix_(X, Y)
>>> XX, YY
(array([[0],
[1],
[2],
[3]]), array([[0, 1, 2, 3]]))
>>> z(XX, YY)
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6]])
If your grid axes are ranges you can directly create the grid using np.ogrid
>>> XX, YY = np.ogrid[:4, :4]
>>> XX, YY
(array([[0],
[1],
[2],
[3]]), array([[0, 1, 2, 3]]))
If the function is not array aware you can make it so using np.vectorize:
>>> def f(x, y):
... if x > y:
... return x
... else:
... return -x
...
>>> np.vectorize(f)(*np.ogrid[-3:4, -3:4])
array([[ 3, 3, 3, 3, 3, 3, 3],
[-2, 2, 2, 2, 2, 2, 2],
[-1, -1, 1, 1, 1, 1, 1],
[ 0, 0, 0, 0, 0, 0, 0],
[ 1, 1, 1, 1, -1, -1, -1],
[ 2, 2, 2, 2, 2, -2, -2],
[ 3, 3, 3, 3, 3, 3, -3]])
One very short way to achieve what you want is to produce a meshgrid from your coordinates:
X,Y = np.meshgrid(x,y)
z = X+Y
or more general:
z = f(X,Y)
or even in one line:
z = f(*np.meshgrid(x,y))
EDIT:
If your function also may return a constant, you have to somehow infer the dimensions that the result should have. If you want to continue using meshgrids one very simple way would be re-write your function in this way:
def f(x,y):
return x*0+y*0+a
where a would be your constant. numpy would then take care of the dimensions for you. This is of course a bit weird looking, so instead you could write
def f(x,y):
return np.full(x.shape, a)
If you really want to go with functions that work both on scalars and arrays, it's probably best to go with np.vectorize as in #PaulPanzer's answer.

Slice numpy array using a sparse matrix

Say I have a sparse matrix c and a numpy array a. I'd like to slice the entries of a based on some condition on c.
import scipy.sparse as sps
import numpy as np
x = np.array([1,0,0,1])
y = np.array([0,0,0,1])
c = sps.csc_matrix( (np.ones((4,)) , (x,y)), shape = (2,2),dtype=int)
a = np.array([ [1,2],[3,3]])
idx = c != 0
The variable idx is now a sparse matrix of booleans (it only lists True's). I would like to slice the matrix a and call the same entries of a where c != 0.
c[idx]
works fine but the following will not work:
a[idx]
I could use idx.todense(), but I am finding that these .todense() functions are taking up too memory...
You could index a by getting the indices of the rows and cols where c is nonzero. You can do that by converting c to the COO matrix and using the row and col attributes.
Here's some data for an example:
In [41]: a
Out[41]:
array([[10, 11, 12, 13],
[14, 15, 16, 17],
[18, 19, 20, 21],
[22, 23, 24, 25]])
In [42]: c
Out[42]:
<4x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Column format>
In [43]: c.A
Out[43]:
array([[0, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0],
[0, 0, 0, 1]])
Convert c to COO format:
In [45]: c2 = c.tocoo()
In [46]: c2
Out[46]:
<4x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in COOrdinate format>
In [47]: c2.row
Out[47]: array([2, 0, 2, 3], dtype=int32)
In [48]: c2.col
Out[48]: array([0, 2, 2, 3], dtype=int32)
Now index a with c2.row and c2.col to get the values from a at the positions where c is nonzero:
In [49]: a[c2.row, c2.col]
Out[49]: array([18, 12, 20, 25])
Note, however, that the order of the values is not the same as a[idx.A]:
In [50]: a[(c != 0).A]
Out[50]: array([12, 18, 20, 25])
By the way, this type of indexing of a is not "slicing". Slicing refers to indexing a with a "slice", created using the slice notation start:stop:step (or, less commonly, with a builtin slice object slice(start, stop, step)), e.g. a[1:3, :2]. What you are doing is sometimes called "advanced" indexing (e.g. http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html).