Numpy reshape "2D many columns" to "3D of 2D single columns" - numpy

Some sklearn encoders don't accept many-columned 2D arrays.
Make example data
lzt_int = [1, 2, 3, 4, 5, 6]
d1_int = np.array(lzt_int)
d2_int_multi = d2_int.reshape(int(d2_int.shape[0]/3), 3)
A many columned 2D array
>>> d2_int_multi
array([[1, 2, 3],
[4, 5, 6]])
Want to efficiently turn into a 3D array of 2D single columns that looks like this.
array([
[[1],
[4]],
[[2],
[5]],
[[3],
[6]],
])
Transformation attempts
>>> d2_int_multi.reshape(3, 2, 1, order='C')
array([[[1],
[2]],
[[3],
[4]],
[[5],
[6]]])
>>> d2_int_multi.reshape(3, 2, 1, order='F')
array([[[1],
[5]],
[[4],
[3]],
[[2],
[6]]])
>>> d2_int_multi.reshape(3, 2, 1, order='A')
array([[[1],
[2]],
[[3],
[4]],
[[5],
[6]]])
For the sake of memory - I'd prefer not to access each column, make it a 2D array, before adding it to a 3D array.

You want to add an extra axis and then transpose the result. You can do those operations with this line:
d2_int_multi[None].T
Since this doesn't move any data but only creates a new view of the original array, its very efficient.

Related

Transforming a sequence of integers into the binary representation of that sequence's strides [duplicate]

I'm looking for a way to select multiple slices from a numpy array at once. Say we have a 1D data array and want to extract three portions of it like below:
data_extractions = []
for start_index in range(0, 3):
data_extractions.append(data[start_index: start_index + 5])
Afterwards data_extractions will be:
data_extractions = [
data[0:5],
data[1:6],
data[2:7]
]
Is there any way to perform above operation without the for loop? Some sort of indexing scheme in numpy that would let me select multiple slices from an array and return them as that many arrays, say in an n+1 dimensional array?
I thought maybe I can replicate my data and then select a span from each row, but code below throws an IndexError
replicated_data = np.vstack([data] * 3)
data_extractions = replicated_data[[range(3)], [slice(0, 5), slice(1, 6), slice(2, 7)]
You can use the indexes to select the rows you want into the appropriate shape.
For example:
data = np.random.normal(size=(100,2,2,2))
# Creating an array of row-indexes
indexes = np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
# data[indexes] will return an element of shape (3,5,2,2,2). Converting
# to list happens along axis 0
data_extractions = list(data[indexes])
np.all(data_extractions[1] == data[1:6])
True
The final comparison is against the original data.
stride_tricks can do that
a = np.arange(10)
b = np.lib.stride_tricks.as_strided(a, (3, 5), 2 * a.strides)
b
# array([[0, 1, 2, 3, 4],
# [1, 2, 3, 4, 5],
# [2, 3, 4, 5, 6]])
Please note that b references the same memory as a, in fact multiple times (for example b[0, 1] and b[1, 0] are the same memory address). It is therefore safest to make a copy before working with the new structure.
nd can be done in a similar fashion, for example 2d -> 4d
a = np.arange(16).reshape(4, 4)
b = np.lib.stride_tricks.as_strided(a, (3,3,2,2), 2*a.strides)
b.reshape(9,2,2) # this forces a copy
# array([[[ 0, 1],
# [ 4, 5]],
# [[ 1, 2],
# [ 5, 6]],
# [[ 2, 3],
# [ 6, 7]],
# [[ 4, 5],
# [ 8, 9]],
# [[ 5, 6],
# [ 9, 10]],
# [[ 6, 7],
# [10, 11]],
# [[ 8, 9],
# [12, 13]],
# [[ 9, 10],
# [13, 14]],
# [[10, 11],
# [14, 15]]])
In this post is an approach with strided-indexing scheme using np.lib.stride_tricks.as_strided that basically creates a view into the input array and as such is pretty efficient for creation and being a view occupies nomore memory space.
Also, this works for ndarrays with generic number of dimensions.
Here's the implementation -
def strided_axis0(a, L):
# Store the shape and strides info
shp = a.shape
s = a.strides
# Compute length of output array along the first axis
nd0 = shp[0]-L+1
# Setup shape and strides for use with np.lib.stride_tricks.as_strided
# and get (n+1) dim output array
shp_in = (nd0,L)+shp[1:]
strd_in = (s[0],) + s
return np.lib.stride_tricks.as_strided(a, shape=shp_in, strides=strd_in)
Sample run for a 4D array case -
In [44]: a = np.random.randint(11,99,(10,4,2,3)) # Array
In [45]: L = 5 # Window length along the first axis
In [46]: out = strided_axis0(a, L)
In [47]: np.allclose(a[0:L], out[0]) # Verify outputs
Out[47]: True
In [48]: np.allclose(a[1:L+1], out[1])
Out[48]: True
In [49]: np.allclose(a[2:L+2], out[2])
Out[49]: True
You can slice your array with a prepared slicing array
a = np.array(list('abcdefg'))
b = np.array([
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]
])
a[b]
However, b doesn't have to generated by hand in this way. It can be more dynamic with
b = np.arange(5) + np.arange(3)[:, None]
In the general case you have to do some sort of iteration - and concatenation - either when constructing the indexes or when collecting the results. It's only when the slicing pattern is itself regular that you can use a generalized slicing via as_strided.
The accepted answer constructs an indexing array, one row per slice. So that is iterating over the slices, and arange itself is a (fast) iteration. And np.array concatenates them on a new axis (np.stack generalizes this).
In [264]: np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
Out[264]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
indexing_tricks convenience methods to do the same thing:
In [265]: np.r_[0:5, 1:6, 2:7]
Out[265]: array([0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6])
This takes the slicing notation, expands it with arange and concatenates. It even lets me expand and concatenate into 2d
In [269]: np.r_['0,2',0:5, 1:6, 2:7]
Out[269]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
In [270]: data=np.array(list('abcdefghijk'))
In [272]: data[np.r_['0,2',0:5, 1:6, 2:7]]
Out[272]:
array([['a', 'b', 'c', 'd', 'e'],
['b', 'c', 'd', 'e', 'f'],
['c', 'd', 'e', 'f', 'g']],
dtype='<U1')
In [273]: data[np.r_[0:5, 1:6, 2:7]]
Out[273]:
array(['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'f', 'c', 'd', 'e',
'f', 'g'],
dtype='<U1')
Concatenating results after indexing also works.
In [274]: np.stack([data[0:5],data[1:6],data[2:7]])
My memory from other SO questions is that relative timings are in the same order of magnitude. It may vary for example with the number of slices versus their length. Overall the number of values that have to be copied from source to target will be the same.
If the slices vary in length, you'd have to use the flat indexing.
No matter which approach you choose, if 2 slices contain same element, it doesn't support mathematical operations correctly unlesss you use ufunc.at which can be more inefficient than loop. For testing:
def as_strides(arr, window_size, stride, writeable=False):
'''Get a strided sub-matrices view of a 4D ndarray.
Args:
arr (ndarray): input array with shape (batch_size, m1, n1, c).
window_size (tuple): with shape (m2, n2).
stride (tuple): stride of windows in (y_stride, x_stride).
writeable (bool): it is recommended to keep it False unless needed
Returns:
subs (view): strided window view, with shape (batch_size, y_nwindows, x_nwindows, m2, n2, c)
See also numpy.lib.stride_tricks.sliding_window_view
'''
batch_size = arr.shape[0]
m1, n1, c = arr.shape[1:]
m2, n2 = window_size
y_stride, x_stride = stride
view_shape = (batch_size, 1 + (m1 - m2) // y_stride,
1 + (n1 - n2) // x_stride, m2, n2, c)
strides = (arr.strides[0], y_stride * arr.strides[1],
x_stride * arr.strides[2]) + arr.strides[1:]
subs = np.lib.stride_tricks.as_strided(arr,
view_shape,
strides=strides,
writeable=writeable)
return subs
import numpy as np
np.random.seed(1)
Xs = as_strides(np.random.randn(1, 5, 5, 2), (3, 3), (2, 2), writeable=True)[0]
print('input\n0,0\n', Xs[0, 0])
np.add.at(Xs, np.s_[:], 5)
print('unbuffered sum output\n0,0\n', Xs[0,0])
np.add.at(Xs, np.s_[:], -5)
Xs = Xs + 5
print('normal sum output\n0,0\n', Xs[0, 0])
We can use list comprehension for this
data=np.array([1,2,3,4,5,6,7,8,9,10])
data_extractions=[data[b:b+5] for b in [1,2,3,4,5]]
data_extractions
Results
[array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]), array([4, 5, 6, 7, 8]), array([5, 6, 7, 8, 9]), array([ 6, 7, 8, 9, 10])]

numpy one dimensional array incase of list of different length

First of all i really sorry if i am posting a very silly question. I am very new to Numpy.
Question:
Scenario 1 :
import numpy as np
data=[1,2,3,4]
type(data)
array=np.array(data)
array
array.ndim
array.shape
OUTPUT:
array
Out[63]: array([1, 2, 3, 4])
array.ndim
Out[64]: 1
array.shape
Out[65]: (4,)
My question is what is the meaning of (4,). Does it mean it a single row having 4 element. Can we say it is row vector which has one row and 4 column.
If yes then it is creating confusion in the second scenario
Scenario 2 :
data1 =[[1,2,3,4],[5,6,7]]
array1 =np.array(data)
array1 =array1=np.array(data1)
array1
array1.ndim
array1.shape
OUTPUT:
array1
Out[67]: array([[1, 2, 3, 4], [5, 6, 7]], dtype=object)
array1.ndim
Out[68]: 1
array1.shape
Out[69]: (2,)
Here my question is the answer of array1.shape should be (7,) as the dimension is 1.
I want to know here , it is how many rows and how many column. Also why the output is (2,)
data is a list of numbers:
In [166]: data = [1,2,3,4]
arr is 1 dimensional array made from that list - with 4 elements
In [167]: arr = np.array(data)
In [168]: arr
Out[168]: array([1, 2, 3, 4])
In [169]: arr.ndim
Out[169]: 1
In [170]: arr.shape
Out[170]: (4,)
Note that the display of arr has the same [] as the original list.
When talking about 1d arrays, don't try to use row and column ideas. You did not start with a list of lists, and the array does not have rows. It is 1d.
(4,) is Python notation for a single element tuple.
data1 is a list of lists (of different length):
In [172]: data1 =[[1,2,3,4],[5,6,7]]
In [173]: data1
Out[173]: [[1, 2, 3, 4], [5, 6, 7]]
arr1 is a 1d array containing 2 lists. Note the object dtype:
In [174]: arr1 = np.array(data1)
In [175]: arr1
Out[175]: array([list([1, 2, 3, 4]), list([5, 6, 7])], dtype=object)
To get a 1d array of 7 numbers, we have to concatenate the sublists:
In [176]: np.hstack(data1)
Out[176]: array([1, 2, 3, 4, 5, 6, 7])
That operation is similar to a list join:
In [177]: data1[0] + data1[1]
Out[177]: [1, 2, 3, 4, 5, 6, 7]
If the sublists have equal length then we can make a 2d array - with 2 rows and 4 columns, shape (2,4):
In [178]: data2 =[[1,2,3,4],[5,6,7,8]]
In [179]: np.array(data2)
Out[179]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

numpy mask un-shaped N-dimensional array

Here is a Numpy array I would like to mask (note it is not a strict 2D array):
a = array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
This seems impossible however. I would like to understand why, and possibly how to treat this kind of example, where I get a mask from a values to apply it to another array with the same shape.
Thank you very much.
This is an object dtype array, containing 3 elements (which happen to be arrays themselves):
In [94]: a = np.array([np.array([0, 1, 2, 3, 4]), np.array([0, 1]), np.array([0,
...: 1, 2, 3, 4])], dtype=object)
In [95]: a
Out[95]: array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
In [96]: a.shape
Out[96]: (3,)
In [97]: a[1]
Out[97]: array([0, 1])
What do you mean by mask?
I can apply a boolean index to it:
In [99]: a[np.array([True,False,True])]
Out[99]: array([array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])], dtype=object)
a==np.array([0,1]) produces a warning and False; In general == (and other comparison test) does not work well with object dtype arrays.
Maybe what you need is to use Pandas DataFrames that can hold missing values. In your case you could do something like this:
>>> import pandas as pd
>>> df=pd.DataFrame([aa[0].tolist(), aa[1].tolist(), aa[2].tolist()])
>>> df.transpose()
>>> df
0 1 2 3 4
0 0 1 2.0 3.0 4.0
1 0 1 NaN NaN NaN
2 0 1 2.0 3.0 4.0
DataFrames are very powerful and they have more appropriate methods than Numpy arrays when you think of something that is more like a spreadsheet than like a matrix.

Slice numpy array using a sparse matrix

Say I have a sparse matrix c and a numpy array a. I'd like to slice the entries of a based on some condition on c.
import scipy.sparse as sps
import numpy as np
x = np.array([1,0,0,1])
y = np.array([0,0,0,1])
c = sps.csc_matrix( (np.ones((4,)) , (x,y)), shape = (2,2),dtype=int)
a = np.array([ [1,2],[3,3]])
idx = c != 0
The variable idx is now a sparse matrix of booleans (it only lists True's). I would like to slice the matrix a and call the same entries of a where c != 0.
c[idx]
works fine but the following will not work:
a[idx]
I could use idx.todense(), but I am finding that these .todense() functions are taking up too memory...
You could index a by getting the indices of the rows and cols where c is nonzero. You can do that by converting c to the COO matrix and using the row and col attributes.
Here's some data for an example:
In [41]: a
Out[41]:
array([[10, 11, 12, 13],
[14, 15, 16, 17],
[18, 19, 20, 21],
[22, 23, 24, 25]])
In [42]: c
Out[42]:
<4x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Column format>
In [43]: c.A
Out[43]:
array([[0, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0],
[0, 0, 0, 1]])
Convert c to COO format:
In [45]: c2 = c.tocoo()
In [46]: c2
Out[46]:
<4x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in COOrdinate format>
In [47]: c2.row
Out[47]: array([2, 0, 2, 3], dtype=int32)
In [48]: c2.col
Out[48]: array([0, 2, 2, 3], dtype=int32)
Now index a with c2.row and c2.col to get the values from a at the positions where c is nonzero:
In [49]: a[c2.row, c2.col]
Out[49]: array([18, 12, 20, 25])
Note, however, that the order of the values is not the same as a[idx.A]:
In [50]: a[(c != 0).A]
Out[50]: array([12, 18, 20, 25])
By the way, this type of indexing of a is not "slicing". Slicing refers to indexing a with a "slice", created using the slice notation start:stop:step (or, less commonly, with a builtin slice object slice(start, stop, step)), e.g. a[1:3, :2]. What you are doing is sometimes called "advanced" indexing (e.g. http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html).

referencing rows in a matrix using index from another matrix

You have an original sparse matrix X:
>>print type(X)
>>print X.todense()
<class 'scipy.sparse.csr.csr_matrix'>
[[1,4,3]
[3,4,1]
[2,1,1]
[3,6,3]]
You have a second sparse matrix Z, which is derived from some rows of X (say the values are doubled so we can see the difference between the two matrices). In pseudo-code:
>>Z = X[[0,2,3]]
>>print Z.todense()
[[1,4,3]
[2,1,1]
[3,6,3]]
>>Z = Z*2
>>print Z.todense()
[[2, 8, 6]
[4, 2, 2]
[6, 12,6]]
What's the best way of retrieving the rows in Z using the ORIGINAL indices from X. So for instance, in pseudo-code:
>>print Z[[0,3]]
[[2,8,6] #0 from Z, and what would be row **0** from X)
[6,12,6]] #2 from Z, but what would be row **3** from X)
That is, how can you retrieve rows from Z, using indices that refer to the original rows position in the original matrix X? To do this, you can't modify X in anyway (you can't add an index column to the matrix X), but there are no other limits.
If you have the original indices in an array i, and the values in i are in increasing order (as in your example), you can use numpy.searchsorted(i, [0, 3]) to find the indices in Z that correspond to indices [0, 3] in the original X. Here's a demonstration in an IPython session:
In [39]: X = csr_matrix([[1,4,3],[3,4,1],[2,1,1],[3,6,3]])
In [40]: X.todense()
Out[40]:
matrix([[1, 4, 3],
[3, 4, 1],
[2, 1, 1],
[3, 6, 3]])
In [41]: i = array([0, 2, 3])
In [42]: Z = 2 * X[i]
In [43]: Z.todense()
Out[43]:
matrix([[ 2, 8, 6],
[ 4, 2, 2],
[ 6, 12, 6]])
In [44]: Zsub = Z[searchsorted(i, [0, 3])]
In [45]: Zsub.todense()
Out[45]:
matrix([[ 2, 8, 6],
[ 6, 12, 6]])