Some sklearn encoders don't accept many-columned 2D arrays.
Make example data
lzt_int = [1, 2, 3, 4, 5, 6]
d1_int = np.array(lzt_int)
d2_int_multi = d2_int.reshape(int(d2_int.shape[0]/3), 3)
A many columned 2D array
>>> d2_int_multi
array([[1, 2, 3],
[4, 5, 6]])
Want to efficiently turn into a 3D array of 2D single columns that looks like this.
array([
[[1],
[4]],
[[2],
[5]],
[[3],
[6]],
])
Transformation attempts
>>> d2_int_multi.reshape(3, 2, 1, order='C')
array([[[1],
[2]],
[[3],
[4]],
[[5],
[6]]])
>>> d2_int_multi.reshape(3, 2, 1, order='F')
array([[[1],
[5]],
[[4],
[3]],
[[2],
[6]]])
>>> d2_int_multi.reshape(3, 2, 1, order='A')
array([[[1],
[2]],
[[3],
[4]],
[[5],
[6]]])
For the sake of memory - I'd prefer not to access each column, make it a 2D array, before adding it to a 3D array.
You want to add an extra axis and then transpose the result. You can do those operations with this line:
d2_int_multi[None].T
Since this doesn't move any data but only creates a new view of the original array, its very efficient.
I'm looking for a way to select multiple slices from a numpy array at once. Say we have a 1D data array and want to extract three portions of it like below:
data_extractions = []
for start_index in range(0, 3):
data_extractions.append(data[start_index: start_index + 5])
Afterwards data_extractions will be:
data_extractions = [
data[0:5],
data[1:6],
data[2:7]
]
Is there any way to perform above operation without the for loop? Some sort of indexing scheme in numpy that would let me select multiple slices from an array and return them as that many arrays, say in an n+1 dimensional array?
I thought maybe I can replicate my data and then select a span from each row, but code below throws an IndexError
replicated_data = np.vstack([data] * 3)
data_extractions = replicated_data[[range(3)], [slice(0, 5), slice(1, 6), slice(2, 7)]
You can use the indexes to select the rows you want into the appropriate shape.
For example:
data = np.random.normal(size=(100,2,2,2))
# Creating an array of row-indexes
indexes = np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
# data[indexes] will return an element of shape (3,5,2,2,2). Converting
# to list happens along axis 0
data_extractions = list(data[indexes])
np.all(data_extractions[1] == data[1:6])
True
The final comparison is against the original data.
stride_tricks can do that
a = np.arange(10)
b = np.lib.stride_tricks.as_strided(a, (3, 5), 2 * a.strides)
b
# array([[0, 1, 2, 3, 4],
# [1, 2, 3, 4, 5],
# [2, 3, 4, 5, 6]])
Please note that b references the same memory as a, in fact multiple times (for example b[0, 1] and b[1, 0] are the same memory address). It is therefore safest to make a copy before working with the new structure.
nd can be done in a similar fashion, for example 2d -> 4d
a = np.arange(16).reshape(4, 4)
b = np.lib.stride_tricks.as_strided(a, (3,3,2,2), 2*a.strides)
b.reshape(9,2,2) # this forces a copy
# array([[[ 0, 1],
# [ 4, 5]],
# [[ 1, 2],
# [ 5, 6]],
# [[ 2, 3],
# [ 6, 7]],
# [[ 4, 5],
# [ 8, 9]],
# [[ 5, 6],
# [ 9, 10]],
# [[ 6, 7],
# [10, 11]],
# [[ 8, 9],
# [12, 13]],
# [[ 9, 10],
# [13, 14]],
# [[10, 11],
# [14, 15]]])
In this post is an approach with strided-indexing scheme using np.lib.stride_tricks.as_strided that basically creates a view into the input array and as such is pretty efficient for creation and being a view occupies nomore memory space.
Also, this works for ndarrays with generic number of dimensions.
Here's the implementation -
def strided_axis0(a, L):
# Store the shape and strides info
shp = a.shape
s = a.strides
# Compute length of output array along the first axis
nd0 = shp[0]-L+1
# Setup shape and strides for use with np.lib.stride_tricks.as_strided
# and get (n+1) dim output array
shp_in = (nd0,L)+shp[1:]
strd_in = (s[0],) + s
return np.lib.stride_tricks.as_strided(a, shape=shp_in, strides=strd_in)
Sample run for a 4D array case -
In [44]: a = np.random.randint(11,99,(10,4,2,3)) # Array
In [45]: L = 5 # Window length along the first axis
In [46]: out = strided_axis0(a, L)
In [47]: np.allclose(a[0:L], out[0]) # Verify outputs
Out[47]: True
In [48]: np.allclose(a[1:L+1], out[1])
Out[48]: True
In [49]: np.allclose(a[2:L+2], out[2])
Out[49]: True
You can slice your array with a prepared slicing array
a = np.array(list('abcdefg'))
b = np.array([
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]
])
a[b]
However, b doesn't have to generated by hand in this way. It can be more dynamic with
b = np.arange(5) + np.arange(3)[:, None]
In the general case you have to do some sort of iteration - and concatenation - either when constructing the indexes or when collecting the results. It's only when the slicing pattern is itself regular that you can use a generalized slicing via as_strided.
The accepted answer constructs an indexing array, one row per slice. So that is iterating over the slices, and arange itself is a (fast) iteration. And np.array concatenates them on a new axis (np.stack generalizes this).
In [264]: np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
Out[264]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
indexing_tricks convenience methods to do the same thing:
In [265]: np.r_[0:5, 1:6, 2:7]
Out[265]: array([0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6])
This takes the slicing notation, expands it with arange and concatenates. It even lets me expand and concatenate into 2d
In [269]: np.r_['0,2',0:5, 1:6, 2:7]
Out[269]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
In [270]: data=np.array(list('abcdefghijk'))
In [272]: data[np.r_['0,2',0:5, 1:6, 2:7]]
Out[272]:
array([['a', 'b', 'c', 'd', 'e'],
['b', 'c', 'd', 'e', 'f'],
['c', 'd', 'e', 'f', 'g']],
dtype='<U1')
In [273]: data[np.r_[0:5, 1:6, 2:7]]
Out[273]:
array(['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'f', 'c', 'd', 'e',
'f', 'g'],
dtype='<U1')
Concatenating results after indexing also works.
In [274]: np.stack([data[0:5],data[1:6],data[2:7]])
My memory from other SO questions is that relative timings are in the same order of magnitude. It may vary for example with the number of slices versus their length. Overall the number of values that have to be copied from source to target will be the same.
If the slices vary in length, you'd have to use the flat indexing.
No matter which approach you choose, if 2 slices contain same element, it doesn't support mathematical operations correctly unlesss you use ufunc.at which can be more inefficient than loop. For testing:
def as_strides(arr, window_size, stride, writeable=False):
'''Get a strided sub-matrices view of a 4D ndarray.
Args:
arr (ndarray): input array with shape (batch_size, m1, n1, c).
window_size (tuple): with shape (m2, n2).
stride (tuple): stride of windows in (y_stride, x_stride).
writeable (bool): it is recommended to keep it False unless needed
Returns:
subs (view): strided window view, with shape (batch_size, y_nwindows, x_nwindows, m2, n2, c)
See also numpy.lib.stride_tricks.sliding_window_view
'''
batch_size = arr.shape[0]
m1, n1, c = arr.shape[1:]
m2, n2 = window_size
y_stride, x_stride = stride
view_shape = (batch_size, 1 + (m1 - m2) // y_stride,
1 + (n1 - n2) // x_stride, m2, n2, c)
strides = (arr.strides[0], y_stride * arr.strides[1],
x_stride * arr.strides[2]) + arr.strides[1:]
subs = np.lib.stride_tricks.as_strided(arr,
view_shape,
strides=strides,
writeable=writeable)
return subs
import numpy as np
np.random.seed(1)
Xs = as_strides(np.random.randn(1, 5, 5, 2), (3, 3), (2, 2), writeable=True)[0]
print('input\n0,0\n', Xs[0, 0])
np.add.at(Xs, np.s_[:], 5)
print('unbuffered sum output\n0,0\n', Xs[0,0])
np.add.at(Xs, np.s_[:], -5)
Xs = Xs + 5
print('normal sum output\n0,0\n', Xs[0, 0])
We can use list comprehension for this
data=np.array([1,2,3,4,5,6,7,8,9,10])
data_extractions=[data[b:b+5] for b in [1,2,3,4,5]]
data_extractions
Results
[array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]), array([4, 5, 6, 7, 8]), array([5, 6, 7, 8, 9]), array([ 6, 7, 8, 9, 10])]
Here is a Numpy array I would like to mask (note it is not a strict 2D array):
a = array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
This seems impossible however. I would like to understand why, and possibly how to treat this kind of example, where I get a mask from a values to apply it to another array with the same shape.
Thank you very much.
This is an object dtype array, containing 3 elements (which happen to be arrays themselves):
In [94]: a = np.array([np.array([0, 1, 2, 3, 4]), np.array([0, 1]), np.array([0,
...: 1, 2, 3, 4])], dtype=object)
In [95]: a
Out[95]: array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
In [96]: a.shape
Out[96]: (3,)
In [97]: a[1]
Out[97]: array([0, 1])
What do you mean by mask?
I can apply a boolean index to it:
In [99]: a[np.array([True,False,True])]
Out[99]: array([array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])], dtype=object)
a==np.array([0,1]) produces a warning and False; In general == (and other comparison test) does not work well with object dtype arrays.
Maybe what you need is to use Pandas DataFrames that can hold missing values. In your case you could do something like this:
>>> import pandas as pd
>>> df=pd.DataFrame([aa[0].tolist(), aa[1].tolist(), aa[2].tolist()])
>>> df.transpose()
>>> df
0 1 2 3 4
0 0 1 2.0 3.0 4.0
1 0 1 NaN NaN NaN
2 0 1 2.0 3.0 4.0
DataFrames are very powerful and they have more appropriate methods than Numpy arrays when you think of something that is more like a spreadsheet than like a matrix.
You have an original sparse matrix X:
>>print type(X)
>>print X.todense()
<class 'scipy.sparse.csr.csr_matrix'>
[[1,4,3]
[3,4,1]
[2,1,1]
[3,6,3]]
You have a second sparse matrix Z, which is derived from some rows of X (say the values are doubled so we can see the difference between the two matrices). In pseudo-code:
>>Z = X[[0,2,3]]
>>print Z.todense()
[[1,4,3]
[2,1,1]
[3,6,3]]
>>Z = Z*2
>>print Z.todense()
[[2, 8, 6]
[4, 2, 2]
[6, 12,6]]
What's the best way of retrieving the rows in Z using the ORIGINAL indices from X. So for instance, in pseudo-code:
>>print Z[[0,3]]
[[2,8,6] #0 from Z, and what would be row **0** from X)
[6,12,6]] #2 from Z, but what would be row **3** from X)
That is, how can you retrieve rows from Z, using indices that refer to the original rows position in the original matrix X? To do this, you can't modify X in anyway (you can't add an index column to the matrix X), but there are no other limits.
If you have the original indices in an array i, and the values in i are in increasing order (as in your example), you can use numpy.searchsorted(i, [0, 3]) to find the indices in Z that correspond to indices [0, 3] in the original X. Here's a demonstration in an IPython session:
In [39]: X = csr_matrix([[1,4,3],[3,4,1],[2,1,1],[3,6,3]])
In [40]: X.todense()
Out[40]:
matrix([[1, 4, 3],
[3, 4, 1],
[2, 1, 1],
[3, 6, 3]])
In [41]: i = array([0, 2, 3])
In [42]: Z = 2 * X[i]
In [43]: Z.todense()
Out[43]:
matrix([[ 2, 8, 6],
[ 4, 2, 2],
[ 6, 12, 6]])
In [44]: Zsub = Z[searchsorted(i, [0, 3])]
In [45]: Zsub.todense()
Out[45]:
matrix([[ 2, 8, 6],
[ 6, 12, 6]])
I am trying to create a dictionary from a relatively large numpy array. I tried using the dictionary constructor like so:
elements =dict((k,v) for (a[:,0] , a[:,-1]) in myarray)
I am assuming I am doing this incorrectly since I get the error: "ValueError: too many values to unpack"
The numPy array looks like this:
[ 2.01206281e+13 -8.42110000e+04 -8.42110000e+04 ..., 0.00000000e+00
3.30000000e+02 -3.90343147e-03]
I want the first column 2.01206281e+13 to be the key and the last column -3.90343147e-03 to be the value for each row in the array
Am I on the right track/is there a better way to go about doing this?
Thanks
Edit: let me be more clear I want the first column to be the key and the last column to be the value. I want to do this for every row in the numpy array
This is kind of a hard question on answer without knowing what exactly myarray is, but this might help you get started.
>>> import numpy as np
>>> a = np.random.randint(0, 10, size=(3, 2))
>>> a
array([[1, 6],
[9, 3],
[2, 8]])
>>> dict(a)
{1: 6, 2: 8, 9: 3}
or
>>> a = np.random.randint(0, 10, size=(3, 5))
>>> a
array([[9, 7, 4, 4, 6],
[8, 9, 1, 6, 5],
[7, 5, 3, 4, 7]])
>>> dict(a[:, [0, -1]])
{7: 7, 8: 5, 9: 6}
elements = dict( zip( * [ iter( myarray ) ] * 2 ) )
What we see here is that we create an iterator based on the myarray list. We put it in a list and double it. Now we've got the same iterator bound to the first and second place in a list which we give as arguments to the zip function which creates a list of pairs for the dict creator.