Add another column(index) into the array - numpy

I have an array
a =array([ 0.74552751, 0.70868784, 0.7351144 , 0.71597612, 0.77608263,
0.71213591, 0.77297658, 0.75637376, 0.76636106, 0.76098067,
0.79142821, 0.71932262, 0.68984604, 0.77008623, 0.76334351,
0.76129872, 0.76717526, 0.78413129, 0.76483804, 0.75160062,
0.7532506 ], dtype=float32)
I want to store my array in item,value format and can't seems to get it right.
I'm trying to get this format:
a = [(0, 0.001497),
(1, 0.0061543),
..............
(46, 0.001436781),
(47, 0.00654533),
(48, 0.0027139),
(49, 0.00462962)],

Numpy arrays have a fixed data type that you must specify. It looks like a data type of
int for your item and float for you value would work best. Something like:
import numpy as np
dtype = [("item", int), ("value", float)]
a = np.array([(0, 0.), (1, .1), (2, .2)], dtype=dtype)
The string part of the dtype is the name of each field. The names allow you to access the fields more easily like this:
print a['value']
# [ 0., 0.1, 0.2]
a['value'] = [7, 8, 9]
print a
# [(0, 7.0) (1, 8.0) (2, 9.0)]
If you need to copy another array into the array I describe above, you can do it just by using the filed name:
new = np.empty(len(a), dtype)
new['index'] = [3, 4, 5]
new['value'] = a['value']
print new
# [(3, 7.0) (4, 8.0) (5, 9.0)]

Related

Given a (nested) view into a numpy 2D array, how to retrive the coords w.r.t. the original array

Consider the following:
A = np.zeros((100,100)) # TODO: populate A
filt = median_filter(A, size=5) # doesn't impact A.shape
view = filt[30:40, 30:40]
subvew = view[0:5, 0:5]
Is it possible to extract from subview the corresponding rectangle within A?
I'd like to do something like:
coords = get_rect(subview)
rect_A = A[coords]
But if I'm constantly having to pass bounding-rects thru the system the code uglifies fast.
numpy must store this information internally, but is it possible to access it?
PS I'm not doing anything fancy like view = A[::2]
PPS From reviewing the excellent answer, it looks like it should be possible to subclass numpy.ndarray, adding a .parent property and a .get_global_rect() method. But it looks like a HARD task.
In [40]: x = np.arange(24).reshape(4,6)
__array_interface__ is a way of viewing everything about a numpy array.
In [41]: x.__array_interface__
Out[41]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (4, 6),
'version': 3}
In [42]: x.strides
Out[42]: (48, 8)
For a view:
In [43]: y = x[:3,1:4]
In [44]: y
Out[44]:
array([[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]])
In [45]: y.__array_interface__
Out[45]:
{'data': (43385720, False),
'strides': (48, 8),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 3),
'version': 3}
In [46]: y.base
Out[46]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23])
x.base is the same, the original np.arange(24).
The key difference in y is the shape, and data value, which "points" 8 bytes further along.
So while one could, in theory, deduce the indexing used to create y, numpy does not have a function or method to do that for us. Keeping track of your own "coordinates" is the best option.
Another way to put it, y is a numpy.ndarray, just like x. It does not carry any extra information about how it was created. The same applies to z, a view of y.
As for the 1d base
In [48]: x.base.strides
Out[48]: (8,)
In [49]: x.base.shape
Out[49]: (24,)
In [50]: x.base.__array_interface__
Out[50]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (24,),
'version': 3}

Transforming a sequence of integers into the binary representation of that sequence's strides [duplicate]

I'm looking for a way to select multiple slices from a numpy array at once. Say we have a 1D data array and want to extract three portions of it like below:
data_extractions = []
for start_index in range(0, 3):
data_extractions.append(data[start_index: start_index + 5])
Afterwards data_extractions will be:
data_extractions = [
data[0:5],
data[1:6],
data[2:7]
]
Is there any way to perform above operation without the for loop? Some sort of indexing scheme in numpy that would let me select multiple slices from an array and return them as that many arrays, say in an n+1 dimensional array?
I thought maybe I can replicate my data and then select a span from each row, but code below throws an IndexError
replicated_data = np.vstack([data] * 3)
data_extractions = replicated_data[[range(3)], [slice(0, 5), slice(1, 6), slice(2, 7)]
You can use the indexes to select the rows you want into the appropriate shape.
For example:
data = np.random.normal(size=(100,2,2,2))
# Creating an array of row-indexes
indexes = np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
# data[indexes] will return an element of shape (3,5,2,2,2). Converting
# to list happens along axis 0
data_extractions = list(data[indexes])
np.all(data_extractions[1] == data[1:6])
True
The final comparison is against the original data.
stride_tricks can do that
a = np.arange(10)
b = np.lib.stride_tricks.as_strided(a, (3, 5), 2 * a.strides)
b
# array([[0, 1, 2, 3, 4],
# [1, 2, 3, 4, 5],
# [2, 3, 4, 5, 6]])
Please note that b references the same memory as a, in fact multiple times (for example b[0, 1] and b[1, 0] are the same memory address). It is therefore safest to make a copy before working with the new structure.
nd can be done in a similar fashion, for example 2d -> 4d
a = np.arange(16).reshape(4, 4)
b = np.lib.stride_tricks.as_strided(a, (3,3,2,2), 2*a.strides)
b.reshape(9,2,2) # this forces a copy
# array([[[ 0, 1],
# [ 4, 5]],
# [[ 1, 2],
# [ 5, 6]],
# [[ 2, 3],
# [ 6, 7]],
# [[ 4, 5],
# [ 8, 9]],
# [[ 5, 6],
# [ 9, 10]],
# [[ 6, 7],
# [10, 11]],
# [[ 8, 9],
# [12, 13]],
# [[ 9, 10],
# [13, 14]],
# [[10, 11],
# [14, 15]]])
In this post is an approach with strided-indexing scheme using np.lib.stride_tricks.as_strided that basically creates a view into the input array and as such is pretty efficient for creation and being a view occupies nomore memory space.
Also, this works for ndarrays with generic number of dimensions.
Here's the implementation -
def strided_axis0(a, L):
# Store the shape and strides info
shp = a.shape
s = a.strides
# Compute length of output array along the first axis
nd0 = shp[0]-L+1
# Setup shape and strides for use with np.lib.stride_tricks.as_strided
# and get (n+1) dim output array
shp_in = (nd0,L)+shp[1:]
strd_in = (s[0],) + s
return np.lib.stride_tricks.as_strided(a, shape=shp_in, strides=strd_in)
Sample run for a 4D array case -
In [44]: a = np.random.randint(11,99,(10,4,2,3)) # Array
In [45]: L = 5 # Window length along the first axis
In [46]: out = strided_axis0(a, L)
In [47]: np.allclose(a[0:L], out[0]) # Verify outputs
Out[47]: True
In [48]: np.allclose(a[1:L+1], out[1])
Out[48]: True
In [49]: np.allclose(a[2:L+2], out[2])
Out[49]: True
You can slice your array with a prepared slicing array
a = np.array(list('abcdefg'))
b = np.array([
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]
])
a[b]
However, b doesn't have to generated by hand in this way. It can be more dynamic with
b = np.arange(5) + np.arange(3)[:, None]
In the general case you have to do some sort of iteration - and concatenation - either when constructing the indexes or when collecting the results. It's only when the slicing pattern is itself regular that you can use a generalized slicing via as_strided.
The accepted answer constructs an indexing array, one row per slice. So that is iterating over the slices, and arange itself is a (fast) iteration. And np.array concatenates them on a new axis (np.stack generalizes this).
In [264]: np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
Out[264]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
indexing_tricks convenience methods to do the same thing:
In [265]: np.r_[0:5, 1:6, 2:7]
Out[265]: array([0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6])
This takes the slicing notation, expands it with arange and concatenates. It even lets me expand and concatenate into 2d
In [269]: np.r_['0,2',0:5, 1:6, 2:7]
Out[269]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
In [270]: data=np.array(list('abcdefghijk'))
In [272]: data[np.r_['0,2',0:5, 1:6, 2:7]]
Out[272]:
array([['a', 'b', 'c', 'd', 'e'],
['b', 'c', 'd', 'e', 'f'],
['c', 'd', 'e', 'f', 'g']],
dtype='<U1')
In [273]: data[np.r_[0:5, 1:6, 2:7]]
Out[273]:
array(['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'f', 'c', 'd', 'e',
'f', 'g'],
dtype='<U1')
Concatenating results after indexing also works.
In [274]: np.stack([data[0:5],data[1:6],data[2:7]])
My memory from other SO questions is that relative timings are in the same order of magnitude. It may vary for example with the number of slices versus their length. Overall the number of values that have to be copied from source to target will be the same.
If the slices vary in length, you'd have to use the flat indexing.
No matter which approach you choose, if 2 slices contain same element, it doesn't support mathematical operations correctly unlesss you use ufunc.at which can be more inefficient than loop. For testing:
def as_strides(arr, window_size, stride, writeable=False):
'''Get a strided sub-matrices view of a 4D ndarray.
Args:
arr (ndarray): input array with shape (batch_size, m1, n1, c).
window_size (tuple): with shape (m2, n2).
stride (tuple): stride of windows in (y_stride, x_stride).
writeable (bool): it is recommended to keep it False unless needed
Returns:
subs (view): strided window view, with shape (batch_size, y_nwindows, x_nwindows, m2, n2, c)
See also numpy.lib.stride_tricks.sliding_window_view
'''
batch_size = arr.shape[0]
m1, n1, c = arr.shape[1:]
m2, n2 = window_size
y_stride, x_stride = stride
view_shape = (batch_size, 1 + (m1 - m2) // y_stride,
1 + (n1 - n2) // x_stride, m2, n2, c)
strides = (arr.strides[0], y_stride * arr.strides[1],
x_stride * arr.strides[2]) + arr.strides[1:]
subs = np.lib.stride_tricks.as_strided(arr,
view_shape,
strides=strides,
writeable=writeable)
return subs
import numpy as np
np.random.seed(1)
Xs = as_strides(np.random.randn(1, 5, 5, 2), (3, 3), (2, 2), writeable=True)[0]
print('input\n0,0\n', Xs[0, 0])
np.add.at(Xs, np.s_[:], 5)
print('unbuffered sum output\n0,0\n', Xs[0,0])
np.add.at(Xs, np.s_[:], -5)
Xs = Xs + 5
print('normal sum output\n0,0\n', Xs[0, 0])
We can use list comprehension for this
data=np.array([1,2,3,4,5,6,7,8,9,10])
data_extractions=[data[b:b+5] for b in [1,2,3,4,5]]
data_extractions
Results
[array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]), array([4, 5, 6, 7, 8]), array([5, 6, 7, 8, 9]), array([ 6, 7, 8, 9, 10])]

Numpy summation with sliding window is really slow

Code:
shape = np.array([6, 6])
grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i, x in enumerate(shape)], indexing='ij')]).T
slices = [tuple(slice(box[i], box[i] + 2) for i in range(len(box))) for box in grid]
score = np.zeros((7,7,3))
column = np.random.randn(36, 12) #just for example
column
>> array([[ 0, 1, 2, 3, ... 425, 426, 427, 428, 429, 430, 431]])
column = column.reshape((16, 3, 3, 3))
for i, window in enumerate(slices):
score[window] += column[i]
score
>> array([[[0.000e+00, 1.000e+00, 2.000e+00],
[3.000e+01, 3.200e+01, 3.400e+01],
[9.000e+01, 9.300e+01, 9.600e+01], ...
[8.280e+02, 8.300e+02, 8.320e+02],
[4.290e+02, 4.300e+02, 4.310e+02]]])
It works but last 2 lines take really much time as they will be in loop. The problem is that 'grid' variable contains an array of windows. And I don't now how to speed up the process.
Let's simplify the problem at bit - reduce the dimensions, and drop the final size 3 dimension:
In [265]: shape = np.array([4,4])
In [266]: grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i
...: , x in enumerate(shape)], indexing='ij')]).T
...: grid = [tuple(slice(box[i], box[i] + 3) for i in range(len(box))) fo
...: r box in grid]
...:
...:
In [267]: len(grid)
Out[267]: 16
In [268]: score = np.arange(36).reshape(6,6)
In [269]: X = np.array([score[x] for x in grid]).reshape(4,4,3,3)
In [270]: X
Out[270]:
array([[[[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]],
[[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]],
[[ 2, 3, 4],
[ 8, 9, 10],
[14, 15, 16]],
....
[[21, 22, 23],
[27, 28, 29],
[33, 34, 35]]]])
This is a moving window - one (3,3) array, shift over 1,..., shift down 1, etc
With as_strided is is possible to construct a view of the array, that consists of all these windows, but without actually copying values. Having worked with as_strided before I was able construct the equivalent strides as:
In [271]: score.shape
Out[271]: (6, 6)
In [272]: score.strides
Out[272]: (48, 8)
In [273]: ast = np.lib.stride_tricks.as_strided
In [274]: x=ast(score, shape=(4,4,3,3), strides=(48,8,48,8))
In [275]: np.allclose(X,x)
Out[275]: True
This could be extended to your (28,28,3) dimensions, and turned into the summation.
Generating such moving windows has been covered in previous SO questions. And it's also implemented in one of the image processing packages.
Adaptation for a 3 channel image,
In [45]: arr.shape
Out[45]: (6, 6, 3)
In [46]: arr.strides
Out[46]: (144, 24, 8)
In [47]: arr[:3,:3,0]
Out[47]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
In [48]: x = ast(arr, shape=(4,4,3,3,3), strides=(144,24,144,24,8))
In [49]: x[0,0,:,:,0]
Out[49]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
Since we are moving the window by one element at a time, the strides for x are easily derived form the source strides.
For 4x4 windows, just change the shape
x = ast(arr, shape=(3,3,4,4,3), strides=(144,24,144,24,8))
In Efficiently Using Multiple Numpy Slices for Random Image Cropping
#Divikar suggests using skimage
With the default step=1, the result is compatible:
In [55]: from skimage.util.shape import view_as_windows
In [63]: y = view_as_windows(arr,(4,4,3))
In [64]: y.shape
Out[64]: (3, 3, 1, 4, 4, 3)
In [69]: np.allclose(x,y[:,:,0])
Out[69]: True

Why does Seaborn keep drawing non-existing range value on the x-axis?

Snippet:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
test = pd.DataFrame({'value':[1,2,5,7,8,10,11,12,15,16,18,20,36,37,39]})
test['range'] = pd.cut(test.value, np.arange(0,45,5)) # generate range
test = test.groupby('range')['value'].count().to_frame().reset_index() # count occurance in each range
test = test[test.value!=0] #filter out rows with value = 0
plt.figure(figsize=(10,5))
plt.xticks(rotation=90)
plt.yticks(np.arange(0,10, 1))
sns.barplot(x=test.range, y=test.value)
Output:
If we look at what's in test:
range value
0 (0, 5] 3
1 (5, 10] 3
2 (10, 15] 3
3 (15, 20] 3
7 (35, 40] 3
The range (20,25], (25,30],(30,35] have already been filtered out, but they still appear in the plot. Why is that? How can I output a plot without empty ranges?
P.S. #jezrael 's solution works perfect with the snippet above. I tried it on a real dataset:
Snippet:
test['range'] = test['range'].cat.remove_unused_categories()
Warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I used the following instead to avoid the warning:
test['range'].cat.remove_unused_categories(inplace=True)
This is caused by using multiple variables, so be aware:
test = blah blah blah
test_df = test[test.value!=0]
test_df['range'] = test_df['range'].cat.remove_unused_categories() # warning!
There is problem range column is categorical, so categories are not removed by design like in another operations.
You need Series.cat.remove_unused_categories:
...
test = test[test.value!=0] #filter out rows with value = 0
print (test['range'])
0 (0, 5]
1 (5, 10]
2 (10, 15]
3 (15, 20]
7 (35, 40]
Name: range, dtype: category
Categories (8, interval[int64]):
[(0, 5] < (5, 10] < (10, 15] < (15, 20] < (20, 25] < (25, 30] < (30, 35] < (35, 40]]
test['range'] = test['range'].cat.remove_unused_categories()
print (test['range'])
0 (0, 5]
1 (5, 10]
2 (10, 15]
3 (15, 20]
7 (35, 40]
Name: range, dtype: category
Categories (5, interval[int64]):
[(0, 5] < (5, 10] < (10, 15] < (15, 20] < (35, 40]]
plt.figure(figsize=(10,5))
plt.xticks(rotation=90)
plt.yticks(np.arange(0,10, 1))
sns.barplot(x=test.range, y=test.value)
EDIT:
You need copy:
test_df = test[test.value!=0].copy()
test_df['range'] = test_df['range'].cat.remove_unused_categories() # no warning!
If you modify values in test_df later you will find that the modifications do not propagate back to the original data (test), and that Pandas does warning.

Creating structured array from flat list

Suppose I have a flat list L like this:
In [97]: L
Out[97]: [2010.5, 1, 2, 3, 4, 5]
...and I want to get a structured array like this:
array((2010.5, [1, 2, 3, 4, 5]),
dtype=[('A', '<f4'), ('B', '<u4', (5,))])
How can I most efficiently make this conversion? I can not pass the latter dtype directly to array(L, ...) or to array(tuple(L)):
In [98]: dtp [9/1451]
Out[98]: [('A', '<f4', 1), ('B', '<u4', 5)]
In [99]: array(L, dtp)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-99-32809e0456a7> in <module>()
----> 1 array(L, dtp)
TypeError: expected an object with a buffer interface
In [101]: array(tuple(L), dtp)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-101-4d0c49a9f01d> in <module>()
----> 1 array(tuple(L), dtp)
ValueError: size of tuple must match number of fields.
What does work is to pass a temporary dtype where each field has one entry, then view this with the dtype I actually want:
In [102]: tdtp
Out[102]:
[('a', numpy.float32),
('b', numpy.uint32),
('c', numpy.uint32),
('d', numpy.uint32),
('e', numpy.uint32),
('f', numpy.uint32)]
In [103]: array(tuple(L), tdtp).view(dtp)
Out[103]:
array((2010.5, [1, 2, 3, 4, 5]),
dtype=[('A', '<f4'), ('B', '<u4', (5,))])
But to create this temporary dtype is an additional step that I would like to avoid if possible.
Is it possible to go directly from my flat list to my structured dtype, without using the intermediate dtype shown above?
(Note: in my real use case I have a reading routine reading a custom file format and many values per entry; so I would prefer to avoid a situation where I need to construct both the temporary and the actual dtype by hand.)
Instead of passing tuple(L) to array, pass an argument with nested values that match the nesting of the dtype. For the example you showed, you could pass in (L[0], L[1:]):
In [28]: L
Out[28]: [2010.5, 1, 2, 3, 4, 5]
In [29]: dtp
Out[29]: [('A', '<f4', 1), ('B', '<u4', 5)]
In [30]: array((L[0], L[1:]), dtype=dtp)
Out[30]:
array((2010.5, [1L, 2L, 3L, 4L, 5L]),
dtype=[('A', '<f4'), ('B', '<u4', (5,))])