Numpy and The best way to remove rows with idendical values - numpy

I'm struggling with numpy lib.
I have a tensor of the shape (batch_size, timestep, feature):
For example lets create a dummy:
x = np.arange(42).reshape(2,7,3)
#now make some rows have homogeneous values
x[:,::3,:] =0
x[:,::5,:] =2
Now I need a numpyish way(which is repeatable in tensorflow) to remove rows(axis=-2) where all values are the same. So in the end I need a tensor to look like this:
[[[ 3 4 5]
[ 6 7 8]
[12 13 14]]
[[24 25 26]
[27 28 29]
[33 34 35]]]
Thanks.
P.S. this is not the same question as to "remove all zero rows". Since here we are talking about rows with homo- values. And this is a bit trickier.

If you are okay with losing one dimension (so that your array remains homogeneous), then you can do:
x[~np.all(x == x[:, :, 0, np.newaxis], axis=-1)]
# out:
[[ 3 4 5]
[ 6 7 8]
[12 13 14]
[24 25 26]
[27 28 29]
[33 34 35]]
Credit: #unutbu's answer to a similar problem, here adapted to one more dimension.
Why is the 3rd dimension removed? Imagine if your conditions were such that you wanted to select 2 rows from your first array and 3 from your second: then the result would be heterogeneous, which would have to be stored as a masked array or as a list of arrays.

There might be a more clever way using only numpy. However, you could just iterate over the 2nd dimension and do a comparison.
not_same= []
for n in range(x.shape[1]): # iterate over the 2nd dimension
# test if it is homogeneous i.e. first value equal all values
not_same.append(~np.all(x[:,n,:] ==x[0,n,0]))
out = x[:,not_same,:]
This gives you:
array([[[ 3, 4, 5],
[ 6, 7, 8],
[12, 13, 14]],
[[24, 25, 26],
[27, 28, 29],
[33, 34, 35]]])

Related

Assign numpy matrix to pandas columns

I have dataframe with 48870 rows and calculated embeddings with shape (48870, 768)
I wanna assign this embeddings to padnas column
When i try
test['original_text_embeddings'] = embeddings
I have an error: Wrong number of items passed 768, placement implies 1
I know if a make something like df.loc['original_text_embeddings'] = embeddings[0] will work but i need to automate this process
A dataframe/column needs a 1d list/array:
In [84]: x = np.arange(12).reshape(3,4)
In [85]: pd.Series(x)
...
ValueError: Data must be 1-dimensional
Splitting the array into a list (of arrays):
In [86]: pd.Series(list(x))
Out[86]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
2 [8, 9, 10, 11]
dtype: object
In [87]: _.to_numpy()
Out[87]:
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])],
dtype=object)
Your embeddings have 768 columns, which would translate to equally 768 columns in a data frame. You are trying to assign all columns from the embeddings to just one column in the data frame, which is not possible.
What you could do is generating a new data frame from the embeddings and concatenate the test df with the embedding df
embedding_df = pd.DataFrame(embeddings)
test = pd.concat([test, embedding_df], axis=1)
Have a look at the documentation for handling indexes and concatenating on different axis:
https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Explicit slicing across a particular dimension

I've got a 3D tensor x (e.g 4x4x100). I want to obtain a subset of this by explicitly choosing elements across the last dimension. This would have been easy if I was choosing the same elements across last dimension (e.g. x[:,:,30:50] but I want to target different elements across that dimension using the 2D tensor indices which specifies the idx across third dimension. Is there an easy way to do this in numpy?
A simpler 2D example:
x = [[1,2,3,4,5,6],[10,20,30,40,50,60]]
indices = [1,3]
Let's say I want to grab two elements across third dimension of x starting from points specified by indices. So my desired output is:
[[2,3],[40,50]]
Update: I think I could use a combination of take() and ravel_multi_index() but some of the platforms that are inspired by numpy (like PyTorch) don't seem to have ravel_multi_index so I'm looking for alternative solutions
Iterating over the idx, and collecting the slices is not a bad option if the number of 'rows' isn't too large (and the size of the sizes is relatively big).
In [55]: x = np.array([[1,2,3,4,5,6],[10,20,30,40,50,60]])
In [56]: idx = [1,3]
In [57]: np.array([x[j,i:i+2] for j,i in enumerate(idx)])
Out[57]:
array([[ 2, 3],
[40, 50]])
Joining the slices like this only works if they all are the same size.
An alternative is to collect the indices into an array, and do one indexing.
For example with a similar iteration:
idxs = np.array([np.arange(i,i+2) for i in idx])
But broadcasted addition may be better:
In [58]: idxs = np.array(idx)[:,None]+np.arange(2)
In [59]: idxs
Out[59]:
array([[1, 2],
[3, 4]])
In [60]: x[np.arange(2)[:,None], idxs]
Out[60]:
array([[ 2, 3],
[40, 50]])
ravel_multi_index is not hard to replicate (if you don't need clipping etc):
In [65]: np.ravel_multi_index((np.arange(2)[:,None],idxs),x.shape)
Out[65]:
array([[ 1, 2],
[ 9, 10]])
In [66]: x.flat[_]
Out[66]:
array([[ 2, 3],
[40, 50]])
In [67]: np.arange(2)[:,None]*x.shape[1]+idxs
Out[67]:
array([[ 1, 2],
[ 9, 10]])
along the 3D axis:
x = [x[:,i].narrow(2,index,2) for i,index in enumerate(indices)]
x = torch.stack(x,dim=1)
by enumerating you get the index of the axis and index from where you want to start slicing in one.
narrow gives you a zero-copy length long slice from a starting index start along a certain axis
you said you wanted:
dim = 2
start = index
length = 2
then you simply have to stack these tensors back to a single 3D.
This is the least work intensive thing i can think of for pytorch.
EDIT
if you just want different indices along different axis and indices is a 2D tensor you can do:
x = [x[:,i,index] for i,index in enumerate(indices)]
x = torch.stack(x,dim=1)
You really should have given a proper working example, making it unnecessarily confusing.
Here is how to do it in numpy, now clue about torch, though.
The following picks a slice of length n along the third dimension starting from points idx depending on the other two dimensions:
# example
a = np.arange(60).reshape(2, 3, 10)
idx = [(1,2,3),(4,3,2)]
n = 4
# build auxiliary 4D array where the last two dimensions represent
# a sliding n-window of the original last dimension
j,k,l = a.shape
s,t,u = a.strides
aux = np.lib.stride_tricks.as_strided(a, (j,k,l-n+1,n), (s,t,u,u))
# pick desired offsets from sliding windows
aux[(*np.ogrid[:j, :k], idx)]
# array([[[ 1, 2, 3, 4],
# [12, 13, 14, 15],
# [23, 24, 25, 26]],
# [[34, 35, 36, 37],
# [43, 44, 45, 46],
# [52, 53, 54, 55]]])
I came up with below using broadcasting:
x = np.array([[1,2,3,4,5,6,7,8,9,10],[10,20,30,40,50,60,70,80,90,100]])
i = np.array([1,5])
N = 2 # number of elements I want to extract along each dimension. Starting points specified in i
r = np.arange(x.shape[-1])
r = np.broadcast_to(r, x.shape)
ii = i[:, np.newaxis]
ii = np.broadcast_to(ii, x.shape)
mask = np.logical_and(r-ii>=0, r-ii<=N)
output = x[mask].reshape(2,3)
Does this look alright?

Keras summation Layer acting weird, summing over training set

I am having trouble understanding the basic way Keras works. I am experimenting with a single summation layer, implemented as a Lambda layer using tensorflow as a backend:
from keras import backend as K
test_model = Sequential()
test_model.add( Lambda( lambda x: K.sum(x, axis=0), input_shape=(2,3)) )
x = np.reshape(np.arange(12), (2,2,3))
test_model.predict(x)
This returns:
array([[ 6., 8., 10.],
[ 12., 14., 16.]], dtype=float32)
Which is very weird, as it sums over the first index, which to my understanding corresponds to the index of the training data. Also, if I change the axis to axis=1 then the sum is taken over the second coordinate, which is what I would expect to get for axis=0.
What is going on? Why does it seem like the axis chosen effects how the data is passed to the lambda layer?
The input_shape is the shape of one sample of the batch.
It doesn't matter if you have 200 or 10000 samples in a batch, all the samples should be (2,3).
But the batch itself is what is passed along from one layer to another.
A batch contains "n" samples, each sample with the input_shape:
Batch shape then is: (n, 2, 3) -- n samples, each sample with input_shape = (2,3)
You don't define "n" when input_shape is required, because "n" will be defined when you use fit or another training command, with the batch_size. (In your example, n = 2)
This is the original array:
[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
Sample 1 = [ 0 1 2], [ 3 4 5]
Sample 2 = [ 6 7 8], [ 9 10 11]
Summing on index 0 (the batch size dimension) will sum sample 1 with sample 2:
[ 6 8 10], [12 14 16]
Summing on index 1 will sum the first dimension of one sample's input shape:
[ 3, 5, 7 ], [15, 17, 19]

I am trying to array index a 4 dimensional numpy array.

i have a 4 dimensional array -- say a=numpy.array(40,40,4,1000)
I also have an index array -- say b = np.arrange(35)
I am looking to make an array doing something like c = a[b,b,3,999] where the resulting array would look something like d = numpy.array(35,35). Would appreciate any thoughts on what the right way to do this is. Thank you. Neela.
Since b=np.arange(35) is just the first 35 indices, use slices instead:
c = a[:35,:35,3,999]
If the values in b are not contiguous, then you will need to adjust its shape
c = a[b[:,None], b[None,:], 3, 999]
e.g.
In [754]: a=np.arange(3*4*5).reshape(3,4,5)
In [755]: b=np.array([2,0,1])
In [756]: a[b[:,None],b[None,:],3]
Out[756]:
array([[53, 43, 48],
[13, 3, 8],
[33, 23, 28]])
b[:,None] is a (3,1) array, b[None,:] a (1,3), together they broadcast to (3,3) arrays.
You may need to read up on broadcasting and advanced indexing.
More explicitly this indexing is:
a[[[2],[0],[1]], [[2,0,1]], 3]
np.ix_ is a handy tool for generating indexes like this:
In [795]: I,J = np.ix_(b,b)
In [796]: I
Out[796]:
array([[2],
[0],
[1]])
In [797]: J
Out[797]: array([[2, 0, 1]])
In [798]: a[I,J,3]
Out[798]:
array([[53, 43, 48],
[13, 3, 8],
[33, 23, 28]])

Numpy rebinning a 2D array

I am looking for a fast formulation to do a numerical binning of a 2D numpy array. By binning I mean calculate submatrix averages or cumulative values. For ex. x = numpy.arange(16).reshape(4, 4) would have been splitted in 4 submatrix of 2x2 each and gives numpy.array([[2.5,4.5],[10.5,12.5]]) where 2.5=numpy.average([0,1,4,5]) etc...
How to perform such an operation in an efficient way... I don't have really any ideay how to perform this ...
Many thanks...
You can use a higher dimensional view of your array and take the average along the extra dimensions:
In [12]: a = np.arange(36).reshape(6, 6)
In [13]: a
Out[13]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
In [14]: a_view = a.reshape(3, 2, 3, 2)
In [15]: a_view.mean(axis=3).mean(axis=1)
Out[15]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
In general, if you want bins of shape (a, b) for an array of (rows, cols), your reshaping of it should be .reshape(rows // a, a, cols // b, b). Note also that the order of the .mean is important, e.g. a_view.mean(axis=1).mean(axis=3) will raise an error, because a_view.mean(axis=1) only has three dimensions, although a_view.mean(axis=1).mean(axis=2) will work fine, but it makes it harder to understand what is going on.
As is, the above code only works if you can fit an integer number of bins inside your array, i.e. if a divides rows and b divides cols. There are ways to deal with other cases, but you will have to define the behavior you want then.
See the SciPy Cookbook on rebinning, which provides this snippet:
def rebin(a, *args):
'''rebin ndarray data into a smaller ndarray of the same rank whose dimensions
are factors of the original dimensions. eg. An array with 6 columns and 4 rows
can be reduced to have 6,3,2 or 1 columns and 4,2 or 1 rows.
example usages:
>>> a=rand(6,4); b=rebin(a,3,2)
>>> a=rand(6); b=rebin(a,2)
'''
shape = a.shape
lenShape = len(shape)
factor = asarray(shape)/asarray(args)
evList = ['a.reshape('] + \
['args[%d],factor[%d],'%(i,i) for i in range(lenShape)] + \
[')'] + ['.sum(%d)'%(i+1) for i in range(lenShape)] + \
['/factor[%d]'%i for i in range(lenShape)]
print ''.join(evList)
return eval(''.join(evList))
I assume that you only want to know how to generally build a function that performs well and does something with arrays, just like numpy.reshape in your example. So if performance really matters and you're already using numpy, you can write your own C code for that, like numpy does. For example, the implementation of arange is completely in C. Almost everything with numpy which matters in terms of performance is implemented in C.
However, before doing so you should try to implement the code in python and see if the performance is good enough. Try do make the python code as efficient as possible. If it still doesn't suit your performance needs, go the C way.
You may read about that in the docs.