Why does numpy.delete always delete one extra element in my test? - numpy

I want to use numpy.delete to delete certain elements in an array
import numpy as np
aa = np.array([1,2,3,4,5,6,7,8])
bb = np.array([0,0,0,0,0,0,0,0], dtype='bool')
np.delete(aa,bb)
gives me the results:
array([2, 3, 4, 5, 6, 7, 8])
I expect the results like this:
array([1, 2, 3, 4, 5, 6, 7, 8])
And if I change the bb to
bb = np.array([1,0,0,0,0,0,0,0], dtype='bool')
I got:
np.delete(aa,bb)
array([3, 4, 5, 6, 7, 8])
The code is simple, but I do not understand why numpy.delete behaves like this. Any explanations?

When I paste your code into a repl, I get the intended output. I am using Numpy v1.19.4 and Python 3.8.5. Check if there is an update for Numpy, and make sure that you are not doing any operations after that may remove the first item in the array.

np.delete is using an array of ints as the indices to remove. If you use a bool, that is converted to an int (False =0, True=1). So what you are doing in your first example is saying remove the 0 index value and in the second example you have a 1 and a 0 so it's removing those indexes.
In the future numpy will not cast the booleans as integers.
FutureWarning: in the future insert will treat boolean arrays and array-likes as boolean index instead of casting it to integer

Related

Rearranging numpy arrays

I was not able to find a duplicate of my question, unfortunately, although I am sure that this is a problem which has been solved before
I have a numpy array with a certain set of indices, eg.
ind1 = np.array([1, 3, 5, 7])
With these indices, I can filter some values from another array. Lets call this other array rows. As an example, I can retrieve
rows[ind1] = [1, 10, 20, 15]
The order of rows[ind1] must not be changed in the following.
I have another index array, ind2
ind2 = np.array([4, 5, 6, 7])
I also have an array cols, where I can filter values from using ind2. I know that cols[ind2] results in an array which has the size of rows[ind1] and the entries are the same, but the order is different. An example:
cols[ind2] = [15, 20, 10, 1]
I would like to rearrange the order of cols[ind2], so that it corresponds to rows[ind1]. I am interested in the corresponding order of ind2.
In the example, the result should be
cols[ind2] = [1, 10, 20, 15]
ind2 = [7, 6, 5, 4]
Using numpy, I did not find a way to do this. Any ideas would be helpful. Thanks in advance.
There may be a better way, but you can do this using argsorts.
Let's call your "reordered ind2" ind3.
If you are sure that rows[ind1] and cols[ind2] will have the same length and all of the same elements, then the sorted versions of both will be the same i.e np.sort(rows[ind1]) = np.sort(cols[ind2]).
If this is the case, and you don't run into any problems with repeated elements (unsure of your exact use case), then what you can do is find the indices to put cols[ind2] in order, and then from there, find the indices to put np.sort(cols[ind2]) into the order of rows[ind1].
So, if
p1 = np.argsort(rows[ind1])
and
p2 = np.argsort(cols[ind2])
and
p3 = np.argsort(p1)
Then
ind3 = ind2[p2][p3]. The reason this works is because if you do an argsort of an argsort, it gives you the indices you need to reverse the first sort. p2 sorts cols[ind2] (that's the definition of argsort), and p3 unsorts the result of that back into the order of rows[ind1].

What does the [1] do when using .where()?

I m practicing on a Data Cleaning Kaggle excercise.
In parsing dates example I can´t figure out what the [1] does at the end of the indices object.
Thanks..
# Finding indices corresponding to rows in different date format
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]
As described in the documentation, numpy.where called with a single argument is equivalent to calling np.asarray([date_lengths == 24]).nonzero().
numpy.nonzero return a tuple with as many items as the dimensions of the input array with the indexes of the non-zero values.
>>> np.nonzero([1,0,2,0])
(array([0, 2]),)
Slicing [1] enables to get the second element (i.e. second dimension) but as the input was wrapped into […], this is equivalent to doing:
np.where(date_lengths == 24)[0]
>>> np.nonzero([1,0,2,0])[0]
array([0, 2])
It is an artefact of the extra [] around the condition. For example:
a = np.arange(10)
To find, for example, indices where a>3 can be done like this:
np.where(a > 3)
gives as output a tuple with one array
(array([4, 5, 6, 7, 8, 9]),)
So the indices can be obtained as
indices = np.where(a > 3)[0]
In your case, the condition is between [], which is unnecessary, but still works.
np.where([a > 3])
returns a tuple of which the first is an array of zeros, and the second array is the array of indices you want
(array([0, 0, 0, 0, 0, 0]), array([4, 5, 6, 7, 8, 9]))
so the indices are obtained as
indices = np.where([a > 3])[1]

Numpy multi-dimensional array slicing

I have a 3-D NumPy array with shape (100, 50, 20). I was trying to slice the third dimension of the array by using the index, e.g., from 1 to 6 and from 8 to 10.
I tried the following code, but it kept reporting a syntax error.
newarr [:,:,1:10] = oldarr[:,:,[1:7,8:11]]
You can use np.r_ to concatenate slice objects:
newarr [:,:,1:10] = oldarr[:,:,np.r_[1:7,8:11]]
Example:
np.r_[1:4,6:8]
array([1, 2, 3, 6, 7])

what does `replace` in `pandas.DataFrame.sample()` do? [duplicate]

Here explains the function numpy.random.choice. However, I am confused about the third parameter replace. What is it? And in which case will it be useful? Thanks!
It controls whether the sample is returned to the sample pool. If you want only unique samples then this should be false.
You can use it when you want sample some elements from a list, and meanwhile you want the elements no repeat, then you can set the "replace=False".
eg.
from numpy import random as rd
ary = list(range(10))
# usage
In[18]: rd.choice(ary, size=8, replace=False)
Out[18]: array([0, 5, 9, 8, 2, 1, 6, 3]) # no repeated elements
In[19]: rd.choice(ary, size=8, replace=True)
Out[19]: array([4, 9, 8, 5, 4, 1, 1, 9]) # elements may be repeated

Numpy rebinning a 2D array

I am looking for a fast formulation to do a numerical binning of a 2D numpy array. By binning I mean calculate submatrix averages or cumulative values. For ex. x = numpy.arange(16).reshape(4, 4) would have been splitted in 4 submatrix of 2x2 each and gives numpy.array([[2.5,4.5],[10.5,12.5]]) where 2.5=numpy.average([0,1,4,5]) etc...
How to perform such an operation in an efficient way... I don't have really any ideay how to perform this ...
Many thanks...
You can use a higher dimensional view of your array and take the average along the extra dimensions:
In [12]: a = np.arange(36).reshape(6, 6)
In [13]: a
Out[13]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
In [14]: a_view = a.reshape(3, 2, 3, 2)
In [15]: a_view.mean(axis=3).mean(axis=1)
Out[15]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
In general, if you want bins of shape (a, b) for an array of (rows, cols), your reshaping of it should be .reshape(rows // a, a, cols // b, b). Note also that the order of the .mean is important, e.g. a_view.mean(axis=1).mean(axis=3) will raise an error, because a_view.mean(axis=1) only has three dimensions, although a_view.mean(axis=1).mean(axis=2) will work fine, but it makes it harder to understand what is going on.
As is, the above code only works if you can fit an integer number of bins inside your array, i.e. if a divides rows and b divides cols. There are ways to deal with other cases, but you will have to define the behavior you want then.
See the SciPy Cookbook on rebinning, which provides this snippet:
def rebin(a, *args):
'''rebin ndarray data into a smaller ndarray of the same rank whose dimensions
are factors of the original dimensions. eg. An array with 6 columns and 4 rows
can be reduced to have 6,3,2 or 1 columns and 4,2 or 1 rows.
example usages:
>>> a=rand(6,4); b=rebin(a,3,2)
>>> a=rand(6); b=rebin(a,2)
'''
shape = a.shape
lenShape = len(shape)
factor = asarray(shape)/asarray(args)
evList = ['a.reshape('] + \
['args[%d],factor[%d],'%(i,i) for i in range(lenShape)] + \
[')'] + ['.sum(%d)'%(i+1) for i in range(lenShape)] + \
['/factor[%d]'%i for i in range(lenShape)]
print ''.join(evList)
return eval(''.join(evList))
I assume that you only want to know how to generally build a function that performs well and does something with arrays, just like numpy.reshape in your example. So if performance really matters and you're already using numpy, you can write your own C code for that, like numpy does. For example, the implementation of arange is completely in C. Almost everything with numpy which matters in terms of performance is implemented in C.
However, before doing so you should try to implement the code in python and see if the performance is good enough. Try do make the python code as efficient as possible. If it still doesn't suit your performance needs, go the C way.
You may read about that in the docs.