Numpy, how to retrieve sub-array of array (specific indices)? - numpy

I have an array:
>>> arr1 = np.array([[1,2,3], [4,5,6], [7,8,9]])
array([[1 2 3]
[4 5 6]
[7 8 9]])
I want to retrieve a list (or 1d-array) of elements of this array by giving a list of their indices, like so:
indices = [[0,0], [0,2], [2,0]]
print(arr1[indices])
# result
[1,6,7]
But it does not work, I have been looking for a solution about it for a while, but I only found ways to select per row and/or per column (not per specific indices)
Someone has any idea ?
Cheers
Aymeric

First make indices an array instead of a nested list:
indices = np.array([[0,0], [0,2], [2,0]])
Then, index the first dimension of arr1 using the first values of indices, likewise the second:
arr1[indices[:,0], indices[:,1]]
It gives array([1, 3, 7]) (which is correct, your [1, 6, 7] example output is probably a typo).

Related

What does the [1] do when using .where()?

I m practicing on a Data Cleaning Kaggle excercise.
In parsing dates example I can´t figure out what the [1] does at the end of the indices object.
Thanks..
# Finding indices corresponding to rows in different date format
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]
As described in the documentation, numpy.where called with a single argument is equivalent to calling np.asarray([date_lengths == 24]).nonzero().
numpy.nonzero return a tuple with as many items as the dimensions of the input array with the indexes of the non-zero values.
>>> np.nonzero([1,0,2,0])
(array([0, 2]),)
Slicing [1] enables to get the second element (i.e. second dimension) but as the input was wrapped into […], this is equivalent to doing:
np.where(date_lengths == 24)[0]
>>> np.nonzero([1,0,2,0])[0]
array([0, 2])
It is an artefact of the extra [] around the condition. For example:
a = np.arange(10)
To find, for example, indices where a>3 can be done like this:
np.where(a > 3)
gives as output a tuple with one array
(array([4, 5, 6, 7, 8, 9]),)
So the indices can be obtained as
indices = np.where(a > 3)[0]
In your case, the condition is between [], which is unnecessary, but still works.
np.where([a > 3])
returns a tuple of which the first is an array of zeros, and the second array is the array of indices you want
(array([0, 0, 0, 0, 0, 0]), array([4, 5, 6, 7, 8, 9]))
so the indices are obtained as
indices = np.where([a > 3])[1]

Sorting an array based on one column, then based on a second column

I would like sort an array based on one column, then for all the columns values that are equal - sort them based on a second column. For example: suppose that I have the array:
a = np.array([[0,1,1],[0,3,1],[1,7,2],[0,2,1]])
I can sort it by column 0 using:
sorted_array = a[np.argsort(a[:, 0])]
however, I want rows that have similar values at the [0] column to be sorted by the [1] column, so my result would look like:
desired_result = np.array([[0,1,1],[0,2,1],[0,3,1],[1,7,2]])
What is the best way to achieve that? Thanks.
You can sort them as tuple, then convert back to numpy array:
out = np.array(sorted(map(tuple,a)))
Output:
array([[0, 1, 1],
[0, 2, 1],
[0, 3, 1],
[1, 7, 2]])
You first sort the array in the secondary column, then you sort in the primary axis, making sure to use a stable sorting method.
sorted_array = a[np.argsort(a[:, 1])]
sorted_array = sorted_array[np.argsort(sorted_array[:, 0], kind='stable')]
Or you can use lexsort
sorted_array = a[np.lexsort((a[:,1], a[:, 0])), :]

Numpy Advanced Indexing : How the broadcast is happening?

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
if we run the following statement
x[1:, [2,0,1]]
we get the following result
array([[ 6, 4, 5],
[10, 8, 9]])
According to numpy's doc:
Advanced indexes always are broadcast and iterated as one:
I am unable to understand how the pairing of indices is happening here and also broadcasting .
The selected answer is not correct.
Here the [2,0,1] indeed has shape (3,) and will not be extended during broadcasting.
While 1: means you first slicing the array before broadcasting. During the broadcasting, just think of the slicing : as a placeholder for a 0d-scalar at each run. So we get:
shape([2,0,1]) = (3,)
shape([:]) = () -> (1,) -> (3,)
So it's the [:] conceptually extended into shape (3,), like this:
x[[1,1,1], [2,0,1]] =
[6 4 5]
x[[2,2,2], [2,0,1]] =
[10 8 9]
Finally, we need to stack the results back
[[6 4 5]
[10 8 9]]
From NumPy User Guide, Section 3.4.7 Combining index arrays with slices
the slice is converted to an index array np.array that is broadcast
with the index array to produce the resultant array.
In our case the slice 1: is converted to to an index array np.array([[1,2]]) which has shape (1,2) . This is row index array.
The next index array ( column index array) np.array([2,0,1]) has shape (3,2)
row index array shape (1,2)
column index array shape (3,2)
the index arrays do not have the same shape. But they can be broadcasted to same shape. The row index array is broadcasted to match the shape of column index array.

Slicing a numpy array and passing the slice to a function

I want to have a function that can operate on either a row or a column of a 2D ndarray. Assume the array has C order. The function changes values in the 2D data.
Inside the function I want to have identical index syntax whether it is called with a row or column. A row slice is [n,:] and column slice [:,n] so they have different shapes. Inside the function this requires different indexing expressions.
Is there a way to do this that does not require moving or allocating memory? I am under the impression that using reshape will force a copy to make the data to make it contiguous. Is there a way to use nditer in the function?
Do you mean like this:
In [74]: def foo(arr, n):
...: arr += n
...:
In [75]: arr = np.ones((2,3),int)
In [76]: foo(arr[0,:],1)
In [77]: arr
Out[77]:
array([[2, 2, 2],
[1, 1, 1]])
In [78]: foo(arr[:,1],[100,200])
In [79]: arr
Out[79]:
array([[ 2, 102, 2],
[ 1, 201, 1]])
In the first case I'm adding 1 to one row of the array, ie. a row slice. In the second case I'm add a array (list) to a column. In that case n has to have the right length.
Usually we don't worry about whether the values are C contiguous. Striding takes care of access either way.

Numpy remove rows with same column values

How do I remove rows from ndarray arrays which have the same nth column value?
For eg,
a = np.ndarray([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
And I want to have rows unique by third column.
I want to have just the [1, 3, 5] row left.
numpy.unique does not do it. It will check for uniqueness in every column; I can't specify the
column by which to check uniqueness.
How can I do this efficiently for thousand + rows?
Thank you.
You could try a combination of bincount, nonzero and in1d
import numpy as np
a = np.array([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
#A tuple containing the values which are unique in column 3
unique_in_column = (np.bincount(a[:,2]) == 1).nonzero()
a[:,2] == unique_in_column[0]
unique_index = np.in1d(a[:,2], unique_in_column[0])
unique_a = a[unique_index]
This should do the trick. However, I'm not sure how this method scales with 1000+ rows.
I had done this finally:
repeatdict = {}
todel = []
for i, row in enumerate(kplist):
if repeatdict.get(row[2], 0):
todel.append(i)
else:
repeatdict[row[2]] = 1
kplist = np.delete(kplist, todel, axis=0)
Basically, I iterated over the list store the values of the third column, and if in the next iteration the same value is already found in the repeatdict dict, that row is marked for deletion, by storing its index in todel list.
Then we can get rid of the unwanted rows by calling np.delete with the list of all row indexes which we want to delete.
Also, I'm not picking my answer as the picked answer, because I know there's probably a better way to do this with just numpy magic.
I'll wait.