Numpy Indexing Behavior - numpy

I am having a lot of trouble understanding numpy indexing for multidimensional arrays. In this example that I am working with, let's say that I have a 2D array, A, which is 100x10. Then I have another array, B, which is a 100x1 1D array of values between 0-9 (indices for A). In MATLAB, I would use A(sub2ind(size(A), 1:size(A,1)', B) to return for each row of A, the value at the index stored in the corresponding row of B.
So, as a test case, let's say I have this:
A = np.random.rand(100,10)
B = np.int32(np.floor(np.random.rand(100)*10))
If I print their shapes, I get:
print A.shape returns (100L, 10L)
print B.shape returns (100L,)
When I try to index into A using B naively (incorrectly)
Test1 = A[:,B]
print Test1.shape returns (100L, 100L)
but if I do
Test2 = A[range(A.shape[0]),B]
print Test2.shape returns (100L,)
which is what I want. I'm having trouble understanding the distinction being made here. In my mind, A[:,5] and A[range(A.shape[0]),5] should return the same thing, but it isn't here. How is : different from using range(sizeArray) which just creates an array from [0:sizeArray] inclusive, to use an indices?

Let's look at a simple array:
In [654]: X=np.arange(12).reshape(3,4)
In [655]: X
Out[655]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
With the slice we can pick 3 columns of X, in any order (and even repeated). In other words, take all the rows, but selected columns.
In [656]: X[:,[3,2,1]]
Out[656]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
If instead I use a list (or array) of 3 values, it pairs them up with the column values, effectively picking 3 values, X[0,3],X[1,2],X[2,1]:
In [657]: X[[0,1,2],[3,2,1]]
Out[657]: array([3, 6, 9])
If instead I gave it a column vector to index rows, I get the same thing as with the slice:
In [659]: X[[[0],[1],[2]],[3,2,1]]
Out[659]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
This amounts to picking 9 individual values, as generated by broadcasting:
In [663]: np.broadcast_arrays(np.arange(3)[:,None],np.array([3,2,1]))
Out[663]:
[array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]),
array([[3, 2, 1],
[3, 2, 1],
[3, 2, 1]])]
numpy indexing can be confusing. But a good starting point is this page: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

Related

Finding values from different rows in pandas

I have a dataframe comprising the data and another dataframe, containing a single row carrying indices.
data = {'col_1': [4, 5, 6, 7], 'col_2': [3, 4, 9, 8],'col_3': [5, 5, 6, 9],'col_4': [8, 7, 6, 5]}
df = pd.DataFrame(data)
ind = {'ind_1': [2], 'ind_2': [1],'ind_3': [3],'ind_4': [2]}
ind = pd.DataFrame(ind)
Both have the same number of columns. I want to extract the values of df corresponding to the index stored in ind so that I get a single row at the end.
For this data it should be: [6, 4, 9, 6]. I tried df.loc[ind.loc[0]] but that of course gives me four different rows, not one.
The other idea I have is to zip columns and rows and iterate over them. But I feel there should be a simpler way.
you can go to NumPy domain and index there:
In [14]: df.to_numpy()[ind, np.arange(len(df.columns))]
Out[14]: array([[6, 4, 9, 6]], dtype=int64)
this pairs up 2, 1, 3, 2 from ind and 0, 1, 2, 3 from 0 to number of columns - 1; so we get the values at [2, 0], [1, 1] and so on.
There's also df.lookup but it's being deprecated, so...
In [19]: df.lookup(ind.iloc[0], df.columns)
~\Anaconda3\Scripts\ipython:1: FutureWarning: The 'lookup' method is deprecated and will beremoved in a future version.You can use DataFrame.melt and DataFrame.locas a substitute.
Out[19]: array([6, 4, 9, 6], dtype=int64)

What is the difference between np.array([val1, val2]) and np.array([[val1, val2]])?

What is the difference between np.array([1, 2]) and np.array([[1, 2]])?
Which one of them is a matrix?
I also do not understand the output for shape of the above tensors. The former returns (2,) and the latter returns (1,2).
np.array([1, 2]) builds an array starting from a list, thus giving you a 1D array with the shape (2, ) since it only contains a single list of two elements.
When using the double [ you are actually passing a list of lists, thus this gets you a multidimensional array, or matrix, with the shape (1, 2).
With the latter you are able to build more complex matrices like:
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
rendering a 3x3 matrix:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

Sort one list from another list in TensorFlow

I have two tf.Tensors A: [x0, x1, x2, x3, x4] and B: [2, 2, 1, 3, 2]. I would like to sort A using B.
Basically I would like to do the following, but using only TF operators:
list1, list2 = zip(*sorted(zip(list1, list2)))
I tried tf.sort() with tf.stack, but it seem to sort each dimension independently. I think I need to use tf.argsort similarly to this answer Sort array's rows by another array in Python but the indexing fails as tensor indexing do not seems to be supported.
I think I found the solution:
list1 = [2, 2, 1, 3, 2]
list2 = [0, 1, 2, 3, 4]
ids = tf.argsort(list1)
out = tf.gather(list2, ids) # [2, 0, 1, 4, 3]

numpy find values of maxima pointed to by argmax [duplicate]

This question already has answers here:
Index n dimensional array with (n-1) d array
(3 answers)
Closed 4 years ago.
I have a 3-d array. I find the indexes of the maxima along an axis using argmax. How do I now use these indexes to obtain the maximal values?
2nd part: How to do this for arrays of N-d?
Eg:
u = np.arange(12).reshape(3,4,1)
In [125]: e = u.argmax(axis=2)
Out[130]: e
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
It would be nice if u[e] produced the expected results, but it doesn't work.
The return value of argmax along an axis can't be simply used as an index. It only works in a 1d case.
In [124]: u = np.arange(12).reshape(3,4,1)
In [125]: e = u.argmax(axis=2)
In [126]: u.shape
Out[126]: (3, 4, 1)
In [127]: e.shape
Out[127]: (3, 4)
e is (3,4), but its values only index the last dimension of u.
In [128]: u[e].shape
Out[128]: (3, 4, 4, 1)
Instead we have to construct indices for the other 2 dimensions, ones which broadcast with e. For example:
In [129]: I,J=np.ix_(range(3),range(4))
In [130]: I
Out[130]:
array([[0],
[1],
[2]])
In [131]: J
Out[131]: array([[0, 1, 2, 3]])
Those are (3,1) and (1,4). Those are compatible with (3,4) e and the desired output
In [132]: u[I,J,e]
Out[132]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
This kind of question has been asked before, so probably should be marked as a duplicate. The fact that your last dimension is size 1, and hence e is all 0s, distracting readers from the underlying issue (using a multidimensional argmax as index).
numpy: how to get a max from an argmax result
Get indices of numpy.argmax elements over an axis
Assuming you've taken the argmax on the last dimension
In [156]: ij = np.indices(u.shape[:-1])
In [157]: u[(*ij,e)]
Out[157]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
or:
ij = np.ix_(*[range(i) for i in u.shape[:-1]])
If the axis is in the middle, it'll take a bit more tuple fiddling to arrange the ij elements and e.
so for general N-d array
dims = np.ix_(*[range(x) for x in u.shape[:-1]])
u.__getitem__((*dims,e))
You can't write u[*dims,e], that's a syntax error, so I think you must use getitem directly.

NumPy: generalize one-hot encoding to k-hot encoding

I'm using this code to one-hot encode values:
idxs = np.array([1, 3, 2])
vals = np.zeros((idxs.size, idxs.max()+1))
vals[np.arange(idxs.size), idxs] = 1
But I would like to generalize it to k-hot encoding (where shape of vals would be same, but each row can contain k ones).
Unfortunatelly, I can't figure out how to index multiple cols from each row. I tried vals[0:2, [[0, 1], [3]] to select first and second column from first row and third column from second row, but it does not work.
It's called advanced-indexing.
to select first and second column from first row and third column from second row
You just need to pass the respective rows and columns in separate iterables (tuple, list):
In [9]: a
Out[9]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
In [10]: a[[0, 0, 1],[0, 1, 3]]
Out[10]: array([0, 1, 8])