Numpy: np.sum with negative axis - numpy

I wonder what does "If axis is negative it counts from the last to the first axis." mean in the docs, I've test these:
>>> t
array([[1, 2],
[3, 4]])
>>> np.sum(t, axis=1)
array([3, 7])
>>> np.sum(t, axis=0)
array([4, 6])
>>> np.sum(t, axis=-2)
array([4, 6])
Still confused, I need some easily understood explanation.

First look at list indexing on a length-2 list:
>>> L = ['one', 'two']
>>> L[-1] # last element
'two'
>>> L[-2] # second-to-last element
'one'
>>> L[-3] # out of bounds - only two elements in this list
# IndexError: list index out of range
The axis argument is analogous, except it's specifying the dimension of the ndarray. It will be easier to see if using a non-square array:
>>> t = np.arange(1,11).reshape(2,5)
>>> t
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
>>> t.ndim # two-dimensional array
2
>>> t.shape # a tuple of length t.ndim
(2, 5)
So let's look at the various ways to call sum:
>>> t.sum() # all elements
55
>>> t.sum(axis=0) # sum over 0th axis i.e. columns
array([ 7, 9, 11, 13, 15])
>>> t.sum(axis=1) # sum over 1st axis i.e. rows
array([15, 40])
>>> t.sum(axis=-2) # sum over -2th axis i.e. columns again (-2 % ndim == 0)
array([ 7, 9, 11, 13, 15])
Trying t.sum(axis=-3) will be an error, because you only have 2 dimensions in this array. You could use it on a 3d array, though.

Related

Assign numpy matrix to pandas columns

I have dataframe with 48870 rows and calculated embeddings with shape (48870, 768)
I wanna assign this embeddings to padnas column
When i try
test['original_text_embeddings'] = embeddings
I have an error: Wrong number of items passed 768, placement implies 1
I know if a make something like df.loc['original_text_embeddings'] = embeddings[0] will work but i need to automate this process
A dataframe/column needs a 1d list/array:
In [84]: x = np.arange(12).reshape(3,4)
In [85]: pd.Series(x)
...
ValueError: Data must be 1-dimensional
Splitting the array into a list (of arrays):
In [86]: pd.Series(list(x))
Out[86]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
2 [8, 9, 10, 11]
dtype: object
In [87]: _.to_numpy()
Out[87]:
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])],
dtype=object)
Your embeddings have 768 columns, which would translate to equally 768 columns in a data frame. You are trying to assign all columns from the embeddings to just one column in the data frame, which is not possible.
What you could do is generating a new data frame from the embeddings and concatenate the test df with the embedding df
embedding_df = pd.DataFrame(embeddings)
test = pd.concat([test, embedding_df], axis=1)
Have a look at the documentation for handling indexes and concatenating on different axis:
https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Numpy Advanced Indexing confusion

If a is numpy array of shape (5,3), b is of shape (2,2) and c is of shape (2,2), what is the shape of a[b,c]?
Can anyone explain this to me with an example. I've read the docs but still I am not able to understand how it works.
Just for the purpose of expounding the concept of advanced indexing, here is a contrived example:
# input arrays
In [22]: a
Out[22]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [23]: b
Out[23]:
array([[0, 1],
[2, 3]])
In [24]: c
Out[24]:
array([[0, 1],
[2, 2]])
# advanced indexing
In [25]: a[b, c]
Out[25]:
array([[ 0, 4],
[ 8, 11]])
By the expression a[b, c], we are using the arrays b and c to selectively pull out elements from the array a.
To interpret the output of a[b, c]:
# b # c # 2D indices
[[0, 1], [[0, 1] ---> (0,0) (1,1)
[2, 3]] [2, 2]] ---> (2,2) (3,2)
The 2D indices would simply be applied to the array a and the corresponding elements would be returned as array in the result of a[b, c]
a[(0,0)] --> 0
a[(1,1)] --> 4
a[(2,2)] --> 8
a[(3,2)] --> 11
The above elements are returned as a 2D array since the arrays b and c are 2D arrays themselves.
Also, please note that advanced indexing always returns a copy.
In [27]: (a[b, c]).flags.owndata
Out[27]: True
However, an assignment operation using advanced indexing will alter the original array (in-place). But, this behaviour is also dependent on two factors:
whether your indexing operation is pure (only advanced indexing) or mixed (a combination of advanced & simple indexing)
in case of mixed indexing, the order in which they are applied.
See: Views and copies confusion with NumPy arrays when combining index operations

NumPy: generalize one-hot encoding to k-hot encoding

I'm using this code to one-hot encode values:
idxs = np.array([1, 3, 2])
vals = np.zeros((idxs.size, idxs.max()+1))
vals[np.arange(idxs.size), idxs] = 1
But I would like to generalize it to k-hot encoding (where shape of vals would be same, but each row can contain k ones).
Unfortunatelly, I can't figure out how to index multiple cols from each row. I tried vals[0:2, [[0, 1], [3]] to select first and second column from first row and third column from second row, but it does not work.
It's called advanced-indexing.
to select first and second column from first row and third column from second row
You just need to pass the respective rows and columns in separate iterables (tuple, list):
In [9]: a
Out[9]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
In [10]: a[[0, 0, 1],[0, 1, 3]]
Out[10]: array([0, 1, 8])

Indexing a sub-array by lists [duplicate]

This question already has an answer here:
Assign values to numpy.array
(1 answer)
Closed 5 years ago.
I have some array A and 2 lists of indices ind1 and ind2, one for each axis. Now this gives me a slice of the array, to which I need to assign some new values. Problem is, my approach for this does not work.
Let me demonstrate with an example. First I create an array, and try to access some slice:
>>> A=numpy.arange(9).reshape(3,3)
>>> ind1, ind2 = [0,1], [1,2]
>>> A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> A[ind1,ind2]
array([1, 5])
Now this just gives me 2 values, not the 2-by-2 matrix I was going for. So I tried this:
>>> A[ind1,:][:,ind2]
array([[1, 2],
[4, 5]])
Okay, better. Now let's say these value should be 0:
>>> A[ind1,:][:,ind2]=0
>>> A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
If I try to assign like this, the array A does not get updated, because of the double indexing (I am only assigning to some copy of A, which gets discarded). Is there some way to index the sub array by just indexing once?
Note: Indexing by selecting some appropriate range like A[:2,1:3] would work for this example, but I need something that works with any arbitrary list of indices.
What about using meshgrid to create your 2d-indexes? As follows
>>> import numpy as np
>>> A = np.arange(9).reshape(3,3)
>>> ind1, ind2 = [0,1],[1,2]
>>> ind12 = np.meshgrid(ind1,ind2, indexing='ij')
>>> # = np.ix_(ind1,ind2) as pointed out by #Divakar
>>> A[ind12]
[[1 2]
[4 5]]
And finally
>>> A[ind12] = 0
>>> A
[[0 0 0]
[3 0 0]
[6 7 8]]
Which works with any arbitrary list of indices.
>>> ind1, ind2 = [0,2],[0,2]
>>> ind12 = np.meshgrid(ind1,ind2, indexing='ij')
>>> A[ind12] = 100
[[100 1 100]
[ 3 4 5]
[100 7 100]]
As pointed out by #hpaulj in comments, note that np.ix_(ind1,ind2) is actually equivalent to the following use of np.meshgrid,
>>> np.meshgrid(ind1,ind2, indexing='ij', sparse=True)
Which is a priori even more efficient. This is a major point in the np.ix_'s favor when the parameters indexing and sparse are constantly set to 'ij' and True respectively.

Numpy Indexing Behavior

I am having a lot of trouble understanding numpy indexing for multidimensional arrays. In this example that I am working with, let's say that I have a 2D array, A, which is 100x10. Then I have another array, B, which is a 100x1 1D array of values between 0-9 (indices for A). In MATLAB, I would use A(sub2ind(size(A), 1:size(A,1)', B) to return for each row of A, the value at the index stored in the corresponding row of B.
So, as a test case, let's say I have this:
A = np.random.rand(100,10)
B = np.int32(np.floor(np.random.rand(100)*10))
If I print their shapes, I get:
print A.shape returns (100L, 10L)
print B.shape returns (100L,)
When I try to index into A using B naively (incorrectly)
Test1 = A[:,B]
print Test1.shape returns (100L, 100L)
but if I do
Test2 = A[range(A.shape[0]),B]
print Test2.shape returns (100L,)
which is what I want. I'm having trouble understanding the distinction being made here. In my mind, A[:,5] and A[range(A.shape[0]),5] should return the same thing, but it isn't here. How is : different from using range(sizeArray) which just creates an array from [0:sizeArray] inclusive, to use an indices?
Let's look at a simple array:
In [654]: X=np.arange(12).reshape(3,4)
In [655]: X
Out[655]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
With the slice we can pick 3 columns of X, in any order (and even repeated). In other words, take all the rows, but selected columns.
In [656]: X[:,[3,2,1]]
Out[656]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
If instead I use a list (or array) of 3 values, it pairs them up with the column values, effectively picking 3 values, X[0,3],X[1,2],X[2,1]:
In [657]: X[[0,1,2],[3,2,1]]
Out[657]: array([3, 6, 9])
If instead I gave it a column vector to index rows, I get the same thing as with the slice:
In [659]: X[[[0],[1],[2]],[3,2,1]]
Out[659]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
This amounts to picking 9 individual values, as generated by broadcasting:
In [663]: np.broadcast_arrays(np.arange(3)[:,None],np.array([3,2,1]))
Out[663]:
[array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]),
array([[3, 2, 1],
[3, 2, 1],
[3, 2, 1]])]
numpy indexing can be confusing. But a good starting point is this page: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html