Bincount with indices - numpy

I am looking for an efficient way to compute the indices of the binnings of bincount as a ndarray.
To illustrate:
>>> x = np.array([0, 1, 1, 0, 2])
>>> b = np.bincount(x)
>>> b
[2 2 1]
I am now looking for an ndarray that represents the indices of the elements of each bin:
[0 3 1 2 4]
I am looking for a fast numpy solution that should not contain loops. Anyone knows how to implement this? Thanks very much in advance!

Related

Numpy Advanced Indexing confusion

If a is numpy array of shape (5,3), b is of shape (2,2) and c is of shape (2,2), what is the shape of a[b,c]?
Can anyone explain this to me with an example. I've read the docs but still I am not able to understand how it works.
Just for the purpose of expounding the concept of advanced indexing, here is a contrived example:
# input arrays
In [22]: a
Out[22]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [23]: b
Out[23]:
array([[0, 1],
[2, 3]])
In [24]: c
Out[24]:
array([[0, 1],
[2, 2]])
# advanced indexing
In [25]: a[b, c]
Out[25]:
array([[ 0, 4],
[ 8, 11]])
By the expression a[b, c], we are using the arrays b and c to selectively pull out elements from the array a.
To interpret the output of a[b, c]:
# b # c # 2D indices
[[0, 1], [[0, 1] ---> (0,0) (1,1)
[2, 3]] [2, 2]] ---> (2,2) (3,2)
The 2D indices would simply be applied to the array a and the corresponding elements would be returned as array in the result of a[b, c]
a[(0,0)] --> 0
a[(1,1)] --> 4
a[(2,2)] --> 8
a[(3,2)] --> 11
The above elements are returned as a 2D array since the arrays b and c are 2D arrays themselves.
Also, please note that advanced indexing always returns a copy.
In [27]: (a[b, c]).flags.owndata
Out[27]: True
However, an assignment operation using advanced indexing will alter the original array (in-place). But, this behaviour is also dependent on two factors:
whether your indexing operation is pure (only advanced indexing) or mixed (a combination of advanced & simple indexing)
in case of mixed indexing, the order in which they are applied.
See: Views and copies confusion with NumPy arrays when combining index operations

Elementwise multiplication of NumPy arrays of different shapes

When I use numpy.multiply(a,b) to multiply numpy arrays with shapes (2, 1),(2,) I get a 2 by 2 matrix. But what I want is element-wise multiplication.
I'm not familiar with numpy's rules. Can anyone explain what's happening here?
When doing an element-wise operation between two arrays, which are not of the same dimensionality, NumPy will perform broadcasting. In your case Numpy will broadcast b along the rows of a:
import numpy as np
a = np.array([[1],
[2]])
b = [3, 4]
print(a * b)
Gives:
[[3 4]
[6 8]]
To prevent this, you need to make a and b of the same dimensionality. You can add dimensions to an array by using np.newaxis or None in your indexing, like this:
print(a * b[:, np.newaxis])
Gives:
[[3]
[8]]
Let's say you have two arrays, a and b, with shape (2,3) and (2,) respectively:
a = np.random.randint(10, size=(2,3))
b = np.random.randint(10, size=(2,))
The two arrays, for example, contain:
a = np.array([[8, 0, 3],
[2, 6, 7]])
b = np.array([7, 5])
Now for handling a product element to element a*b you have to specify what numpy has to do when reaching for the absent axis=1 of array b. You can do so by adding None:
result = a*b[:,None]
With result being:
array([[56, 0, 21],
[10, 30, 35]])
Here are the input arrays a and b of the same shape as you mentioned:
In [136]: a
Out[136]:
array([[0],
[1]])
In [137]: b
Out[137]: array([0, 1])
Now, when we do multiplication using either * or numpy.multiply(a, b), we get:
In [138]: a * b
Out[138]:
array([[0, 0],
[0, 1]])
The result is a (2,2) array because numpy uses broadcasting.
# b
#a | 0 1
------------
0 | 0*0 0*1
1 | 1*0 1*1
I just explained the broadcasting rules in broadcasting arrays in numpy
In your case
(2,1) + (2,) => (2,1) + (1,2) => (2,2)
It has to add a dimension to the 2nd argument, and can only add it at the beginning (to avoid ambiguity).
So you want a (2,1) result, you have to expand the 2nd argument yourself, with reshape or [:, np.newaxis].

Get column-index from column-name in pandas? [duplicate]

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Indexing a sub-array by lists [duplicate]

This question already has an answer here:
Assign values to numpy.array
(1 answer)
Closed 5 years ago.
I have some array A and 2 lists of indices ind1 and ind2, one for each axis. Now this gives me a slice of the array, to which I need to assign some new values. Problem is, my approach for this does not work.
Let me demonstrate with an example. First I create an array, and try to access some slice:
>>> A=numpy.arange(9).reshape(3,3)
>>> ind1, ind2 = [0,1], [1,2]
>>> A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> A[ind1,ind2]
array([1, 5])
Now this just gives me 2 values, not the 2-by-2 matrix I was going for. So I tried this:
>>> A[ind1,:][:,ind2]
array([[1, 2],
[4, 5]])
Okay, better. Now let's say these value should be 0:
>>> A[ind1,:][:,ind2]=0
>>> A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
If I try to assign like this, the array A does not get updated, because of the double indexing (I am only assigning to some copy of A, which gets discarded). Is there some way to index the sub array by just indexing once?
Note: Indexing by selecting some appropriate range like A[:2,1:3] would work for this example, but I need something that works with any arbitrary list of indices.
What about using meshgrid to create your 2d-indexes? As follows
>>> import numpy as np
>>> A = np.arange(9).reshape(3,3)
>>> ind1, ind2 = [0,1],[1,2]
>>> ind12 = np.meshgrid(ind1,ind2, indexing='ij')
>>> # = np.ix_(ind1,ind2) as pointed out by #Divakar
>>> A[ind12]
[[1 2]
[4 5]]
And finally
>>> A[ind12] = 0
>>> A
[[0 0 0]
[3 0 0]
[6 7 8]]
Which works with any arbitrary list of indices.
>>> ind1, ind2 = [0,2],[0,2]
>>> ind12 = np.meshgrid(ind1,ind2, indexing='ij')
>>> A[ind12] = 100
[[100 1 100]
[ 3 4 5]
[100 7 100]]
As pointed out by #hpaulj in comments, note that np.ix_(ind1,ind2) is actually equivalent to the following use of np.meshgrid,
>>> np.meshgrid(ind1,ind2, indexing='ij', sparse=True)
Which is a priori even more efficient. This is a major point in the np.ix_'s favor when the parameters indexing and sparse are constantly set to 'ij' and True respectively.

Indexing per row in TensorFlow

I have a matrix:
Params =
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
For each row I want to select some elements using column indices:
col_indices =
[[0 1]
[1 2]
[2 3]]
In Numpy, I can create row indices:
row_indices =
[[0 0]
[1 1]
[2 2]]
and do params[row_indices, col_indices]
In TenforFlow, I did this:
tf_params = tf.constant(params)
tf_col_indices = tf.constant(col_indices, dtype=tf.int32)
tf_row_indices = tf.constant(row_indices, dtype=tf.int32)
tf_params[row_indices, col_indices]
But there raised an error:
ValueError: Shape must be rank 1 but is rank 3
What does it mean? How should I do this kind of indexing properly?
Thanks!
Tensor rank (sometimes referred to as order or degree or n-dimension) is the number of dimensions of the tensor. For example, the following tensor (defined as a Python list) has a rank of 2:
t = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
A rank two tensor is what we typically think of as a matrix, a rank one tensor is a vector. For a rank two tensor you can access any element with the syntax t[i, j]. For a rank three tensor you would need to address an element with t[i, j, k]. See this for more details.
ValueError: Shape must be rank 1 but is rank 3 means you are trying to create a 3-tensor (cube of numbers) instead of a vector.
To see how you can declare tensor constants of different shape, you can see this.