Compute unique groups from Pandas group-by results - pandas

I'd like to count the unique groups from the result of a Pandas group-by operation. For instance here is an example data frame.
In [98]: df = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [10,10,11,10,10,15]})
In [99]: df.groupby('A').groups
Out[99]: {1: [0, 3], 2: [1, 4], 3: [2, 5]}
The conceptual groups are {1: [10, 10], 2: [10, 10], 3: [11, 15]} where the index locations in the groups above are substituded with the values from column B, but the first problem I've run into is how to convert those positions (e.g. [0, 3]) into values from the B column.
Given the ability to convert the groups into the value groups from column B I can compute the unique groups by hand, but a secondary question here is if Pandas has a built-in routine for this, which I haven't seen.
Edit updated with target output:
This is the output I would be looking for in the simplest case:
{1: [10, 10], 2: [10, 10], 3: [11, 15]}
And counting the unique groups would produce something equivalent to:
{[10, 10]: 2, [11, 15]: 1}

How about:
>>> df = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [10,10,11,10,10,15]})
>>> df.groupby("A")["B"].apply(tuple).value_counts()
(10, 10) 2
(11, 15) 1
dtype: int64
or maybe
>>> df.groupby("A")["B"].apply(lambda x: tuple(sorted(x))).value_counts()
(10, 10) 2
(11, 15) 1
dtype: int64
if you don't care about the order within the group.
You can trivially call .to_dict() if you'd like, e.g.
>>> df.groupby("A")["B"].apply(tuple).value_counts().to_dict()
{(11, 15): 1, (10, 10): 2}

maybe:
>>> df.groupby('A')['B'].aggregate(lambda ts: list(ts.values)).to_dict()
{1: [10, 10], 2: [10, 10], 3: [11, 15]}
for counting the groups you need to convert to tuple because lists are not hashable:
>>> ts = df.groupby('A')['B'].aggregate(lambda ts: tuple(ts.values))
>>> ts.value_counts().to_dict()
{(11, 15): 1, (10, 10): 2}

Related

Finding values from different rows in pandas

I have a dataframe comprising the data and another dataframe, containing a single row carrying indices.
data = {'col_1': [4, 5, 6, 7], 'col_2': [3, 4, 9, 8],'col_3': [5, 5, 6, 9],'col_4': [8, 7, 6, 5]}
df = pd.DataFrame(data)
ind = {'ind_1': [2], 'ind_2': [1],'ind_3': [3],'ind_4': [2]}
ind = pd.DataFrame(ind)
Both have the same number of columns. I want to extract the values of df corresponding to the index stored in ind so that I get a single row at the end.
For this data it should be: [6, 4, 9, 6]. I tried df.loc[ind.loc[0]] but that of course gives me four different rows, not one.
The other idea I have is to zip columns and rows and iterate over them. But I feel there should be a simpler way.
you can go to NumPy domain and index there:
In [14]: df.to_numpy()[ind, np.arange(len(df.columns))]
Out[14]: array([[6, 4, 9, 6]], dtype=int64)
this pairs up 2, 1, 3, 2 from ind and 0, 1, 2, 3 from 0 to number of columns - 1; so we get the values at [2, 0], [1, 1] and so on.
There's also df.lookup but it's being deprecated, so...
In [19]: df.lookup(ind.iloc[0], df.columns)
~\Anaconda3\Scripts\ipython:1: FutureWarning: The 'lookup' method is deprecated and will beremoved in a future version.You can use DataFrame.melt and DataFrame.locas a substitute.
Out[19]: array([6, 4, 9, 6], dtype=int64)

How to filter a pandas dataframe if a column is a list

In the dataframe that comes from
http:bit.ly/imdbratings
one column, actors_list, is a list of the actors in the movie.
How do I filter the dataframe for movies where Al Pacino took part?
e.g. [u'Marlon Brando', u'Al Pacino', u'James Caan']
You could use filtering with a map function inside it. Suppose you are looking for actor number 32:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['A','B','C','D','E','F'],
'Actors':[[1,2,3],[2,4,3],[3,4,5,32,1],[4,5,2,3],[102,302],[1,2,3,32,5]]})
df[df['Actors'].map(lambda x: 32 in x)]
Output:
name Actors
2 C [3, 4, 5, 32, 1]
5 F [1, 2, 3, 32, 5]
Or if you want to check if at least one actor from the list of actors you wish is present in the movies then use any in combination with lambda:
important_actors = [32,3]
print(df[df['Actors'].map(lambda x: any(i in x for i in important_actors))])
Output:
name Actors
0 A [1, 2, 3]
1 B [2, 4, 3]
2 C [3, 4, 5, 32, 1]
3 D [4, 5, 2, 3]
5 F [1, 2, 3, 32, 5]
The structure is that... You can now change any for all if you wish to filter the movies where all actors are in them, and so on... Feel free to leave a comment if you need further explanation/have any doubts.
You can do string contains.
l=[u'Marlon Brando', u'Al Pacino', u'James Caan']
m=df['actors_list'].str.join('|').str.contains('|'.join(l))
df=df[m]
Or
m=pd.DataFrame(df['actors_list'].tolist()).isin(l).any(1)
df=df[m.values]

Numpy: np.sum with negative axis

I wonder what does "If axis is negative it counts from the last to the first axis." mean in the docs, I've test these:
>>> t
array([[1, 2],
[3, 4]])
>>> np.sum(t, axis=1)
array([3, 7])
>>> np.sum(t, axis=0)
array([4, 6])
>>> np.sum(t, axis=-2)
array([4, 6])
Still confused, I need some easily understood explanation.
First look at list indexing on a length-2 list:
>>> L = ['one', 'two']
>>> L[-1] # last element
'two'
>>> L[-2] # second-to-last element
'one'
>>> L[-3] # out of bounds - only two elements in this list
# IndexError: list index out of range
The axis argument is analogous, except it's specifying the dimension of the ndarray. It will be easier to see if using a non-square array:
>>> t = np.arange(1,11).reshape(2,5)
>>> t
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
>>> t.ndim # two-dimensional array
2
>>> t.shape # a tuple of length t.ndim
(2, 5)
So let's look at the various ways to call sum:
>>> t.sum() # all elements
55
>>> t.sum(axis=0) # sum over 0th axis i.e. columns
array([ 7, 9, 11, 13, 15])
>>> t.sum(axis=1) # sum over 1st axis i.e. rows
array([15, 40])
>>> t.sum(axis=-2) # sum over -2th axis i.e. columns again (-2 % ndim == 0)
array([ 7, 9, 11, 13, 15])
Trying t.sum(axis=-3) will be an error, because you only have 2 dimensions in this array. You could use it on a 3d array, though.

NumPy: generalize one-hot encoding to k-hot encoding

I'm using this code to one-hot encode values:
idxs = np.array([1, 3, 2])
vals = np.zeros((idxs.size, idxs.max()+1))
vals[np.arange(idxs.size), idxs] = 1
But I would like to generalize it to k-hot encoding (where shape of vals would be same, but each row can contain k ones).
Unfortunatelly, I can't figure out how to index multiple cols from each row. I tried vals[0:2, [[0, 1], [3]] to select first and second column from first row and third column from second row, but it does not work.
It's called advanced-indexing.
to select first and second column from first row and third column from second row
You just need to pass the respective rows and columns in separate iterables (tuple, list):
In [9]: a
Out[9]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
In [10]: a[[0, 0, 1],[0, 1, 3]]
Out[10]: array([0, 1, 8])

Numpy Indexing Behavior

I am having a lot of trouble understanding numpy indexing for multidimensional arrays. In this example that I am working with, let's say that I have a 2D array, A, which is 100x10. Then I have another array, B, which is a 100x1 1D array of values between 0-9 (indices for A). In MATLAB, I would use A(sub2ind(size(A), 1:size(A,1)', B) to return for each row of A, the value at the index stored in the corresponding row of B.
So, as a test case, let's say I have this:
A = np.random.rand(100,10)
B = np.int32(np.floor(np.random.rand(100)*10))
If I print their shapes, I get:
print A.shape returns (100L, 10L)
print B.shape returns (100L,)
When I try to index into A using B naively (incorrectly)
Test1 = A[:,B]
print Test1.shape returns (100L, 100L)
but if I do
Test2 = A[range(A.shape[0]),B]
print Test2.shape returns (100L,)
which is what I want. I'm having trouble understanding the distinction being made here. In my mind, A[:,5] and A[range(A.shape[0]),5] should return the same thing, but it isn't here. How is : different from using range(sizeArray) which just creates an array from [0:sizeArray] inclusive, to use an indices?
Let's look at a simple array:
In [654]: X=np.arange(12).reshape(3,4)
In [655]: X
Out[655]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
With the slice we can pick 3 columns of X, in any order (and even repeated). In other words, take all the rows, but selected columns.
In [656]: X[:,[3,2,1]]
Out[656]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
If instead I use a list (or array) of 3 values, it pairs them up with the column values, effectively picking 3 values, X[0,3],X[1,2],X[2,1]:
In [657]: X[[0,1,2],[3,2,1]]
Out[657]: array([3, 6, 9])
If instead I gave it a column vector to index rows, I get the same thing as with the slice:
In [659]: X[[[0],[1],[2]],[3,2,1]]
Out[659]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
This amounts to picking 9 individual values, as generated by broadcasting:
In [663]: np.broadcast_arrays(np.arange(3)[:,None],np.array([3,2,1]))
Out[663]:
[array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]),
array([[3, 2, 1],
[3, 2, 1],
[3, 2, 1]])]
numpy indexing can be confusing. But a good starting point is this page: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html