Apply custom functions to groupby pandas - pandas

I know there are questions/answers about how to use custom function for groupby in pandas but my case is slightly different.
My data is
group_col val_col
0 a [1, 2, 34]
1 a [2, 4]
2 b [2, 3, 4, 5]
data = {'group_col': {0: 'a', 1: 'a', 2: 'b'}, 'val_col': {0: [1, 2, 34], 1: [2, 4], 2: [2, 3, 4, 5]}}
df = pd.DataFrame(data)
What I am trying to do is to group by group_col, then sum up the length of lists in val_col for each group. My desire output is
a 5
b 4
I wonder I can do this in pandas?

You can try
df['val_col'].str.len().groupby(df['group_col']).sum()

df.groupby('group_col')['val_col'].sum().str.len()
Output:
group_col
a 5
b 4
Name: val_col, dtype: int64

Related

fastest way to get value frequency stored in dictionary format in groupby pandas

In order to calculate frequency of each value by id, we can do something using value_counts and groupby.
>>> df = pd.DataFrame({"id":[1,1,1,2,2,2], "col":['a','a','b','a','b','b']})
>>> df
id col
0 1 a
1 1 a
2 1 b
3 2 a
4 2 b
5 2 b
>>> df.groupby('id')['col'].value_counts()
id col
1 a 2
b 1
2 b 2
a 1
But I would like to get results stored in dictionary format, not Series. So how am I able to achieve that and also the speed is fast if we have a large dataset?
The ideal format is:
id
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
You can unstack the groupby result to get a dict-of-dicts:
df.groupby('id')['col'].value_counts().unstack().to_dict(orient='index')
# {1: {'a': 2, 'b': 1}, 2: {'a': 1, 'b': 2}}
If you want a Series of dicts, use agg instead of to_dict:
df.groupby('id')['col'].value_counts().unstack().agg(pd.Series.to_dict)
col
a {1: 2, 2: 1}
b {1: 1, 2: 2}
dtype: object
I don't recommend storing data in this format, objects are generally more troublesome to work with.
If unstacking generates NaNs, try an alternative with GroupBy.agg:
df.groupby('id')['col'].agg(lambda x: x.value_counts().to_dict())
id
1 {'a': 2, 'b': 1}
2 {'b': 2, 'a': 1}
Name: col, dtype: object
We can do pd.crosstab
pd.Series(pd.crosstab(df.id,df.col).to_dict('i'))
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
dtype: object

Best way to extract rows with none-matching column values for two python pandas dataframes

I would like to know the best approach for the following pandas dataframe comparison task:
Two dataframes df_a and df_b with both having columns = ['W','X','Y','Z']:
import pandas as pd
df_a = pd.DataFrame([
['a', 2, 2, 3],
['b', 5, 3, 5],
['b', 7, 6, 44],
['c', 3, 12, 19],
['c', 7, 13, 45],
['c', 3, 13, 45],
['d', 5, 11, 90],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
df_b = pd.DataFrame([
['a', 2, 2, 3],
['a', 4, 3, 15],
['b', 5, 12, 24],
['b', 7, 6, 44],
['c', 3, 12, 19],
['d', 3, 23, 45],
['d', 6, 11, 91],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
Extract those rows from df_a that do not have a match in columns ['W','X'] in df_b
Extract those rows from df_b that do not have a match in columns ['W','X'] in df_a
Since I am kind of newbie to pandas (and could not find any other source that gives information on the mentioned task) help-out is very much appreciated.
Thanx in advance.
The basic way is using left outer merge with indicator=True and select left_only using query
cols = ['W', 'X']
df_a_only = (df_a.merge(df_b[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_a.columns])
Out[87]:
W X Y Z
4 c 7 13 45
6 d 5 11 90
df_b_only = (df_b.merge(df_a[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_b.columns])
Out[89]:
W X Y Z
1 a 4 3 15
6 d 3 23 45
7 d 6 11 91
Note: if your dataframe is huge, itt is better doing one full outer merge than using 2 left outer merge as above and choosing left_only and right_only accordingly. However, in doing full outer merge, you need to do post process of NaN, converting float back to integer and rename columns.

Intersect a dataframe with a larger one that includes it and remove common rows

I have two dataframes:
df_small = pd.DataFrame(np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]),
columns=['a', 'b', 'c'])
and
df_large = pd.DataFrame(np.array([[22, 1, 2, 3, 99],
[31, 4, 5, 6, 75],
[73, 7, 8, 9, 23],
[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Now what I want is to intersect the two and only take the rows in df_large that that do not contain the rows from df_small, hence the result should be:
df_result = pd.DataFrame(np.array([[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Use DataFrame.merge with indicator=True and left join and because error is necessary remove duplicates by DataFrame.drop_duplicates from df_small:
m = df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)['_merge'].ne('both')
df = df_large[m]
print (df)
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
Another solution is very similar, only filtered by query and last removed column _merge:
df = (df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)
.query('_merge != "both"')
.drop('_merge', axis=1))
Use DataFrame.merge:
df_large.merge(df_small,how='outer',indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
Output:
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
You can evade merging and make your code a bit more readable. It's really not that clear what happens when you merge and drop duplicates.
Indexes and Multiindexes were made for intersections and other set operations.
common_columns = df_large.columns.intersection(df_small.columns).to_list()
df_small_as_Multiindex = pd.MultiIndex.from_frame(df_small)
df_result = df_large.set_index(common_columns).\
drop(index = df_small_as_Multiindex).\ #Drop the common rows
reset_index() #Not needed if the a,b,c columns are meaningful indexes

How to get the index of each increment in pandas series?

how to get the index of pandas series when the value incremented by one?
Ex. The input is
A
0 0
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
the output should be: [0, 1, 4, 6, 7]
You can use Series.duplicated and access the index, should be slightly faster.
df.index[~df.A.duplicated()]
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
If you really want a list, you can do this,
df.index[~df.A.duplicated()].tolist()
# [0, 1, 4, 6, 7]
Note that duplicated (and drop_duplicates) will only work if your Series does not have any decrements.
Alternatively, you can use diff here, and index into df.index, similar to the previous solution:
np.insert(df.index[df.A.diff().gt(0)], 0, 0)
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
It is drop_duplicates
df.drop_duplicates('A').index.tolist()
[0, 1, 4, 6, 7]
This makes sure the second row is incremented by one (not by two or anything else!)
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
output is numpy array:
array([2, 5])
Example:
# * * here value increase by 1
# 0 1 2 3 4 5 6 7
df = pd.DataFrame({ 'A' : [1, 1, 1, 2, 8, 3, 4, 4]})
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
array([2, 5])

Using isnull() and groupby() on a pandas dataframe

Suppose I have a dataframe df with columns 'A', 'B', 'C'.
I would like to count the number of null values in column 'B' as grouped by 'A' and make a dictionary out of it:
Tried the following by failed:
df.groupby('A')['B'].isnull().sum().to_dict()
Any help will be appreciated.
Setup
df = pd.DataFrame(dict(A=[1, 2] * 3, B=[1, 2, None, 4, None, None]))
df
A B
0 1 1.0
1 2 2.0
2 1 NaN
3 2 4.0
4 1 NaN
5 2 NaN
Option 1
df['B'].isnull().groupby(df['A']).sum().to_dict()
{1: 2.0, 2: 1.0}
Option 2
df.groupby('A')['B'].apply(lambda x: x.isnull().sum()).to_dict()
{1: 2, 2: 1}
Option 3
Getting creative
df.A[df.B.isnull()].value_counts().to_dict()
{1: 2, 2: 1}
Option 4
from collections import Counter
dict(Counter(df.A[df.B.isnull()]))
{1: 2, 2: 1}
Option 5
from collections import defaultdict
d = defaultdict(int)
for t in df.itertuples():
d[t.A] += pd.isnull(t.B)
dict(d)
{1: 2, 2: 1}
Option 6
Unnecessarily complicated
(lambda t: dict(zip(t[1], np.bincount(t[0]))))(df.A[df.B.isnull()].factorize())
{1: 2, 2: 1}
Option 7
df.groupby([df.B.isnull(), 'A']).size().loc[True].to_dict()
{1: 2, 2: 1}
Or using the different between count and size ,see the link
(df.groupby('A')['B'].size()-df.groupby('A')['B'].count()).to_dict()
Out[119]: {1: 2, 2: 1}