I know there are questions/answers about how to use custom function for groupby in pandas but my case is slightly different.
My data is
group_col val_col
0 a [1, 2, 34]
1 a [2, 4]
2 b [2, 3, 4, 5]
data = {'group_col': {0: 'a', 1: 'a', 2: 'b'}, 'val_col': {0: [1, 2, 34], 1: [2, 4], 2: [2, 3, 4, 5]}}
df = pd.DataFrame(data)
What I am trying to do is to group by group_col, then sum up the length of lists in val_col for each group. My desire output is
a 5
b 4
I wonder I can do this in pandas?
You can try
df['val_col'].str.len().groupby(df['group_col']).sum()
df.groupby('group_col')['val_col'].sum().str.len()
Output:
group_col
a 5
b 4
Name: val_col, dtype: int64
Related
In order to calculate frequency of each value by id, we can do something using value_counts and groupby.
>>> df = pd.DataFrame({"id":[1,1,1,2,2,2], "col":['a','a','b','a','b','b']})
>>> df
id col
0 1 a
1 1 a
2 1 b
3 2 a
4 2 b
5 2 b
>>> df.groupby('id')['col'].value_counts()
id col
1 a 2
b 1
2 b 2
a 1
But I would like to get results stored in dictionary format, not Series. So how am I able to achieve that and also the speed is fast if we have a large dataset?
The ideal format is:
id
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
You can unstack the groupby result to get a dict-of-dicts:
df.groupby('id')['col'].value_counts().unstack().to_dict(orient='index')
# {1: {'a': 2, 'b': 1}, 2: {'a': 1, 'b': 2}}
If you want a Series of dicts, use agg instead of to_dict:
df.groupby('id')['col'].value_counts().unstack().agg(pd.Series.to_dict)
col
a {1: 2, 2: 1}
b {1: 1, 2: 2}
dtype: object
I don't recommend storing data in this format, objects are generally more troublesome to work with.
If unstacking generates NaNs, try an alternative with GroupBy.agg:
df.groupby('id')['col'].agg(lambda x: x.value_counts().to_dict())
id
1 {'a': 2, 'b': 1}
2 {'b': 2, 'a': 1}
Name: col, dtype: object
We can do pd.crosstab
pd.Series(pd.crosstab(df.id,df.col).to_dict('i'))
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
dtype: object
I would like to know the best approach for the following pandas dataframe comparison task:
Two dataframes df_a and df_b with both having columns = ['W','X','Y','Z']:
import pandas as pd
df_a = pd.DataFrame([
['a', 2, 2, 3],
['b', 5, 3, 5],
['b', 7, 6, 44],
['c', 3, 12, 19],
['c', 7, 13, 45],
['c', 3, 13, 45],
['d', 5, 11, 90],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
df_b = pd.DataFrame([
['a', 2, 2, 3],
['a', 4, 3, 15],
['b', 5, 12, 24],
['b', 7, 6, 44],
['c', 3, 12, 19],
['d', 3, 23, 45],
['d', 6, 11, 91],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
Extract those rows from df_a that do not have a match in columns ['W','X'] in df_b
Extract those rows from df_b that do not have a match in columns ['W','X'] in df_a
Since I am kind of newbie to pandas (and could not find any other source that gives information on the mentioned task) help-out is very much appreciated.
Thanx in advance.
The basic way is using left outer merge with indicator=True and select left_only using query
cols = ['W', 'X']
df_a_only = (df_a.merge(df_b[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_a.columns])
Out[87]:
W X Y Z
4 c 7 13 45
6 d 5 11 90
df_b_only = (df_b.merge(df_a[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_b.columns])
Out[89]:
W X Y Z
1 a 4 3 15
6 d 3 23 45
7 d 6 11 91
Note: if your dataframe is huge, itt is better doing one full outer merge than using 2 left outer merge as above and choosing left_only and right_only accordingly. However, in doing full outer merge, you need to do post process of NaN, converting float back to integer and rename columns.
I have two dataframes:
df_small = pd.DataFrame(np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]),
columns=['a', 'b', 'c'])
and
df_large = pd.DataFrame(np.array([[22, 1, 2, 3, 99],
[31, 4, 5, 6, 75],
[73, 7, 8, 9, 23],
[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Now what I want is to intersect the two and only take the rows in df_large that that do not contain the rows from df_small, hence the result should be:
df_result = pd.DataFrame(np.array([[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Use DataFrame.merge with indicator=True and left join and because error is necessary remove duplicates by DataFrame.drop_duplicates from df_small:
m = df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)['_merge'].ne('both')
df = df_large[m]
print (df)
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
Another solution is very similar, only filtered by query and last removed column _merge:
df = (df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)
.query('_merge != "both"')
.drop('_merge', axis=1))
Use DataFrame.merge:
df_large.merge(df_small,how='outer',indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
Output:
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
You can evade merging and make your code a bit more readable. It's really not that clear what happens when you merge and drop duplicates.
Indexes and Multiindexes were made for intersections and other set operations.
common_columns = df_large.columns.intersection(df_small.columns).to_list()
df_small_as_Multiindex = pd.MultiIndex.from_frame(df_small)
df_result = df_large.set_index(common_columns).\
drop(index = df_small_as_Multiindex).\ #Drop the common rows
reset_index() #Not needed if the a,b,c columns are meaningful indexes
how to get the index of pandas series when the value incremented by one?
Ex. The input is
A
0 0
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
the output should be: [0, 1, 4, 6, 7]
You can use Series.duplicated and access the index, should be slightly faster.
df.index[~df.A.duplicated()]
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
If you really want a list, you can do this,
df.index[~df.A.duplicated()].tolist()
# [0, 1, 4, 6, 7]
Note that duplicated (and drop_duplicates) will only work if your Series does not have any decrements.
Alternatively, you can use diff here, and index into df.index, similar to the previous solution:
np.insert(df.index[df.A.diff().gt(0)], 0, 0)
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
It is drop_duplicates
df.drop_duplicates('A').index.tolist()
[0, 1, 4, 6, 7]
This makes sure the second row is incremented by one (not by two or anything else!)
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
output is numpy array:
array([2, 5])
Example:
# * * here value increase by 1
# 0 1 2 3 4 5 6 7
df = pd.DataFrame({ 'A' : [1, 1, 1, 2, 8, 3, 4, 4]})
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
array([2, 5])
Suppose I have a dataframe df with columns 'A', 'B', 'C'.
I would like to count the number of null values in column 'B' as grouped by 'A' and make a dictionary out of it:
Tried the following by failed:
df.groupby('A')['B'].isnull().sum().to_dict()
Any help will be appreciated.
Setup
df = pd.DataFrame(dict(A=[1, 2] * 3, B=[1, 2, None, 4, None, None]))
df
A B
0 1 1.0
1 2 2.0
2 1 NaN
3 2 4.0
4 1 NaN
5 2 NaN
Option 1
df['B'].isnull().groupby(df['A']).sum().to_dict()
{1: 2.0, 2: 1.0}
Option 2
df.groupby('A')['B'].apply(lambda x: x.isnull().sum()).to_dict()
{1: 2, 2: 1}
Option 3
Getting creative
df.A[df.B.isnull()].value_counts().to_dict()
{1: 2, 2: 1}
Option 4
from collections import Counter
dict(Counter(df.A[df.B.isnull()]))
{1: 2, 2: 1}
Option 5
from collections import defaultdict
d = defaultdict(int)
for t in df.itertuples():
d[t.A] += pd.isnull(t.B)
dict(d)
{1: 2, 2: 1}
Option 6
Unnecessarily complicated
(lambda t: dict(zip(t[1], np.bincount(t[0]))))(df.A[df.B.isnull()].factorize())
{1: 2, 2: 1}
Option 7
df.groupby([df.B.isnull(), 'A']).size().loc[True].to_dict()
{1: 2, 2: 1}
Or using the different between count and size ,see the link
(df.groupby('A')['B'].size()-df.groupby('A')['B'].count()).to_dict()
Out[119]: {1: 2, 2: 1}