Group by count in pandas dataframe - pandas

In pandas dataframe I want to create two new columns that would calculate count the occurrence of the same value and a third column that would calculate the ratio
ratio = count_occurrence_both_columns /count_occurrence_columnA *100
df = pd.DataFrame({"column A": ["Atlanta", "Atlanta", "New York", "New York","New York"], "column B": ["AT", "AT", "NY", "NY", "AT"]})
df
columnA
ColumnB
occurrence_columnA
occurrence_both_columns
Ratio
Atlanta
AT
2
2
100%
Atlanta
AT
2
2
100%
Newyork
NY
3
2
66.66%
Newyork
NY
3
2
66.66%
Newyork
AT
3
1
33.33%

First, you can create a dictionary that has the keys as column A unique values and the values as the count.
>>> column_a_mapping = df['column A'].value_counts().to_dict()
>>> column_a_mapping
>>> {'New York': 3, 'Atlanta': 2}
Then, you can create a new column that has the two columns merged in order to have the same value counts dictionary as above.
>>> df['both_columns'] = (
df[['column A', 'column B']]
.apply(lambda row: '_'.join(row), axis=1)
)
>>> both_columns_mapping = df['both_columns'].value_counts().to_dict()
>>> both_columns_mapping
>>> {'New York_NY': 2, 'Atlanta_AT': 2, 'New York_AT': 1}
Once you have the unique values count, you can simple use the replace pd.Series method.
>>> df['count_occurrence_both_columns'] = df['both_columns'].replace(both_columns_mapping)
>>> df['count_occurrence_columnA'] = df['column A'].replace(column_a_mapping)
Lastly, you can drop the column that has both columns merged and then create you ratio column with:
>>> df['ratio'] = df['count_occurrence_both_columns'] / df['count_occurrence_columnA'] * 100
>>> df.drop('both_columns', axis=1, inplace=True)
You should obtain this dataframe:
column A
column B
count_occurrence_columnA
count_occurrence_both_columns
ratio
Atlanta
AT
2
2
100.000000
Atlanta
AT
2
2
100.000000
New York
NY
3
2
66.666667
New York
NY
3
2
66.666667
New York
AT
3
1
33.333333

Use pandas groupby to count the items
df['occurrence_columnA'] = df.groupby(['column A'])['column B'].transform(len)
df['occurrence_both_columns'] = df.groupby(['column A','column B'])['occurrence_columnA'].transform(len)
The alternative way is to use transform('count') but this will ignore NaN's

Related

extract conditional max based on a reference data frame

my reference data frame is of the following type:
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
a toy working df:
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR'],
['Afghanistan','AFN'],['Afghanistan','ALL'],['Afghanistan','EUR']],
columns=['country','currency'])
As my working df may have country and currency differently, for example country =='France' and 'currency'=='AFN', I would like to create a column with max score based on either, i.e., this country/currency combo would imply a score of 4.
Desired output:
Out[102]:
country currency score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
6 Afghanistan AFN 4
7 Afghanistan ALL 4
8 Afghanistan EUR 4
Here is what I have so far, but it's multiline and extremely clunky:
df = pd.merge(df, tbl[['country', 'score']],
how='left', on='country')
df['em_score'] = df['score']
df = df.drop('score', axis=1)
df = pd.merge(df, tbl[['currency', 'score']],
how='left', on='currency')
df['em_score'] = df[['em_score', 'score']].max(axis=1)
df = df.drop('score', axis=1)
Here's a way to do it:
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
Explanation:
for each column in tbl other than score (in your case, country and currency), create a Series with that column as index
use pd.concat() to create a new dataframe with multiple columns, each a Series object created using join() between the working df and one of the Series objects from the previous step
use max() on each row to get the desired em_score.
Full test code with sample df:
import pandas as pd
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR']],
columns=['country','currency'])
print('','tbl',tbl,sep='\n')
print('','df',df,sep='\n')
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
print('','output',df,sep='\n')
Output:
tbl
country currency score
0 Afghanistan AFN 4
1 Albania ALL 2
2 France EUR 1
df
country currency
0 France AFN
1 France ALL
2 France EUR
3 Albania AFN
4 Albania ALL
5 Albania EUR
output
country currency em_score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
So, for case if you have
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1],
['France', 'AFN', 0]],
columns=['country', 'currency', 'score'])
This code will find the max of either the max score for the country or the currency of each row:
np.maximum(np.array(tbl.groupby(['country']).max().loc[tbl['country'], 'score']),
np.array(tbl.groupby(['currency']).max().loc[tbl['currency'], 'score']))

Pandas Column Transformation with list of dict in column

I am getting the data from a nosql database own by third party. Post data fetch the dataframe look like below: I wish to explode perfomance column but can't figure out a way. Is it even possible?
import pandas as pd
cols = ['name', 'performance']
data = [
['bob', [{'dates': '15-12-2021', 'gdp': 19},
{'dates': '16-12-2021', 'gdp': 36},
{'dates': '12-12-2022', 'gdp': 39},
{'dates': '13-12-2022', 'gdp': 35},
{'dates': '14-12-2022', 'gdp': 35}]]]
df = pd.DataFrame(data, columns=cols)
Expected output:
cols = ['name', 'dates', 'gdp']
data = [
['bob', '15-12-2021', 19],
['bob', '16-12-2021', 36],
['bob', '12-12-2022', 39],
['bob', '13-12-2022', 35],
['bob', '14-12-2022', 35]]
df = pd.DataFrame(data, columns=cols)
Use DataFrame.explode with DataFrame.reset_index first and then flatten dictionaries by json_normalize, DataFrame.pop is used for remove column performance in ouput DataFrame:
df1 = df.explode('performance').reset_index(drop=True)
df1 = df1.join(pd.json_normalize(df1.pop('performance')))
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
Another solutions with list comprehension - if only 2 columns input DataFrame:
L = [{**{'name':a},**x} for a, b in zip(df['name'], df['performance']) for x in b]
df1 = pd.DataFrame(L)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
If multiple columns use DataFrame.join with original DataFrame:
L = [{**{'i':a},**x} for a, b in df.pop('performance').items() for x in b]
df1 = df.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35

Pandas dataframe median of a column with condition

So I have a dataframe with two columns (price, location). Now I want to get the median of price, if the location is e.g. "Paris". How do I achieve that?
dataframe:
location price
paris 5
paris 2
rome 5
paris 4
...
desired result: 4 (median of 2,5,4)
I think you need df.groupby to group on location, and then .median():
median = df.groupby('location').median()
To get the value for each location:
median.loc['paris', 'price']
Output:
4
import pandas as pd
# Build dataframe
data = [['Paris', 2], ['New York', 3], ['Rome', 4], ['Paris', 5], ['Paris', 4]]
df = pd.DataFrame(data, columns=['location', 'price'])
# Get paris only rows
df_paris = df[df['location'] == 'Paris']
# Print median
print(df_paris['price'].median())

Comparing strings in two different dataframe and adding a column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two dataframes as follows:
df1 =
Index Name Age
0 Bob1 20
1 Bob2 21
2 Bob3 22
The second dataframe is as follows -
df2 =
Index Country Name
0 US Bob1
1 UK Bob123
2 US Bob234
3 Canada Bob2
4 Canada Bob987
5 US Bob3
6 UK Mary1
7 UK Mary2
8 UK Mary3
9 Canada Mary65
I would like to compare the names from df1 to the countries in df2 and create a new dataframe as follows:
Index Country Name Age
0 US Bob1 20
1 Canada Bob2 21
2 US Bob3 22
Thank you.
Using merge() should solve the problem.
df3 = pd.merge(df1, df2, on='Name')
Outcome:
import pandas as pd
df1 = pd.DataFrame({ "Name":["Bob1", "Bob2", "Bob3"], "Age":[20,21,22]})
df2 = pd.DataFrame({ "Country":["US", "UK", "US", "Canada", "Canada", "US", "UK", "UK", "UK", "Canada"],
"Name":["Bob1", "Bob123", "Bob234", "Bob2", "Bob987", "Bob3", "Mary1", "Mary2", "Mary3", "Mary65"]})
df3 = pd.merge(df1, df2, on='Name')
df3

How to make new cell based on appearance in dataframe cell

I want to create new column in dataframe if a value is in existed column with array type and another column matches another condition.
Dataset:
name loto
0 Jason [22]
1 Molly [222]
2 Tina [232]
3 Jake [223]
4 Amy [73, 1, 2, 3]
If name=="Jason" and loto has 22 new=1
I tried to use np.where, but having issues check element in array.
import numpy as np
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'loto': [[22], [222], [232], [223], [73,1,2,3]]}
df = pd.DataFrame(data, columns = ['name', 'loto'])
df['new'] = np.where((22 in df['loto']) & (df[name]=="Jason"), 1, 0)
first create value you want to check in a set like set([22])
provide loto_chck in map and apply condition in .loc
loto_val = set([22])
loto_chck= loto_val.issubset
df.loc[(df['loto'].map(loto_chck))&(df['name']=='Jason'),"new"]=1
name loto new
0 Jason [22] 1
1 Molly [222] Nan
2 Tina [232] Nan
3 Jake [223] Nan
4 Amy [73, 1, 2, 3] Nan
You could try :
df['new'] = ((df.apply(lambda x : 22 in x.loto , axis = 1)) & \
(df.name =='Jason')).astype(int)
Even though it's not a good idea to store lists in a dataframe