extract conditional max based on a reference data frame - pandas

my reference data frame is of the following type:
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
a toy working df:
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR'],
['Afghanistan','AFN'],['Afghanistan','ALL'],['Afghanistan','EUR']],
columns=['country','currency'])
As my working df may have country and currency differently, for example country =='France' and 'currency'=='AFN', I would like to create a column with max score based on either, i.e., this country/currency combo would imply a score of 4.
Desired output:
Out[102]:
country currency score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
6 Afghanistan AFN 4
7 Afghanistan ALL 4
8 Afghanistan EUR 4
Here is what I have so far, but it's multiline and extremely clunky:
df = pd.merge(df, tbl[['country', 'score']],
how='left', on='country')
df['em_score'] = df['score']
df = df.drop('score', axis=1)
df = pd.merge(df, tbl[['currency', 'score']],
how='left', on='currency')
df['em_score'] = df[['em_score', 'score']].max(axis=1)
df = df.drop('score', axis=1)

Here's a way to do it:
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
Explanation:
for each column in tbl other than score (in your case, country and currency), create a Series with that column as index
use pd.concat() to create a new dataframe with multiple columns, each a Series object created using join() between the working df and one of the Series objects from the previous step
use max() on each row to get the desired em_score.
Full test code with sample df:
import pandas as pd
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR']],
columns=['country','currency'])
print('','tbl',tbl,sep='\n')
print('','df',df,sep='\n')
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
print('','output',df,sep='\n')
Output:
tbl
country currency score
0 Afghanistan AFN 4
1 Albania ALL 2
2 France EUR 1
df
country currency
0 France AFN
1 France ALL
2 France EUR
3 Albania AFN
4 Albania ALL
5 Albania EUR
output
country currency em_score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2

So, for case if you have
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1],
['France', 'AFN', 0]],
columns=['country', 'currency', 'score'])
This code will find the max of either the max score for the country or the currency of each row:
np.maximum(np.array(tbl.groupby(['country']).max().loc[tbl['country'], 'score']),
np.array(tbl.groupby(['currency']).max().loc[tbl['currency'], 'score']))

Related

Pandas Column Transformation with list of dict in column

I am getting the data from a nosql database own by third party. Post data fetch the dataframe look like below: I wish to explode perfomance column but can't figure out a way. Is it even possible?
import pandas as pd
cols = ['name', 'performance']
data = [
['bob', [{'dates': '15-12-2021', 'gdp': 19},
{'dates': '16-12-2021', 'gdp': 36},
{'dates': '12-12-2022', 'gdp': 39},
{'dates': '13-12-2022', 'gdp': 35},
{'dates': '14-12-2022', 'gdp': 35}]]]
df = pd.DataFrame(data, columns=cols)
Expected output:
cols = ['name', 'dates', 'gdp']
data = [
['bob', '15-12-2021', 19],
['bob', '16-12-2021', 36],
['bob', '12-12-2022', 39],
['bob', '13-12-2022', 35],
['bob', '14-12-2022', 35]]
df = pd.DataFrame(data, columns=cols)
Use DataFrame.explode with DataFrame.reset_index first and then flatten dictionaries by json_normalize, DataFrame.pop is used for remove column performance in ouput DataFrame:
df1 = df.explode('performance').reset_index(drop=True)
df1 = df1.join(pd.json_normalize(df1.pop('performance')))
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
Another solutions with list comprehension - if only 2 columns input DataFrame:
L = [{**{'name':a},**x} for a, b in zip(df['name'], df['performance']) for x in b]
df1 = pd.DataFrame(L)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
If multiple columns use DataFrame.join with original DataFrame:
L = [{**{'i':a},**x} for a, b in df.pop('performance').items() for x in b]
df1 = df.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35

Group by count in pandas dataframe

In pandas dataframe I want to create two new columns that would calculate count the occurrence of the same value and a third column that would calculate the ratio
ratio = count_occurrence_both_columns /count_occurrence_columnA *100
df = pd.DataFrame({"column A": ["Atlanta", "Atlanta", "New York", "New York","New York"], "column B": ["AT", "AT", "NY", "NY", "AT"]})
df
columnA
ColumnB
occurrence_columnA
occurrence_both_columns
Ratio
Atlanta
AT
2
2
100%
Atlanta
AT
2
2
100%
Newyork
NY
3
2
66.66%
Newyork
NY
3
2
66.66%
Newyork
AT
3
1
33.33%
First, you can create a dictionary that has the keys as column A unique values and the values as the count.
>>> column_a_mapping = df['column A'].value_counts().to_dict()
>>> column_a_mapping
>>> {'New York': 3, 'Atlanta': 2}
Then, you can create a new column that has the two columns merged in order to have the same value counts dictionary as above.
>>> df['both_columns'] = (
df[['column A', 'column B']]
.apply(lambda row: '_'.join(row), axis=1)
)
>>> both_columns_mapping = df['both_columns'].value_counts().to_dict()
>>> both_columns_mapping
>>> {'New York_NY': 2, 'Atlanta_AT': 2, 'New York_AT': 1}
Once you have the unique values count, you can simple use the replace pd.Series method.
>>> df['count_occurrence_both_columns'] = df['both_columns'].replace(both_columns_mapping)
>>> df['count_occurrence_columnA'] = df['column A'].replace(column_a_mapping)
Lastly, you can drop the column that has both columns merged and then create you ratio column with:
>>> df['ratio'] = df['count_occurrence_both_columns'] / df['count_occurrence_columnA'] * 100
>>> df.drop('both_columns', axis=1, inplace=True)
You should obtain this dataframe:
column A
column B
count_occurrence_columnA
count_occurrence_both_columns
ratio
Atlanta
AT
2
2
100.000000
Atlanta
AT
2
2
100.000000
New York
NY
3
2
66.666667
New York
NY
3
2
66.666667
New York
AT
3
1
33.333333
Use pandas groupby to count the items
df['occurrence_columnA'] = df.groupby(['column A'])['column B'].transform(len)
df['occurrence_both_columns'] = df.groupby(['column A','column B'])['occurrence_columnA'].transform(len)
The alternative way is to use transform('count') but this will ignore NaN's

how reset index with respect of a group?

I have an id column for each person (data with the same id belongs to one person). I want these:
Now the id column is not based on numbering, it's 10 digit. How can I reset id with integers, e.g. 1, 2, 3, 4?
For example:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
Output:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
Use, factorize:
df['id']=df['id'].factorize()[0]+1
Output:
id col1
0 1 summer
1 1 goest
2 2 yes
3 2 No
4 2 why
5 3 Hi
Another option is to use categorical data:
df['id'] = df['id'].astype('category').cat.codes + 1
Try:
df.reset_index(inplace=True)
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame([('bird', 389.0),
('bird', 24.0),
('mammal', 80.5),
('mammal', np.nan)],
index=['falcon', 'parrot', 'lion', 'monkey'],
columns=('class', 'max_speed'))
print(df)
class max_speed
falcon bird 389.0
parrot bird 24.0
lion mammal 80.5
monkey mammal NaN
This is how looks like, let replace the index:
df.reset_index(inplace=True)
print(df)
index class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal NaN
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['12a4', '12a4', '3b', '3b', '3b', '4t'],
'col1': ['summer', 'goest', 'yes', 'No', 'why', 'Hi']})
unique_id = df.drop_duplicates(subset=['id']).reset_index(drop=True)
id_dict = dict(zip(unique_id['id'], unique_id.index))
df['id'] = df['id'].apply(lambda x: id_dict[x])
df.drop_duplicates(subset=['id']).reset_index(drop=True) removes duplicate rows in column id.
# print(unique_id)
id col1
0 12a4 summer
1 3b yes
2 4t Hi
dict(zip(unique_id['id'], unique_id.index)) creates a dictionary from column id and index value.
# print(id_dict)
{'12a4': 0, '3b': 1, '4t': 2}
df['id'].apply(lambda x: id_dict[x]) sets the column value mapping with value from dict.
# print(df)
id col1
0 0 summer
1 0 goest
2 1 yes
3 1 No
4 1 why
5 2 Hi

How t add data in separate columns data in Pandas Dataframe?

question:
goldmedal = pd.DataFrame({'Country': ['india', 'japan', 'korea'],
'Medals': [5, 3, 4]}
)
silvermedal = pd.DataFrame({'Country': ['india', 'china', 'korea'],
'Medals': [9, 0, 6]}
)
bronzemedal = pd.DataFrame({'Country': ['japan', 'india', 'vietnam'],
'Medals': [4, 2, 2]}
)
I need to find the cumulative medals earned by the mentioned countries.
I tried this
add function: goldmedal.add(silvermedal, fill_value=0) O/P
Country Medals
0 indiaindia 14
1 japanchina 3
2 koreakorea 10
merge function: pd.merge (goldmedal,silvermedal, how='inner', on='Country')
O/P
Country Medalsx Medalsy
0 india 5 9
1 korea 4 6
How do I get the following output?
Country Medals
0 india 16
1 china 0
2 korea 10
3 vietnam 2
4 japan 7
pd.concat([goldmedal, silvermedal, bronzemedal]).groupby('Country').sum().reset_index()

using pandas dataframe group agg function

There is a dataframe, say
df
Country Continent PopulationEst
0 Germany Europe 8.036970e+07
1 Canada North America 35.239865+07
...
I want to create a dateframe that displays the size (the number of countries in each continent), and the sum, mean, and std deviation for the estimated population of each country.
I did the following:
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
But the result df2 has multiple level columns like below:
df2.columns
MultiIndex(levels=[['PopulationEst'], ['size', 'sum', 'mean', 'std']],
labels=[[0, 0, 0, 0], [0, 1, 2, 3]])
How can I remove the PopulationEst from the columns, so just have ['size', 'sum', 'mean', 'std'] columns for the dataframe?
I think you need add ['PopulationEst'] - agg uses this column for aggregation:
df2 = df.groupby('Continent')['PopulationEst'].agg(['size', 'sum','mean','std'])
Sample:
df = pd.DataFrame({
'Country': ['Germany', 'Germany', 'Canada', 'Canada'],
'PopulationEst': [8, 4, 35, 50],
'Continent': ['Europe', 'Europe', 'North America', 'North America']},
columns=['Country','PopulationEst','Continent'])
print (df)
Country PopulationEst Continent
0 Germany 8 Europe
1 Germany 4 Europe
2 Canada 35 North America
3 Canada 50 North America
df2 = df.groupby('Continent')['PopulationEst'].agg(['size', 'sum','mean','std'])
print (df2)
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
print (df2)
PopulationEst
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
Another solution is with MultiIndex.droplevel:
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
df2.columns = df2.columns.droplevel(0)
print (df2)
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
I think this could do what you need:
grouping = {'Continent': ['size'], 'PopEst':['sum', 'mean', 'std']}
df.groupby('Continent').agg(grouping)