Comparing strings in two different dataframe and adding a column [duplicate] - pandas

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two dataframes as follows:
df1 =
Index Name Age
0 Bob1 20
1 Bob2 21
2 Bob3 22
The second dataframe is as follows -
df2 =
Index Country Name
0 US Bob1
1 UK Bob123
2 US Bob234
3 Canada Bob2
4 Canada Bob987
5 US Bob3
6 UK Mary1
7 UK Mary2
8 UK Mary3
9 Canada Mary65
I would like to compare the names from df1 to the countries in df2 and create a new dataframe as follows:
Index Country Name Age
0 US Bob1 20
1 Canada Bob2 21
2 US Bob3 22
Thank you.

Using merge() should solve the problem.
df3 = pd.merge(df1, df2, on='Name')
Outcome:
import pandas as pd
df1 = pd.DataFrame({ "Name":["Bob1", "Bob2", "Bob3"], "Age":[20,21,22]})
df2 = pd.DataFrame({ "Country":["US", "UK", "US", "Canada", "Canada", "US", "UK", "UK", "UK", "Canada"],
"Name":["Bob1", "Bob123", "Bob234", "Bob2", "Bob987", "Bob3", "Mary1", "Mary2", "Mary3", "Mary65"]})
df3 = pd.merge(df1, df2, on='Name')
df3

Related

Pandas Column Transformation with list of dict in column

I am getting the data from a nosql database own by third party. Post data fetch the dataframe look like below: I wish to explode perfomance column but can't figure out a way. Is it even possible?
import pandas as pd
cols = ['name', 'performance']
data = [
['bob', [{'dates': '15-12-2021', 'gdp': 19},
{'dates': '16-12-2021', 'gdp': 36},
{'dates': '12-12-2022', 'gdp': 39},
{'dates': '13-12-2022', 'gdp': 35},
{'dates': '14-12-2022', 'gdp': 35}]]]
df = pd.DataFrame(data, columns=cols)
Expected output:
cols = ['name', 'dates', 'gdp']
data = [
['bob', '15-12-2021', 19],
['bob', '16-12-2021', 36],
['bob', '12-12-2022', 39],
['bob', '13-12-2022', 35],
['bob', '14-12-2022', 35]]
df = pd.DataFrame(data, columns=cols)
Use DataFrame.explode with DataFrame.reset_index first and then flatten dictionaries by json_normalize, DataFrame.pop is used for remove column performance in ouput DataFrame:
df1 = df.explode('performance').reset_index(drop=True)
df1 = df1.join(pd.json_normalize(df1.pop('performance')))
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
Another solutions with list comprehension - if only 2 columns input DataFrame:
L = [{**{'name':a},**x} for a, b in zip(df['name'], df['performance']) for x in b]
df1 = pd.DataFrame(L)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
If multiple columns use DataFrame.join with original DataFrame:
L = [{**{'i':a},**x} for a, b in df.pop('performance').items() for x in b]
df1 = df.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35

Group by count in pandas dataframe

In pandas dataframe I want to create two new columns that would calculate count the occurrence of the same value and a third column that would calculate the ratio
ratio = count_occurrence_both_columns /count_occurrence_columnA *100
df = pd.DataFrame({"column A": ["Atlanta", "Atlanta", "New York", "New York","New York"], "column B": ["AT", "AT", "NY", "NY", "AT"]})
df
columnA
ColumnB
occurrence_columnA
occurrence_both_columns
Ratio
Atlanta
AT
2
2
100%
Atlanta
AT
2
2
100%
Newyork
NY
3
2
66.66%
Newyork
NY
3
2
66.66%
Newyork
AT
3
1
33.33%
First, you can create a dictionary that has the keys as column A unique values and the values as the count.
>>> column_a_mapping = df['column A'].value_counts().to_dict()
>>> column_a_mapping
>>> {'New York': 3, 'Atlanta': 2}
Then, you can create a new column that has the two columns merged in order to have the same value counts dictionary as above.
>>> df['both_columns'] = (
df[['column A', 'column B']]
.apply(lambda row: '_'.join(row), axis=1)
)
>>> both_columns_mapping = df['both_columns'].value_counts().to_dict()
>>> both_columns_mapping
>>> {'New York_NY': 2, 'Atlanta_AT': 2, 'New York_AT': 1}
Once you have the unique values count, you can simple use the replace pd.Series method.
>>> df['count_occurrence_both_columns'] = df['both_columns'].replace(both_columns_mapping)
>>> df['count_occurrence_columnA'] = df['column A'].replace(column_a_mapping)
Lastly, you can drop the column that has both columns merged and then create you ratio column with:
>>> df['ratio'] = df['count_occurrence_both_columns'] / df['count_occurrence_columnA'] * 100
>>> df.drop('both_columns', axis=1, inplace=True)
You should obtain this dataframe:
column A
column B
count_occurrence_columnA
count_occurrence_both_columns
ratio
Atlanta
AT
2
2
100.000000
Atlanta
AT
2
2
100.000000
New York
NY
3
2
66.666667
New York
NY
3
2
66.666667
New York
AT
3
1
33.333333
Use pandas groupby to count the items
df['occurrence_columnA'] = df.groupby(['column A'])['column B'].transform(len)
df['occurrence_both_columns'] = df.groupby(['column A','column B'])['occurrence_columnA'].transform(len)
The alternative way is to use transform('count') but this will ignore NaN's

Get rid of NaN values and re-order columns in pandas

I have made an ad hoc example that you can run, to show you a dataframe similar to df3 that I have to use:
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name','Age'])
df2 = pd.DataFrame(people2,columns=['Name','Age'])
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name'], people_list[1]['Name']), axis=1)
df3
How would I modify the dataframe df3 to get rid of the NaN values and put the 2 columns one next to the other (I don't care about keeping the id's, I just want a clean dataframe with the 2 columns next to each other) ??
you can drop nan values 1st:
df3 = pd.concat([df1.dropna(), df2.dropna()])
Output:
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
3 Mark 20.0
4 Jane 22.0
5 Jack 23.0
Or if you want to contact side-by-side:
df3 = pd.concat([df1.dropna().reset_index(drop=True), df2.dropna().reset_index(drop=True)], 1)
output:
Name Age Name Age
0 Alex 10.0 Mark 20.0
1 Bob 12.0 Jane 22.0
2 Clarke 13.0 Jack 23.0
If you just wanna concat the name column side-by-side:
df3 = pd.concat([df1.dropna().reset_index(drop=True)['Name'], df2.dropna().reset_index(drop=True)['Name']], 1)
output:
Name Name
0 Alex Mark
1 Bob Jane
2 Clarke Jack
If you want to modify only df3 it can be done via iloc and dropna:
df3 = pd.concat([df3.iloc[:,0].dropna().reset_index(drop=True) , df3.iloc[:,1].dropna().reset_index(drop=True)],1)
Output:
Name Name
0 Alex Mark
1 Bob Jane
2 Clarke Jack
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name','Age']).dropna()
df2 = pd.DataFrame(people2,columns=['Name','Age']).dropna()
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name'], people_list[1]['Name']), axis=1)
print(df3)
This will help you concatenate two df
if I have understood correctly what you mean, this is a possible solution
people1 = [['Alex',10],['Bob',12],['Clarke',13],['NaN',],['NaN',],['NaN',]]
people2 = [['NaN',],['NaN',],['NaN',],['Mark',20],['Jane',22],['Jack',23]]
df1 = pd.DataFrame(people1,columns=['Name1','Age']).dropna()
df2 = pd.DataFrame(people2,columns=['Name2','Age']).dropna().reset_index()
people_list=[df1, df2]
df3 = pd.concat((people_list[0]['Name1'], people_list[1]['Name2']), axis=1)
print(df3)
Name1 Name2
0 Alex Mark
1 Bob Jane
2 Clarke Jack
if you already have that dataframe:
count = df3.Name2.isna().sum()
df3.loc[:, 'Name2'] = df3.Name2.shift(-count)
df3 = df3.dropna()
print(df3)
Name1 Name2
0 Alex Mark
1 Bob Jane
2 Clarke Jack

add "all" row to pandas group by

This is my code (using pandas 0.19.2)
import pandas as pd
data=StringIO("""category,region,sales
fruits,east,12
vegatables,east,3
fruits,west,5
vegatables,wst,7
""")
df = pd.read_csv(data)
print(df.groupby('category', as_index=False).agg({'sales': sum}))
This is the output:
category sales
0 fruits 17
1 vegatables 10
My question is: how do add an 'all' row so the output would look like this:
category sales
0 fruits 17
1 vegatables 10
all 27
You can try pivot_table and alter the new data:
new_df = df.pivot_table(columns='category',index='region', values='sales')
new_df['all'] = new_df.sum(1)
Output:
category fruits vegatables all
region
east 12 3 15
west 5 7 12
And if you want your original data:
new_df.stack().to_frame(name='Sales').reset_index()
Output:
region category Sales
0 east fruits 12
1 east vegatables 3
2 east all 15
3 west fruits 5
4 west vegatables 7
5 west all 12
here is what i ended up doing:
from io import StringIO
import pandas as pd
data = StringIO("""category,region,sales
fruits,east,12
vegatables,east,3
fruits,west,5
vegatables,wst,7
""")
df = pd.read_csv(data)
body=df.groupby('category', as_index=False).agg({'sales': sum})
head=df.groupby(lambda x: True, as_index=False) #advanced panda trickery
head=head.agg({'sales': sum})
head.insert(0,'category','*all*')
print(body.append(head))
basically, create another dataframe with the 'all' row and concat

How do I update the value of column based on other dataframe

How do i update a column in pandas based on condition from other data frame.
I have 2 dataframe df1 and df2
import pandas as pd
df1=pd.DataFrame({'names':['andi','andrew','jhon','andreas'],
'salary':[1000,2000,2300,1500]})
df2=pd.DataFrame({'names':['andi','andrew'],
'raise':[1500,2500]})
expected output
names salary
andi 1500
andrew 2500
jhon 2300
andreas 1500
Use Series.combine_first with DataFrame.set_index:
df = (df2.set_index('names')['raise']
.combine_first(df1.set_index('names')['salary'])
.reset_index())
print (df)
names raise
0 andi 1500.0
1 andreas 1500.0
2 andrew 2500.0
3 jhon 2300.0
Using merge & update, similar like sql.
df3 = pd.merge(df1, df2, how = 'left', left_on ='names', right_on = 'names')
df3.loc[df3['raise'].notnull(),'salary'] = df3['raise']
df3
names salary raise
0 andi 1500.0 1500.0
1 andrew 2500.0 2500.0
2 jhon 2300.0 NaN
3 andreas 1500.0 NaN