Pandas: Round integers before joining dataframes - pandas

I have two data frames that both contain coordinates. One of them, df1, has coordinates at a better resolution (with decimals), and I would like to join it to df2 which has a less-good resolution:
import pandas as pd
df1 = pd.DataFrame({'x': [1.1, 2.2, 3.3],
'y': [2.3, 3.3, 4.1],
'val': [10,11,12]})
df2 = pd.DataFrame({'x': [1,2,3,5.5],
'y': [2,3,4,5.6]})
df1['x_org']=df1['x']
df1['y_org']=df1['y']
df1[['x','y']] = df1[['x','y']].round()
df1 = pd.merge(df1, df2, how='left', on=['x','y'])
df1.drop({'x','y'}, axis=1)
# rename...
The code above does exactly what I want, but it is a bit cumbersome. Is there an easier way to achieve this?

Use:
df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_')).drop(['x_','y_'], axis=1)
Also is possible remove columns ending by _ dynamic:
df = df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_')).filter(regex='.*[^_]$')
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
df = df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_end')).filter(regex='.*(?<!_end)$')
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
Or:
df = (df1.set_index(['x','y'], drop=False).rename(lambda x: round(x))
.merge(df2.set_index(['x','y']),
left_index=True,
right_index=True,
how='left').reset_index(drop=True))
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12

IIUC, you could pass the rounded values as joining keys:
pd.merge(df1.rename(columns={'x': 'x_org', 'y': 'y_org'}),
df2,
how='left',
left_on=[df1['x'].round(), df1['x'].round()],
right_on=['x', 'y'])#.drop({'x','y'}, axis=1) # if x/y are unwanted
output:
x_org y_org val x y
0 1.1 2.3 10 1.0 1.0
1 2.2 3.3 11 2.0 2.0
2 3.3 4.1 12 3.0 3.0

Related

Pandas pivot table with prefix to columns

I have a dataframe:
df = C1 A1. A2. A3. Type
A 1. 5. 2. AG
A 7. 3. 8. SC
And I want to create:
df = C1 A1_AG A1_SC A2_AG A2_SC
A 1. 7. 5. 3
How can it be done?
You can rather use a melt and transpose:
(df.melt('Type')
.assign(col=lambda d: d['Type']+'_'+d['variable'])
.set_index('col')[['value']].T
)
Output:
col AG_A1 SC_A1 AG_A2 SC_A2 AG_A3 SC_A3
value 1 7 5 3 2 8
with additional columns(s):
(df.melt(['C1', 'Type'])
.assign(col=lambda d: d['Type']+'_'+d['variable'])
.pivot(index=['C1'], columns='col', values='value')
.reset_index()
)
Output:
col C1 AG_A1 AG_A2 AG_A3 SC_A1 SC_A2 SC_A3
0 A 1 5 2 7 3 8
Use DataFrame.set_index with DataFrame.unstack:
df = df.set_index(['C1','Type']).unstack()
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index()
print (df)
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0
One convenience option with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_wider(index = 'C1', names_from = 'Type')
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0
Of course, you can skip the convenience function and use pivot directly:
result = df.pivot(index='C1', columns='Type')
result.columns = result.columns.map('_'.join)
result.reset_index()
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0

Quickly replace values in a Pandas DataFrame

I have the following dataframe:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
# A B Sum
# 1 1 3 4
# 2 2 4 6
# Sum 3 7 10
I want to:
replace 1 by 3*4/10
replace 2 by 3*6/10
replace 3 by 4*7/10
replace 4 by 7*6/10
What is the easiest way to do this? I want the solution to be able to extend to n number of rows and columns. Been cracking my head over this. TIA!
If I understood you correctly:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
conditions = [(df==1), (df==2), (df==3), (df==4)]
values = [(3*4)/10, (3*6)/10, (4*7)/10, (7*6)/10]
df[df.columns] = np.select(conditions, values, df)
OutPut:
A B Sum
1 1.2 2.8 4.2
2 1.8 4.2 6.0
Sum 2.8 7.0 10.0
Let us try create it from original df before you do the sum and assign
import numpy as np
v = np.multiply.outer(df.sum(1).values,df.sum().values)/df.sum().sum()
out = pd.DataFrame(v,index=df.index,columns=df.columns)
out
Out[20]:
A B
1 1.2 2.8
2 1.8 4.2

join two df with unequal count of levels in column

I would like to join two dataframes (df1,df2), but i can't figure it.
import pandas as pd
data = {'Step_ID': ["Step1", "Step1", "Step1", "Step2", "Step2", "Step3", "Step3"],
'value_01': [2, 2.3, 2.2, 0, 0, 5, 5.2]}
df1 = pd.DataFrame(data)
data = {'Step_ID': ["Step1", "Step1", "Step1", "Step1", "Step2", "Step2", "Step2", "Step3", "Step3", "Step3"],
'value_02': [2.3, 2.5, 2.1, 2.5, 0, 0, 0, 5.1, 5.6, 5.8]}
df2 = pd.DataFrame(data)
I would like to merge the on the column "Step_ID" as follows:
I tried several merges and its settings, but without any sucess.
pd.merge(df1, df2, left_on = ['Step_ID'], right_on = ['Step_ID'], how = 'outer')
The closest solution i have done with the following code, but it is not as required:
df1.combine_first(df2)
Is there any possibility to join those two dataframe in the required way? See the picture above.
We can try with cumcount + merge
new_df = df1.assign(index_count=df1.groupby('Step_ID').cumcount())\
.merge(df2.assign(index_count=df2.groupby('Step_ID').cumcount()),
on=['Step_ID', 'index_count'], how='outer')\
.sort_values(['Step_ID', 'index_count'])\
.drop('index_count', axis=1)
print(new_df)
Step_ID value_01 value_02
0 Step1 2.0 2.3
1 Step1 2.3 2.5
2 Step1 2.2 2.1
7 Step1 NaN 2.5
3 Step2 0.0 0.0
4 Step2 0.0 0.0
8 Step2 NaN 0.0
5 Step3 5.0 5.1
6 Step3 5.2 5.6
9 Step3 NaN 5.8

Pandas | How to calculate the average value of each cell in multiple dataframe with the same shape?

I have several DataFrames like this:
they are saving in a list df_list = [df1,df2,df3,df4,df5....]
I want to generate a new DataFrame df_average.
In df_average, each grid is equal to the average values of the corresponding grid of df1, df2, df3, df4,df4. For example:
df_average[1,'Q1'] = average(df1[1,'Q1'],df2[1,'Q1'],df3[1,'Q1'],df4[1,'Q1']),
df_average[1,'Q2'] = average(df1[1,'Q2'],df2[1,'Q2'],df3[1,'Q2'],df4[1,'Q2'])
How to realize it in an efficient way ?
Code below averages values for each cell. The output size is same as the other dataframes:
# Import libraries
import pandas as pd
# Create DataFrame
df1 = pd.DataFrame({
'Q1': [1,2,3],
'Q2': [11,12,13],
'Q3': [10,20,30],
'Q4': [31,32,33],
'Q5': [61,62,63],
})
df2 = df1.copy()*2
df3 = df1.copy()*0.5
df4 = df1.copy()*-1
# Get average
df_average = (df1+df2+df3+df4)/4
df_average
Output:
Q1 Q2 Q3 Q4 Q5
0 0.625 6.875 6.25 19.375 38.125
1 1.250 7.500 12.50 20.000 38.750
2 1.875 8.125 18.75 20.625 39.375
You can use pd.concat, followed by a groupby on the index using mean for aggregation.
df1 = pd.DataFrame({'Q1':[1,2,3], 'Q2':[1,7,8], 'Q3':[8,9,1], 'Q4':[4,3,7]})
df2 = pd.DataFrame({'Q1':[7,9,10], 'Q2':[9,2,8], 'Q3':[3,4,2], 'Q4':[1,5,6]})
df_average = pd.concat([df1, df2])
df_average = df_average.groupby(df_average.index).agg({'Q1': 'mean',
'Q2': 'mean',
'Q3': 'mean',
'Q4': 'mean'})
print(df_average)
Q1 Q2 Q3 Q4
0 4.0 5.0 5.5 2.5
1 5.5 4.5 6.5 4.0
2 6.5 8.0 1.5 6.5

pandas groupby and agg operation of selected columns and row

I have a dataframe as below:
I am not sure if it is possible to use pandas to make an output as below:
difference=Response[df.Time=="pre"]-Response.min for each group
If pre is always first per groups and values in output should be repeated:
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: (x.iat[0] - x).min())
For only first value per groups is possible replace values to empty strings, but get mixed values - numeric with strings, so next processing should be problem:
df['diff'] = df['diff'].mask(df['diff'].duplicated(), '')
EDIT:
df = pd.DataFrame({
'Response':[2,5,0.4,2,1,4],
'Time':[7,'pre',9,4,2,'pre'],
'IDs':list('aaabbb')
})
#print (df)
d = df[df.Time=="pre"].set_index('IDs')['Response'].to_dict()
print (d)
{'a': 5.0, 'b': 4.0}
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: d[x.name] - x.min())
print (df)
Response Time IDs diff
0 2.0 7 a 4.6
1 5.0 pre a 4.6
2 0.4 9 a 4.6
3 2.0 4 b 3.0
4 1.0 2 b 3.0
5 4.0 pre b 3.0