I have a few different data frames like this below
df1
idx col1 col2 col3
2020-11-20 01:00:00 1 5 9
2020-11-20 02:00:00 2 6 10
2020-11-20 03:00:00 3 7 11
2020-11-20 04:00:00 4 8 12
df2
idx col4 col5 col6
2020-11-20 02:00:00 13 15 17
2020-11-20 03:00:00 14 16 18
df3
idx col7 col8 col9
2020-11-20 01:00:00 19 20 21
and essentially I need to keep all the columns from all DF's but align the values on the timestamp that is the index for each dataframe. My expected output is this
df_merged
idx col1 col2 col3 col4 col5 col6 col7 col8 col9
2020-11-20 01:00:00 1 5 9 NaN NaN NaN 19 20 21
2020-11-20 02:00:00 2 6 10 13 15 17 NaN NaN NaN
2020-11-20 03:00:00 3 7 11 14 16 18 NaN NaN NaN
2020-11-20 04:00:00 4 8 12 NaN NaN NaN NaN NaN NaN
I have tried various things like merge, concat, join, manually doing it for hours now and I am stumped why it wont work. These df's are simplified versions, but my issue with these approaches are that my df1 has a length of 1619, df2 has a length of 1619, df3 has a length of 1617, and df4 (not here but follows same idea) has a length of 1613. When I try this
df_merged = reduce(lambda left,right: pd.merge(left,right,how='left'), [df1,df2,df3,df4]) what happens is that the df_merged size is now 12k rows (not 1619 like the original df). I tried dropping duplicates as well on the final df_merged and that only left me with like 600 rows. I also have tried manually combining them with loc, iloc and isin() but still no luck.
Really any help would be greatly appreciated!
Use merge with how = 'outer'.
Demonstration:
# data preparation
string = """idx col1 col2 col3
2020-11-20 01:00:00 1 5 9
2020-11-20 02:00:00 2 6 10
2020-11-20 03:00:00 3 7 11
2020-11-20 04:00:00 4 8 12"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
string = """idx col4 col5 col6
2020-11-20 02:00:00 13 15 17
2020-11-20 03:00:00 14 16 18"""
data = [x.split(' ') for x in string.split('\n')]
df2 = pd.DataFrame(data[1:], columns = data[0])
string = """idx col7 col8 col9
2020-11-20 01:00:00 19 20 21"""
data = [x.split(' ') for x in string.split('\n')]
df3 = pd.DataFrame(data[1:], columns = data[0])
#solution
df.merge(df2, on = 'idx', how = 'outer').merge(df3, on='idx', how='outer')
Output:
idx col1 col2 col3 col4 col5 col6 col7 col8 col9
0 2020-11-20 01:00:00 1 5 9 NaN NaN NaN 19 20 21
1 2020-11-20 02:00:00 2 6 10 13 15 17 NaN NaN NaN
2 2020-11-20 03:00:00 3 7 11 14 16 18 NaN NaN NaN
3 2020-11-20 04:00:00 4 8 12 NaN NaN NaN NaN NaN NaN
Related
I have a dataframe that has multiple columns. Out of which I need to pick up few columns, subtract one from the other, and then multiply the result with another column.
For demonstration, I have simulated a simple dataframe, but with a similar structure to my actual dataframe.
Below is what I tried:
df = pd.DataFrame({'col0':['09-June-2022', '10-June-2022', \
'11-June-2022', '12-June-2022'],
'col1':[1,-2,3,-4], 'col2':[-2,5,6,-8], \
'col3':[-5,-5,5,9], 'col4': [3,4,5,6]})
print(df)
columnlist = ['col1', 'col2', 'col3']
diff = 0
for c in columnlist:
diff = diff - df[c]
final_calculation = diff.mul(df['col4'])
print(final_calculation)
On printing the df, it looks like this:
col0 col1 col2 col3 col4
0 09-June-2022 1 -2 -5 3
1 10-June-2022 -2 5 -5 4
2 11-June-2022 3 6 5 5
3 12-June-2022 -4 -8 9 6
The output that I get is:
0 18
1 8
2 -70
3 18
But it is incorrect. Ideally the final df should have been:
0 24
1 -8
2 -15
3 -30
I'm unable to wrap my head around this. Tried df.sub, df.diff, tried using lambda, assign, somehow unable to figure it out completely. Please help!
My actual df looks something like below:
df_export_interchangedata: GeneratedAt CalculatedEmissionFactor CPLE CPLW DUK LGEE MISO NYIS OVEC TVA
0 2018-07-01 01:00:00 0.000258 105.0 0.0 603.0 1133.0 0.0 0.0 578.0 621.0
1 2018-07-01 02:00:00 0.000251 0.0 0.0 535.0 992.0 0.0 0.0 577.0 795.0
2 2018-07-01 03:00:00 0.000246 2.0 0.0 123.0 897.0 0.0 0.0 545.0 801.0
3 2018-07-01 04:00:00 0.000239 520.0 0.0 0.0 833.0 0.0 0.0 467.0 778.0
4 2018-07-01 05:00:00 0.000233 596.0 0.0 18.0 679.0 0.0 0.0 343.0 637.0
... ... ... ... ... ... ... ... ... ... ...
60490 2022-05-30 20:00:00 NaN 182.0 0.0 0.0 101.0 2555.0 0.0 NaN 0.0
60491 2022-05-30 21:00:00 NaN 268.0 0.0 0.0 185.0 3555.0 0.0 NaN 0.0
60492 2022-05-30 22:00:00 NaN 30.0 0.0 0.0 124.0 3681.0 0.0 NaN 0.0
60493 2022-05-30 23:00:00 NaN 0.0 0.0 0.0 118.0 2846.0 0.0 NaN 0.0
60494 2022-05-31 00:00:00 NaN 0.0 0.0 0.0 0.0 2098.0 0.0 NaN 0.0
Idea is multiple all columns without first by -1, then sum and last multiple by col4:
d = dict.fromkeys(columnlist[1:], -1)
d[columnlist[0]] = 1
print (d)
{'col2': -1, 'col3': -1, 'col1': 1}
df['out'] = df[columnlist].mul(pd.Series(d), axis=1).sum(axis=1).mul(df['col4'])
print (df)
col0 col1 col2 col3 col4 out
0 09-June-2022 1 -2 -5 3 24
1 10-June-2022 -2 5 -5 4 -8
2 11-June-2022 3 6 5 5 -40
3 12-June-2022 -4 -8 9 6 -30
EDIT:
df['out'] = df[columnlist].mul(pd.Series(d), axis=1).sum(axis=1)
print (df)
col0 col1 col2 col3 col4 out
0 09-June-2022 1 -2 -5 3 8
1 10-June-2022 -2 5 -5 4 -2
2 11-June-2022 3 6 5 5 -8
3 12-June-2022 -4 -8 9 6 -5
df['out'] *= df['col4'].to_numpy()
print (df)
col0 col1 col2 col3 col4 out
0 09-June-2022 1 -2 -5 3 24
1 10-June-2022 -2 5 -5 4 -8
2 11-June-2022 3 6 5 5 -40
3 12-June-2022 -4 -8 9 6 -30
I have a data frame including many columns. I want the col1 as the denominator and all other columns as the nominator. I have done this for just col2 (see below code). I want to do this for all other columns in shortcode.
df
Town col1 col2 col3 col4
A 8 7 5 2
B 8 4 2 3
C 8 5 8 5
here is my code for col2:
df['col2'] = df['col2'] / df['col1'
here is my result:
df
A 8 0.875000 1.0 5 2
B 8 0.500000 0.0 2 3
C 8 0.625000 1.0 8 5
I want to do the same with all cols (i.e. col3, col4....)
If this could be done in pivot_table then it will be awsome
Thanks for your help
Use df.iloc with df.div:
In [2084]: df.iloc[:, 2:] = df.iloc[:, 2:].div(df.col1, axis=0)
In [2085]: df
Out[2085]:
Town col1 col2 col3 col4
0 A 8 0.875 0.625 0.250
1 B 8 0.500 0.250 0.375
2 C 8 0.625 1.000 0.625
OR use df.filter , pd.concat with df.div
In [2073]: x = df.filter(like='col').set_index('col1')
In [2078]: out = pd.concat([df.Town, x.div(x.index).reset_index()], 1)
In [2079]: out
Out[2079]:
Town col1 col2 col3 col4
0 A 8 0.875 0.625 0.250
1 B 8 0.500 0.250 0.375
2 C 8 0.625 1.000 0.625
I have 2 tables which I am merging( Left Join) based on common column but other column does not have exact column values and hence some of the column values are blank. I want to fill the missing value with closest tenth . for example I have these two dataframes:
d = {'col1': [1.31, 2.22,3.33,4.44,5.55,6.66], 'col2': ['010100', '010101','101011','110000','114000','120000']}
df1=pd.DataFrame(data=d)
d2 = {'col2': ['010100', '010102','010144','114218','121212','166110'],'col4': ['a','b','c','d','e','f']}
df2=pd.DataFrame(data=d2)
# df1
col1 col2
0 1.31 010100
1 2.22 010101
2 3.33 101011
3 4.44 110000
4 5.55 114000
5 6.66 120000
# df2
col2 col4
0 010100 a
1 010102 b
2 010144 c
3 114218 d
4 121212 e
5 166110 f
After left merging on col2,
I get:
df1.merge(df2,how='left',on='col2')
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 NaN
2 3.33 101011 NaN
3 4.44 111100 NaN
4 5.55 114100 NaN
5 6.66 166100 NaN
Vs what I want, for for all values where NaN, my col2 value firstly converts to closest 10 and then matches in col2 of table 1 if there is a match, place col4 accordingly, if not then closest 100, then closest thousand, ten thousand..
Ideally my answer should be:
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 a
2 3.33 101011 f
3 4.44 111100 d
4 5.55 114100 d
5 6.66 166100 f
Please help me in coding this
I have a dataframe let's say:
col1 col2 col3
1 x 3
1 y 4
and I have a list:
2
3
4
5
Can I append the list to the data frame like this:
col1 col2 col3
1 x 3
1 y 4
2 Nan Nan
3 Nan Nan
4 Nan Nan
5 Nan Nan
Thank you.
Use concat or append with DataFrame contructor:
df = df.append(pd.DataFrame([2,3,4,5], columns=['col1']))
df = pd.concat([df, pd.DataFrame([2,3,4,5], columns=['col1'])])
print (df)
col1 col2 col3
0 1 x 3.0
1 1 y 4.0
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
3 5 NaN NaN
Suppose I have a data frame like the following data.frame in pandas
a 1 11
a 3 12
a 20 13
b 2 14
b 4 15
I want to generate a resulting data.frame like this
V1 1 2 3 4 20
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN
How can I get this transformation?
Thank you.
You can use pivot:
import pandas as pd
df = pd.DataFrame({'col1': ['a','a','a','b','b'],
'col2': [1,3,20,2,4],
'col3': [11,12,13,14,15]})
print df.pivot(index='col1', columns='col2')
Output:
col3
col2 1 2 3 4 20
col1
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN