Pandas: Joining dataframes from different sources - sql

Have the following datasets from two different sources i.e. Oracle and MySQL:
DF1 (Oracle):
A B C
1122 8827
822 8282 6622
727 72 1183
91 5092
992 113 7281
DF2 (MySQL):
E F G
8827 6363
822 5526 9393
727 928 6671
9221 7282
992 921 7262
445 6298
Need to join these in pandas such that the below result is obtained.
Expected o/p:
A B C F G
822 8282 6622 5526 9393
727 72 1183 928 6671
992 113 7281 921 7262
1122 8827
91 5092
8827 6363
445 6298
Update_1:
As suggested, tried the following:
import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print df1
data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print df2
DF11 = df1.set_index(df1['A'].fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
.reset_index(drop=True)\
.drop('E', axis=1)
getting the following:
A B C F G
0 727 72.0 1183.0 928.0 6671.0
1 822 8282.0 6622.0 5526.0 9393.0
2 992 113.0 7281.0 921.0 7262.0
3 1122.0 8827.0 8827.0 6363.0
4 1122.0 8827.0 9221.0 7282.0
5 1122.0 8827.0 445.0 6298.0
6 91.0 5092.0 8827.0 6363.0
7 91.0 5092.0 9221.0 7282.0
8 91.0 5092.0 445.0 6298.0
Q: How to avoid the repetition of values and get the expected o/p?

Your problem is complicated by nulls in the join key. You try some logic like this to achieve your result, or create a different key for joins that doesn't have nulls.
DF11 = DF1.set_index(DF1['A'].fillna(DF1.groupby('A').cumcount().astype(str)+'A'))
DF22 = DF2.set_index(DF2['E'].fillna(DF2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
.reset_index(drop=True)\
.drop('E', axis=1)
Output:
A B C F G
0 NaN 1122.0 8827.0 NaN NaN
1 822.0 8282.0 6622.0 5526.0 9393.0
2 727.0 72.0 1183.0 928.0 6671.0
3 NaN 91.0 5092.0 NaN NaN
4 992.0 113.0 7281.0 921.0 7262.0
5 NaN NaN NaN 8827.0 6363.0
6 NaN NaN NaN 9221.0 7282.0
7 NaN NaN NaN 445.0 6298.0
Update, due the fact your data has blanks and not np.nan, I had to add a method in those statement to replace '' with np.nan to get fillna to work correctly.
df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))
Try this:
import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print(df1)
data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print(df2)
DF11 = df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].replace('',np.nan).fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
.reset_index(drop=True)\
.drop('E', axis=1)

Question, for your desired output, did you intentionally leave out column E?
if not...
I'm not sure whether or not the dataframes coming from different sources would have any bearing on how they would be joined together.
import pandas as pd
...
frames = [DF1, DF2]
result = pd.concat(frames)
This should perform the join you want to accomplish.

Related

How to get each column merged from a separate file to line up next to each other rather than under each df?

Trying to append single column dfs to once csv. Each column representing theold dfs. Do not know how to stop the dfs from stacking in a csv file.
Csv
master_df = pd.DataFrame()
for file in os.listdir('TotalDailyMCUSDEachPool'):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('MasterFile.csv', index=False)
You should use master_df.append(pd.read_csv(file), axis=1) for your purpose. Check this for more info: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html.
An example is as follows:
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
}
)
df2 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
}
)
result = pd.concat([df1, df2], axis=1)
# A B
# 0 A0 B2
# 1 A1 B3
# 2 A2 B6
# 3 A3 B7
If appends works fine "out of the box" only if column names of datasets are the same and you want to just append rows. In this example:
master_df = pd.DataFrame()
df = pd.DataFrame({'Numbers': [1010, 2020, 3030, 2020, 1515, 3030, 4545]})
df2 = pd.DataFrame({'Others': [1015, 2132, 3030, 2020, 1515, 3030, 4545]})
master_df = master_df.append(df)
master_df = master_df.append(df2)
master_df
You would get this output:
Numbers Others
0 1010.0 NaN
1 2020.0 NaN
2 3030.0 NaN
3 2020.0 NaN
4 1515.0 NaN
5 3030.0 NaN
6 4545.0 NaN
0 NaN 1015.0
1 NaN 2132.0
2 NaN 3030.0
3 NaN 2020.0
4 NaN 1515.0
5 NaN 3030.0
6 NaN 4545.0
To prevent it, use pd.concat() with axis=1
master_df = pd.concat([df,df2], axis = 1)
master_df
Which should result in good dataset
Numbers Others
0 1010 1015
1 2020 2132
2 3030 3030
3 2020 2020
4 1515 1515
5 3030 3030
6 4545 4545

regroup uneven number of rows pandas df

I need to regroup a df from the above format in the one below but it fails and the output shape is (unique number of IDs, 2). Is there a more obvious solution?
You can use groupby and pivot:
(df.assign(n=df.groupby('ID').cumcount().add(1))
.pivot(index='ID', columns='n', values='Value')
.add_prefix('val_')
.reset_index()
)
Example input:
df = pd.DataFrame({'ID': [7,7,8,11,12,18,22,22,22],
'Value': list('abcdefghi')})
Output:
n ID val_1 val_2 val_3
0 7 a b NaN
1 8 c NaN NaN
2 11 d NaN NaN
3 12 e NaN NaN
4 18 f NaN NaN
5 22 g h i

Split one column of dataframe to new columns in pandas

I need to change the following data frame in which one column contains a list of tuple
df = pd.DataFrame({'columns1':list('AB'),'columns2':[1,2],
'columns3':[[(122,0.5), (104, 0)], [(104, 0.6)]]})
print (df)
columns1 columns2 columns3
0 A 1 [(122, 0.5), (104, 0)]
1 B 2 [(104, 0.6)]
in to this, in which the tuple first element should be the column header
columns1 columns2 104 122
0 A 1 0.0 0.5
1 B 2 0.6 NaN
How can I do this using panda in Jupiter notebook
Use list comprehension with convert values to dictionaries, sorting columns and add to original with DataFrame.join:
df = pd.read_csv('Sample - Sample.csv.csv')
print (df)
column1 column2 column3
0 A U1 [(187, 0.674), (111, 0.738)]
1 B U2 [(54, 1.0)]
2 C U3 [(169, 0.474), (107, 0.424), (88, 0.519), (57,...
import ast
df1 = pd.DataFrame([dict(ast.literal_eval(x)) for x in df.pop('column3')], index=df.index).sort_index(axis=1)
df = df.join(df1)
print (df)
column1 column2 54 57 64 88 107 111 169 187
0 A U1 NaN NaN NaN NaN NaN 0.738 NaN 0.674
1 B U2 1.0 NaN NaN NaN NaN NaN NaN NaN
2 C U3 NaN 0.526 0.217 0.519 0.424 NaN 0.474 NaN

Pandas subtract column values only when non nan

I have a dataframe df as follows with about 200 columns:
Date Run_1 Run_295 Prc
2/1/2020 3
2/2/2020 2 6
2/3/2020 5 2
I want to subtract column Prc from columns Run_1 Run_295 Run_300 only when they are non-Nan or non empty, to get the following:
Date Run_1 Run_295
2/1/2020
2/2/2020 -4
2/3/2020 3
I am not sure how to proceed with the above.
Code to reproduce the dataframe:
import pandas as pd
from io import StringIO
s = """Date,Run_1,Run_295,Prc
2/1/2020,,,3
2/2/2020,2,,6
2/3/2020,,5,2"""
df = pd.read_csv(StringIO(s))
print(df)
You can simply subtract it. It exactly does what you want:
df.Run_1-df.Prc
Here is the complete code to your output:
df.Run_1= df.Run_1-df.Prc
df.Run_295= df.Run_295-df.Prc
df.drop('Prc', axis=1, inplace=True)
df
Date Run_1 Run_295
0 2/1/2020 NaN NaN
1 2/2/2020 -4.0 NaN
2 2/3/2020 NaN 3.0
Three steps, melt to unpivot your dataframe
Then loc to handle assignment
& GroupBy to reomake your original df.
sure there is a better way to this, but this avoids loops and apply
cols = df.columns
s = pd.melt(df,id_vars=['Date','Prc'],value_name='Run Rate')
s.loc[s['Run Rate'].isnull()==False,'Run Rate'] = s['Run Rate'] - s['Prc']
df_new = s.groupby([s["Date"], s["Prc"], s["variable"]])["Run Rate"].first().unstack(-1)
print(df_new[cols])
variable Date Run_1 Run_295 Prc
0 2/1/2020 NaN NaN 3
1 2/2/2020 -4.0 NaN 6
2 2/3/2020 NaN 3.0 2

Pandas DataFrame .merge not displaying Nan

I have a data frame df containing only dates from 2007-01-01 to 2018-04-30 (not as index)
I have a second data frame sub containing dates and values from 2007-01-01 to 2018-04-20
I want to have a result data frame res with ALL dates from df and the values from sub at the right place. I am using
res = pd.merge(df, sub, on='date', how='outer')
I expect to have NaNs from 2018-04-21 to 2018-04-30 in the res data frame.
Instead I got res has only values up to 2018-04-20 (it truncated the missing ones)
Why?
IIUC, setting indexes and then joining will be useful here:
## create sample data
df = pd.DataFrame({'mdates': pd.date_range('12/13/1989', periods=100, freq='D')})
df['val'] = np.random.randint(10, 500, 100)
df1 = pd.DataFrame({'mdates': pd.date_range('12/01/1989', periods=50, freq='D')})
## join data
df1 = df1.set_index('mdates').join(df.set_index('mdates'))
print(df1.head(20))
val
mdates
1989-12-01 NaN
1989-12-02 NaN
1989-12-03 NaN
1989-12-04 NaN
1989-12-05 NaN
1989-12-06 NaN
1989-12-07 NaN
1989-12-08 NaN
1989-12-09 NaN
1989-12-10 NaN
1989-12-11 NaN
1989-12-12 NaN
1989-12-13 215.0
1989-12-14 189.0
1989-12-15 97.0
1989-12-16 264.0
1989-12-17 419.0
1989-12-18 57.0
1989-12-19 376.0
1989-12-20 448.0