making pandas multiple data frames with multi-indexed columns and join altogether - pandas

Some would say this needs to be two separate questions, but they are inter-related so I just write them all here.
1. Making multi-indexed columns
I have three data frames:
data_large = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 60, 50], "buy":[20, 30, 40]})
data_mini = pd.DataFrame({"name":["b", "c", "d"], "sell":[60, 20, 10], "buy":[30, 50, 40]})
data_topix = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 80, 0], "buy":[70, 30, 40]})
But first of all, I want to make their columns multi-indexed like below.
This is what I tried, but doesn't work as expected. name gets under the index level Nikkei225Large
iterables = [['Nikkei225Large'], ['name', 'buy', 'sell']]
index_large = pd.MultiIndex.from_product(iterables, names=['product', 'sell_buy'])
data_large.columns = index_large
2. Joining multiple pandas with multi-indexed columns, for ex. using reduce
Next, outer-join the three data frames altogether on the column name. The expected output is:
For now, I just join them using reduce like below, but I want to do it with multi-indexed columns.
from functools import reduce
dfs = {0: data_large, 1: data_mini, 2: data_topix}
def agg_df(dfList):
df_agged = reduce(lambda left, right: pd.merge(left, right,
left_index=True, right_index=True,
on='name',
how='outer'), dfList)
return df_agged
df_final = agg_df(dfs.values())
Any help would be appreciated!

IIUC, you can do this using pd.concat with keys parameter:
df_out = pd.concat([dfi.set_index('name') for dfi in [data_large, data_mini, data_topix]],
keys=['Nikkei225Large', 'Nikkei225Mini', 'Topix'], axis=1)\
.rename_axis(index=['Name'], columns=['product','buy_sell'])
Output:
product Nikkei225Large Nikkei225Mini Topix
buy_sell sell buy sell buy sell buy
Name
a 10.0 20.0 NaN NaN 10.0 70.0
b 60.0 30.0 60.0 30.0 80.0 30.0
c 50.0 40.0 20.0 50.0 0.0 40.0
d NaN NaN 10.0 40.0 NaN NaN

Related

How to merge same name column from two different dataframes?

I have four different datasets. I have merged three of the dataframes correctly. I have same name column in 3rd and 4th dataset. When I merge it with 4th dataset. I am not getting the same name column values in well mannerd way. The user_id is repeating when I merge. I don't want to repeat the user_id. I want to see the value in the del_keys column where it's showing me NaN value rather than it's showing me the value in the last of table. Moreover, I want to merge values of same name column on the basis of their user_id.
In the above image you can see what kind of problem I am getting.
My expected output will look like. There should not be repeated user_id.
using merge on user_id column
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'del': [1.0, np.nan, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
final=df.merge(df2,on="user_id",how="outer")
Combine first to get rid of Nan values and then drop duplicates
final["del_keys"]=final['del_keys_y'].combine_first(final['del_keys_x'])
final.drop(columns=["del_keys_x","del_keys_y"],inplace=True)
final.drop_duplicates(subset="user_id")
I'm guessing that you use pd.concat to merge the dataframes.
Some dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3],
'del_keys': [1.0, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
Merge using pd.concat:
df = pd.concat([df1, df2])
>>> user_id del_keys
0 1 1.0
1 2 NaN
2 3 NaN
0 3 1.0
1 4 2.0
2 5 3.0
Remove duplicates using pd.drop_duplicates:
(
df
.sort_values('del_keys')
.drop_duplicates('user_id', keep='first')
.sort_values('user_id')
)
>>> user_id del_keys
0 1 1.0
1 2 NaN
0 3 1.0
1 4 2.0
2 5 3.0
First, we sort the values by del_keys such that all NaNs are the bottom of the dataframe. Then we can drop the duplicates and keep the first occurrence for each user_id. Lastly, we can sort again to restore the original order.

Perform multiple math operations on columns in df

I want to do operation [(b-a)/a] * 100 on a dataframe [i.e., percentage change from a reference value]. where a is my first column and b is all other columns of the dataframe.
I tried below steps and it is working but very messy !!
df = pd.DataFrame({'obj1': [1, 3, 4],
'obj2': [6, 9, 10], 'obj3':[2, 6, 8]},
index=['circle', 'triangle', 'rectangle'])
#first we subtract all columns with first col - as that is the starting point : b-a
df_aftersub = df.sub(pd.Series(df.iloc[:,[0]].squeeze()),axis='index')
#second we divide the result with first column to get change - b-a/a
df_change = df_aftersub.div(pd.Series(df.iloc[:,[0]].squeeze()),axis='index')
#third we multiply with 100 to get percent change - b-a/a*100
df_final = df_change*100
df_final
output needed
obj1 obj2 obj3
circle 0.0 500.0 100.0
triangle 0.0 200.0 100.0
rectangle 0.0 150.0 100.0
how to do it in less lines of code and if possible less temporary dataframes (if possible simple to understand)
First subtract first column by DataFrame.sub and divide by DataFrame.div, last multiple by 100:
s = df.iloc[:, 0]
df_final = df.sub(s, axis=0).div(s, axis=0).mul(100)
print (df_final)
obj1 obj2 obj3
circle 0.0 500.0 100.0
triangle 0.0 200.0 100.0
rectangle 0.0 150.0 100.0

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

Pandas: Merging multiple dataframes efficiently

I have a situation where I need to merge multiple dataframes that I can do easily using the below code:
# Merge all the datasets together
df_prep1 = df_prep.merge(df1,on='e_id',how='left')
df_prep2 = df_prep1.merge(df2,on='e_id',how='left')
df_prep3 = df_prep2.merge(df3,on='e_id',how='left')
df_prep4 = df_prep3.merge(df_4,on='e_id',how='left')
df_prep5 = df_prep4.merge(df_5,on='e_id',how='left')
df_prep6 = df_prep5.merge(df_6,on='e_id',how='left')
But what I want to understand is that if there is any other efficient way to perform this merge, maybe using a helper function? If yes, then how could I achieve that?
You can use reduce from functools module to merge multiple dataframes:
from functools import reduce
dfs = [df_1, df_2, df_3, df_4, df_5, df_6]
out = reduce(lambda dfl, dfr: pd.merge(dfl, dfr, on='e_id', how='left'), dfs)
You can put all your dfs into a list, or pass them from a function, a loop, etc. and then have 1 main df that you merge everything onto.
You can start with an empty df and iterate through. In your case, since you are doing left merge, it looks like your df_prep should already have all of the e_id values that you want. You'll need to figure out what you want to do with any additional columns, e.g., you can have pandas add _x and _y after conflicting column names that you don't merge, or rename them, etc. See this toy example:
main_df = pd.DataFrame({'e_id': [0, 1, 2, 3, 4]})
for x in range(3):
dfx = pd.DataFrame({'e_id': [x], 'another_col' + str(x): [x * 10]})
main_df = main_df.merge(dfx, on='e_id', how='left')
to get:
e_id another_col0 another_col1 another_col2
0 0 0.0 NaN NaN
1 1 NaN 10.0 NaN
2 2 NaN NaN 20.0
3 3 NaN NaN NaN
4 4 NaN NaN NaN

Converting only specific columns in dataframe to numeric

I currently have a dataframe with n number of number-value columns and three columns that are datetime and string values. I want to convert all the columns (but three) to numeric values but am not sure what the best method is. Below is a sample dataframe (simplified):
df2 = pd.DataFrame(np.array([[1, '5-4-2016', 10], [1,'5-5-2016', 5],[2, '5-
4-2016', 10], [2, '5-5-2016', 7], [5, '5-4-2016', 8]]), columns= ['ID',
'Date', 'Number'])
I tried using something like (below) but was unsuccessful.
exclude = ['Date']
df = df.drop(exclude, 1).apply(pd.to_numeric,
errors='coerce').combine_first(df)
The expected output: (essentially, the datatype of fields 'ID' and 'Number' change to floats while 'Date' stays the same)
ID Date Number
0 1.0 5-4-2016 10.0
1 1.0 5-5-2016 5.0
2 2.0 5-4-2016 10.0
3 2.0 5-5-2016 7.0
4 5.0 5-4-2016 8.0
Have you tried Series.astype()?
df['ID'] = df['ID'].astype(float)
df['Number'] = df['Number'].astype(float)
or for all columns besides date:
for col in [x for x in df.columns if x != 'Date']:
df[col] = df[col].astype(float)
or
df[[x for x in df.columns if x != 'Date']].transform(lambda x: x.astype(float), axis=1)
You need to call to_numeric with option downcast='float', if you want it change to float. Otherwise, it will be int. You also need to join back to non-converted columns of the original df2
df2[exclude].join(df2.drop(exclude, 1).apply(pd.to_numeric, downcast='float', errors='coerce'))
Out[1815]:
Date ID Number
0 5-4-2016 1.0 10.0
1 5-5-2016 1.0 5.0
2 5-4-2016 2.0 10.0
3 5-5-2016 2.0 7.0
4 5-4-2016 5.0 8.0