Properly map values between two dataframes - pandas

I have dataframe df
d = {'Col1': [10,67], 'Col2': [30,10],'Col3': [70,40]}
df = pd.DataFrame(data=d)
which results in
Col1 Col2 Col3
0 10 30 70
1 67 10 40
and df2
df2=pd.DataFrame(data=([25,36,47,(0,20)],[70,85,95,(20,40)],
[12,35,49,(40,60)],[50,49,21,(60,80)],[60,75,38,(80,100)]),
columns=["Col1","Col2","Col3","Range"])
which results in:
Col1 Col2 Col3 Range
0 25 36 47 (0, 20)
1 70 85 95 (20, 40)
2 12 35 49 (40, 60)
3 50 49 21 (60, 80)
4 60 75 38 (80, 100)
Both frames are just for example purposes and might be much bigger in reality. Both frames have the same columns but one.
I want to apply some function (x/y) between each value from df and a value in df2 from the same column. The value from df2 however maybe in varying rows depending on the Range column.
For example 10 from df (Col1) falls in range (0,20) in df2 therefore I want to use 25 from Col1 (df2) and do 10/25.
30 from df (Col2) falls in range (20,40) in df2 therefore I want to take 85 from Col2 (df2) and do 30/85.
70 from df (Col3) falls in range (60,80) in df2 therefore I want to take 21 from Col3 (df2) and do 70/21.
I want to do this for each row in df.
Don't really know how to do the proper mapping; I always tend to start with some for loops which are not very pretty especially if both dataframes are of bigger shape. Expected output can be any array, dataframe or the like composed of the resulting numbers.

Here is one way to do it by defining a helper function:
def find_denominator_for(v):
"""Helper function.
>>> find_denominator_for(10)
{'Col1': 25, 'Col2': 36, 'Col3': 47}
"""
for tup, sub_dict in df2.set_index("Range").to_dict(orient="index").items():
if min(tup) <= v <= max(tup):
return sub_dict
for col in df.columns:
df[col] = df[col] / df[col].apply(lambda x: find_denominator_for(x)[col])
Then:
print(df)
# Output
Col1 Col2 Col3
0 0.40 0.352941 3.333333
1 1.34 0.277778 0.421053

Related

Pandas groupby custom nlargest

When trying to solve my own question here I came up with an interesting problem. Consider I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
df= pd.DataFrame(dict(group = np.random.choice(["a", "b", "c", "d"],
size = 100),
values = np.random.randint(0, 100,
size = 100)
)
)
I want to select top values per each group, but I want to select the values according to some range. Let's say, top x to y values per each group. If any group has less than x values in it, give top(min((y-x), x)) values for that group.
In general, I am looking for a custom made alternative function which could be used with groupby objects to select not top n values, but instead top x to y range of values.
EDIT: nlargest() is a special case of the solution to my problem where x = 1 and y = n
Any further help, or guidance will be appreciated
Adding an example with this df and this top(3, 6). For every group output the values from top 3rd until top 6th values:
group value
a 190
b 166
a 163
a 106
b 86
a 77
b 70
b 69
c 67
b 54
b 52
a 50
c 24
a 20
a 11
As group c has just two members, it will output top(3)
group value
a 106
a 77
a 50
b 69
b 54
b 52
c 67
c 24
there are other means of doing this and depending on how large your dataframe is, you may want to search groupby slice or something similar. You may also need to check my conditions are correct (<, <=, etc)
x=3
y=6
# this gets the groups which don't meet the x minimum
df1 = df[df.groupby('group')['value'].transform('count')<x]
# this df takes those groups meeting the minimum and then shifts by x-1; does some cleanup and chooses nlargest
df2 = df[df.groupby('group')['value'].transform('count')>=x].copy()
df2['shifted'] = df2.groupby('group').shift(periods=-(x-1))
df2.drop('value', axis=1, inplace=True)
df2 = df2.groupby('group')['shifted'].nlargest(y-x).reset_index().rename(columns={'shifted':'value'}).drop('level_1', axis=1)
# putting it all together
df_final = pd.concat([df1, df2])
df_final
group value
8 c 67.0
12 c 24.0
0 a 106.0
1 a 77.0
2 a 50.0
3 b 70.0
4 b 69.0
5 b 54.0

Change column values into rating and sum

Change the column values and sum the row according to conditions.
d = {'col1': [20, 40], 'col2': [30, 40],'col3':[200,300}
df = pd.DataFrame(data=d)
col1 col2 col3
0 20 30 200
1 40 40 300
Col4 shoud give back the sum of the row after the values have been tranfered to a rating.
Col1 Value between 0-20 ->2 Points, 20-40 -> 3 Points
Col2 Value between 40-50 ->2 Points, 70-80 -> 3 Points
Col3 Value between 0-100 ->2 Points, 100-300 -> 2 Points
col 4 (Points)
0 2
1 6
Use pd. cut as follows. Values didnt add up though. Happy to asist further if clarified.
pd.cut to bin and save in new columnms suffixed withname Points. Select only columns with string Points and add.
df['col1Points'],df['col2Points'],df['col3Points']=\
pd.cut(df.col1, [0,20,40],labels=[2,3])\
,pd.cut(df.col2, [40,70,80],labels=[2,3])\
,pd.cut(df.col3, [-0,100,300],labels=[2,3])
df['col4']=df.filter(like='Points').sum(axis=1)
col1 col2 col3 col1Points col2Points col3Points col4
0 20 30 200 2 NaN 3 5.0
1 40 40 300 3 NaN 3 6.0

Using pandas, how to join two tables on variable index?

There are two tables, the entries may have different id type. I need to join two tables based on id_type of df1 and the correct column of df2. For the background of the problem, the ids are security id in financial world, the id type may be CUSIP, ISIN, RIC etc..
print(df1)
id id_type value
0 11 type_A 0.1
1 22 type_B 0.2
2 13 type_A 0.3
print(df2)
type_A type_B type_C
0 11 21 xx
1 12 22 yy
2 13 23 zz
The desired output is
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3
Here is an alternative approach, which generalizes to many security types (CUSIP, ISIN, RIC, SEDOL, etc.).
First, create df1 and df2 along the lines of the original example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'sec_id': [11, 22, 33],
'sec_id_type': ['CUSIP', 'ISIN', 'RIC'],
'value': [100, 200, 300]})
df2 = pd.DataFrame({'CUSIP': [11, 21, 31],
'ISIN': [21, 22, 23],
'RIC': [31, 32, 33],
'SEDOL': [41, 42, 43]})
Second, create an intermediate data frame x1. We will use the first column for one join, and the second and third columns for a different join:
index = [idx for idx in df2.index for _ in df2.columns]
sec_id_types = df2.columns.to_list() * df2.shape[0]
sec_ids = df2.values.ravel()
data = [
(idx, sec_id_type, sec_id)
for idx, sec_id_type, sec_id in zip(index, sec_id_types, sec_ids)
]
x1 = pd.DataFrame.from_records(data, columns=['index', 'sec_id_type', 'sec_id'])
Join df1 and x1 to extract values from df1:
x2 = (x1.merge(df1, on=['sec_id_type', 'sec_id'], how='left')
.dropna()
.set_index('index'))
Finally, join df2 and x1 (from previous step) to get final result
print(df2.merge(x2, left_index=True, right_index=True, how='left'))
CUSIP ISIN RIC SEDOL sec_id_type sec_id value
0 11 21 31 41 CUSIP 11 100.0
1 21 22 32 42 ISIN 22 200.0
2 31 23 33 43 RIC 33 300.0
The columns sec_id_type and sec_id show the joins work as expected.
NEW Solution 1: create a temporary column that determines the ID with np.where
df2['id'] = np.where(df2['type_A'] == df1['id'], df2['type_A'], df2['type_B'])
df = pd.merge(df2,df1[['id','value']],how='left',on='id').drop('id', axis=1)
NEW Solution 2: Can you simply merge on the index? If not go with solution #1.
df = pd.merge(df2, df1['value'], how ='left', left_index=True, right_index=True)
output:
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3
OLD Solution:
Through a combination of pd.merge, pd.melt and pd.concat, I found a solution, although I wonder if there is a shorter way (probably):
df_A_B = pd.merge(df2[['type_A']], df2[['type_B']], how='left', left_index=True, right_index=True) \
.melt(var_name = 'id_type', value_name='id')
df_C = pd.concat([df2[['type_C']]] * 2).reset_index(drop=True)
df_A_B_C = pd.merge(df_A_B, df_C, how='left', left_index=True, right_index=True)
df3 = pd.merge(df_A_B_C, df1, how='left', on=['id_type', 'id']).dropna().drop(['id_type', 'id'], axis=1)
df4 = pd.merge(df2, df3, how='left', on=['type_C'])
df4
output:
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3

Calculate row wise percentage in pandas

I have a data frame as shown below
id val1 val2 val3
a 100 60 40
b 20 18 12
c 160 140 100
For each row I want to calculate the percentage.
The expected output as shown below
id val1 val2 val3
a 50 30 20
b 40 36 24
c 40 35 25
I tried following code
df['sum'] = df['val1]+df['val2]+df['val3]
df['val1] = df['val1]/df['sum']
df['val2] = df['val2]/df['sum']
df['val3] = df['val3]/df['sum']
I would like to know is there any easy and alternate way than this in pandas.
We can do the following:
We slice the correct columns with iloc
Use apply with axis=1 to apply each calculation row wise
We use div, sum and mul to divide each value to the rows sum and multiply it by 100 to get the percentages in whole numbers not decimals
We convert our floats back to int with astype
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.div(x.sum()).mul(100), axis=1).astype(int)
Output
id val1 val2 val3
0 a 50 30 20
1 b 40 36 24
2 c 40 35 25
Or a vectorized solution, accessing the numpy arrays underneath our dataframe.
note: this method should perform better in terms of speed
df.iloc[:, 1:] = (df.iloc[:, 1:] / df.sum(axis=1)[:, None]).mul(100).astype(int)
Or similar but using the pandas DataFrame.div method:
proposed by Jon Clements
df.iloc[:, 1:] = df.iloc[:, 1:].div(df.iloc[:, 1:].sum(1), axis=0).mul(100)

split string for a range of columns Pandas

How can I split the string to list for each column for the following Pandas dataframe with many columns?
col1 col2
0/1:9,12:21:99 0/1:9,12:22:99
0/1:9,12:23:99 0/1:9,15:24:99
Desired output:
col1 col2
[0/1,[9,12],21,99] [0/1,[9,12],22,99]
[0/1,[9,12],23,99] [0/1,[9,15],24,99]
I could do:
df['col1'].str.split(":", n = -1, expand = True)
df['col2'].str.split(":", n = -1, expand = True)
but I have many columns, I was wondering if I could do it in a more automated way?
I would then like to calculate the mean of the 2nd element of each list for every row, that is for the first row, get the mean of 21 and 22 and for the second row, get the mean of 23 and 24.
If the data is like your sample, you can make use of stack:
new_df = (df.iloc[:,0:2]
.stack()
.str.split(':',expand=True)
)
Then new_df is double indexed:
0 1 2 3
0 col1 0/1 9,12 21 99
col2 0/1 9,12 22 99
1 col1 0/1 9,12 23 99
col2 0/1 9,15 24 99
And say if you want the mean of 2nd numbers:
new_df[2].unstack(level=-1).astype(float).mean(axis=1)
gives:
0 21.5
1 23.5
dtype: float64