Calculate row wise percentage in pandas - pandas

I have a data frame as shown below
id val1 val2 val3
a 100 60 40
b 20 18 12
c 160 140 100
For each row I want to calculate the percentage.
The expected output as shown below
id val1 val2 val3
a 50 30 20
b 40 36 24
c 40 35 25
I tried following code
df['sum'] = df['val1]+df['val2]+df['val3]
df['val1] = df['val1]/df['sum']
df['val2] = df['val2]/df['sum']
df['val3] = df['val3]/df['sum']
I would like to know is there any easy and alternate way than this in pandas.

We can do the following:
We slice the correct columns with iloc
Use apply with axis=1 to apply each calculation row wise
We use div, sum and mul to divide each value to the rows sum and multiply it by 100 to get the percentages in whole numbers not decimals
We convert our floats back to int with astype
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.div(x.sum()).mul(100), axis=1).astype(int)
Output
id val1 val2 val3
0 a 50 30 20
1 b 40 36 24
2 c 40 35 25
Or a vectorized solution, accessing the numpy arrays underneath our dataframe.
note: this method should perform better in terms of speed
df.iloc[:, 1:] = (df.iloc[:, 1:] / df.sum(axis=1)[:, None]).mul(100).astype(int)
Or similar but using the pandas DataFrame.div method:
proposed by Jon Clements
df.iloc[:, 1:] = df.iloc[:, 1:].div(df.iloc[:, 1:].sum(1), axis=0).mul(100)

Related

Properly map values between two dataframes

I have dataframe df
d = {'Col1': [10,67], 'Col2': [30,10],'Col3': [70,40]}
df = pd.DataFrame(data=d)
which results in
Col1 Col2 Col3
0 10 30 70
1 67 10 40
and df2
df2=pd.DataFrame(data=([25,36,47,(0,20)],[70,85,95,(20,40)],
[12,35,49,(40,60)],[50,49,21,(60,80)],[60,75,38,(80,100)]),
columns=["Col1","Col2","Col3","Range"])
which results in:
Col1 Col2 Col3 Range
0 25 36 47 (0, 20)
1 70 85 95 (20, 40)
2 12 35 49 (40, 60)
3 50 49 21 (60, 80)
4 60 75 38 (80, 100)
Both frames are just for example purposes and might be much bigger in reality. Both frames have the same columns but one.
I want to apply some function (x/y) between each value from df and a value in df2 from the same column. The value from df2 however maybe in varying rows depending on the Range column.
For example 10 from df (Col1) falls in range (0,20) in df2 therefore I want to use 25 from Col1 (df2) and do 10/25.
30 from df (Col2) falls in range (20,40) in df2 therefore I want to take 85 from Col2 (df2) and do 30/85.
70 from df (Col3) falls in range (60,80) in df2 therefore I want to take 21 from Col3 (df2) and do 70/21.
I want to do this for each row in df.
Don't really know how to do the proper mapping; I always tend to start with some for loops which are not very pretty especially if both dataframes are of bigger shape. Expected output can be any array, dataframe or the like composed of the resulting numbers.
Here is one way to do it by defining a helper function:
def find_denominator_for(v):
"""Helper function.
>>> find_denominator_for(10)
{'Col1': 25, 'Col2': 36, 'Col3': 47}
"""
for tup, sub_dict in df2.set_index("Range").to_dict(orient="index").items():
if min(tup) <= v <= max(tup):
return sub_dict
for col in df.columns:
df[col] = df[col] / df[col].apply(lambda x: find_denominator_for(x)[col])
Then:
print(df)
# Output
Col1 Col2 Col3
0 0.40 0.352941 3.333333
1 1.34 0.277778 0.421053

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Pandas groupby custom nlargest

When trying to solve my own question here I came up with an interesting problem. Consider I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
df= pd.DataFrame(dict(group = np.random.choice(["a", "b", "c", "d"],
size = 100),
values = np.random.randint(0, 100,
size = 100)
)
)
I want to select top values per each group, but I want to select the values according to some range. Let's say, top x to y values per each group. If any group has less than x values in it, give top(min((y-x), x)) values for that group.
In general, I am looking for a custom made alternative function which could be used with groupby objects to select not top n values, but instead top x to y range of values.
EDIT: nlargest() is a special case of the solution to my problem where x = 1 and y = n
Any further help, or guidance will be appreciated
Adding an example with this df and this top(3, 6). For every group output the values from top 3rd until top 6th values:
group value
a 190
b 166
a 163
a 106
b 86
a 77
b 70
b 69
c 67
b 54
b 52
a 50
c 24
a 20
a 11
As group c has just two members, it will output top(3)
group value
a 106
a 77
a 50
b 69
b 54
b 52
c 67
c 24
there are other means of doing this and depending on how large your dataframe is, you may want to search groupby slice or something similar. You may also need to check my conditions are correct (<, <=, etc)
x=3
y=6
# this gets the groups which don't meet the x minimum
df1 = df[df.groupby('group')['value'].transform('count')<x]
# this df takes those groups meeting the minimum and then shifts by x-1; does some cleanup and chooses nlargest
df2 = df[df.groupby('group')['value'].transform('count')>=x].copy()
df2['shifted'] = df2.groupby('group').shift(periods=-(x-1))
df2.drop('value', axis=1, inplace=True)
df2 = df2.groupby('group')['shifted'].nlargest(y-x).reset_index().rename(columns={'shifted':'value'}).drop('level_1', axis=1)
# putting it all together
df_final = pd.concat([df1, df2])
df_final
group value
8 c 67.0
12 c 24.0
0 a 106.0
1 a 77.0
2 a 50.0
3 b 70.0
4 b 69.0
5 b 54.0

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100