Pandas - Converting columns in percentage based on first columns value - pandas

There is a data frame with totals and counts:
pd.DataFrame({
'categorie':['a','b','c'],
'total':[100,1000,500],
'x':[10,100,5],
'y':[100,1000,500]
})
categorie
total
x
y
a
100
10
100
b
1000
100
1000
c
500
5
500
I like to convert the counted columns into percentage based on the totals:
categorie
total
x%
y%
a
100
10
100
b
1000
10
100
c
500
1
100
Following will work for a series:
(100 * df['x'] / df['total']).round(1)
How to apply this for all columns in the data frame?

try via div(),mul() and astype() method:
df[['x%','y%']]=df[['x','y']].div(df['total'],axis=0).mul(100).astype(int)
output of df:
categorie total x y x% y%
0 a 100 10 100 10 100
1 b 1000 100 1000 10 100
2 c 500 5 500 1 100

Related

Presto running cumulative sum with limit or cap

I am trying to add a column to my table with a running sum that restarts once a value is reached. It should look like this (for ex the limit here is 500):
x
output
100
100
200
300
100
400
300
300
200
500
200
200
Another option is to create some type of id for each batch of values with the sum<500:
x
output
100
1
200
1
100
1
300
2
200
2
200
3
I am using the SUM(x) over () function but I am not finding a way to restart the sum after it reached the 500.

Pandas DataFrame subtract values

Im new to python
I have a data frame (df) which has the following structure:
ID
rate
Sequential number
a
150
1
a
150
1
a
50
2
b
250
1
c
25
1
d
25
1
d
40
2
d
25
3
The ID are customers, the value are monthly rates and Sequential number is a number that always increases by 1, if the customer changes the monthly rate
I want to do the following:
for every ID find the maximum value in the column Sequential number, take the associated value in the column rate, find the minimum value in the column Sequential number and take associated value in the column rate and subtracting the rates.
At the end I want to have a additional column to my data frame with the difference of the rates. Maybe the loop could do the following:
for id in df()
find max() in column Sequential number and get value in rates -
min () in column Sequential number and get value in rates
return difference
The new df_new should be this
ID
rate
Sequential number
rate_diff
a
150
1
0
a
150
1
0
a
50
2
-100
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
0
d
30
3
5
If an ID has only one entry, the rate_diff should be 0
I tried already the lambda Function:
df['diff_rate'] = df.groupby('ID')['rate'].transform(lambda x : x-x.min())
but this returns
ID
rate
Sequential number
rate_diff
a
150
1
100
a
150
1
100
a
50
2
0
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
15
d
30
3
10
Maybe someone of you have a small workaround for this! :-)
One approach with indexing:
g = df.groupby('ID')['Sequential number']
IMAX = g.idxmax()
IMIN = g.idxmin()
df['rate_diff'] = 0
df.loc[IMAX, 'rate_diff'] = (df.loc[IMAX, 'rate'].to_numpy()
-df.loc[IMIN, 'rate'].to_numpy()
)
Another with groupby.transform+where:
g = df.sort_values(by=['ID', 'Sequential number']).groupby('ID')
m = g['Sequential number'].idxmax()
df['rate_diff'] = (g['rate'].transform(lambda x: x.iloc[-1]-x.iloc[0])
.where(df.index.isin(m), 0)
)
output:
ID rate Sequential number rate_diff
0 a 150 1 0
1 a 150 1 0
2 a 50 2 -100
3 b 250 1 0
4 c 25 1 0
5 d 25 1 0
6 d 40 2 0
7 d 30 3 5

How to iterate over rows and get max values of any previous rows

I have this dataframe:
pd.DataFrame({'ids':['a','b','c','d','e','f']
,'id_order':[1,2,3,4,5,6]
,'value':[1000,500,3000,2000,1000,5000]})
What I want is to iterate over the rows and get the maximum value of all previous rows.
For example, when I iterate to id_order==2 I would get 1000 (from id_order 1).
When I move forward to id_order==5 I would get 3000 (from id_order 3)
The desired outcome should be as follows:
pd.DataFrame({'ids':['a','b','c','d','e','f']
,'id_order':[1,2,3,4,5,6]
,'value':[1000,500,2000,3000,1000,5000]
,'outcome':[0,1000,1000,2000,3000,3000]})
This will be done on a big dataset so efficiency is also a factor.
I would greatly appreciate your help in this.
Thanks
You can shift the value column and take the cumulative maximum:
df["outcome"] = df.value.shift(fill_value=0).cummax()
Since shifting nullifies the first entry we fill it with 0.
>>> df
ids id_order value outcome
0 a 1 1000 0
1 b 2 500 1000
2 c 3 3000 1000
3 d 4 2000 3000
4 e 5 1000 3000
5 f 6 5000 3000

How to Calculate Percentages for Groups in SQL

I have a table that looks something like this
Class ID Value
A 1 300
A 2 200
A 3 500
B 1 300
B 2 300
C 1 1000
Is there a way of using SQL to calculate the percentage share each ID has to the class.
For example, the percentages for class A would be 30% to id 1, 20% to ID 2, and 50% to id 3 and so on for the other classes:
Class ID Value Percentage
A 1 300 30%
A 2 200 20%
A 3 500 50%
B 1 300 50%
B 2 300 50%
C 1 1000 100%
You can use window functions (if your database, which you did not disclose, supports them):
select
t.*,
1.0 * value / sum(value) over(partition by class) ratio
from mytable t
This gives you a ratio, that is a value between 0 and 1 - I find that this is more relevant than a percentage, but you can multiply that by 100 if you like.

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8