Calculate Percent of Groupby Variable to Sum Column - pandas

I'm not finding a similar example to understand this in python. I have a dataset that looks like this:
ID Capacity
A 50
A 50
A 50
B 30
B 30
B 30
C 100
C 100
C 100
I need to find the percent of each ID for the sum of the "Capacity" column. So, the answer looks like this:
ID Capacity Percent_Capacity
A 50 0.2777
A 50 0.2777
A 50 0.2777
B 30 0.1666
B 30 0.1666
B 30 0.1666
C 100 0.5555
C 100 0.5555
C 100 0.5555
Thank you - still learning python.

total=df.groupby('ID')['Capacity'].first().sum()
df['percent_capacity'] = df['Capacity']/total
df
ID Capacity percent_capacity
0 A 50 0.277778
1 A 50 0.277778
2 A 50 0.277778
3 B 30 0.166667
4 B 30 0.166667
5 B 30 0.166667
6 C 100 0.555556
7 C 100 0.555556
8 C 100 0.555556

Using drop_duplicates:
df['percent_capacity'] = df['Capacity']/df.drop_duplicates(subset='ID')['Capacity'].sum()
Output:
ID Capacity percent_capacity
0 A 50 0.277778
1 A 50 0.277778
2 A 50 0.277778
3 B 30 0.166667
4 B 30 0.166667
5 B 30 0.166667
6 C 100 0.555556
7 C 100 0.555556
8 C 100 0.555556

Related

Pandas pivot? pivot_table? melt? stack or unstack?

I have a dataframe that looks like this:
id Revenue Cost qty time
0 A 400 50 2 1
1 A 900 200 8 2
2 A 800 100 8 3
3 B 300 20 1 1
4 B 600 150 4 2
5 B 650 155 4 3
And I'm trying to get to this:
id Type 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4
Where time will always just be repeated 1-3, so I need to transpose or pivot on just time, with the column for 1-3
Here is what I have tried so far:
pd.pivot_table(df, values = ['Revenue', 'qty', 'Cost'] , index=['id'], columns='time').reset_index()
But that just makes one really long table that puts everything side by side vs stacked like this:
Revenue qty Cost
1 2 3 1 2 3 1 2 3
In that situation I would need to convert the Revenue, qty and Cost to a row and just use the 1, 2, 3 as the column names. So the ID would be duplicated for each 'type' but list it out based on time 1-3.
We can still do unstack and stack
df.set_index(['id','time']).stack().unstack(level=1).reset_index()
Out[24]:
time id level_1 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4
An alternative, using melt and pivot on Pandas 1.1.0 :
(df
.melt(["id", "time"])
.pivot(["id", "variable"], "time", "value")
.reset_index()
.rename_axis(columns=None)
)
id variable 1 2 3
0 A Cost 50 200 100
1 A Revenue 400 900 800
2 A qty 2 8 8
3 B Cost 20 150 155
4 B Revenue 300 600 650
5 B qty 1 4 4

Clean the data based on condition pandas

I have a data frame as shown below
ID Unit_ID Price Duration
1 A 200 2
2 B 1000 3
2 C 1000 3
2 D 1000 3
2 F 1000 3
2 G 200 1
3 A 500 2
3 B 200 2
From the above data frame if ID, Price and Duration are same then replace the Price by average (Price divided by count of Such combination).
For example from the above data frame from row 2 to 5 has same ID, Price and Duration, that means its count is 4, so the new Price = 1000/4 = 250.
Expected Output:
ID Unit_ID Price Duration
1 A 200 2
2 B 250 3
2 C 250 3
2 D 250 3
2 F 250 3
2 G 200 1
3 A 500 2
3 B 200 2
Use GroupBy.transform with GroupBy.size for Series with same size like original filled by counts, so possible divide by Series.div:
df['Price'] = df['Price'].div(df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
print (df)
ID Unit_ID Price Duration
0 1 A 200.0 2
1 2 B 250.0 3
2 2 C 250.0 3
3 2 D 250.0 3
4 2 F 250.0 3
5 2 G 200.0 1
6 3 A 500.0 2
7 3 B 200.0 2
Detail:
print (df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
0 1
1 4
2 4
3 4
4 4
5 1
6 1
7 1
Name: Price, dtype: int64

Replace a particular value based on condition using groupby in pandas

I have a dataframe as shown below
ID Sector Usage Price
1 A R 20
2 A C 100
3 A R 40
4 A R 1
5 A C 200
6 A C 1
7 A C 1
8 A R 1
1 B R 40
2 B C 200
3 B R 60
4 B R 1
5 B C 400
6 B C 1
7 B C 1
8 B R 1
From the above I would like to replace the Price=1 with average Price of Sector and Usage combination other than 1.
Expected Output:
ID Sector Usage Price
1 A R 20
2 A C 100
3 A R 40
4 A R 30
5 A C 200
6 A C 150
7 A C 150
8 A R 30
1 B R 40
2 B C 200
3 B R 60
4 B R 50
5 B C 400
6 B C 300
7 B C 300
8 B R 50
For example in row 4 Sector = A, Usage=R Price=1 has to be replaced by average of Sector = A and Usage=R, ie (20+40)/2 = 30
Idea is first replace 1 to missing values by Series.mask and then use GroupBy.transform for means per groups used for replaced:
m = df['Price'] == 1
s = df.assign(Price=df['Price'].mask(m)).groupby(['Sector','Usage'])['Price'].transform('mean')
df['Price'] = np.where(m, s, df['Price']).astype(int)
Or:
s = df['Price'].mask(df['Price'] == 1)
mean = df.assign(Price=s).groupby(['Sector','Usage'])['Price'].transform('mean')
df['Price'] = s.fillna(mean).astype(int)
print (df)
ID Sector Usage Price
0 1 A R 20
1 2 A C 100
2 3 A R 40
3 4 A R 30
4 5 A C 200
5 6 A C 150
6 7 A C 150
7 8 A R 30
8 1 B R 40
9 2 B C 200
10 3 B R 60
11 4 B R 50
12 5 B C 400
13 6 B C 300
14 7 B C 300
15 8 B R 50

Groupby diff() in dates, groupby size and check the sequence of other column in pandas

I have data frame as shown below
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
13 4 F 10-Oct-18 500
14 4 M 10-Jan-19 200
15 4 F 10-Jun-19 600
16 2 M 29-Mar-18 100
17 2 M 29-Apr-18 100
18 2 F 29-Dec-18 500
F=Failure
M=Maintenance
Then sorted the data based on ID, Date by using below code.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
Then I want to filter ID's having more than one failure with at least one maintenance in between them.
The expected DF as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 F 2018-06-22 600
3 1 M 2018-08-22 150
4 1 F 2019-03-22 750
5 2 F 2018-01-29 500
6 2 M 2018-03-29 100
7 2 M 2018-04-29 100
8 2 F 2018-12-29 500
10 4 F 2018-10-10 500
11 4 M 2019-01-10 200
12 4 F 2019-06-10 600
logic used get above DF as below.
Let above DF be sl9.
select ID's which is having more than 1 F and at least one M in between them.
Remove the row, if ID wise first status is M.
Remove the row, if ID wise last status is M.
IF there is two consecutive F-F for ID, ignore the first F row.
Then I ran below code to calculate duration.
sl9['Date'] = pd.to_datetime(sl9['Date'])
sl9['D'] = sl9.groupby('ID')['Date'].diff().dt.days
ID Status Date Cost D
0 1 F 2017-06-22 500 nan
1 1 M 2017-07-22 100 30.00
2 1 F 2018-06-22 600 335.00
3 1 M 2018-08-22 150 61.00
4 1 F 2019-03-22 750 212.00
5 2 F 2018-01-29 500 nan
6 2 M 2018-03-29 100 59.00
7 2 M 2018-04-29 100 31.00
8 2 F 2018-12-29 500 244.00
10 4 F 2018-10-10 500 nan
11 4 M 2019-01-10 200 92.00
12 4 F 2019-06-10 600 151.00
From the above DF, I want create a DF as below.
ID Total_Duration No_of_F No_of_M
1 638 3 2
2 334 2 2
4 243 2 2
Tried following code.
df1 = sl9.groupby('ID', sort=False)["D"].sum().reset_index(name ='Total_Duration')
and the out put is shown below
ID Total_Duration
0 1 638.00
1 2 334.00
2 4 243.00
Idea is create new columns for each mask for easy debug, because compliacated solution:
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
#removed M groups if first or last groups per ID
m1 = df['Status'].eq('M')
df['g'] = df['Status'].ne(df.groupby('ID')['Status'].shift()).cumsum()
df['f'] = df.groupby('ID')['g'].transform('first').eq(df['g']) & m1
df['l'] = df.groupby('ID')['g'].transform('last').eq(df['g']) & m1
df1 = df[~(df['f'] | df['l'])].copy()
#count number of M and F and compare by ge for >=
df1['noM'] = df1['Status'].eq('M').groupby(df1['ID']).transform('size').ge(1)
df1['noF'] = df1['Status'].eq('F').groupby(df1['ID']).transform('size').ge(2)
#get non FF values for removing duplicated FF
df1['dupF'] = ~df.groupby('ID')['Status'].shift(-1).eq(df['Status']) | df1['Status'].eq('M')
df1 = df1[df1['noM'] & df1['noF'] & df1['dupF']]
df1 = df1.drop(['g','f','l','noM','noF','dupF'], axis=1)
print (df1)
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
7 1 F 2018-06-22 600
9 1 M 2018-08-22 150
10 1 F 2019-03-22 750
6 2 F 2018-01-29 500
16 2 M 2018-03-29 100
17 2 M 2018-04-29 100
18 2 F 2018-12-29 500
13 4 F 2018-10-10 500
14 4 M 2019-01-10 200
15 4 F 2019-06-10 600
And then:
#difference of days
df1['D'] = df1.groupby('ID')['Date'].diff().dt.days
#aggregate sum
df2 = df1.groupby('ID')['D'].sum().astype(int).to_frame('Total_Duration')
#count values by crosstab
df3 = pd.crosstab(df1['ID'], df1['Status']).add_prefix('No_of_')
#join together
df4 = df2.join(df3).reset_index()
print (df4)
ID Total_Duration No_of_F No_of_M
0 1 638 3 2
1 2 334 2 2
2 4 243 2 1

Aggregating counts of different columns subsets

I have a dataset with a tree structure and for each path in the tree, I want to compute the corresponding counts at each level. Here is a minimal reproducible example with two levels.
import pandas as pd
data = pd.DataFrame()
data['level_1'] = np.random.choice(['1', '2', '3'], 100)
data['level_2'] = np.random.choice(['A', 'B', 'C'], 100)
I know I can get the counts on the last level by doing
counts = data.groupby(['level_1','level_2']).size().reset_index(name='count_2')
print(counts)
level_1 level_2 count_2
0 1 A 10
1 1 B 12
2 1 C 8
3 2 A 10
4 2 B 10
5 2 C 10
6 3 A 17
7 3 B 12
8 3 C 11
What I would like to have is a dataframe with one row for each possible path in the tree with the counts at each level in that path. For the example above, it would be something like
level_1 level_2 count_1 count_2
0 1 A 30 10
1 1 B 30 12
2 1 C 30 8
3 2 A 30 10
4 2 B 30 10
5 2 C 30 10
6 3 A 40 17
7 3 B 40 12
8 3 C 40 11
This is an example with only two levels, which is easy to solve, but I would like to have a way to get those counts for an arbitrary number of levels.
This will be the transform
counts['count_1']=counts.groupby(['level_1']).count_2.transform('sum')
counts
Out[445]:
level_1 level_2 count_2 count_1
0 1 A 7 30
1 1 B 13 30
2 1 C 10 30
3 2 A 7 30
4 2 B 7 30
5 2 C 16 30
6 3 A 9 40
7 3 B 10 40
8 3 C 21 40
You can make do from your original data:
groups = data.groupby('level_1').level_2
pd.merge(groups.value_counts(),
groups.size(),
left_index=True,
right_index=True)
which gives:
level_2_x level_2_y
level_1 level_2
1 A 14 39
B 14 39
C 11 39
2 C 13 34
A 12 34
B 9 34
3 B 12 27
C 9 27
A 6 27