Selecting rows based on a column value - pandas

I have a data frame something like this
data = {'ID': [1,2,3,4,5,6,7,8,9],
'Doc':['Order','Order','Inv','Order','Order','Shp','Order', 'Order','Inv'],
'Rep':[101,101,101,102,102,102,103,103,103]}
frame = pd.DataFrame(data)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
5 Shp 6 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Now I want to select rows for Rep that have Doc type as Inv only.
I want a dataframe as
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
All reps will have Doc type Orders so I was trying to do something like this
frame[frame.Rep == frame.Rep[frame.Doc == 'Inv']]
but I get an error
ValueError: Can only compare identically-labeled Series objects

You can use twice boolean indexing - first get all Rep by condition and then all rows by isin:
a = frame.loc[frame['Doc'] == 'Inv', 'Rep']
print (a)
2 101
8 103
Name: Rep, dtype: int64
df = frame[frame['Rep'].isin(a)]
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Solution with query:
a = frame.query("Doc == 'Inv'")['Rep']
df = frame.query("Rep in #a")
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Timings:
np.random.seed(123)
N = 1000000
L = ['Order','Shp','Inv']
frame = pd.DataFrame({'Doc': np.random.choice(L, N, p=[0.49, 0.5, 0.01]),
'ID':np.arange(1,N+1),
'Rep':np.random.randint(1000, size=N)})
print (frame.head())
Doc ID Rep
0 Shp 1 95
1 Order 2 147
2 Order 3 282
3 Shp 4 82
4 Shp 5 746
In [204]: %timeit (frame.groupby('Rep').filter(lambda x: 'Inv' in x['Doc'].values))
1 loop, best of 3: 250 ms per loop
In [205]: %timeit (frame[frame['Rep'].isin(frame.loc[frame['Doc'] == 'Inv', 'Rep'])])
100 loops, best of 3: 17.3 ms per loop
In [206]: %%timeit
...: a = frame.query("Doc == 'Inv'")['Rep']
...: frame.query("Rep in #a")
...:
100 loops, best of 3: 14.5 ms per loop
EDIT:
Thank you John Galt for nice suggestion:
df = frame.query("Rep in %s" % frame.query("Doc == 'Inv'")['Rep'].tolist())
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103

import pandas as pd
frame_Filtered=frame[frame['Doc'].str.contains('Inv|Order')]
print(frame_Filtered)
Output I got
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103

Related

How to add subtotals to groupby and rank those subtotal categories in descending order?

Sample Dataset (Note that each combination of Col_A and Col_B is unique):
import pandas as pd
d = {'Col_A': [1,2,3,4,5,6,9,9,10,11,11,12,12,12,12,12,12,13,13],
'Col_B': ['A','K','E','E','H','A','J','A','L','A','B','A','J','C','D','E','A','J','L'],
'Value':[180,120,35,654,789,34,567,21,235,83,234,648,654,234,873,248,45,67,94]
}
df = pd.DataFrame(data=d)
The requirement is to generate a table with each Col_B's amount, Col_A's counts, and total amount per Col_A. Show the categories in Col_B in descending order by their total amount.
This is what I have so far:
df.groupby(['Col_B','Col_A']).agg(['count','sum'])
The output would look like this. However, I'd like to add subtotals for each Col_B category and rank those subtotals of the categories in descending order so that it fulfills the requirement of getting each Col_B's amount.
Thanks in advance, everyone!
The expected result is not clear for me but is it what your are looking for?
piv = df.groupby(['Col_B', 'Col_A'])['Amount'].agg(['count','sum']).reset_index()
tot = piv.groupby('Col_B', as_index=False).sum().assign(Col_A='Total')
cat = pd.CategoricalDtype(tot.sort_values('sum')['Col_B'], ordered=True)
out = pd.concat([piv, tot]).astype({'Col_B': cat}) \
.sort_values('Col_B', ascending=False, kind='mergesort') \
.set_index(['Col_B', 'Col_A'])
>>> out
count sum
Col_B Col_A
J 9 1 567
12 1 654
13 1 67
Total 3 1288
A 1 1 180
6 1 34
9 1 21
11 1 83
12 2 693
Total 6 1011
E 3 1 35
4 1 654
12 1 248
Total 3 937
D 12 1 873
Total 1 873
H 5 1 789
Total 1 789
L 10 1 235
13 1 94
Total 2 329
C 12 1 234
Total 1 234
B 11 1 234
Total 1 234
K 2 1 120
Total 1 120

Pandas groupby to get dataframe of unique values

If I have this simple dataframe, how do I use groupby() to get the desired summary dataframe?
Using Python 3.8
Inputs
x = [1,1,1,2,2,2,2,2,3,3,4,4,4]
y = [100,100,100,101,102,102,102,102,103,103,104,104,104]
z = [1,2,3,1,1,2,3,4,1,2,1,2,3]
df = pd.DataFrame(list(zip(x, y, z)), columns =['id', 'set', 'n'])
display(df)
Desired Output
With df.drop_duplicates
df.drop("n",1).drop_duplicates(['id','set'])
id set
0 1 100
3 2 101
4 2 102
8 3 103
10 4 104
Groupby and explode
df.groupby('id')['set'].unique().explode()
id
1 100
2 101
2 102
3 103
4 104
You can try using .explode() and then reset the index of the result:
> df.groupby('id')['set'].unique().explode().reset_index(name='unique_value')
id unique_value
0 1 100
1 2 101
2 2 102
3 3 103
4 4 104

Groupby diff() in dates, groupby size and check the sequence of other column in pandas

I have data frame as shown below
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
13 4 F 10-Oct-18 500
14 4 M 10-Jan-19 200
15 4 F 10-Jun-19 600
16 2 M 29-Mar-18 100
17 2 M 29-Apr-18 100
18 2 F 29-Dec-18 500
F=Failure
M=Maintenance
Then sorted the data based on ID, Date by using below code.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
Then I want to filter ID's having more than one failure with at least one maintenance in between them.
The expected DF as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 F 2018-06-22 600
3 1 M 2018-08-22 150
4 1 F 2019-03-22 750
5 2 F 2018-01-29 500
6 2 M 2018-03-29 100
7 2 M 2018-04-29 100
8 2 F 2018-12-29 500
10 4 F 2018-10-10 500
11 4 M 2019-01-10 200
12 4 F 2019-06-10 600
logic used get above DF as below.
Let above DF be sl9.
select ID's which is having more than 1 F and at least one M in between them.
Remove the row, if ID wise first status is M.
Remove the row, if ID wise last status is M.
IF there is two consecutive F-F for ID, ignore the first F row.
Then I ran below code to calculate duration.
sl9['Date'] = pd.to_datetime(sl9['Date'])
sl9['D'] = sl9.groupby('ID')['Date'].diff().dt.days
ID Status Date Cost D
0 1 F 2017-06-22 500 nan
1 1 M 2017-07-22 100 30.00
2 1 F 2018-06-22 600 335.00
3 1 M 2018-08-22 150 61.00
4 1 F 2019-03-22 750 212.00
5 2 F 2018-01-29 500 nan
6 2 M 2018-03-29 100 59.00
7 2 M 2018-04-29 100 31.00
8 2 F 2018-12-29 500 244.00
10 4 F 2018-10-10 500 nan
11 4 M 2019-01-10 200 92.00
12 4 F 2019-06-10 600 151.00
From the above DF, I want create a DF as below.
ID Total_Duration No_of_F No_of_M
1 638 3 2
2 334 2 2
4 243 2 2
Tried following code.
df1 = sl9.groupby('ID', sort=False)["D"].sum().reset_index(name ='Total_Duration')
and the out put is shown below
ID Total_Duration
0 1 638.00
1 2 334.00
2 4 243.00
Idea is create new columns for each mask for easy debug, because compliacated solution:
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
#removed M groups if first or last groups per ID
m1 = df['Status'].eq('M')
df['g'] = df['Status'].ne(df.groupby('ID')['Status'].shift()).cumsum()
df['f'] = df.groupby('ID')['g'].transform('first').eq(df['g']) & m1
df['l'] = df.groupby('ID')['g'].transform('last').eq(df['g']) & m1
df1 = df[~(df['f'] | df['l'])].copy()
#count number of M and F and compare by ge for >=
df1['noM'] = df1['Status'].eq('M').groupby(df1['ID']).transform('size').ge(1)
df1['noF'] = df1['Status'].eq('F').groupby(df1['ID']).transform('size').ge(2)
#get non FF values for removing duplicated FF
df1['dupF'] = ~df.groupby('ID')['Status'].shift(-1).eq(df['Status']) | df1['Status'].eq('M')
df1 = df1[df1['noM'] & df1['noF'] & df1['dupF']]
df1 = df1.drop(['g','f','l','noM','noF','dupF'], axis=1)
print (df1)
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
7 1 F 2018-06-22 600
9 1 M 2018-08-22 150
10 1 F 2019-03-22 750
6 2 F 2018-01-29 500
16 2 M 2018-03-29 100
17 2 M 2018-04-29 100
18 2 F 2018-12-29 500
13 4 F 2018-10-10 500
14 4 M 2019-01-10 200
15 4 F 2019-06-10 600
And then:
#difference of days
df1['D'] = df1.groupby('ID')['Date'].diff().dt.days
#aggregate sum
df2 = df1.groupby('ID')['D'].sum().astype(int).to_frame('Total_Duration')
#count values by crosstab
df3 = pd.crosstab(df1['ID'], df1['Status']).add_prefix('No_of_')
#join together
df4 = df2.join(df3).reset_index()
print (df4)
ID Total_Duration No_of_F No_of_M
0 1 638 3 2
1 2 334 2 2
2 4 243 2 1

Pandas change order of columns in pivot table

The representation of pivot tabel not looks like something I looking for, to be more specific the order of the resulting rows.
I can`t figure out how to change it in proper way.
Example df:
test_df = pd.DataFrame({'name':['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
code for make pivot:
test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
Actual output:
salary status
month 1 2 3 1 2 3
name
name_1 100 100 100 1 1 2
name_2 110 110 110 1 1 3
name_3 120 120 120 2 2 1
The output I want to see:
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1
You would use sort_index, indicating the axis and the level:
piv = test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
piv.sort_index(axis='columns', level='month')
# salary status salary status salary status
#month 1 1 2 2 3 3
#name
#name_1 100 1 100 1 100 2
#name_2 110 1 110 1 110 3
#name_3 120 2 120 2 120 1
Use DataFrame.sort_index with axis=1, level=1 arguments
(test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
.sort_index(axis=1, level=1))
[out]
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1
import pandas as pd
df = pd.DataFrame({'name':
['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
df = df.pivot_table(index='name', columns=['month'],
values=['salary', 'status']).sort_index(axis='columns', level='month')
print(df)

Pandas MAX formula across different grouped rows

I have dataframe that looks like this:
Auction_id bid_price min_bid rank
123 5 3 1
123 4 3 2
124 3 2 1
124 1 2 2
I'd like to create another column that returns MAX(rank 1 min_bid, rank 2 bid_price). I don't care what appears for the rank 2 column values. I'm hoping for the result to look something like this:
Auction_id bid_price min_bid rank custom_column
123 5 3 1 4
123 4 3 2 NaN/Don't care
124 3 2 1 2
124 1 2 2 NaN/Don't care
Should I be iterating through grouped auction_ids? Can someone provide the topics one would need to be familiar with to tackle this type of problem?
First, set the index equal to the Auction_id. Then you can use loc to select the appropriate values for each Auction_id and use max on their values. Finally, reset your index to return to your initial state.
df.set_index('Auction_id', inplace=True)
df['custom_column'] = pd.concat([df.loc[df['rank'] == 1, 'min_bid'],
df.loc[df['rank'] == 2, 'bid_price']],
axis=1).max(axis=1)
df.reset_index(inplace=True)
>>> df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 4
2 124 3 2 1 2
3 124 1 2 2 2
Here's one crude way to do it.
Create maxminbid() function, which creates a val= MAX(rank 1 min_bid, rank 2 bid_price) and assign this to grp['custom_column'], and for rank==2 store it with NaN
def maxminbid(grp):
val = max(grp.loc[grp['rank']==1, 'min_bid'].values,
grp.loc[grp['rank']==2, 'bid_price'].values)[0]
grp['custom_column'] = val
grp.loc[grp['rank']==2, 'custom_column'] = pd.np.nan
return grp
Then apply maxminbid function on Auction_id grouped objects
df.groupby('Auction_id').apply(maxminbid)
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
But, I suspect, there must be some elegant solution than this one.
Here's an approach that does some reshaping with pivot()
Auction_id bid_price min_bid rank
0 123 5 3 1
1 123 4 3 2
2 124 3 2 1
3 124 1 2 2
Then reshape your frame (df)
pv = df.pivot("Auction_id","rank")
pv
bid_price min_bid
rank 1 2 1 2
Auction_id
123 5 4 3 3
124 3 1 2 2
Adding a column to pv that contains the max. I"m using iloc to get a slice of the pv dataframe.
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
pv
bid_price min_bid custom_column
rank 1 2 1 2
Auction_id
123 5 4 3 3 4
124 3 1 2 2 2
and then add the max to the original frame (df) by mapping to our pv frame
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
all the steps combined
pv = df.pivot("Auction_id","rank")
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN