Selecting rows based on a column value

Selecting rows based on a column value - pandas

I have a data frame something like this
data = {'ID': [1,2,3,4,5,6,7,8,9],
'Doc':['Order','Order','Inv','Order','Order','Shp','Order', 'Order','Inv'],
'Rep':[101,101,101,102,102,102,103,103,103]}
frame = pd.DataFrame(data)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
5 Shp 6 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Now I want to select rows for Rep that have Doc type as Inv only.
I want a dataframe as
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
All reps will have Doc type Orders so I was trying to do something like this
frame[frame.Rep == frame.Rep[frame.Doc == 'Inv']]
but I get an error
ValueError: Can only compare identically-labeled Series objects

You can use twice boolean indexing - first get all Rep by condition and then all rows by isin:
a = frame.loc[frame['Doc'] == 'Inv', 'Rep']
print (a)
2 101
8 103
Name: Rep, dtype: int64
df = frame[frame['Rep'].isin(a)]
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Solution with query:
a = frame.query("Doc == 'Inv'")['Rep']
df = frame.query("Rep in #a")
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Timings:
np.random.seed(123)
N = 1000000
L = ['Order','Shp','Inv']
frame = pd.DataFrame({'Doc': np.random.choice(L, N, p=[0.49, 0.5, 0.01]),
'ID':np.arange(1,N+1),
'Rep':np.random.randint(1000, size=N)})
print (frame.head())
Doc ID Rep
0 Shp 1 95
1 Order 2 147
2 Order 3 282
3 Shp 4 82
4 Shp 5 746
In [204]: %timeit (frame.groupby('Rep').filter(lambda x: 'Inv' in x['Doc'].values))
1 loop, best of 3: 250 ms per loop
In [205]: %timeit (frame[frame['Rep'].isin(frame.loc[frame['Doc'] == 'Inv', 'Rep'])])
100 loops, best of 3: 17.3 ms per loop
In [206]: %%timeit
...: a = frame.query("Doc == 'Inv'")['Rep']
...: frame.query("Rep in #a")
...:
100 loops, best of 3: 14.5 ms per loop
EDIT:
Thank you John Galt for nice suggestion:
df = frame.query("Rep in %s" % frame.query("Doc == 'Inv'")['Rep'].tolist())
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103

import pandas as pd
frame_Filtered=frame[frame['Doc'].str.contains('Inv|Order')]
print(frame_Filtered)
Output I got
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103

Related

How to add subtotals to groupby and rank those subtotal categories in descending order?

Sample Dataset (Note that each combination of Col_A and Col_B is unique):
import pandas as pd
d = {'Col_A': [1,2,3,4,5,6,9,9,10,11,11,12,12,12,12,12,12,13,13],
'Col_B': ['A','K','E','E','H','A','J','A','L','A','B','A','J','C','D','E','A','J','L'],
'Value':[180,120,35,654,789,34,567,21,235,83,234,648,654,234,873,248,45,67,94]
}
df = pd.DataFrame(data=d)
The requirement is to generate a table with each Col_B's amount, Col_A's counts, and total amount per Col_A. Show the categories in Col_B in descending order by their total amount.
This is what I have so far:
df.groupby(['Col_B','Col_A']).agg(['count','sum'])
The output would look like this. However, I'd like to add subtotals for each Col_B category and rank those subtotals of the categories in descending order so that it fulfills the requirement of getting each Col_B's amount.
Thanks in advance, everyone!

The expected result is not clear for me but is it what your are looking for?
piv = df.groupby(['Col_B', 'Col_A'])['Amount'].agg(['count','sum']).reset_index()
tot = piv.groupby('Col_B', as_index=False).sum().assign(Col_A='Total')
cat = pd.CategoricalDtype(tot.sort_values('sum')['Col_B'], ordered=True)
out = pd.concat([piv, tot]).astype({'Col_B': cat}) \
.sort_values('Col_B', ascending=False, kind='mergesort') \
.set_index(['Col_B', 'Col_A'])
>>> out
count sum
Col_B Col_A
J 9 1 567
12 1 654
13 1 67
Total 3 1288
A 1 1 180
6 1 34
9 1 21
11 1 83
12 2 693
Total 6 1011
E 3 1 35
4 1 654
12 1 248
Total 3 937
D 12 1 873
Total 1 873
H 5 1 789
Total 1 789
L 10 1 235
13 1 94
Total 2 329
C 12 1 234
Total 1 234
B 11 1 234
Total 1 234
K 2 1 120
Total 1 120

Pandas groupby to get dataframe of unique values

If I have this simple dataframe, how do I use groupby() to get the desired summary dataframe?
Using Python 3.8
Inputs
x = [1,1,1,2,2,2,2,2,3,3,4,4,4]
y = [100,100,100,101,102,102,102,102,103,103,104,104,104]
z = [1,2,3,1,1,2,3,4,1,2,1,2,3]
df = pd.DataFrame(list(zip(x, y, z)), columns =['id', 'set', 'n'])
display(df)
Desired Output

With df.drop_duplicates
df.drop("n",1).drop_duplicates(['id','set'])
id set
0 1 100
3 2 101
4 2 102
8 3 103
10 4 104

Groupby and explode
df.groupby('id')['set'].unique().explode()
id
1 100
2 101
2 102
3 103
4 104

You can try using .explode() and then reset the index of the result:
> df.groupby('id')['set'].unique().explode().reset_index(name='unique_value')
id unique_value
0 1 100
1 2 101
2 2 102
3 3 103
4 4 104

Groupby diff() in dates, groupby size and check the sequence of other column in pandas

I have data frame as shown below
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
13 4 F 10-Oct-18 500
14 4 M 10-Jan-19 200
15 4 F 10-Jun-19 600
16 2 M 29-Mar-18 100
17 2 M 29-Apr-18 100
18 2 F 29-Dec-18 500
F=Failure
M=Maintenance
Then sorted the data based on ID, Date by using below code.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
Then I want to filter ID's having more than one failure with at least one maintenance in between them.
The expected DF as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 F 2018-06-22 600
3 1 M 2018-08-22 150
4 1 F 2019-03-22 750
5 2 F 2018-01-29 500
6 2 M 2018-03-29 100
7 2 M 2018-04-29 100
8 2 F 2018-12-29 500
10 4 F 2018-10-10 500
11 4 M 2019-01-10 200
12 4 F 2019-06-10 600
logic used get above DF as below.
Let above DF be sl9.
select ID's which is having more than 1 F and at least one M in between them.
Remove the row, if ID wise first status is M.
Remove the row, if ID wise last status is M.
IF there is two consecutive F-F for ID, ignore the first F row.
Then I ran below code to calculate duration.
sl9['Date'] = pd.to_datetime(sl9['Date'])
sl9['D'] = sl9.groupby('ID')['Date'].diff().dt.days
ID Status Date Cost D
0 1 F 2017-06-22 500 nan
1 1 M 2017-07-22 100 30.00
2 1 F 2018-06-22 600 335.00
3 1 M 2018-08-22 150 61.00
4 1 F 2019-03-22 750 212.00
5 2 F 2018-01-29 500 nan
6 2 M 2018-03-29 100 59.00
7 2 M 2018-04-29 100 31.00
8 2 F 2018-12-29 500 244.00
10 4 F 2018-10-10 500 nan
11 4 M 2019-01-10 200 92.00
12 4 F 2019-06-10 600 151.00
From the above DF, I want create a DF as below.
ID Total_Duration No_of_F No_of_M
1 638 3 2
2 334 2 2
4 243 2 2
Tried following code.
df1 = sl9.groupby('ID', sort=False)["D"].sum().reset_index(name ='Total_Duration')
and the out put is shown below
ID Total_Duration
0 1 638.00
1 2 334.00
2 4 243.00

Idea is create new columns for each mask for easy debug, because compliacated solution:
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
#removed M groups if first or last groups per ID
m1 = df['Status'].eq('M')
df['g'] = df['Status'].ne(df.groupby('ID')['Status'].shift()).cumsum()
df['f'] = df.groupby('ID')['g'].transform('first').eq(df['g']) & m1
df['l'] = df.groupby('ID')['g'].transform('last').eq(df['g']) & m1
df1 = df[~(df['f'] | df['l'])].copy()
#count number of M and F and compare by ge for >=
df1['noM'] = df1['Status'].eq('M').groupby(df1['ID']).transform('size').ge(1)
df1['noF'] = df1['Status'].eq('F').groupby(df1['ID']).transform('size').ge(2)
#get non FF values for removing duplicated FF
df1['dupF'] = ~df.groupby('ID')['Status'].shift(-1).eq(df['Status']) | df1['Status'].eq('M')
df1 = df1[df1['noM'] & df1['noF'] & df1['dupF']]
df1 = df1.drop(['g','f','l','noM','noF','dupF'], axis=1)
print (df1)
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
7 1 F 2018-06-22 600
9 1 M 2018-08-22 150
10 1 F 2019-03-22 750
6 2 F 2018-01-29 500
16 2 M 2018-03-29 100
17 2 M 2018-04-29 100
18 2 F 2018-12-29 500
13 4 F 2018-10-10 500
14 4 M 2019-01-10 200
15 4 F 2019-06-10 600
And then:
#difference of days
df1['D'] = df1.groupby('ID')['Date'].diff().dt.days
#aggregate sum
df2 = df1.groupby('ID')['D'].sum().astype(int).to_frame('Total_Duration')
#count values by crosstab
df3 = pd.crosstab(df1['ID'], df1['Status']).add_prefix('No_of_')
#join together
df4 = df2.join(df3).reset_index()
print (df4)
ID Total_Duration No_of_F No_of_M
0 1 638 3 2
1 2 334 2 2
2 4 243 2 1

Pandas change order of columns in pivot table

The representation of pivot tabel not looks like something I looking for, to be more specific the order of the resulting rows.
I can`t figure out how to change it in proper way.
Example df:
test_df = pd.DataFrame({'name':['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
code for make pivot:
test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
Actual output:
salary status
month 1 2 3 1 2 3
name
name_1 100 100 100 1 1 2
name_2 110 110 110 1 1 3
name_3 120 120 120 2 2 1
The output I want to see:
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1

You would use sort_index, indicating the axis and the level:
piv = test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
piv.sort_index(axis='columns', level='month')
# salary status salary status salary status
#month 1 1 2 2 3 3
#name
#name_1 100 1 100 1 100 2
#name_2 110 1 110 1 110 3
#name_3 120 2 120 2 120 1

Use DataFrame.sort_index with axis=1, level=1 arguments
(test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
.sort_index(axis=1, level=1))
[out]
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1

import pandas as pd
df = pd.DataFrame({'name':
['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
df = df.pivot_table(index='name', columns=['month'],
values=['salary', 'status']).sort_index(axis='columns', level='month')
print(df)

Pandas MAX formula across different grouped rows

I have dataframe that looks like this:
Auction_id bid_price min_bid rank
123 5 3 1
123 4 3 2
124 3 2 1
124 1 2 2
I'd like to create another column that returns MAX(rank 1 min_bid, rank 2 bid_price). I don't care what appears for the rank 2 column values. I'm hoping for the result to look something like this:
Auction_id bid_price min_bid rank custom_column
123 5 3 1 4
123 4 3 2 NaN/Don't care
124 3 2 1 2
124 1 2 2 NaN/Don't care
Should I be iterating through grouped auction_ids? Can someone provide the topics one would need to be familiar with to tackle this type of problem?

First, set the index equal to the Auction_id. Then you can use loc to select the appropriate values for each Auction_id and use max on their values. Finally, reset your index to return to your initial state.
df.set_index('Auction_id', inplace=True)
df['custom_column'] = pd.concat([df.loc[df['rank'] == 1, 'min_bid'],
df.loc[df['rank'] == 2, 'bid_price']],
axis=1).max(axis=1)
df.reset_index(inplace=True)
>>> df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 4
2 124 3 2 1 2
3 124 1 2 2 2

Here's one crude way to do it.
Create maxminbid() function, which creates a val= MAX(rank 1 min_bid, rank 2 bid_price) and assign this to grp['custom_column'], and for rank==2 store it with NaN
def maxminbid(grp):
val = max(grp.loc[grp['rank']==1, 'min_bid'].values,
grp.loc[grp['rank']==2, 'bid_price'].values)[0]
grp['custom_column'] = val
grp.loc[grp['rank']==2, 'custom_column'] = pd.np.nan
return grp
Then apply maxminbid function on Auction_id grouped objects
df.groupby('Auction_id').apply(maxminbid)
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
But, I suspect, there must be some elegant solution than this one.

Here's an approach that does some reshaping with pivot()
Auction_id bid_price min_bid rank
0 123 5 3 1
1 123 4 3 2
2 124 3 2 1
3 124 1 2 2
Then reshape your frame (df)
pv = df.pivot("Auction_id","rank")
pv
bid_price min_bid
rank 1 2 1 2
Auction_id
123 5 4 3 3
124 3 1 2 2
Adding a column to pv that contains the max. I"m using iloc to get a slice of the pv dataframe.
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
pv
bid_price min_bid custom_column
rank 1 2 1 2
Auction_id
123 5 4 3 3 4
124 3 1 2 2 2
and then add the max to the original frame (df) by mapping to our pv frame
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
all the steps combined
pv = df.pivot("Auction_id","rank")
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Selecting rows based on a column value - pandas

import pandas as pd frame_Filtered=frame[frame['Doc'].str.contains('Inv|Order')] print(frame_Filtered) Output I got Doc ID Rep 0 Order 1 101 1 Order 2 101 2 Inv 3 101 3 Order 4 102 4 Order 5 102 6 Order 7 103 7 Order 8 103 8 Inv 9 103

Related

How to add subtotals to groupby and rank those subtotal categories in descending order?

Pandas groupby to get dataframe of unique values

Groupby diff() in dates, groupby size and check the sequence of other column in pandas

Pandas change order of columns in pivot table

Pandas MAX formula across different grouped rows

Categories

Resources