Pandas groupby to get dataframe of unique values - pandas

If I have this simple dataframe, how do I use groupby() to get the desired summary dataframe?
Using Python 3.8
Inputs
x = [1,1,1,2,2,2,2,2,3,3,4,4,4]
y = [100,100,100,101,102,102,102,102,103,103,104,104,104]
z = [1,2,3,1,1,2,3,4,1,2,1,2,3]
df = pd.DataFrame(list(zip(x, y, z)), columns =['id', 'set', 'n'])
display(df)
Desired Output

With df.drop_duplicates
df.drop("n",1).drop_duplicates(['id','set'])
id set
0 1 100
3 2 101
4 2 102
8 3 103
10 4 104

Groupby and explode
df.groupby('id')['set'].unique().explode()
id
1 100
2 101
2 102
3 103
4 104

You can try using .explode() and then reset the index of the result:
> df.groupby('id')['set'].unique().explode().reset_index(name='unique_value')
id unique_value
0 1 100
1 2 101
2 2 102
3 3 103
4 4 104

Related

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Selecting rows based on a column value

I have a data frame something like this
data = {'ID': [1,2,3,4,5,6,7,8,9],
'Doc':['Order','Order','Inv','Order','Order','Shp','Order', 'Order','Inv'],
'Rep':[101,101,101,102,102,102,103,103,103]}
frame = pd.DataFrame(data)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
5 Shp 6 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Now I want to select rows for Rep that have Doc type as Inv only.
I want a dataframe as
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
All reps will have Doc type Orders so I was trying to do something like this
frame[frame.Rep == frame.Rep[frame.Doc == 'Inv']]
but I get an error
ValueError: Can only compare identically-labeled Series objects
You can use twice boolean indexing - first get all Rep by condition and then all rows by isin:
a = frame.loc[frame['Doc'] == 'Inv', 'Rep']
print (a)
2 101
8 103
Name: Rep, dtype: int64
df = frame[frame['Rep'].isin(a)]
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Solution with query:
a = frame.query("Doc == 'Inv'")['Rep']
df = frame.query("Rep in #a")
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Timings:
np.random.seed(123)
N = 1000000
L = ['Order','Shp','Inv']
frame = pd.DataFrame({'Doc': np.random.choice(L, N, p=[0.49, 0.5, 0.01]),
'ID':np.arange(1,N+1),
'Rep':np.random.randint(1000, size=N)})
print (frame.head())
Doc ID Rep
0 Shp 1 95
1 Order 2 147
2 Order 3 282
3 Shp 4 82
4 Shp 5 746
In [204]: %timeit (frame.groupby('Rep').filter(lambda x: 'Inv' in x['Doc'].values))
1 loop, best of 3: 250 ms per loop
In [205]: %timeit (frame[frame['Rep'].isin(frame.loc[frame['Doc'] == 'Inv', 'Rep'])])
100 loops, best of 3: 17.3 ms per loop
In [206]: %%timeit
...: a = frame.query("Doc == 'Inv'")['Rep']
...: frame.query("Rep in #a")
...:
100 loops, best of 3: 14.5 ms per loop
EDIT:
Thank you John Galt for nice suggestion:
df = frame.query("Rep in %s" % frame.query("Doc == 'Inv'")['Rep'].tolist())
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
import pandas as pd
frame_Filtered=frame[frame['Doc'].str.contains('Inv|Order')]
print(frame_Filtered)
Output I got
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103

Rank multiple columns in pandas

I have a dataset of a series with missing values that I want to replace by the index. The second column contains the same numbers than the first column, but in a different order.
here's an example:
>>> df
ind u v d
0 5 7 151
1 7 20 151
2 8 40 151
3 20 5 151
this should turn out to:
>>>df
ind u v d
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
i reindexed the values in row 'u' by creating a new column:
>>>df['new_index'] = range(1, len(numbers) + 1)
but how do I now replace values of the second column referring to the indexes?
Thanks for any advice!
You can use Series.rank, but first need create Series with unstack and last create DataFrame with unstack again:
df[['u','v']] = df[['u','v']].unstack().rank(method='dense').astype(int).unstack(0)
print (df)
u v d
ind
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
If use only DataFrame.rank, output in v is different:
df[['u','v']] = df[['u','v']].rank(method='dense').astype(int)
print (df)
u v d
ind
0 1 2 151
1 2 3 151
2 3 4 151
3 4 1 151

Pandas MAX formula across different grouped rows

I have dataframe that looks like this:
Auction_id bid_price min_bid rank
123 5 3 1
123 4 3 2
124 3 2 1
124 1 2 2
I'd like to create another column that returns MAX(rank 1 min_bid, rank 2 bid_price). I don't care what appears for the rank 2 column values. I'm hoping for the result to look something like this:
Auction_id bid_price min_bid rank custom_column
123 5 3 1 4
123 4 3 2 NaN/Don't care
124 3 2 1 2
124 1 2 2 NaN/Don't care
Should I be iterating through grouped auction_ids? Can someone provide the topics one would need to be familiar with to tackle this type of problem?
First, set the index equal to the Auction_id. Then you can use loc to select the appropriate values for each Auction_id and use max on their values. Finally, reset your index to return to your initial state.
df.set_index('Auction_id', inplace=True)
df['custom_column'] = pd.concat([df.loc[df['rank'] == 1, 'min_bid'],
df.loc[df['rank'] == 2, 'bid_price']],
axis=1).max(axis=1)
df.reset_index(inplace=True)
>>> df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 4
2 124 3 2 1 2
3 124 1 2 2 2
Here's one crude way to do it.
Create maxminbid() function, which creates a val= MAX(rank 1 min_bid, rank 2 bid_price) and assign this to grp['custom_column'], and for rank==2 store it with NaN
def maxminbid(grp):
val = max(grp.loc[grp['rank']==1, 'min_bid'].values,
grp.loc[grp['rank']==2, 'bid_price'].values)[0]
grp['custom_column'] = val
grp.loc[grp['rank']==2, 'custom_column'] = pd.np.nan
return grp
Then apply maxminbid function on Auction_id grouped objects
df.groupby('Auction_id').apply(maxminbid)
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
But, I suspect, there must be some elegant solution than this one.
Here's an approach that does some reshaping with pivot()
Auction_id bid_price min_bid rank
0 123 5 3 1
1 123 4 3 2
2 124 3 2 1
3 124 1 2 2
Then reshape your frame (df)
pv = df.pivot("Auction_id","rank")
pv
bid_price min_bid
rank 1 2 1 2
Auction_id
123 5 4 3 3
124 3 1 2 2
Adding a column to pv that contains the max. I"m using iloc to get a slice of the pv dataframe.
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
pv
bid_price min_bid custom_column
rank 1 2 1 2
Auction_id
123 5 4 3 3 4
124 3 1 2 2 2
and then add the max to the original frame (df) by mapping to our pv frame
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
all the steps combined
pv = df.pivot("Auction_id","rank")
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN