Pandas change order of columns in pivot table - pandas

The representation of pivot tabel not looks like something I looking for, to be more specific the order of the resulting rows.
I can`t figure out how to change it in proper way.
Example df:
test_df = pd.DataFrame({'name':['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
code for make pivot:
test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
Actual output:
salary status
month 1 2 3 1 2 3
name
name_1 100 100 100 1 1 2
name_2 110 110 110 1 1 3
name_3 120 120 120 2 2 1
The output I want to see:
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1

You would use sort_index, indicating the axis and the level:
piv = test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
piv.sort_index(axis='columns', level='month')
# salary status salary status salary status
#month 1 1 2 2 3 3
#name
#name_1 100 1 100 1 100 2
#name_2 110 1 110 1 110 3
#name_3 120 2 120 2 120 1

Use DataFrame.sort_index with axis=1, level=1 arguments
(test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
.sort_index(axis=1, level=1))
[out]
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1

import pandas as pd
df = pd.DataFrame({'name':
['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
df = df.pivot_table(index='name', columns=['month'],
values=['salary', 'status']).sort_index(axis='columns', level='month')
print(df)

Related

Pandas groupby to get dataframe of unique values

If I have this simple dataframe, how do I use groupby() to get the desired summary dataframe?
Using Python 3.8
Inputs
x = [1,1,1,2,2,2,2,2,3,3,4,4,4]
y = [100,100,100,101,102,102,102,102,103,103,104,104,104]
z = [1,2,3,1,1,2,3,4,1,2,1,2,3]
df = pd.DataFrame(list(zip(x, y, z)), columns =['id', 'set', 'n'])
display(df)
Desired Output
With df.drop_duplicates
df.drop("n",1).drop_duplicates(['id','set'])
id set
0 1 100
3 2 101
4 2 102
8 3 103
10 4 104
Groupby and explode
df.groupby('id')['set'].unique().explode()
id
1 100
2 101
2 102
3 103
4 104
You can try using .explode() and then reset the index of the result:
> df.groupby('id')['set'].unique().explode().reset_index(name='unique_value')
id unique_value
0 1 100
1 2 101
2 2 102
3 3 103
4 4 104

Pandas pivot? pivot_table? melt? stack or unstack?

I have a dataframe that looks like this:
id Revenue Cost qty time
0 A 400 50 2 1
1 A 900 200 8 2
2 A 800 100 8 3
3 B 300 20 1 1
4 B 600 150 4 2
5 B 650 155 4 3
And I'm trying to get to this:
id Type 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4
Where time will always just be repeated 1-3, so I need to transpose or pivot on just time, with the column for 1-3
Here is what I have tried so far:
pd.pivot_table(df, values = ['Revenue', 'qty', 'Cost'] , index=['id'], columns='time').reset_index()
But that just makes one really long table that puts everything side by side vs stacked like this:
Revenue qty Cost
1 2 3 1 2 3 1 2 3
In that situation I would need to convert the Revenue, qty and Cost to a row and just use the 1, 2, 3 as the column names. So the ID would be duplicated for each 'type' but list it out based on time 1-3.
We can still do unstack and stack
df.set_index(['id','time']).stack().unstack(level=1).reset_index()
Out[24]:
time id level_1 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4
An alternative, using melt and pivot on Pandas 1.1.0 :
(df
.melt(["id", "time"])
.pivot(["id", "variable"], "time", "value")
.reset_index()
.rename_axis(columns=None)
)
id variable 1 2 3
0 A Cost 50 200 100
1 A Revenue 400 900 800
2 A qty 2 8 8
3 B Cost 20 150 155
4 B Revenue 300 600 650
5 B qty 1 4 4

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Selecting rows based on a column value

I have a data frame something like this
data = {'ID': [1,2,3,4,5,6,7,8,9],
'Doc':['Order','Order','Inv','Order','Order','Shp','Order', 'Order','Inv'],
'Rep':[101,101,101,102,102,102,103,103,103]}
frame = pd.DataFrame(data)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
5 Shp 6 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Now I want to select rows for Rep that have Doc type as Inv only.
I want a dataframe as
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
All reps will have Doc type Orders so I was trying to do something like this
frame[frame.Rep == frame.Rep[frame.Doc == 'Inv']]
but I get an error
ValueError: Can only compare identically-labeled Series objects
You can use twice boolean indexing - first get all Rep by condition and then all rows by isin:
a = frame.loc[frame['Doc'] == 'Inv', 'Rep']
print (a)
2 101
8 103
Name: Rep, dtype: int64
df = frame[frame['Rep'].isin(a)]
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Solution with query:
a = frame.query("Doc == 'Inv'")['Rep']
df = frame.query("Rep in #a")
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
Timings:
np.random.seed(123)
N = 1000000
L = ['Order','Shp','Inv']
frame = pd.DataFrame({'Doc': np.random.choice(L, N, p=[0.49, 0.5, 0.01]),
'ID':np.arange(1,N+1),
'Rep':np.random.randint(1000, size=N)})
print (frame.head())
Doc ID Rep
0 Shp 1 95
1 Order 2 147
2 Order 3 282
3 Shp 4 82
4 Shp 5 746
In [204]: %timeit (frame.groupby('Rep').filter(lambda x: 'Inv' in x['Doc'].values))
1 loop, best of 3: 250 ms per loop
In [205]: %timeit (frame[frame['Rep'].isin(frame.loc[frame['Doc'] == 'Inv', 'Rep'])])
100 loops, best of 3: 17.3 ms per loop
In [206]: %%timeit
...: a = frame.query("Doc == 'Inv'")['Rep']
...: frame.query("Rep in #a")
...:
100 loops, best of 3: 14.5 ms per loop
EDIT:
Thank you John Galt for nice suggestion:
df = frame.query("Rep in %s" % frame.query("Doc == 'Inv'")['Rep'].tolist())
print (df)
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
6 Order 7 103
7 Order 8 103
8 Inv 9 103
import pandas as pd
frame_Filtered=frame[frame['Doc'].str.contains('Inv|Order')]
print(frame_Filtered)
Output I got
Doc ID Rep
0 Order 1 101
1 Order 2 101
2 Inv 3 101
3 Order 4 102
4 Order 5 102
6 Order 7 103
7 Order 8 103
8 Inv 9 103

Pandas MAX formula across different grouped rows

I have dataframe that looks like this:
Auction_id bid_price min_bid rank
123 5 3 1
123 4 3 2
124 3 2 1
124 1 2 2
I'd like to create another column that returns MAX(rank 1 min_bid, rank 2 bid_price). I don't care what appears for the rank 2 column values. I'm hoping for the result to look something like this:
Auction_id bid_price min_bid rank custom_column
123 5 3 1 4
123 4 3 2 NaN/Don't care
124 3 2 1 2
124 1 2 2 NaN/Don't care
Should I be iterating through grouped auction_ids? Can someone provide the topics one would need to be familiar with to tackle this type of problem?
First, set the index equal to the Auction_id. Then you can use loc to select the appropriate values for each Auction_id and use max on their values. Finally, reset your index to return to your initial state.
df.set_index('Auction_id', inplace=True)
df['custom_column'] = pd.concat([df.loc[df['rank'] == 1, 'min_bid'],
df.loc[df['rank'] == 2, 'bid_price']],
axis=1).max(axis=1)
df.reset_index(inplace=True)
>>> df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 4
2 124 3 2 1 2
3 124 1 2 2 2
Here's one crude way to do it.
Create maxminbid() function, which creates a val= MAX(rank 1 min_bid, rank 2 bid_price) and assign this to grp['custom_column'], and for rank==2 store it with NaN
def maxminbid(grp):
val = max(grp.loc[grp['rank']==1, 'min_bid'].values,
grp.loc[grp['rank']==2, 'bid_price'].values)[0]
grp['custom_column'] = val
grp.loc[grp['rank']==2, 'custom_column'] = pd.np.nan
return grp
Then apply maxminbid function on Auction_id grouped objects
df.groupby('Auction_id').apply(maxminbid)
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
But, I suspect, there must be some elegant solution than this one.
Here's an approach that does some reshaping with pivot()
Auction_id bid_price min_bid rank
0 123 5 3 1
1 123 4 3 2
2 124 3 2 1
3 124 1 2 2
Then reshape your frame (df)
pv = df.pivot("Auction_id","rank")
pv
bid_price min_bid
rank 1 2 1 2
Auction_id
123 5 4 3 3
124 3 1 2 2
Adding a column to pv that contains the max. I"m using iloc to get a slice of the pv dataframe.
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
pv
bid_price min_bid custom_column
rank 1 2 1 2
Auction_id
123 5 4 3 3 4
124 3 1 2 2 2
and then add the max to the original frame (df) by mapping to our pv frame
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN
all the steps combined
pv = df.pivot("Auction_id","rank")
pv["custom_column"] = pv.iloc[:,[1,2]].max(axis=1)
df.loc[df["rank"] == 1,"custom_column"] = df["Auction_id"].map(pv["custom_column"])
df
Auction_id bid_price min_bid rank custom_column
0 123 5 3 1 4
1 123 4 3 2 NaN
2 124 3 2 1 2
3 124 1 2 2 NaN