Pandas pivot? pivot_table? melt? stack or unstack? - pandas

I have a dataframe that looks like this:
id Revenue Cost qty time
0 A 400 50 2 1
1 A 900 200 8 2
2 A 800 100 8 3
3 B 300 20 1 1
4 B 600 150 4 2
5 B 650 155 4 3
And I'm trying to get to this:
id Type 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4
Where time will always just be repeated 1-3, so I need to transpose or pivot on just time, with the column for 1-3
Here is what I have tried so far:
pd.pivot_table(df, values = ['Revenue', 'qty', 'Cost'] , index=['id'], columns='time').reset_index()
But that just makes one really long table that puts everything side by side vs stacked like this:
Revenue qty Cost
1 2 3 1 2 3 1 2 3
In that situation I would need to convert the Revenue, qty and Cost to a row and just use the 1, 2, 3 as the column names. So the ID would be duplicated for each 'type' but list it out based on time 1-3.

We can still do unstack and stack
df.set_index(['id','time']).stack().unstack(level=1).reset_index()
Out[24]:
time id level_1 1 2 3
0 A Revenue 400 900 800
1 A Cost 50 200 100
2 A qty 2 8 8
3 B Revenue 300 600 650
4 B Cost 20 150 155
5 B qty 1 4 4

An alternative, using melt and pivot on Pandas 1.1.0 :
(df
.melt(["id", "time"])
.pivot(["id", "variable"], "time", "value")
.reset_index()
.rename_axis(columns=None)
)
id variable 1 2 3
0 A Cost 50 200 100
1 A Revenue 400 900 800
2 A qty 2 8 8
3 B Cost 20 150 155
4 B Revenue 300 600 650
5 B qty 1 4 4

Related

Pandas: I want slice the data and shuffle them to genereate some synthetic data

Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1

Calculate Percent of Groupby Variable to Sum Column

I'm not finding a similar example to understand this in python. I have a dataset that looks like this:
ID Capacity
A 50
A 50
A 50
B 30
B 30
B 30
C 100
C 100
C 100
I need to find the percent of each ID for the sum of the "Capacity" column. So, the answer looks like this:
ID Capacity Percent_Capacity
A 50 0.2777
A 50 0.2777
A 50 0.2777
B 30 0.1666
B 30 0.1666
B 30 0.1666
C 100 0.5555
C 100 0.5555
C 100 0.5555
Thank you - still learning python.
total=df.groupby('ID')['Capacity'].first().sum()
df['percent_capacity'] = df['Capacity']/total
df
ID Capacity percent_capacity
0 A 50 0.277778
1 A 50 0.277778
2 A 50 0.277778
3 B 30 0.166667
4 B 30 0.166667
5 B 30 0.166667
6 C 100 0.555556
7 C 100 0.555556
8 C 100 0.555556
Using drop_duplicates:
df['percent_capacity'] = df['Capacity']/df.drop_duplicates(subset='ID')['Capacity'].sum()
Output:
ID Capacity percent_capacity
0 A 50 0.277778
1 A 50 0.277778
2 A 50 0.277778
3 B 30 0.166667
4 B 30 0.166667
5 B 30 0.166667
6 C 100 0.555556
7 C 100 0.555556
8 C 100 0.555556

Clean the data based on condition pandas

I have a data frame as shown below
ID Unit_ID Price Duration
1 A 200 2
2 B 1000 3
2 C 1000 3
2 D 1000 3
2 F 1000 3
2 G 200 1
3 A 500 2
3 B 200 2
From the above data frame if ID, Price and Duration are same then replace the Price by average (Price divided by count of Such combination).
For example from the above data frame from row 2 to 5 has same ID, Price and Duration, that means its count is 4, so the new Price = 1000/4 = 250.
Expected Output:
ID Unit_ID Price Duration
1 A 200 2
2 B 250 3
2 C 250 3
2 D 250 3
2 F 250 3
2 G 200 1
3 A 500 2
3 B 200 2
Use GroupBy.transform with GroupBy.size for Series with same size like original filled by counts, so possible divide by Series.div:
df['Price'] = df['Price'].div(df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
print (df)
ID Unit_ID Price Duration
0 1 A 200.0 2
1 2 B 250.0 3
2 2 C 250.0 3
3 2 D 250.0 3
4 2 F 250.0 3
5 2 G 200.0 1
6 3 A 500.0 2
7 3 B 200.0 2
Detail:
print (df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
0 1
1 4
2 4
3 4
4 4
5 1
6 1
7 1
Name: Price, dtype: int64

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Pandas change order of columns in pivot table

The representation of pivot tabel not looks like something I looking for, to be more specific the order of the resulting rows.
I can`t figure out how to change it in proper way.
Example df:
test_df = pd.DataFrame({'name':['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
code for make pivot:
test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
Actual output:
salary status
month 1 2 3 1 2 3
name
name_1 100 100 100 1 1 2
name_2 110 110 110 1 1 3
name_3 120 120 120 2 2 1
The output I want to see:
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1
You would use sort_index, indicating the axis and the level:
piv = test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
piv.sort_index(axis='columns', level='month')
# salary status salary status salary status
#month 1 1 2 2 3 3
#name
#name_1 100 1 100 1 100 2
#name_2 110 1 110 1 110 3
#name_3 120 2 120 2 120 1
Use DataFrame.sort_index with axis=1, level=1 arguments
(test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
.sort_index(axis=1, level=1))
[out]
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1
import pandas as pd
df = pd.DataFrame({'name':
['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
df = df.pivot_table(index='name', columns=['month'],
values=['salary', 'status']).sort_index(axis='columns', level='month')
print(df)