Pandas - Pivot/stack/unstack/melt - pandas

I have a dataframe that looks like this:
name
value 1
value 2
A
100
101
A
100
102
A
100
103
B
200
201
B
200
202
B
200
203
C
300
301
C
300
302
C
300
303
And I'm trying to get to this:
name
value 1
value 2
value 3
value 4
value 5
value 6
A
100
101
100
102
100
103
B
200
201
200
202
200
203
C
300
301
300
302
300
303
Here is what i have tried so far;
dataframe.stack()
dataframe.unstack()
dataframe.melt(id_vars=['name'])
I need to transpose the data by ensuring that;
The first row remains as it is but every subsequent value associated with the same name should be transposed to a coulmn.
Whereas the second value B (for. ex) should transpose it's associated value as a new value under the column A values, it should not form a separate altogether.

Try:
def fn(x):
vals = x.values.ravel()
return pd.DataFrame(
[vals],
columns=[f"value {i}" for i in range(1, vals.shape[0] + 1)],
)
out = (
df.set_index("name")
.groupby(level=0)
.apply(fn)
.reset_index()
.drop(columns="level_1")
)
print(out.to_markdown())
Prints:
name
value 1
value 2
value 3
value 4
value 5
value 6
0
A
100
101
100
102
100
103
1
B
200
201
200
202
200
203
2
C
300
301
300
302
300
303

Flatten values for each name
(
df.set_index('name')
.groupby(level=0)
.apply(lambda x: pd.Series(x.values.flat))
.rename(columns=lambda x: f'value {x + 1}')
.reset_index()
)

One option using melt, groupby`, and pivot_wider (from pyjanitor):
# pip install pyjanitor
import pandas as pd
import janitor
(df
.melt('name', ignore_index = False)
.sort_index()
.drop(columns='variable')
.assign(header = lambda df: df.groupby('name').cumcount() + 1)
.pivot_wider('name', 'header', names_sep = ' ')
)
name value 1 value 2 value 3 value 4 value 5 value 6
0 A 100 101 100 102 100 103
1 B 200 201 200 202 200 203
2 C 300 301 300 302 300 303

Related

Turn MultiIndex Series into pivot table design by unique value counts

Sample Data:
Date,code
06/01/2021,405
06/01/2021,405
06/01/2021,400
06/02/2021,200
06/02/2021,300
06/03/2021,500
06/02/2021,500
06/03/2021,300
06/05/2021,500
06/04/2021,500
06/03/2021,400
06/02/2021,400
06/04/2021,400
06/03/2021,400
06/01/2021,400
06/04/2021,200
06/05/2021,200
06/02/2021,200
06/06/2021,300
06/04/2021,300
06/06/2021,300
06/05/2021,400
06/03/2021,400
06/04/2021,400
06/04/2021,500
06/01/2021,200
06/02/2021,300
import pandas as pd
df = pd.read_csv(testfile.csv)
code_total = df.groupby(by="Date",)['code'].value_counts()
print(code_total)
Date code
06/01/2021 400 2
405 2
200 1
06/02/2021 200 2
300 2
400 1
500 1
06/03/2021 400 3
300 1
500 1
06/04/2021 400 2
500 2
200 1
300 1
06/05/2021 200 1
400 1
500 1
06/06/2021 300 2
dates = set([x[0] for x in code_total.index])
codes = set([x[1] for x in code_total.index])
test = pd.DataFrame(code_total,columns=sorted(codes),index=sorted(dates))
print(test)
Is there a way to transpose the second index into a column and retain the value for the counts? Ultimately I'm trying to plot the count of unique error codes on a line graph. I've been searching up many different ways but am always missing something. any help would be appreciated.
Use Series.unstack:
df = df.groupby(by="Date",)['code'].value_counts().unstack(fill_value=0)

Getting groups by group index

I want to access the group by group index. My dataframe is as given below
import pandas as pd
from io import StringIO
import numpy as np
data = """
id,name
100,A
100,B
100,C
100,D
100,pp;
212,E
212,F
212,ds
212,G
212, dsds
212, sas
300,Endüstrisi`
"""
df = pd.read_csv(StringIO(data))
I want to groupby 'id' and access the groups by its group index.
dfg=df.groupby('id',sort=False,as_index=False)
dfg.get_group(0)
I was expecting this to return the first group which is the group for id =1 (which is the first group)
You need pass value of id:
dfg=df.groupby('id',sort=False)
a = dfg.get_group(100)
print (a)
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
dfg=df.groupby('id',sort=False)
a = dfg.get_group(df.loc[0, 'id'])
print (a)
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
If need enumerate groups is possible use GroupBy.ngroup:
dfg=df.groupby('id',sort=False)
a = df[dfg.ngroup() == 0]
print (a)
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
Detail:
print (dfg.ngroup())
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 1
9 1
10 1
11 2
dtype: int64
EDIT: Another idea is if need select groups by positions (all id are consecutive groups) with compare by unique values of id selected by positions:
ids = df['id'].unique()
print (ids)
[100 212 300]
print (df[df['id'].eq(ids[0])])
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
print (df[df['id'].eq(ids[1])])
id name
5 212 E
6 212 F
7 212 ds
8 212 G
9 212 dsds
10 212 sas

creating new column within multiindex

I have a simple dataframe:
A B
1 2 1 2
Foo 100 200 300 400
Bar 100 200 300 400
I want to add a new column which is (B,2) - (A,2)
What I've tried is:
df["Chg","Period"] = [df.loc[:, [("B",2)]] - df.loc[:, [("A", 2)]]]
But I'm told that:
Length of values does not match length of index
I'm a bit confused - I thought that by having two column headers for my new column it would work, but I'm now struggling. Any help would be most appreciated
Thanks
Use tuples for select MultiIndex and also for new MultiIndex column:
df[("Chg","Period")] = df[("B",2)] - df[("A", 2)]
print (df)
A B Chg
1 2 1 2 Period
Foo 100 200 300 400 200
Bar 100 200 300 400 200
If want working by multiple columns together, e.g. subtract B with A to new MultiIndex levels is possible use DataFrame.xs, then create MultiIndex by MultiIndex.from_product and add to original by DataFrame.join:
df1 = df.xs('B', axis=1, level=0) - df.xs('A', axis=1, level=0)
df1.columns = pd.MultiIndex.from_product([['Diff'], df1.columns])
print (df1)
Diff
1 2
Foo 200 200
Bar 200 200
df = df.join(df1)
print (df)
A B Diff
1 2 1 2 1 2
Foo 100 200 300 400 200 200
Bar 100 200 300 400 200 200

How to assign the multiple values of an output to new multiple columns of a dataframe?

I have the following function:
def sum(x):
oneS = x.iloc[0:len(x)//10].agg('sum')
twoS = x.iloc[len(x)//10:2*len(x)//10].agg('sum')
threeS = x.iloc[2*len(x)//10:3*len(x)//10].agg('sum')
fourS = x.iloc[3*len(x)//10:4*len(x)//10].agg('sum')
fiveS = x.iloc[4*len(x)//10:5*len(x)//10].agg('sum')
sixS = x.iloc[5*len(x)//10:6*len(x)//10].agg('sum')
sevenS = x.iloc[6*len(x)//10:7*len(x)//10].agg('sum')
eightS = x.iloc[7*len(x)//10:8*len(x)//10].agg('sum')
nineS = x.iloc[8*len(x)//10:9*len(x)//10].agg('sum')
tenS = x.iloc[9*len(x)//10:len(x)//10].agg('sum')
return [oneS,twoS,threeS,fourS,fiveS,sixS,sevenS,eightS,nineS,tenS]
How to assign the outputs of this function to columns of dataframe (which already exists)
The dataframe I am applying the function is as below
Cycle Type Time
1 1 101
1 1 102
1 1 103
1 1 104
1 1 105
1 1 106
9 1 101
9 1 102
9 1 103
9 1 104
9 1 105
9 1 106
The dataframe I want to add the columns is something like below & the new columns Ones, TwoS..... Should be added like shown & filled with the results of the function.
Cycle Type OneS TwoS ThreeS
1 1
9 1
8 1
10 1
3 1
5 2
6 2
7 2
If I write a function for just one value and apply it like the following, it is possible:
grouped_data['fm']= data_train_bel1800.groupby(['Cycle', 'Type'])['Time'].apply( lambda x: fm(x))
But I want to do it all at once so that it is neat and clear.
You can use:
def f(x):
out = []
for i in range(10):
out.append(x.iloc[i*len(x)//10:(i+1)*len(x)//10].agg('sum'))
return pd.Series(out)
df1 = (data_train_bel1800.groupby(['Cycle', 'Type'])['Time']
.apply(f)
.unstack()
.add_prefix('new_')
.reset_index())
print (df1)
Cycle Type new_0 new_1 new_2 new_3 new_4 new_5 new_6 new_7 new_8 \
0 1 1 0 101 102 205 207 209 315 211 211
1 9 1 0 101 102 205 207 209 315 211 211
new_9
0 106
1 106

Merge two dataframes on different named columns for multiple columns

I have two dataframes: Users and Item_map.
Users consists of user and fake_item_ids stored in three columns.
Item_map consists of real_item_ids and fake_item_ids.
What I want is to replace all of the the fake_item_ids with the real_item_ids.
To illustrate with dummy code:
DataFrame Users
user fake_0 fake_1
0 1 6786 3938
1 2 6786 6786
2 3 4345 4345
3 4 7987 3938
4 5 7987 5464
DataFrame Item_map
real_id fake_id
0 101 7987
1 202 6786
2 303 5464
3 404 4345
4 505 3938
Expected results:
DataFrame Users
user real_0 real_1
0 1 202 505
1 2 202 202
2 3 404 404
3 4 101 505
4 5 101 303
I have tried the following, based on an answer found here: how to concat two data frames with different column names in pandas? - python
users['fake_0'] = users.merge(items.rename(columns={'fake_id': 'fake_0'}), how='inner')['real_id']
which resulted in this:
user fake_0 fake_1
0 1 202 3938
1 2 202 6786
2 3 404 4345
3 4 101 3938
4 5 101 5464
This works, but it seems silly to do so for every column separately (I have nine columns that have fake_ids that need to be real_ids).
Any help is much appreciated!
Dummy code:
users = pd.DataFrame({
'user': [1, 2, 3, 4, 5],
'fake_0': [6786, 6786, 4345, 7987, 7987],
'fake_1': [3938, 6786, 4345, 3938, 5464]
})
item_map = pd.DataFrame({
'real_id': [101, 202, 303, 404, 505],
'fake_id': [7987, 6786, 5464, 4345, 3938]
})
We using replace
df.replace(dict(zip(df1.fake_id,df1.real_id)))
Out[46]:
user fake_0 fake_1
0 1 202 505
1 2 202 202
2 3 404 404
3 4 101 505
4 5 101 303
I'm not sure if this will be the most efficient solution, but it should work for your example with 10 columns without you having to edit anything.
First, create a lookup dictionary from your item_map:
d = pd.Series(index=item_map['fake_id'], data=item_map['real_id'].values).to_dict()
Then, use applymap to look up each column except 'user':
results = users.set_index('user').applymap(lambda x: d[x]).reset_index()
If you want, you can then rename the columns to get your desired output:
results.columns = [col.replace('fake', 'real') for col in results.columns]
Results:
user real_0 real_1
0 1 202 505
1 2 202 202
2 3 404 404
3 4 101 505
4 5 101 303