How do I turn column headings into a column - pandas

I have a pandas dataframe that looks like this:
Year A B C D
1999 1 3 5 7
2000 11 13 17 19
2001 23 29 31 37
And I want it to look like this:
Year Type Value
1999 A 1
1999 B 3
1999 C 5
1999 D 7
2000 A 11
2000 B 13
Etc. Is there a way to do this and if so, how?

You can recreate your df
pd.DataFrame({'Year':df.Year.repeat((df.shape[1]-1)),'Type':list(df)[1:]*len(df),'Value':np.concatenate(df.iloc[:,1:].values)})
Out[95]:
Type Value Year
0 A 1 1999
0 B 3 1999
0 C 5 1999
0 D 7 1999
1 A 11 2000
1 B 13 2000
1 C 17 2000
1 D 19 2000
2 A 23 2001
2 B 29 2001
2 C 31 2001
2 D 37 2001

First set_index and then stack, rename_axis and last reset_index:
df = df.set_index('Year').stack().rename_axis(('Year','Type')).reset_index(name='Value')
print (df)
Year Type Value
0 1999 A 1
1 1999 B 3
2 1999 C 5
3 1999 D 7
4 2000 A 11
5 2000 B 13
6 2000 C 17
7 2000 D 19
8 2001 A 23
9 2001 B 29
10 2001 C 31
11 2001 D 37
Or use melt, but order of values is different:
df = df.melt('Year', var_name='Type', value_name='Value')
print (df)
Year Type Value
0 1999 A 1
1 2000 A 11
2 2001 A 23
3 1999 B 3
4 2000 B 13
5 2001 B 29
6 1999 C 5
7 2000 C 17
8 2001 C 31
9 1999 D 7
10 2000 D 19
11 2001 D 37
... so is necessary sorting:
df = (df.melt('Year', var_name='Type', value_name='Value')
.sort_values(['Year','Type'])
.reset_index(drop=True))
print (df)
Year Type Value
0 1999 A 1
1 1999 B 3
2 1999 C 5
3 1999 D 7
4 2000 A 11
5 2000 B 13
6 2000 C 17
7 2000 D 19
8 2001 A 23
9 2001 B 29
10 2001 C 31
11 2001 D 37
Numpy solution:
a = np.repeat(df['Year'], len(df.columns.difference(['Year'])))
b = np.tile(df.columns.difference(['Year']), len(df.index))
c = df.drop('Year', 1).values.ravel()
df = pd.DataFrame(np.column_stack([a,b,c]), columns=['Year','Type','Value'])
print (df)
Year Type Value
0 1999 A 1
1 1999 B 3
2 1999 C 5
3 1999 D 7
4 2000 A 11
5 2000 B 13
6 2000 C 17
7 2000 D 19
8 2001 A 23
9 2001 B 29
10 2001 C 31
11 2001 D 37

Related

Unpivot a data-frame that has information of two teams in one row?

I have some data that holds information about two opposing teams
home_x away_x
0 7 28
1 11 10
2 11 20
3 12 15
4 12 16
I know about .melt(), which returns something like this:
variable value
0 home_x 7
1 home_x 11
2 home_x 11
3 home_x 12
4 home_x 12
So each value is a row here.
There are several attributes for each team.
I want each row to have all the attributes for the respective team( home or away)
The ultimate goals is to have all the attributes of both teams in one row. This would double the number of rows.
home_x away_x
0 7 28
would be transformed into:
team1_x team2_x
0 7 28
0 28 7
sample df:
home_x
away_x
home_y
away_y
0
7
28
7
20
1
28
7
28
13
2
28
7
28
4
3
7
28
7
58
4
11
10
11
10
try:
res = pd.DataFrame()
for c in df.columns.str.split("_").str[1].unique():
p1 = df.filter(regex=f"{c}$")
c1,c2 =p1.columns
df_map = {c1:c2, c2:c1}
swap = p1.rename(columns={**df_map})
res = pd.concat([res,p1.append(swap).sort_index(ignore_index=True)], axis=1)
then rename the columns.
import re
repl = {'home': 'team1', 'away': 'team2'}
res.columns = [re.sub('|'.join(repl.keys()), lambda x: repl[x.group()], i) for i in res.columns]
team1_x
team2_x
team1_y
team2_y
0
7
28
7
20
1
28
7
20
7
2
28
7
28
13
3
7
28
13
28
4
28
7
28
4
5
7
28
4
28
6
7
28
7
58
7
28
7
58
7
8
11
10
11
10
9
10
11
10
11
Here is an approach:
You might need to group on the last split of the column names and then group on axis=1, then iterate through the groups and reverse the column order and name them same with the suffix:
def myinfo(data):
c = data.columns.str.split("_").str[-1]
f = lambda x: pd.DataFrame.set_axis(x, ["team1","team2"],axis=1)
l = [pd.concat([*map(f , (v,v.iloc[:,::-1]))]).add_suffix(f"_{k}")
for k,v in data.groupby(c,axis=1)]
return pd.concat(l,axis=1).sort_index()
print(myinfo(df))
team1_x team2_x
0 7 28
0 28 7
1 11 10
1 10 11
2 11 20
2 20 11
3 12 15
3 15 12
4 12 16
4 16 12

python3 pandas: how to fast find the latest number of months that have value is not 0

There is Dataframe as following:
id year month y sinx
1 2019 1 0 1
1 2019 2 0 2
1 2019 3 1 3
1 2019 4 0 4
1 2019 5 0 5
1 2019 6 0 6
1 2019 7 0 7
1 2019 8 2 8
1 2019 9 0 9
1 2019 10 0 10
1 2019 11 0 11
1 2019 12 0 11
1 2020 1 0 12
1 2020 2 0 13
1 2020 3 2 14
1 2020 4 0 15
2 2019 1 0 1
2 2019 2 0 2
2 2019 3 0 3
2 2019 4 0 4
.......
I want to get the the number of months that each id before each month that the value (y column) is not 0, if there is no previous month or there is no previous month that the values is not 0, just set the value is -1.
for example as the above Dataframe, I want to get the following result.
Moreover the Dataframe is about 5M is large. the speed should be fast:
id year month y sinx num_month
1 2019 1 0 1 -1
1 2019 2 0 2 -1
1 2019 3 1 3 -1
1 2019 4 0 4 1
1 2019 5 0 5 2
1 2019 6 0 6 3
1 2019 7 0 7 4
1 2019 8 2 8 5
1 2019 9 0 9 1
1 2019 10 0 10 2
1 2019 11 0 11 3
1 2019 12 0 11 4
1 2020 1 0 12 5
1 2020 2 0 13 6
1 2020 3 2 14 7
1 2020 4 0 15 1
2 2019 1 0 1 -1
2 2019 2 0 2 -1
2 2019 3 1 3 -1
2 2019 4 0 4 1
.......
Getting the cumulative count is alright, but the logic to get the -1 values was a little bit tricky. These are all vectorized pandas methods, so it should be performant on millions of rows:
You can groupby the necessary columns as well as the the cumsum of y in preparation for getting the cumcount()
However, you want to do the cumcount for one further row, so I fix the last row of each group with np.where()
The slightly trickier part is then changing the values to -1. I use similar techniques as the previous steps to achieve it, ultimately using mask to change the relevant values to -1, based off some conditions.
m = df.groupby(['id', 'y', df['y'].cumsum()]).cumcount() + 1 ########## setup you base getting cumcount
df['num_month'] = np.where((m == 1) & (m.shift() > 1), m.shift() + 1, m).astype(int) # extend cumcount one further row per group for first line of code
s1 = df.groupby('id').transform('idxmin').iloc[:,0] ############## get index location of first value per group and return as series with same length
s2 = df.groupby(['id', (df['y'] > 0).cumsum()]).transform('idxmin').iloc[:,0] ######### get index location of first non-zero value per group and return as series with same length
df['num_month'] = df['num_month'].mask((s1 == s2) | (s1 == s2.shift()), -1) ########### using s1 and s2 conditions, update the necessary rows to -1
df
Out[1]:
id year month y sinx num_month
0 1 2019 1 0 1 -1
1 1 2019 2 0 2 -1
2 1 2019 3 1 3 -1
3 1 2019 4 0 4 1
4 1 2019 5 0 5 2
5 1 2019 6 0 6 3
6 1 2019 7 0 7 4
7 1 2019 8 2 8 5
8 1 2019 9 0 9 1
9 1 2019 10 0 10 2
10 1 2019 11 0 11 3
11 1 2019 12 0 11 4
12 1 2020 1 0 12 5
13 1 2020 2 0 13 6
14 1 2020 3 2 14 7
15 1 2020 4 0 15 1
16 2 2019 1 0 1 -1
17 2 2019 2 0 2 -1
18 2 2019 3 1 3 -1
19 2 2019 4 0 4 1
Let us try groupby with ffill
s = df.month.mask(df.y.eq(0)).groupby(df.year).apply(lambda x : x.ffill().shift())
df['New'] = (df.month-s).fillna(-1)
df
Out[35]:
id year month y sinx New
0 1 2019 1 0 1 -1.0
1 1 2019 2 0 2 -1.0
2 1 2019 3 1 3 -1.0
3 1 2019 4 0 4 1.0
4 1 2019 5 0 5 2.0
5 1 2019 6 0 6 3.0
6 1 2019 7 0 7 4.0
7 1 2019 8 2 8 5.0
8 1 2019 9 0 9 1.0
9 1 2019 10 0 10 2.0
10 1 2019 11 0 11 3.0
11 1 2019 12 0 11 4.0
12 1 2020 1 0 12 -1.0
13 1 2020 2 0 13 -1.0
14 1 2020 3 2 14 -1.0
15 1 2020 4 0 15 1.0
16 2 2019 1 0 1 -7.0
17 2 2019 2 0 2 -6.0
18 2 2019 3 0 3 -5.0
19 2 2019 4 0 4 -4.0

Pandas Groupby and divide the dataset into subgroups based on user input and label numbers to each subgroup

Here is my data:
ID Mnth Amt Flg
B 1 10 0
B 2 12 0
B 3 14 0
B 4 41 0
B 5 134 0
B 6 14 0
B 7 134 0
B 8 134 0
B 9 12 0
B 10 41 0
B 11 4 0
B 12 14 0
B 12 14 0
A 1 34 0
A 2 22 0
A 3 56 0
A 4 129 0
A 5 40 0
A 6 20 0
A 7 58 0
A 8 123 0
If I give 3 as input, my output should be:
ID Mnth Amt Flg Level_Flag
B 1 10 0 0
B 2 12 0 1
B 3 14 0 1
B 4 41 0 1
B 5 134 0 2
B 6 14 0 2
B 7 134 0 2
B 8 134 0 3
B 9 12 0 3
B 10 41 0 3
B 11 4 0 4
B 12 14 0 4
B 12 14 0 4
A 1 34 0 0
A 2 22 0 0
A 3 56 0 1
A 4 129 0 1
A 5 40 0 1
A 6 20 0 2
A 7 58 0 2
A 8 123 0 2
So basically I want to divide the data into subgroups with 3 rows in each subgroup from bottom up and label those subgroups as mentioned in level_flag column. I have IDs like A,C and so on. So I want to do this for each group of ID.Thanks in Advance.
Edit :- I want the same thing to be done after grouping it by ID
First we decide the unique numbers nums by dividing the length of your df by n. Then we repeat those numbers n times. Finally we reverse the array and chop it of at the length of df and reverse it one more time.
def create_flags(d, n):
nums = np.ceil(len(d) / n)
level_flag = np.repeat(np.arange(nums), n)[::-1][:len(d)][::-1]
return level_flag
df['Level_Flag'] = df.groupby('ID')['ID'].transform(lambda x: create_flags(x, 3))
ID Mnth Amt Flg Level_Flag
0 B 1 10 0 0.0
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
To remove the incomplete rows, use GroupBy.transform:
m = df.groupby(['ID', 'Level_Flag'])['Level_Flag'].transform('count').ge(3)
df = df[m]
ID Mnth Amt Flg Level_Flag
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0

Aggregating counts of different columns subsets

I have a dataset with a tree structure and for each path in the tree, I want to compute the corresponding counts at each level. Here is a minimal reproducible example with two levels.
import pandas as pd
data = pd.DataFrame()
data['level_1'] = np.random.choice(['1', '2', '3'], 100)
data['level_2'] = np.random.choice(['A', 'B', 'C'], 100)
I know I can get the counts on the last level by doing
counts = data.groupby(['level_1','level_2']).size().reset_index(name='count_2')
print(counts)
level_1 level_2 count_2
0 1 A 10
1 1 B 12
2 1 C 8
3 2 A 10
4 2 B 10
5 2 C 10
6 3 A 17
7 3 B 12
8 3 C 11
What I would like to have is a dataframe with one row for each possible path in the tree with the counts at each level in that path. For the example above, it would be something like
level_1 level_2 count_1 count_2
0 1 A 30 10
1 1 B 30 12
2 1 C 30 8
3 2 A 30 10
4 2 B 30 10
5 2 C 30 10
6 3 A 40 17
7 3 B 40 12
8 3 C 40 11
This is an example with only two levels, which is easy to solve, but I would like to have a way to get those counts for an arbitrary number of levels.
This will be the transform
counts['count_1']=counts.groupby(['level_1']).count_2.transform('sum')
counts
Out[445]:
level_1 level_2 count_2 count_1
0 1 A 7 30
1 1 B 13 30
2 1 C 10 30
3 2 A 7 30
4 2 B 7 30
5 2 C 16 30
6 3 A 9 40
7 3 B 10 40
8 3 C 21 40
You can make do from your original data:
groups = data.groupby('level_1').level_2
pd.merge(groups.value_counts(),
groups.size(),
left_index=True,
right_index=True)
which gives:
level_2_x level_2_y
level_1 level_2
1 A 14 39
B 14 39
C 11 39
2 C 13 34
A 12 34
B 9 34
3 B 12 27
C 9 27
A 6 27

Pandas - expanding mean with groupby

I'm trying to get an expanding mean. I can get it to work when I iterate and "group" just by filtering by the specific values, but it takes way too long to do. I feel like this should be an easy application to do with a groupby, but when I do it, it just does the expanding mean to the entire dataset, as opposed to just doing it for each of the groups in grouby.
for a quick example:
I want to take this (in this particular case, grouped by 'player' and 'year'), and get an expanding mean.
player pos year wk pa ra
a qb 2001 1 10 0
a qb 2001 2 5 0
a qb 2001 3 10 0
a qb 2002 1 12 0
a qb 2002 2 13 0
b rb 2001 1 0 20
b rb 2001 2 0 17
b rb 2001 3 0 12
b rb 2002 1 0 14
b rb 2002 2 0 15
to get:
player pos year wk pa ra avg_pa avg_ra
a qb 2001 1 10 0 10 0
a qb 2001 2 5 0 7.5 0
a qb 2001 3 10 0 8.3 0
a qb 2002 1 12 0 12 0
a qb 2002 2 13 0 12.5 0
b rb 2001 1 0 20 0 20
b rb 2001 2 0 17 0 18.5
b rb 2001 3 0 12 0 16.3
b rb 2002 1 0 14 0 14
b rb 2002 2 0 15 0 14.5
Not sure where I'm going wrong:
# Group by player and season - also put weeks in correct ascending order
grouped = calc_averages.groupby(['player','pos','seas']).apply(pd.DataFrame.sort_values, 'wk')
grouped['avg_pa'] = grouped['pa'].expanding().mean()
But this will give an expanding mean for the entire set, not for each player, season.
Try:
df.sort_values('wk').groupby(['player','pos','year'])['pa','ra'].expanding().mean()\
.reset_index()
Output:
player pos year level_3 pa ra
0 a qb 2001 0 10.000000 0.000000
1 a qb 2001 1 7.500000 0.000000
2 a qb 2001 2 8.333333 0.000000
3 a qb 2002 3 12.000000 0.000000
4 a qb 2002 4 12.500000 0.000000
5 b rb 2001 5 0.000000 20.000000
6 b rb 2001 6 0.000000 18.500000
7 b rb 2001 7 0.000000 16.333333
8 b rb 2002 8 0.000000 14.000000
9 b rb 2002 9 0.000000 14.500000