Reshape wide to long for many columns with a common prefix - pandas

My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?

A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

Related

Is there an easy way to select across rows in a panda frame to create a new column?

Question updated, see below
I have a large dataframe similar in structure to e.g.
df=pd.DataFrame({'A': [0, 0, 0, 11, 22,33], 'B': [10, 20,30, 110, 220, 330], 'C':['x', 'y', 'z', 'x', 'y', 'z']})
df
A B C
0 0 10 x
1 0 20 y
2 0 30 z
3 11 110 x
4 22 220 y
5 33 330 z
I want to create a new column by selecting the column value of B from a different row based on the value of C being equal to the current row and the value of A being 0, so the expected result is
A B C new_B_based_on_A_and_C
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
I want to have a simple solution without needing to have a for loop over the rows. Something like
df.apply(lambda row: df[df[(df['C']==row.C) & (df['A']==0)]]['B'].iloc[0], axis=1)
The dataframe is guaranteed to have those values and the values are unique
Update for a more general case
I am looking for a general solution that would also work for multiple columns to match on e.g.
df=pd.DataFrame({'A': [0, 0, 0,0, 11, 22,33, 44], 'B': [10, 20,30, 40, 110, 220, 330, 440], 'C':['x', 'y', 'x', 'y', 'x', 'y', 'x', 'y'], 'D': [1, 1, 5, 5, 1,1 ,5, 5]})
A B C D
0 0 10 x 1
1 0 20 y 1
2 0 30 x 5
3 0 40 y 5
4 11 110 x 1
5 22 220 y 1
6 33 330 x 5
7 44 440 y 5
and the result would be then
A B C D new_B_based_on_A_C_D
0 0 10 x 1 10
1 0 20 y 1 20
2 0 30 x 5 30
3 0 40 y 5 40
4 11 110 x 1 10
5 22 220 y 1 20
6 33 330 x 5 30
7 44 440 y 5 40
You can do a map:
# you **must** make sure that for each unique `C` value,
# there is only one row with `A==0`.
df['new'] = df['C'].map(df.loc[df['A']==0].set_index('C')['B'])
Output:
A B C new
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
Explanation: Imagine you have a series s indicating the mapping:
idx
idx1 value1
idx2 value2
idx3 value3
then that's what map does: df['C'].map(s).
Now, for a dataframe d:
C B
c1 b1
c2 b2
c3 b3
we do s=d.set_index('C')['B'] to get the above form.
Finally, as mentioned, you mapping happens where A==0, so d = df[df['A']==0].
Composing the forward path:
mapping_data = df[df['A']==0]
mapping_series = mapping_data.set_index('C')['B']
new_values = df['C'].map(mapping_series)
and the first piece of code is just all these lines combined.
If I understood the question, for the general case you could use a merge like this:
df.merge(df.loc[df['A'] == 0, ['B', 'C', 'D']], on=['C', 'D'], how='left', suffixes=('', '_new'))
Output:
A B C D B_new
0 10 x 1 10
0 20 y 1 20
0 30 x 5 30
0 40 y 5 40
11 110 x 1 10
22 220 y 1 20
33 330 x 5 30
44 440 y 5 40

Length of passed values is 1, index implies 10

Why and what is this error about??? It shows Length of passed
values is 1, index implies 10. I tried many times to run the
code and I come across the same
ser = pd.Series(np.random.randint(1, 50, 10))
result = np.argwhere(ser % 3==0)
print(result)
argwhere() operates on a numpy array not a panda series. See below
a = np.random.randint(1, 50, 12)
a = pd.Series(a)
print(a)
np.argwhere(a.values%3==0)
output
0 28
1 46
2 4
3 40
4 19
5 26
6 6
7 24
8 26
9 30
10 33
11 27
dtype: int64
[250]:
array([[ 6],
[ 7],
[ 9],
[10],
[11]])
Please read documentation for numpy.random.randint You will see that the parameters are (low, high, size).
In your case, you are sending (1, 50, 10). So 10 random numbers will be generated between 1 and 50.
If you want multiples of 3, then you need to do this ser[ser % 3==0] not use np.anywhere.
See similar issue raised earlier and answered on Stack Overflow
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 50, 10))
print (ser)
result = ser[ser % 3==0]
print(result)
Output of this will be:
Original Series.
0 17
1 34
2 29
3 15
4 24
5 20
6 21
7 48
8 6
9 42
dtype: int64
Multiples of 3 will be:
3 15
4 24
6 21
7 48
8 6
9 42
dtype: int64
Use Index.tolist:
In [1374]: ser
Out[1374]:
0 44
1 5
2 35
3 10
4 16
5 20
6 25
7 9
8 44
9 16
dtype: int64
In [1372]: l = ser[ser % 3 == 0].index.tolist()
In [1373]: l
Out[1373]: [7]
where l will be a list of indexes of elements which are a multiple of 3.

Pandas Dataframe aggregate different groups of columns

I have a dataframe
df = pd.DataFrame(
[np.random.randint(1,10,8),
np.random.randint(1,10,8),
np.random.randint(1,10,8),
np.random.randint(1,10,8)]).T
# left col is the index
>> a b c d group
0 5 6 3 2 g1
1 5 6 6 6 g1
2 3 9 5 3 g1
3 5 6 8 2 g1
4 2 2 9 6 g1
5 9 5 4 8 g2
6 1 3 5 2 g2
7 3 8 8 6 g2
I want to groupby "group" column and then do a few different operations:
• For column "a" I want to get the min and max value
• For the rest I want to sum them
min_max_col = ['a']
sum_cols = ['b','c','d']
Is there a simple way to do this?
The result should look something like this:
>> min max sum_b sum_c sum_d
g1 2 5 29 48 19
g2 1 9 16 48 16
Use agg
df = df.groupby('group').agg({'a':[ np.min, np.max], 'b': np.sum, 'c': np.sum, 'd': np.sum})
df.columns = ['min', 'max', 'sum_b', 'sum_c', 'sum_d']
df = df.reset_index()
group min max sum_b sum_c sum_d
0 g1 2 5 29 31 19
1 g2 1 9 16 17 16
This is different because we are leveraging pandas internally referenced sum, min, and max functions. It is my opinion that we should leverage those as much as possible.
f = dict(
a=['min', 'max'],
b='sum',
c='sum',
d='sum'
)
df.groupby('group').agg(f)
a b c d
min max sum sum sum
group
g1 2 5 29 31 19
g2 1 9 16 17 16

Element wise multiplication of each row

I have two DataFrame objects which I want to apply an element-wise multiplication on each row onto:
df_prob_wc.shape # (3505, 13)
df_prob_c.shape # (13, 1)
I thought I could do it with DataFrame.apply()
df_prob_wc.apply(lambda x: x.multiply(df_prob_c), axis=1)
which gives me:
TypeError: ("'int' object is not iterable", 'occurred at index $')
or with
df_prob_wc.apply(lambda x: x * df_prob_c, axis=1)
which gives me:
TypeError: 'int' object is not iterable
But it's not working.
However, I can do this:
df_prob_wc.apply(lambda x: x * np.asarray([1,2,3,4,5,6,7,8,9,10,11,12,13]), axis=1)
What am I doing wrong here?
It seems you need multiple by Series created with df_prob_c by iloc:
df_prob_wc = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df_prob_wc)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df_prob_c = pd.DataFrame([[4,5,6,1,2,3]])
#for align data same columns in both df
df_prob_c.index = df_prob_wc.columns
print (df_prob_c)
0
A 4
B 5
C 6
D 1
E 2
F 3
print (df_prob_wc.shape)
(3, 6)
print (df_prob_c.shape)
(6, 1)
print (df_prob_c.iloc[:,0])
A 4
B 5
C 6
D 1
E 2
F 3
Name: 0, dtype: int64
print (df_prob_wc.mul(df_prob_c.iloc[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
Another solution is multiple by numpy array, only need [:,0] for select:
print (df_prob_wc.mul(df_prob_c.values[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
And another solution with DataFrame.squeeze:
print (df_prob_wc.mul(df_prob_c.squeeze(), axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9

Filter multiples in a pandas dataframe

My data can be easily converted into a pandas dataframe that looks something like:
import pandas as pd
data={'a':["t", "g"]*9,'b' [1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6],'distance':[10, 15, 290, 300, 315, 320, 350, 360, 10, 25, 225, 240, 325, 335, 365, 205, 15, 35]}
df=pd.DataFrame(data,columns=['a','b','distance'])
print df
a b distance
0 t 1 10
1 g 2 15
2 t 3 290
3 g 4 300
4 t 5 315
5 g 6 320
6 t 1 350
7 g 2 360
8 t 3 10
9 g 4 25
10 t 5 225
11 g 6 240
12 t 1 325
13 g 2 335
14 t 3 365
15 g 4 205
16 t 5 15
17 g 6 35
I want to erase all the lines that have the same value in the "b" column but keep the one line with the smallest value in the "distance" column. In this case I would like to erase all the lines that have a "distance" greater than 200 so that, in this example, only the lines with the index 0,1,8,9,16,17 remain. In the end all the lines should have a different "b" value and the smallest "distance". It would look like:
a b distance
0 t 1 10
1 g 2 15
2 t 3 10
3 g 4 25
4 t 5 15
5 g 6 35
How could I do that?
groupby on b col and call idxmin on distance column to index the orig df:
In [114]:
df.loc[df.groupby('b')['distance'].idxmin()]
Out[114]:
a b distance
0 t 1 10
1 g 2 15
8 t 3 10
9 g 4 25
16 t 5 15
17 g 6 35
Here you can see that idxmin returns the indices of the lowest values:
In [115]:
df.groupby('b')['distance'].idxmin()
Out[115]:
b
1 0
2 1
3 8
4 9
5 16
6 17
Name: distance, dtype: int64
Try this:
df.groupby('b')['a','b','distance'].min()
# a b distance
# b
# 1 t 1 10
# 2 g 2 15
# 3 t 3 10
# 4 g 4 25
# 5 t 5 15
# 6 g 6 35
​