Is there an easy way to select across rows in a panda frame to create a new column? - pandas

Question updated, see below
I have a large dataframe similar in structure to e.g.
df=pd.DataFrame({'A': [0, 0, 0, 11, 22,33], 'B': [10, 20,30, 110, 220, 330], 'C':['x', 'y', 'z', 'x', 'y', 'z']})
df
A B C
0 0 10 x
1 0 20 y
2 0 30 z
3 11 110 x
4 22 220 y
5 33 330 z
I want to create a new column by selecting the column value of B from a different row based on the value of C being equal to the current row and the value of A being 0, so the expected result is
A B C new_B_based_on_A_and_C
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
I want to have a simple solution without needing to have a for loop over the rows. Something like
df.apply(lambda row: df[df[(df['C']==row.C) & (df['A']==0)]]['B'].iloc[0], axis=1)
The dataframe is guaranteed to have those values and the values are unique
Update for a more general case
I am looking for a general solution that would also work for multiple columns to match on e.g.
df=pd.DataFrame({'A': [0, 0, 0,0, 11, 22,33, 44], 'B': [10, 20,30, 40, 110, 220, 330, 440], 'C':['x', 'y', 'x', 'y', 'x', 'y', 'x', 'y'], 'D': [1, 1, 5, 5, 1,1 ,5, 5]})
A B C D
0 0 10 x 1
1 0 20 y 1
2 0 30 x 5
3 0 40 y 5
4 11 110 x 1
5 22 220 y 1
6 33 330 x 5
7 44 440 y 5
and the result would be then
A B C D new_B_based_on_A_C_D
0 0 10 x 1 10
1 0 20 y 1 20
2 0 30 x 5 30
3 0 40 y 5 40
4 11 110 x 1 10
5 22 220 y 1 20
6 33 330 x 5 30
7 44 440 y 5 40

You can do a map:
# you **must** make sure that for each unique `C` value,
# there is only one row with `A==0`.
df['new'] = df['C'].map(df.loc[df['A']==0].set_index('C')['B'])
Output:
A B C new
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
Explanation: Imagine you have a series s indicating the mapping:
idx
idx1 value1
idx2 value2
idx3 value3
then that's what map does: df['C'].map(s).
Now, for a dataframe d:
C B
c1 b1
c2 b2
c3 b3
we do s=d.set_index('C')['B'] to get the above form.
Finally, as mentioned, you mapping happens where A==0, so d = df[df['A']==0].
Composing the forward path:
mapping_data = df[df['A']==0]
mapping_series = mapping_data.set_index('C')['B']
new_values = df['C'].map(mapping_series)
and the first piece of code is just all these lines combined.

If I understood the question, for the general case you could use a merge like this:
df.merge(df.loc[df['A'] == 0, ['B', 'C', 'D']], on=['C', 'D'], how='left', suffixes=('', '_new'))
Output:
A B C D B_new
0 10 x 1 10
0 20 y 1 20
0 30 x 5 30
0 40 y 5 40
11 110 x 1 10
22 220 y 1 20
33 330 x 5 30
44 440 y 5 40

Related

`groupby` - `qcut` but with condition

I have a dataframe as follow:
key1 key2 val
0 a x 8
1 a x 6
2 a x 7
3 a x 4
4 a x 9
5 a x 1
6 a x 2
7 a x 3
8 a x 10
9 a x 5
10 a y 4
11 a y 9
12 a y 1
13 a y 2
14 b x 17
15 b x 15
16 b x 18
17 b x 19
18 b x 12
19 b x 20
20 b x 14
21 b x 13
22 b x 16
23 b x 11
24 b y 2
25 b y 3
26 b y 10
27 b y 5
28 b y 4
29 b y 24
30 b y 22
​
What I need to do is:
Access each group by key1
In each group of key1, I need to do qcut on observations that key2 == x
For those observation that is out of bin range, assign them to lowest and highest bins
According to the dataframe above, first group key1 = a is from indx=0-13. However, only the indx from 0-9 are used to create bins(threshold). The bins(threshold) is then applied from indx=0-13
Then for second group key1 = b is from indx=14-30. Only indx from 14-23 are used to creates bins(threshold). The bins(threshold) is then applied from indx=14-30.
However, from indx=24-28 and indx=29-30, they are out of bins range. Then for indx=24-28 assign to smallest bin range, indx=29-30 assign to the largest bin range.
The output looks like this:
key1 key2 val labels
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
My solution: I creates a dict to contain bins as: (for simplicity, take qcut=2)
dict_bins = {}
key_unique = data['key1'].unique()
for k in key_unique:
sub = data[(data['key1'] == k) & (data['key2'] == 'x')].copy()
dict_bins[k] = pd.qcut(sub['val'], 2, labels=False, retbins=True )[1]
Then, I intend to use groupby with apply, but get stuck on accessing dict_bins
data['sort_key1'] = data.groupby(['key1'])['val'].apply(lambda g: --- stuck---)
Any other solution, or modification to my solution is appreciated.
Thank you
A first approach is to create a custom function:
def discretize(df):
bins = pd.qcut(df.loc[df['key2'] == 'x', 'val'], 2, labels=False, retbins=True)[1]
bins = [-np.inf] + bins[1:-1].tolist() + [np.inf]
return pd.cut(df['val'], bins, labels=False)
df['label'] = df.groupby('key1').apply(discretize).droplevel(0)
Output:
>>> df
key1 key2 val label
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
You need to drop the first level of index to align indexes:
>>> df.groupby('key1').apply(discretize)
key1 # <- you have to drop this index level
a 0 1
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 1
12 0
13 0
b 14 1
15 0
16 1
17 1
18 0
19 1
20 0
21 0
22 1
23 0
24 0
25 0
26 0
27 0
28 0
29 1
30 1
Name: val, dtype: int64

Reshape wide to long for many columns with a common prefix

My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

Select dataframe columns based on column values in Pandas

My dataframe looks like:
A B C D .... Y Z
0 5 12 14 4 2
3 6 15 10 1 30
2 10 20 12 5 15
I want to create another dataframe that only contains the columns with an average value greater than 10:
C D .... Z
12 14 2
15 10 30
20 12 15
Use:
df = df.loc[:, df.mean() > 10]
print (df)
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.mean())
A 1.666667
B 7.000000
C 15.666667
D 12.000000
Y 3.333333
Z 15.666667
dtype: float64
print (df.mean() > 10)
A False
B False
C True
D True
Y False
Z True
dtype: bool
Alternative:
print (df[df.columns[df.mean() > 10]])
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.columns[df.mean() > 10])
Index(['C', 'D', 'Z'], dtype='object')

Element wise multiplication of each row

I have two DataFrame objects which I want to apply an element-wise multiplication on each row onto:
df_prob_wc.shape # (3505, 13)
df_prob_c.shape # (13, 1)
I thought I could do it with DataFrame.apply()
df_prob_wc.apply(lambda x: x.multiply(df_prob_c), axis=1)
which gives me:
TypeError: ("'int' object is not iterable", 'occurred at index $')
or with
df_prob_wc.apply(lambda x: x * df_prob_c, axis=1)
which gives me:
TypeError: 'int' object is not iterable
But it's not working.
However, I can do this:
df_prob_wc.apply(lambda x: x * np.asarray([1,2,3,4,5,6,7,8,9,10,11,12,13]), axis=1)
What am I doing wrong here?
It seems you need multiple by Series created with df_prob_c by iloc:
df_prob_wc = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df_prob_wc)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df_prob_c = pd.DataFrame([[4,5,6,1,2,3]])
#for align data same columns in both df
df_prob_c.index = df_prob_wc.columns
print (df_prob_c)
0
A 4
B 5
C 6
D 1
E 2
F 3
print (df_prob_wc.shape)
(3, 6)
print (df_prob_c.shape)
(6, 1)
print (df_prob_c.iloc[:,0])
A 4
B 5
C 6
D 1
E 2
F 3
Name: 0, dtype: int64
print (df_prob_wc.mul(df_prob_c.iloc[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
Another solution is multiple by numpy array, only need [:,0] for select:
print (df_prob_wc.mul(df_prob_c.values[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
And another solution with DataFrame.squeeze:
print (df_prob_wc.mul(df_prob_c.squeeze(), axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9

Filter multiples in a pandas dataframe

My data can be easily converted into a pandas dataframe that looks something like:
import pandas as pd
data={'a':["t", "g"]*9,'b' [1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6],'distance':[10, 15, 290, 300, 315, 320, 350, 360, 10, 25, 225, 240, 325, 335, 365, 205, 15, 35]}
df=pd.DataFrame(data,columns=['a','b','distance'])
print df
a b distance
0 t 1 10
1 g 2 15
2 t 3 290
3 g 4 300
4 t 5 315
5 g 6 320
6 t 1 350
7 g 2 360
8 t 3 10
9 g 4 25
10 t 5 225
11 g 6 240
12 t 1 325
13 g 2 335
14 t 3 365
15 g 4 205
16 t 5 15
17 g 6 35
I want to erase all the lines that have the same value in the "b" column but keep the one line with the smallest value in the "distance" column. In this case I would like to erase all the lines that have a "distance" greater than 200 so that, in this example, only the lines with the index 0,1,8,9,16,17 remain. In the end all the lines should have a different "b" value and the smallest "distance". It would look like:
a b distance
0 t 1 10
1 g 2 15
2 t 3 10
3 g 4 25
4 t 5 15
5 g 6 35
How could I do that?
groupby on b col and call idxmin on distance column to index the orig df:
In [114]:
df.loc[df.groupby('b')['distance'].idxmin()]
Out[114]:
a b distance
0 t 1 10
1 g 2 15
8 t 3 10
9 g 4 25
16 t 5 15
17 g 6 35
Here you can see that idxmin returns the indices of the lowest values:
In [115]:
df.groupby('b')['distance'].idxmin()
Out[115]:
b
1 0
2 1
3 8
4 9
5 16
6 17
Name: distance, dtype: int64
Try this:
df.groupby('b')['a','b','distance'].min()
# a b distance
# b
# 1 t 1 10
# 2 g 2 15
# 3 t 3 10
# 4 g 4 25
# 5 t 5 15
# 6 g 6 35
​