Select dataframe columns based on column values in Pandas - pandas

My dataframe looks like:
A B C D .... Y Z
0 5 12 14 4 2
3 6 15 10 1 30
2 10 20 12 5 15
I want to create another dataframe that only contains the columns with an average value greater than 10:
C D .... Z
12 14 2
15 10 30
20 12 15

Use:
df = df.loc[:, df.mean() > 10]
print (df)
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.mean())
A 1.666667
B 7.000000
C 15.666667
D 12.000000
Y 3.333333
Z 15.666667
dtype: float64
print (df.mean() > 10)
A False
B False
C True
D True
Y False
Z True
dtype: bool
Alternative:
print (df[df.columns[df.mean() > 10]])
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.columns[df.mean() > 10])
Index(['C', 'D', 'Z'], dtype='object')

Related

`groupby` - `qcut` but with condition

I have a dataframe as follow:
key1 key2 val
0 a x 8
1 a x 6
2 a x 7
3 a x 4
4 a x 9
5 a x 1
6 a x 2
7 a x 3
8 a x 10
9 a x 5
10 a y 4
11 a y 9
12 a y 1
13 a y 2
14 b x 17
15 b x 15
16 b x 18
17 b x 19
18 b x 12
19 b x 20
20 b x 14
21 b x 13
22 b x 16
23 b x 11
24 b y 2
25 b y 3
26 b y 10
27 b y 5
28 b y 4
29 b y 24
30 b y 22
​
What I need to do is:
Access each group by key1
In each group of key1, I need to do qcut on observations that key2 == x
For those observation that is out of bin range, assign them to lowest and highest bins
According to the dataframe above, first group key1 = a is from indx=0-13. However, only the indx from 0-9 are used to create bins(threshold). The bins(threshold) is then applied from indx=0-13
Then for second group key1 = b is from indx=14-30. Only indx from 14-23 are used to creates bins(threshold). The bins(threshold) is then applied from indx=14-30.
However, from indx=24-28 and indx=29-30, they are out of bins range. Then for indx=24-28 assign to smallest bin range, indx=29-30 assign to the largest bin range.
The output looks like this:
key1 key2 val labels
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
My solution: I creates a dict to contain bins as: (for simplicity, take qcut=2)
dict_bins = {}
key_unique = data['key1'].unique()
for k in key_unique:
sub = data[(data['key1'] == k) & (data['key2'] == 'x')].copy()
dict_bins[k] = pd.qcut(sub['val'], 2, labels=False, retbins=True )[1]
Then, I intend to use groupby with apply, but get stuck on accessing dict_bins
data['sort_key1'] = data.groupby(['key1'])['val'].apply(lambda g: --- stuck---)
Any other solution, or modification to my solution is appreciated.
Thank you
A first approach is to create a custom function:
def discretize(df):
bins = pd.qcut(df.loc[df['key2'] == 'x', 'val'], 2, labels=False, retbins=True)[1]
bins = [-np.inf] + bins[1:-1].tolist() + [np.inf]
return pd.cut(df['val'], bins, labels=False)
df['label'] = df.groupby('key1').apply(discretize).droplevel(0)
Output:
>>> df
key1 key2 val label
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
You need to drop the first level of index to align indexes:
>>> df.groupby('key1').apply(discretize)
key1 # <- you have to drop this index level
a 0 1
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 1
12 0
13 0
b 14 1
15 0
16 1
17 1
18 0
19 1
20 0
21 0
22 1
23 0
24 0
25 0
26 0
27 0
28 0
29 1
30 1
Name: val, dtype: int64

Is there an easy way to select across rows in a panda frame to create a new column?

Question updated, see below
I have a large dataframe similar in structure to e.g.
df=pd.DataFrame({'A': [0, 0, 0, 11, 22,33], 'B': [10, 20,30, 110, 220, 330], 'C':['x', 'y', 'z', 'x', 'y', 'z']})
df
A B C
0 0 10 x
1 0 20 y
2 0 30 z
3 11 110 x
4 22 220 y
5 33 330 z
I want to create a new column by selecting the column value of B from a different row based on the value of C being equal to the current row and the value of A being 0, so the expected result is
A B C new_B_based_on_A_and_C
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
I want to have a simple solution without needing to have a for loop over the rows. Something like
df.apply(lambda row: df[df[(df['C']==row.C) & (df['A']==0)]]['B'].iloc[0], axis=1)
The dataframe is guaranteed to have those values and the values are unique
Update for a more general case
I am looking for a general solution that would also work for multiple columns to match on e.g.
df=pd.DataFrame({'A': [0, 0, 0,0, 11, 22,33, 44], 'B': [10, 20,30, 40, 110, 220, 330, 440], 'C':['x', 'y', 'x', 'y', 'x', 'y', 'x', 'y'], 'D': [1, 1, 5, 5, 1,1 ,5, 5]})
A B C D
0 0 10 x 1
1 0 20 y 1
2 0 30 x 5
3 0 40 y 5
4 11 110 x 1
5 22 220 y 1
6 33 330 x 5
7 44 440 y 5
and the result would be then
A B C D new_B_based_on_A_C_D
0 0 10 x 1 10
1 0 20 y 1 20
2 0 30 x 5 30
3 0 40 y 5 40
4 11 110 x 1 10
5 22 220 y 1 20
6 33 330 x 5 30
7 44 440 y 5 40
You can do a map:
# you **must** make sure that for each unique `C` value,
# there is only one row with `A==0`.
df['new'] = df['C'].map(df.loc[df['A']==0].set_index('C')['B'])
Output:
A B C new
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
Explanation: Imagine you have a series s indicating the mapping:
idx
idx1 value1
idx2 value2
idx3 value3
then that's what map does: df['C'].map(s).
Now, for a dataframe d:
C B
c1 b1
c2 b2
c3 b3
we do s=d.set_index('C')['B'] to get the above form.
Finally, as mentioned, you mapping happens where A==0, so d = df[df['A']==0].
Composing the forward path:
mapping_data = df[df['A']==0]
mapping_series = mapping_data.set_index('C')['B']
new_values = df['C'].map(mapping_series)
and the first piece of code is just all these lines combined.
If I understood the question, for the general case you could use a merge like this:
df.merge(df.loc[df['A'] == 0, ['B', 'C', 'D']], on=['C', 'D'], how='left', suffixes=('', '_new'))
Output:
A B C D B_new
0 10 x 1 10
0 20 y 1 20
0 30 x 5 30
0 40 y 5 40
11 110 x 1 10
22 220 y 1 20
33 330 x 5 30
44 440 y 5 40

Split a column by element and create new ones with pandas

Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18

Finding the sum of each column and combined them to find the top 3 highest value

a = pd.DataFrame(df.groupby('actor_1_name')['gross'].sum())
b = pd.DataFrame(df.groupby('actor_2_name')['gross'].sum())
c = pd.DataFrame(df.groupby('actor_3_name')['gross'].sum())
x = [a,b,c]
y = pd.concat(x)
p =['actor_1_name','actor_2_name','actor_3_name','gross']
df.loc[y.nlargest(3).index,p]
I want to find the sum of each column then combine them together to find the top 3 highest values, but I'm getting an error and not sure what to do to fix it. I need some assistance.
I believe you need:
df = pd.DataFrame({'actor_1_name':['a','a','a','b','b','c','c','d','d','e'],
'actor_2_name':['d','d','a','c','b','c','c','d','e','e'],
'actor_3_name':['c','c','a','b','b','b','c','e','e','e'],
'gross':[1,2,3,4,5,6,7,8,9,10]})
print (df)
actor_1_name actor_2_name actor_3_name gross
0 a d c 1
1 a d c 2
2 a a a 3
3 b c b 4
4 b b b 5
5 c c b 6
6 c c c 7
7 d d e 8
8 d e e 9
9 e e e 10
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3)
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3)
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3)
x = [a,b,c]
print (x)
[actor_1_name
d 17
c 13
e 10
Name: gross, dtype: int64, actor_2_name
e 19
c 17
d 11
Name: gross, dtype: int64, actor_3_name
e 27
b 15
c 10
Name: gross, dtype: int64]
df1 = pd.concat(x, axis=1, keys=['actor_1_name','actor_2_name','actor_3_name'])
print (df1)
actor_1_name actor_2_name actor_3_name
b NaN NaN 15.0
c 13.0 17.0 10.0
d 17.0 11.0 NaN
e 10.0 19.0 27.0
EDIT1:
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3).reset_index()
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3).reset_index()
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3).reset_index()
x = [a,b,c]
print (x)
[ actor_1_name gross
0 d 17
1 c 13
2 e 10, actor_2_name gross
0 e 19
1 c 17
2 d 11, actor_3_name gross
0 e 27
1 b 15
2 c 10]
df1 = pd.concat(x, axis=1, keys=['a','b','c'])
df1.columns = df1.columns.map('-'.join)
print (df1)
a-actor_1_name a-gross b-actor_2_name b-gross c-actor_3_name c-gross
0 d 17 e 19 e 27
1 c 13 c 17 b 15
2 e 10 d 11 c 10
EDIT2:
a = df.groupby('actor_1_name')['gross'].sum().nlargest(3).reset_index(drop=True)
b = df.groupby('actor_2_name')['gross'].sum().nlargest(3).reset_index(drop=True)
c = df.groupby('actor_3_name')['gross'].sum().nlargest(3).reset_index(drop=True)
x = [a,b,c]
print (x)
[0 17
1 13
2 10
Name: gross, dtype: int64, 0 19
1 17
2 11
Name: gross, dtype: int64, 0 27
1 15
2 10
Name: gross, dtype: int64]
df1 = pd.concat(x, axis=1, keys=['actor_1_name','actor_2_name','actor_3_name'])
print (df1)
actor_1_name actor_2_name actor_3_name
0 17 19 27
1 13 17 15
2 10 11 10

Element wise multiplication of each row

I have two DataFrame objects which I want to apply an element-wise multiplication on each row onto:
df_prob_wc.shape # (3505, 13)
df_prob_c.shape # (13, 1)
I thought I could do it with DataFrame.apply()
df_prob_wc.apply(lambda x: x.multiply(df_prob_c), axis=1)
which gives me:
TypeError: ("'int' object is not iterable", 'occurred at index $')
or with
df_prob_wc.apply(lambda x: x * df_prob_c, axis=1)
which gives me:
TypeError: 'int' object is not iterable
But it's not working.
However, I can do this:
df_prob_wc.apply(lambda x: x * np.asarray([1,2,3,4,5,6,7,8,9,10,11,12,13]), axis=1)
What am I doing wrong here?
It seems you need multiple by Series created with df_prob_c by iloc:
df_prob_wc = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df_prob_wc)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df_prob_c = pd.DataFrame([[4,5,6,1,2,3]])
#for align data same columns in both df
df_prob_c.index = df_prob_wc.columns
print (df_prob_c)
0
A 4
B 5
C 6
D 1
E 2
F 3
print (df_prob_wc.shape)
(3, 6)
print (df_prob_c.shape)
(6, 1)
print (df_prob_c.iloc[:,0])
A 4
B 5
C 6
D 1
E 2
F 3
Name: 0, dtype: int64
print (df_prob_wc.mul(df_prob_c.iloc[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
Another solution is multiple by numpy array, only need [:,0] for select:
print (df_prob_wc.mul(df_prob_c.values[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
And another solution with DataFrame.squeeze:
print (df_prob_wc.mul(df_prob_c.squeeze(), axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9