how to change dataframe in apply functions pandas - pandas

I want to use apply to dynamically modify the content of my dataframe, the table is like:
index price signal stoploss
0 0 1000 True 990.0
1 1 1010 False 990.0
2 2 1020 True 1010.0
3 3 1000 False 1010.0
4 4 990 False 1010.0
5 5 980 False 1010.0
6 6 1000 False 1010.0
7 7 1020 True 1010.0
8 8 1030 False 1010.0
9 9 1040 False 1010.0
my code is :
def test(row, dd):
if row.signal:
dd['inorder']=True
row['stoploss']=1
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7,8,9],
'price':[1000,1010,1020,1000,990,980,1000,1020,1030,1040],
'signal':[True, False, True, False, False, False, False, True, False, False]})
if __name__ == '__main__':
df['stoploss'] = df.loc[df['signal'], 'price'] - 10
df['stoploss'].ffill(inplace=True)
xx = dict(inorder=False)
df.apply(lambda row: test(row, xx), axis=1)
print(df)
When I trace into the function test, I can see that the value is indeed changed to 1, but out of the function test scope, it seems has no effect on the dataframe.
I tried to the use another way to modify the content of the dataframe,
for k, row in df.iterrows():
if row.signal:
xx['inorder'] = True
df.loc[k, 'stoploss'] = 1
this one works, but obviously it's a lot slower than apply.
The correct result I expect is :
index price signal stoploss
0 0 1000 True 1.0
1 1 1010 False 990.0
2 2 1020 True 1.0
3 3 1000 False 1010.0
4 4 990 False 1010.0
5 5 980 False 1010.0
6 6 1000 False 1010.0
7 7 1020 True 1.0
8 8 1030 False 1010.0
9 9 1040 False 1010.0
How to achieve that assignment in apply please?
Thanks

If you look at the docs for apply, you'll notice that apply does not change the DataFrame in place, but rather returns a new dataframe where the function has been applied.
So, in your seconds to last line, you can try
df = df.apply(lambda row: test(row, xx), axis=1)
Edit:
IMO, this isn't very well documented, but the call
df.apply(func, axis=1) will apply func to each row, and set the row to the return value of func.
As written, your example won't work because the function you're applying doesn't return anything. The following minimal example works the way you intend.
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7,8,9],
'price':[1000,1010,1020,1000,990,980,1000,1020,1030,1040],
'signal':[True, False, True, False, False, False, False, True, False, False]})
df['stoploss'] = df.loc[df['signal'], 'price'] - 10
df['stoploss'].ffill(inplace=True)
def test(row):
row.stoploss = 1 if row.signal else row.stoploss
return row
modified_df = df.apply(test, axis=1)
As an aside, I don't think you actually need to use apply to get the result you want. Have you tried something like
df.loc[df['signal'] == True, 'stoploss'] = 1
That would be a much simpler and faster way to get your target output.

Related

Pandas: Fast way to get cols/rows containing na

In Pandas we can drop cols/rows by .dropna(how = ..., axis = ...) but is there a way to get an array-like of True/False indicators for each col/row, which would indicate whether a col/row contains na according to how and axis arguments?
I.e. is there a way to convert .dropna(how = ..., axis = ...) to a method, which would instead of actual removal just tell us, which cols/rows would be removed if we called .dropna(...) with specific how and axis.
Thank you for your time!
You can use isna() to replicate the behaviour of dropna without actually removing data. To mimic the 'how' and 'axis' parameter, you can add any() or all() and set the axis accordingly.
Here is a simple example:
import pandas as pd
df = pd.DataFrame([[pd.NA, pd.NA, 1], [pd.NA, pd.NA, pd.NA]])
df.isna()
Output:
0 1 2
0 True True False
1 True True True
Eq. to dropna(how='any', axis=0)
df.isna().any(axis=0)
Output:
0 True
1 True
2 True
dtype: bool
Eq. to dropna(how='any', axis=1)
df.isna().any(axis=1)
Output:
0 True
1 True
dtype: bool

Pandas PerformanceWarning: DataFrame is highly fragmented. Whats the efficient solution?

Here is a generic code representing what is happening in my script:
import pandas as pd
import numpy as np
dic = {}
for i in np.arange(0,10):
dic[str(i)] = df = pd.DataFrame(np.random.randint(0,1000,size=(5000, 20)),
columns=list('ABCDEFGHIJKLMNOPQRST'))
df_out = pd.DataFrame(index = df.index)
for i in np.arange(0,10):
df_out['A_'+str(i)] = dic[str(i)]['A'].astype('int')
df_out['D_'+str(i)] = dic[str(i)]['D'].astype('int')
df_out['H_'+str(i)] = dic[str(i)]['H'].astype('int')
df_out['I_'+str(i)] = dic[str(i)]['I'].astype('int')
df_out['M_'+str(i)] = dic[str(i)]['M'].astype('int')
df_out['O_'+str(i)] = dic[str(i)]['O'].astype('int')
df_out['Q_'+str(i)] = dic[str(i)]['Q'].astype('int')
df_out['R_'+str(i)] = dic[str(i)]['R'].astype('int')
df_out['S_'+str(i)] = dic[str(i)]['S'].astype('int')
df_out['T_'+str(i)] = dic[str(i)]['T'].astype('int')
df_out['C_'+str(i)] = dic[str(i)]['C'].astype('int')
You will notice that as soon as df_out (output) numbers of inseted columns exceed 100 I get the following warning:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead
The question is how could I use:
pd.concat()
And still have the custom column name that depens on the dictionary key ?
IMPORTANT: I still would like to keep a specific column selections, not all of them.
Like in the example: A, D , H , I etc...
SPECIAL EDIT (based on Corralien's answer)
cols = {'A': 'float',
'D': 'bool'}
out = pd.DataFrame()
for c, df in dic.items():
for col, ftype in cols.items():
out = pd.concat([out,df[[col]].add_suffix(f'_{c}')],
axis=1).astype(ftype)
Many thanks for your help !
You can use a comprehension with pd.concat:
cols = {'A': 'float', 'D': 'bool'}
out = pd.concat([df[cols].astype(cols).add_prefix(f'{k}_')
for k, df in dic.items()], axis=1)
print(out)
# Output:
0_A 0_D 1_A 1_D 2_A 2_D 3_A 3_D
0 116.0 True 396.0 True 944.0 True 398.0 True
1 128.0 True 102.0 True 561.0 True 70.0 True
2 982.0 True 613.0 True 822.0 True 246.0 True
3 830.0 True 366.0 True 861.0 True 906.0 True
4 533.0 True 741.0 True 305.0 True 874.0 True
Use concat with flatten MultiIndex in map:
cols = ['A','D']
df_out = pd.concat({k: v[cols] for k, v in dic.items()}, axis=1).astype('int')
df_out.columns = df_out.columns.map(lambda x: f'{x[1]}_{x[0]}')
print (df_out)
A_0 D_0 A_1 D_1 A_2 D_2 A_3 D_3
0 116 341 396 502 944 483 398 839
1 128 621 102 70 561 656 70 169
2 982 44 613 775 822 379 246 25
3 830 987 366 481 861 632 906 676
4 533 349 741 410 305 422 874 19

Reorder pandas dataframe columns that has columns.name

I want to reorder columns of a dataframe generated from crosstab. However, the method I used doesn't work because it has columns.name
example data
d = {'levels':['High', 'High', 'Mid', 'Low', 'Low', 'Low', 'Mid'], 'converted':[True, True, True, False, False, True, False]}
df = pd.DataFrame(data=d)
df
levels converted
0 High True
1 High True
2 Mid True
3 Low False
4 Low False
5 Low True
6 Mid False
than I used crosstab to count it
cb = pd.crosstab(df['levels'], df['converted'])
cb
converted False True
levels
High 0 2
Low 2 1
Mid 1 1
I want to swap the order of the two columns. I tried cb[[True, False]] and got error ValueError: Item wrong length 2 instead of 3.
I guess it's because it has columns.name, which is converted
Try with sort_index, when the column type is bool, which will make the normal index slice not work
cb.sort_index(axis=1,ascending=False)
Out[190]:
converted True False
levels
High 2 0
Low 1 2
Mid 1 1
you can try the dataframe reindex method as below:
import pandas as pd
d = {'levels':['High', 'High', 'Mid', 'Low', 'Low', 'Low', 'Mid'], 'converted':[True, True, True, False, False, True, False]}
df = pd.DataFrame(data=d)
print(df)
cb = pd.crosstab(df['levels'],df['converted'])
print(cb)
column_titles = [True,False]
cb=cb.reindex(columns=column_titles)
print(cb)

Add items to a dataframe if item in column already present

Other than brute forcing it with loops, given a dataframe df:
A B C
0 True 1 23.0
1 False 2 25.0
2 ... ... ....
and a list of dicts lod:
[{'A': True, 'B':2, 'C':23}, {'A': True, 'B':1, 'C':24}...]
I would like to add the first element of the lod {A: True, B:2, C:23} because 23.0 is already in the df C column, but not the second element {A: True, B:1, C:24} because 24 is not a value in the C column of df.
So add all items of the list of dicts to the dataframe on a column value already being in the dataframe, otherwise continue to the next element.
You can convert list of dict to a data frame , then using isin
add=pd.DataFrame([{'A': True, 'B':2, 'C':23}, {'A': True, 'B':1, 'C':24}])
s=pd.concat([df,add[add.C.isin(df.C)]])
s
Out[464]:
A B C
0 True 1 23.0
1 False 2 25.0
0 True 2 23.0

Recode a pandas.Series containing 0, 1, and NaN to False, True, and NaN

Suppose I have a Series with NaNs:
pd.Series([0, 1, None, 1])
I want to transform this to be equal to:
pd.Series([False, True, None, True])
You'd think x == 1 would suffice, but instead, this returns:
pd.Series([False, True, False, True])
where the null value has become False. This is because np.nan == 1 returns False, rather than None or np.nan as in R.
Is there a nice, vectorized way to get what I want?
Maybe map can do it:
import pandas as pd
x = pd.Series([0, 1, None, 1])
print x.map({1: True, 0: False})
0 False
1 True
2 NaN
3 True
dtype: object
You can use where:
In [11]: (s == 1).where(s.notnull(), np.nan)
Out[11]:
0 0
1 1
2 NaN
3 1
dtype: float64
Note: the True and False have been cast to float as 0 and 1.