Pandas Shift: Looking for better alternative [duplicate] - pandas

This question already has answers here:
Make Multiple Shifted (Lagged) Columns in Pandas
(4 answers)
Closed 3 months ago.
import pandas as pd
df = pd.DataFrame(np.array([[1, 0, 0], [4, 5, 0], [7, 7, 7], [7, 4, 5], [4, 5, 0], [7, 8, 9], [3, 2, 9], [9, 3, 6], [6, 8, 5]]),
columns=['a', 'b', 'c'],
index = ['1/1/2000', '1/1/2001', '1/1/2002', '1/1/2003', '1/1/2004', '1/1/2005', '1/1/2006', '1/1/2007', '1/1/2008'])
df['a_1'] = df['a'].shift(1)
df['a_3'] = df['a'].shift(3)
df['a_5'] = df['a'].shift(5)
df['a_7'] = df['a'].shift(7)
Above is a dummy example of how I am shifting.
Issues: 1. Need extra line for different period of shift, can it be done in one go?
2. Above df is small, in case of massive dataframe this operation is slow. I checked different questions: most are relating it to shift not being cython optimized, is there a faster way (apart from numba which few answer do talk about)

nums = [1, 3, 5, 7]
pd.concat([df] + [df['a'].shift(i).to_frame(f'a_{i}') for i in nums], axis=1)
result:
a b c a_1 a_3 a_5 a_7
1/1/2000 1 0 0 NaN NaN NaN NaN
1/1/2001 4 5 0 1.0 NaN NaN NaN
1/1/2002 7 7 7 4.0 NaN NaN NaN
1/1/2003 7 4 5 7.0 1.0 NaN NaN
1/1/2004 4 5 0 7.0 4.0 NaN NaN
1/1/2005 7 8 9 4.0 7.0 1.0 NaN
1/1/2006 3 2 9 7.0 7.0 4.0 NaN
1/1/2007 9 3 6 3.0 4.0 7.0 1.0
1/1/2008 6 8 5 9.0 7.0 7.0 4.0

Related

Setting multiple column at once give error "Not in index error!"

import pandas as pd
df = pd.DataFrame(
[
[5, 2],
[3, 5],
[5, 5],
[8, 9],
[90, 55]
],
columns = ['max_speed', 'shield']
)
df.loc[(df.max_speed > df.shield), ['stat', 'delta']] \
= 'overspeed', df['max_speed'] - df['shield']
I am setting multiple column using .loc as above, for some cases I get Not in index error!. Am I doing something wrong above?
Create list of tuples by same size like number of Trues with filtered Series after subtract with repeat scalar overspeed:
m = (df.max_speed > df.shield)
s = df['max_speed'] - df['shield']
df.loc[m, ['stat', 'delta']] = list(zip(['overspeed'] * m.sum(), s[m]))
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0
Another idea with helper DataFrame:
df.loc[m, ['stat', 'delta']] = pd.DataFrame({'stat':'overspeed', 'delta':s})[m]
Details:
print(list(zip(['overspeed'] * m.sum(), s[m])))
[('overspeed', 3), ('overspeed', 35)]
print (pd.DataFrame({'stat':'overspeed', 'delta':s})[m])
stat delta
0 overspeed 3
4 overspeed 35
Simpliest is assign separately:
df.loc[m, 'stat'] = 'overspeed'
df.loc[m, 'delta'] = df['max_speed'] - df['shield']
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0

Replace outliers in Pandas dataframe by NaN

I'd like to replace outliers by np.nan. I have a dataframe containing floats, int and NaNs such as:
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
with this function:
def outliers(s, replace=np.nan):
Q1, Q3 = np.percentile(s, [25 ,75])
IQR = Q3-Q1
return s.where((s >= (Q1 - 1.5 * IQR)) & (s <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis=1)
but I get:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
Thanks as always for your help.
Don't use apply here is the annotated code for the optimized version:
def mask_outliers(df, replace):
# Calculate Q1 and Q2 quantile
q = df.agg('quantile', q=[.25, .75])
# Calculate IQR = Q2 - Q1
iqr = q.loc[.75] - q.loc[.25]
# Calculate lower and upper limits to decide outliers
lower = q.loc[.25] - 1.5 * iqr
upper = q.loc[.75] + 1.5 * iqr
# Replace the values that does not lies between [lower, upper]
return df.where(df.ge(lower) & df.le(upper), replace)
Result
mask_outliers(df_ex, np.nan)
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0
This answer provides an answer to the question:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
where the another (accepted) answer provides only a better solution to what you want to achieve.
The are two issues to fix in order to make your code doing what it should:
the NaN values have to be removed from the column before calculating np.percentile() to avoid getting for both Q1 and Q3 the value of NaN.
This is one of the reasons for so many NaN values you see in the result of applying your code to the DataFrame. np.percentile() behaves here another way as Pandas .agg('quantile',...) which calculates the Q1 and Q3 thresholds skipping implicit the NaN values from consideration.
the value for the axis has to be changed from 1 to 0 (i.e. to .apply(outliers, axis=0)) in order to apply outliers column wise.
This is another reason for so many NaN values you see in the result you got. The only row without all values set to NaN is these one which does not have a NaN value in itself, else also in these row all the values would be set to NaN for the reason explained above.
Following changes to your code:
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
and
df_ex_o = df_ex.apply(outliers, axis=0)
will solve the issues. Below the entire code and its output:
import pandas as pd
import numpy as np
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
# print(df_ex)
def outliers(colmn, replace=np.nan):
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
IQR = Q3-Q1
return colmn.where((colmn >= (Q1 - 1.5 * IQR)) & (colmn <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis = 0)
print(df_ex_o)
gives:
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0

Pandas: DataFrame Rolling Average on a Row

I have a Row of values in a dataframe and want to calculate the rolling average (3 Period) by creating a new row.
existing_row 1 2 3 4 5 6 7 8 9
create_new_row 2 3 4 5 6 7 8
Use DataFrame.rolling with axis=1 and mean:
print (df)
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
df1 = df.rolling(3, axis=1).mean()
print (df1)
0 1 2 3 4 5 6 7 8
0 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
If need join to original pass to concat:
df = pd.concat([df, df1], ignore_index=True)
print (df)
0 1 2 3 4 5 6 7 8
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Use rolling_mean:
out = df.append(df.rolling(3, axis=1).mean(), ignore_index=True)
print(out)
# Output
A B C D E F G H I
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Setup:
df = pd.DataFrame({'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'D': {0: 4}, 'E': {0: 5},
'F': {0: 6}, 'G': {0: 7}, 'H': {0: 8}, 'I': {0: 9}})
print(df)
# Output
A B C D E F G H I
0 1 2 3 4 5 6 7 8 9

sum values of columns containing certain strings in pandas

There's a dataframe. How can sum column values a001 + a002, and b +b1?
df2 = pd.DataFrame({'id':[1,2,3,4],
'a001': [1, np.nan, 3, 4],
'a002': [2, 3, 4, 5],
'b': [1, 2, 3, 4],
'b1': [2, 3, 4,np.nan],
})
id a001 a002 b b1
0 1 1.0 2 1 2.0
1 2 NaN 3 2 3.0
2 3 3.0 4 3 4.0
3 4 4.0 5 4 NaN
The final result will be,
id a b
0 1 3 3
1 2 3 5
2 3 7 7
3 4 9 4
Use and modify an answer from previous question but it has AttributeError: 'str' object has no attribute 'str'.
categories = ['a', 'b']
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.str.contains(cat)]
df2.groupby(correct_categories(df2.columns),axis=1).sum()
Create function that returns category if argument matches category. Otherwise, returns argument.
import re
def get_col_grouper(cats):
def col_grouper(x):
return re.sub(f'^({"|".join(cats)}).*', r'\1', x)
return col_grouper
df2.groupby(get_col_grouper(['a', 'b']), axis=1, sort=False).sum()
id a b c c1 d03 d06
0 1.0 3.0 3.0 1.0 1.0 4.0 8.0
1 2.0 3.0 5.0 2.0 2.0 5.0 4.0
2 3.0 7.0 7.0 4.0 4.0 6.0 3.0
3 4.0 9.0 4.0 9.0 9.0 7.0 0.0
Setup
df2 = pd.DataFrame({'id':[1,2,3,4],
'a001': [1, np.nan, 3, 4],
'a002': [2, 3, 4, 5],
'b': [1, 2, 3, 4],
'b1': [2, 3, 4,np.nan],
'c': [1, 2, 4, 9],
'c1': [1, 2, 4, 9],
'd03': [4, 5, 6, 7],
'd06': [8, 4, 3, None],
})
Use .str.extract to get the categories, and groupby with axis=1:
df2.groupby(df2.columns.str.extract('(\D+)', expand=False),
axis=1, sort=False).sum()
Output:
id a b
0 1.0 3.0 3.0
1 2.0 3.0 5.0
2 3.0 7.0 7.0
3 4.0 9.0 4.0

Is there a way to horizontally concatenate dataframes of same length while ignoring the index?

I have dataframes I want to horizontally concatenate while ignoring the index.
I know that for arithmetic operations, ignoring the index can lead to a substantial speedup if you use the numpy array .values instead of the pandas Series. Is it possible to horizontally concatenate or merge pandas dataframes whilst ignoring the index? (To my dismay, ignore_index=True does something else.) And if so, does it give a speed gain?
import pandas as pd
df1 = pd.Series(range(10)).to_frame()
df2 = pd.Series(range(10), index=range(10, 20)).to_frame()
pd.concat([df1, df2], axis=1)
# 0 0
# 0 0.0 NaN
# 1 1.0 NaN
# 2 2.0 NaN
# 3 3.0 NaN
# 4 4.0 NaN
# 5 5.0 NaN
# 6 6.0 NaN
# 7 7.0 NaN
# 8 8.0 NaN
# 9 9.0 NaN
# 10 NaN 0.0
# 11 NaN 1.0
# 12 NaN 2.0
# 13 NaN 3.0
# 14 NaN 4.0
# 15 NaN 5.0
# 16 NaN 6.0
# 17 NaN 7.0
# 18 NaN 8.0
# 19 NaN 9.0
I know I can get the result I want by resetting the index of df2, but I wonder whether there is a faster (perhaps numpy method) to do this?
np.column_stack
Absolutely equivalent to EdChum's answer.
pd.DataFrame(
np.column_stack([df1,df2]),
columns=df1.columns.append(df2.columns)
)
0 0
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
Pandas Option with assign
You can do many things with the new columns.
I don't recommend this!
df1.assign(**df2.add_suffix('_').to_dict('l'))
0 0_
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
A pure numpy method would be to use np.hstack:
In[33]:
np.hstack([df1,df2])
Out[33]:
array([[0, 0],
[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5],
[6, 6],
[7, 7],
[8, 8],
[9, 9]], dtype=int64)
this can be easily converted to a df by passing this as the data arg to the DataFrame ctor:
In[34]:
pd.DataFrame(np.hstack([df1,df2]))
Out[34]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
with respect to whether the data is contiguous, the individual columns will be treated as separate arrays as it's a dict of Series essentially, as you're passing numpy arrays there is no allocation of memory and copying needed here for simple and homogeneous dtype so it should be fast.