sum values of columns containing certain strings in pandas - pandas

There's a dataframe. How can sum column values a001 + a002, and b +b1?
df2 = pd.DataFrame({'id':[1,2,3,4],
'a001': [1, np.nan, 3, 4],
'a002': [2, 3, 4, 5],
'b': [1, 2, 3, 4],
'b1': [2, 3, 4,np.nan],
})
id a001 a002 b b1
0 1 1.0 2 1 2.0
1 2 NaN 3 2 3.0
2 3 3.0 4 3 4.0
3 4 4.0 5 4 NaN
The final result will be,
id a b
0 1 3 3
1 2 3 5
2 3 7 7
3 4 9 4
Use and modify an answer from previous question but it has AttributeError: 'str' object has no attribute 'str'.
categories = ['a', 'b']
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.str.contains(cat)]
df2.groupby(correct_categories(df2.columns),axis=1).sum()

Create function that returns category if argument matches category. Otherwise, returns argument.
import re
def get_col_grouper(cats):
def col_grouper(x):
return re.sub(f'^({"|".join(cats)}).*', r'\1', x)
return col_grouper
df2.groupby(get_col_grouper(['a', 'b']), axis=1, sort=False).sum()
id a b c c1 d03 d06
0 1.0 3.0 3.0 1.0 1.0 4.0 8.0
1 2.0 3.0 5.0 2.0 2.0 5.0 4.0
2 3.0 7.0 7.0 4.0 4.0 6.0 3.0
3 4.0 9.0 4.0 9.0 9.0 7.0 0.0
Setup
df2 = pd.DataFrame({'id':[1,2,3,4],
'a001': [1, np.nan, 3, 4],
'a002': [2, 3, 4, 5],
'b': [1, 2, 3, 4],
'b1': [2, 3, 4,np.nan],
'c': [1, 2, 4, 9],
'c1': [1, 2, 4, 9],
'd03': [4, 5, 6, 7],
'd06': [8, 4, 3, None],
})

Use .str.extract to get the categories, and groupby with axis=1:
df2.groupby(df2.columns.str.extract('(\D+)', expand=False),
axis=1, sort=False).sum()
Output:
id a b
0 1.0 3.0 3.0
1 2.0 3.0 5.0
2 3.0 7.0 7.0
3 4.0 9.0 4.0

Related

Pandas Shift: Looking for better alternative [duplicate]

This question already has answers here:
Make Multiple Shifted (Lagged) Columns in Pandas
(4 answers)
Closed 3 months ago.
import pandas as pd
df = pd.DataFrame(np.array([[1, 0, 0], [4, 5, 0], [7, 7, 7], [7, 4, 5], [4, 5, 0], [7, 8, 9], [3, 2, 9], [9, 3, 6], [6, 8, 5]]),
columns=['a', 'b', 'c'],
index = ['1/1/2000', '1/1/2001', '1/1/2002', '1/1/2003', '1/1/2004', '1/1/2005', '1/1/2006', '1/1/2007', '1/1/2008'])
df['a_1'] = df['a'].shift(1)
df['a_3'] = df['a'].shift(3)
df['a_5'] = df['a'].shift(5)
df['a_7'] = df['a'].shift(7)
Above is a dummy example of how I am shifting.
Issues: 1. Need extra line for different period of shift, can it be done in one go?
2. Above df is small, in case of massive dataframe this operation is slow. I checked different questions: most are relating it to shift not being cython optimized, is there a faster way (apart from numba which few answer do talk about)
nums = [1, 3, 5, 7]
pd.concat([df] + [df['a'].shift(i).to_frame(f'a_{i}') for i in nums], axis=1)
result:
a b c a_1 a_3 a_5 a_7
1/1/2000 1 0 0 NaN NaN NaN NaN
1/1/2001 4 5 0 1.0 NaN NaN NaN
1/1/2002 7 7 7 4.0 NaN NaN NaN
1/1/2003 7 4 5 7.0 1.0 NaN NaN
1/1/2004 4 5 0 7.0 4.0 NaN NaN
1/1/2005 7 8 9 4.0 7.0 1.0 NaN
1/1/2006 3 2 9 7.0 7.0 4.0 NaN
1/1/2007 9 3 6 3.0 4.0 7.0 1.0
1/1/2008 6 8 5 9.0 7.0 7.0 4.0

Replace outliers in Pandas dataframe by NaN

I'd like to replace outliers by np.nan. I have a dataframe containing floats, int and NaNs such as:
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
with this function:
def outliers(s, replace=np.nan):
Q1, Q3 = np.percentile(s, [25 ,75])
IQR = Q3-Q1
return s.where((s >= (Q1 - 1.5 * IQR)) & (s <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis=1)
but I get:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
Thanks as always for your help.
Don't use apply here is the annotated code for the optimized version:
def mask_outliers(df, replace):
# Calculate Q1 and Q2 quantile
q = df.agg('quantile', q=[.25, .75])
# Calculate IQR = Q2 - Q1
iqr = q.loc[.75] - q.loc[.25]
# Calculate lower and upper limits to decide outliers
lower = q.loc[.25] - 1.5 * iqr
upper = q.loc[.75] + 1.5 * iqr
# Replace the values that does not lies between [lower, upper]
return df.where(df.ge(lower) & df.le(upper), replace)
Result
mask_outliers(df_ex, np.nan)
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0
This answer provides an answer to the question:
Any idea on what's going on? I'd like the outliers to be calculated column wise.
where the another (accepted) answer provides only a better solution to what you want to achieve.
The are two issues to fix in order to make your code doing what it should:
the NaN values have to be removed from the column before calculating np.percentile() to avoid getting for both Q1 and Q3 the value of NaN.
This is one of the reasons for so many NaN values you see in the result of applying your code to the DataFrame. np.percentile() behaves here another way as Pandas .agg('quantile',...) which calculates the Q1 and Q3 thresholds skipping implicit the NaN values from consideration.
the value for the axis has to be changed from 1 to 0 (i.e. to .apply(outliers, axis=0)) in order to apply outliers column wise.
This is another reason for so many NaN values you see in the result you got. The only row without all values set to NaN is these one which does not have a NaN value in itself, else also in these row all the values would be set to NaN for the reason explained above.
Following changes to your code:
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
and
df_ex_o = df_ex.apply(outliers, axis=0)
will solve the issues. Below the entire code and its output:
import pandas as pd
import numpy as np
df_ex = pd.DataFrame({
'a': [np.nan,np.nan,2.0,-0.5,6,120],
'b': [1, 3, 4, 2,40,11],
'c': [np.nan, 2, 3, 4,2,2],
'd': [6, 2.2, np.nan, 0,3,3],
'e': [12, 4, np.nan, -5,5,5],
'f': [2, 3, 8, 2,12,8],
'g': [3, 3, 9.0, 11, np.nan,2]})
# print(df_ex)
def outliers(colmn, replace=np.nan):
colmn_noNaN = colmn.dropna()
Q1, Q3 = np.percentile(colmn_noNaN, [25 ,75])
IQR = Q3-Q1
return colmn.where((colmn >= (Q1 - 1.5 * IQR)) & (colmn <= (Q3 + 1.5 * IQR)), replace)
df_ex_o = df_ex.apply(outliers, axis = 0)
print(df_ex_o)
gives:
a b c d e f g
0 NaN 1.0 NaN NaN NaN 2 3.0
1 NaN 3.0 2.0 2.2 4.0 3 3.0
2 2.0 4.0 3.0 NaN NaN 8 9.0
3 -0.5 2.0 4.0 NaN NaN 2 11.0
4 6.0 NaN 2.0 3.0 5.0 12 NaN
5 NaN 11.0 2.0 3.0 5.0 8 2.0

Pandas: DataFrame Rolling Average on a Row

I have a Row of values in a dataframe and want to calculate the rolling average (3 Period) by creating a new row.
existing_row 1 2 3 4 5 6 7 8 9
create_new_row 2 3 4 5 6 7 8
Use DataFrame.rolling with axis=1 and mean:
print (df)
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
df1 = df.rolling(3, axis=1).mean()
print (df1)
0 1 2 3 4 5 6 7 8
0 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
If need join to original pass to concat:
df = pd.concat([df, df1], ignore_index=True)
print (df)
0 1 2 3 4 5 6 7 8
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Use rolling_mean:
out = df.append(df.rolling(3, axis=1).mean(), ignore_index=True)
print(out)
# Output
A B C D E F G H I
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Setup:
df = pd.DataFrame({'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'D': {0: 4}, 'E': {0: 5},
'F': {0: 6}, 'G': {0: 7}, 'H': {0: 8}, 'I': {0: 9}})
print(df)
# Output
A B C D E F G H I
0 1 2 3 4 5 6 7 8 9

How to assigne a dataframe mean to specific rows of dataframe?

I have a data frame like this
df_a = pd.DataFrame({'a': [2, 4, 5, 6, 12],
'b': [3, 5, 7, 9, 15]})
Out[112]:
a b
0 2 3
1 4 5
2 5 7
3 6 9
4 12 15
and mean out
df_a.mean()
Out[118]:
a 5.800
b 7.800
dtype: float64
I want this;
df_a[df_a.index.isin([3, 4])] = df.mean()
But I'm getting an error. How do I achieve this?
I gave an example here. There are observations that I need to change a lot in the data that I am working with. And I keep their index values in a list
If you want to overwrite the values of rows in a list, you can do it with iloc
df_a = pd.DataFrame({'a': [2, 4, 5, 6, 12], 'b': [3, 5, 7, 9, 15]})
idx_list = [3, 4]
df_a.iloc[idx_list,:] = df_a.mean()
Output
a b
0 2.0 3.0
1 4.0 5.0
2 5.0 7.0
3 5.8 7.8
4 5.8 7.8
edit
If you're using an older version of pandas and see NaNs instead of wanted values, you can use a for loop
df_a_mean = df_a.mean()
for i in idx_list:
df_a.iloc[i,:] = df_a_mean

How to avoid unnecessary multi-index entries in pandas dataframe concat?

I have the following data:
df1 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Play', 1: [1, 2, 3, 4, 5]}).set_index(['Room', 'User', 'Task'])
df2 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Play', 2: [1, 2, 3, 4, 5]}).set_index(['Room', 'User', 'Task'])
df3 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Clean', 1: [6, 7, 8, 9, 10]}).set_index(['Room', 'User', 'Task'])
df4 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Clean', 2: [6, 7, 8, 9, 10]}).set_index(['Room', 'User', 'Task'])
df = pd.concat([df1, df2, df3, df4]).sort_index()
And the output result looks like:
I wonder why the multi-index has a duplicate entry for each column there is.
I expected and want the output format to be like this, where all the multi-index keys only occur once and all NaN values are gone:
This would significantly reduce the size of my dataframe and also later on the storage size on the phy. drive.
If is possible sum values:
df = df.sum(level=[0,1,2])
#alternative
#df = df.groupby(level=[0,1,2]).sum()
print (df)
1 2
Room User Task
1 Martin Clean 6.0 6.0
Play 1.0 1.0
2 Martin Clean 7.0 7.0
Play 2.0 2.0
3 Martin Clean 8.0 8.0
Play 3.0 3.0
5 Martin Clean 9.0 9.0
Play 4.0 4.0
8 Martin Clean 10.0 10.0
Play 5.0 5.0
If possible get only first non missing value:
df = df.groupby(level=[0,1,2], sort=False).first()
print (df)
1 2
Room User Task
1 Martin Clean 6.0 6.0
Play 1.0 1.0
2 Martin Clean 7.0 7.0
Play 2.0 2.0
3 Martin Clean 8.0 8.0
Play 3.0 3.0
5 Martin Clean 9.0 9.0
Play 4.0 4.0
8 Martin Clean 10.0 10.0
Play 5.0 5.0