Pandas rolling window with less than or equal to - pandas

I have a dataframe which is classified based on three dimensions:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
when I do a rolling of metric d by the following command:
>>> df.d.rolling(window = 3).mean()
0 NaN
1 NaN
2 2.0
Name: d, dtype: float64
but what I actually want is to perform a rolling <= given number, in a way that if for the first entry the result is the same number itself and then from the second entry it rolls for the window size of 1 and for third it rolls for the window size of 2 and from 3 onwards it rolls the running average of 3 previous windows.
So the result I am expecting is:
for the dataframe:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
>>> df.d.rolling(window = 3).mean()
0 1 #Since this is the first one and so average of the first number is equal to number itself.
1 1.5 # Average of 1 and 2 as rolling criteria is <= 3
2 2.0 # Since here we have 3 elements so from here on it follows the general trend.
Name: d, dtype: float64
Is it possible to roll this way?

I was able to roll using the following command:
>>> df.d.rolling(min_periods = 1, window = 3).mean()
0 1.0
1 1.5
2 2.0
Name: d, dtype: float64
with the help of min_periods one can specify the rolling window minimum config count.

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Different outcome using pandas nunique() and unique()

I have a big DF with 10 millions rows and I need to find the unique number for each column.
I wrote the function below:
(need to return a series)
def count_unique_values(df):
return pd.Series(df.nunique())
and I get this output:
Area 210
Item 436
Element 4
Year 53
Unit 2
Value 313640
dtype: int64
expected result should be value 313641.
when I just do
df['Value'].unique()
I do get that answer. Didn't figure out why I get less with nunique() just there.
Because DataFrame.nunique omit missing values, because default parameter dropna=True, Series.unique function not.
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'D':[np.nan,3,5,5,3,5],
})
print (df)
A D
0 a NaN
1 b 3.0
2 c 5.0
3 d 5.0
4 e 3.0
5 f 5.0
def count_unique_values(df):
return df.nunique()
print (count_unique_values(df))
A 6
D 2
dtype: int64
print (df['D'].unique())
[nan 3. 5.]
print (df['D'].nunique())
2
print (df['D'].unique())
[nan 3. 5.]
Solution is add parameter dropna=False:
print (df['D'].nunique(dropna=False))
3
print (df['D'].unique())
3
So in your function:
def count_unique_values(df):
return df.nunique(dropna=False)
print (count_unique_values(df))
A 6
D 3
dtype: int64

calculate the mean of one row according it's label

calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks
This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64
Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.
You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64

apply() function to generate new value in a new column

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.
You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2