Pandas rolling window with less than or equal to

Pandas rolling window with less than or equal to - pandas

I have a dataframe which is classified based on three dimensions:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
when I do a rolling of metric d by the following command:
>>> df.d.rolling(window = 3).mean()
0 NaN
1 NaN
2 2.0
Name: d, dtype: float64
but what I actually want is to perform a rolling <= given number, in a way that if for the first entry the result is the same number itself and then from the second entry it rolls for the window size of 1 and for third it rolls for the window size of 2 and from 3 onwards it rolls the running average of 3 previous windows.
So the result I am expecting is:
for the dataframe:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
>>> df.d.rolling(window = 3).mean()
0 1 #Since this is the first one and so average of the first number is equal to number itself.
1 1.5 # Average of 1 and 2 as rolling criteria is <= 3
2 2.0 # Since here we have 3 elements so from here on it follows the general trend.
Name: d, dtype: float64
Is it possible to roll this way?

I was able to roll using the following command:
>>> df.d.rolling(min_periods = 1, window = 3).mean()
0 1.0
1 1.5
2 2.0
Name: d, dtype: float64
with the help of min_periods one can specify the rolling window minimum config count.

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?

s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0

Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B

Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Different outcome using pandas nunique() and unique()

I have a big DF with 10 millions rows and I need to find the unique number for each column.
I wrote the function below:
(need to return a series)
def count_unique_values(df):
return pd.Series(df.nunique())
and I get this output:
Area 210
Item 436
Element 4
Year 53
Unit 2
Value 313640
dtype: int64
expected result should be value 313641.
when I just do
df['Value'].unique()
I do get that answer. Didn't figure out why I get less with nunique() just there.

Because DataFrame.nunique omit missing values, because default parameter dropna=True, Series.unique function not.
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'D':[np.nan,3,5,5,3,5],
})
print (df)
A D
0 a NaN
1 b 3.0
2 c 5.0
3 d 5.0
4 e 3.0
5 f 5.0
def count_unique_values(df):
return df.nunique()
print (count_unique_values(df))
A 6
D 2
dtype: int64
print (df['D'].unique())
[nan 3. 5.]
print (df['D'].nunique())
2
print (df['D'].unique())
[nan 3. 5.]
Solution is add parameter dropna=False:
print (df['D'].nunique(dropna=False))
3
print (df['D'].unique())
3
So in your function:
def count_unique_values(df):
return df.nunique(dropna=False)
print (count_unique_values(df))
A 6
D 3
dtype: int64

calculate the mean of one row according it's label

calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks

This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64

Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.

You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64

apply() function to generate new value in a new column

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.

You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!

Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas rolling window with less than or equal to - pandas

I was able to roll using the following command: >>> df.d.rolling(min_periods = 1, window = 3).mean() 0 1.0 1 1.5 2 2.0 Name: d, dtype: float64 with the help of min_periods one can specify the rolling window minimum config count.

Related

pandas dataframe how to replace extreme outliers for all columns

Different outcome using pandas nunique() and unique()

calculate the mean of one row according it's label

apply() function to generate new value in a new column

Extract rows with maximum values in pandas dataframe

Categories

Resources