Rolling means in Pandas dataframe - pandas

I am trying to run some computations on DataFrames. I want to compute the average difference between two sets of rolling mean. To be more specific, the average of the difference between a long-term mean (lst) and a smaller-one (lst_2). I am trying to combine the calculation with a double for loop as follows:
import pandas as pd
import numpy as pd
def main(df):
df=df.pct_change()
lst=[100,150,200,250,300]
lst_2=[5,10,15,20]
result=pd.DataFrame(np.sum([calc(df,T,t) for T in lst for t in lst_2]))/(len(lst)+len(lst_2))
return result
def calc(df,T,t):
roll=pd.DataFrame(np.sign(df.rolling(t).mean()-df.rolling(T).mean()))
return roll
Overall I should have 20 differences (5 and 100, 10 and 100, 15 and 100 ... 20 and 300); I take the sign of the difference and I want the average of these differences at each point in time. Ideally the result would be a dataframe result.
I got the error: cannot copy sequence with size 3951 to array axis with dimension 1056 when it runs the double for loops. Obviously I understand that due to rolling of different T and t, the dimensions of the dataframes are not equal when it comes to the array conversion (with np.sum), but I thought it would put "NaN" to align the dimensions.
Hope I have been clear enough. Thank you.
As requested in the comments, here is an example. Let's suppose the following
dataframe:
df = pd.DataFrame({'A': [100,101.4636,104.9477,106.7089,109.2701,111.522,113.3832,113.8672,115.0718,114.6945,111.7446,108.8154]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
df=df.pct_change()
and I have the following 2 sets of mean I need to compute:
lst=[8,10]
lst_1=[3,4]
Then I follow these steps:
1/
I want to compute the rolling mean(3) - rolling mean(8), and get the sign of it:
roll=np.sign(df.rolling(3).mean()-df.rolling(8).mean())
This should return the following:
roll = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
2/
I redo step 1 with the combination of differences 3-10 ; 4-8 ; 4-10. So I get overall 4 roll dataframes.
roll_3_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_3_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
3/
Now that I have all the diffs, I simply want the average of them, so I sum all the 4 rolling dataframes, and I divide it by 4 (number of differences computed). The results should be (before dropping all N/A values):
result = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])

Related

Pandas | How to effectively filter a column

I'm looking for a way to quickly and effectively filter through a dataframe column and remove values that don't meet a condition.
Say, I have a column with the numbers 4, 5 and 10. I want to filter the column and replace any numbers above 7 with 0. How would I go about this?
You're talking about two separate things - filtering and value replacement. They both have uses and end up being similar in nature but for filtering I'll point to this great answer.
Let's say our data frame is called df and looks like
A B
1 4 10
2 4 2
3 10 1
4 5 9
5 10 3
Column A fits your statement of a column only having values 4, 5, 10. If you wanted to replace numbers above 7 with 0, this would do it:
df["A"] = [0 if x > 7 else x for x in df["A"]]
If you read through the right-hand side it cleanly explains what it is doing. It helps to include parentheses to separate out the "what to do" with the "what you're doing it over":
df["A"] = [(0 if x > 7 else x) for x in df["A"]]
If you want to do a manipulation over multiple columns, then utilizing zip allows you to do it easily. For example, if you want the sum of columns A and B then:
df["sum"] = [x[0] + x[1] for x in zip(df["A"], df["B"])]
Take care when you overwrite data - this removes information. It's a good practice to have the transformed data in other columns so you can trace back when something inevitably goes wonky.
There is many options. One possibility for if then... is np.where
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1, 200, 4, 5, 6, 11],
'y': [4, 5, 10, 24, 4 , 3]})
df['y'] = np.where(df['y'] > 7, 0, df['y'])

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

How to plot outliers with regard to unique ids

I have item_code column in my data and another column, sales, which represents sales quantity for the particular item.
The data can have a particular item id many times. There are other columns tell apart these entries.
I want to plot only the outlier sales for each item (because data has thousands of different item ids, plotting every entry can be difficult).
Since I'm very new to this, what is the right way and tool to do this?
you can use pandas. You should choose a method to detect outliers, but I have an example for you:
If you want to get outliers for all sales (not in groups), you can use apply with function (example - lambda function) to have outliers indexes.
import numpy as np
%matplotlib inline
df = pd.DataFrame({'item_id': [1, 1, 2, 1, 2, 1, 2],
'sales': [0, 2, 30, 3, 30, 30, 55]})
df[df.apply(lambda x: np.abs(x.sales - df.sales.mean()) / df.sales.std() > 1, 1)
].set_index('item_id').plot(style='.', color='red')
In this example we generated data sample and search indexes of points what are more then mean / std + 1 (you can try another method). And then just plot them where y is count of sales and x is item id. This method detected points 0 and 55. If you want search outliers in groups, you can group data before.
df.groupby('item_id').apply(lambda data: data.loc[
data.apply(lambda x: np.abs(x.sales - data.sales.mean()) / data.sales.std() > 1, 1)
]).set_index('item_id').plot(style='.', color='red')
In this example we have points 30 and 55, because 0 isn't outlier for group where item_id = 1, but 30 is.
Is it what you want to do? I hope it helps start with it.

Pandas - given a sorted dataframe and a list of target values, how to retrieve rows next to these values in one go

Suppose I have a sorted dataframe and a list of target values as below
In [57]: df
Out[57]:
value
0 1
1 2
2 3
3 4
4 5
5 6
In [58]: target_values=[1.5, 3.5, 5.5]
What I want is to get the first row which has a value >= the target value respectively. In the example above, the index of such rows are [1, 3, 5].
I can achieve the goal with following code
In [60]: [df[df.value >= t].iloc[0] for t in target_values]
However, it will scan the dataframe for len(target_values) times. Is there a Pandas function which can achieve the goal with just one scan?
It's called searchsorted. You can use pandas method, or numpy
pandas
df.value.searchsorted(target_values)
array([1, 3, 5])
numpy
df.value.values.searchsorted(target_values)
array([1, 3, 5])
#build a pair wise difference matrix
pairwise_diff = df.values[:,None]-target_values
#find the non-negative min diff for each value in target values.
np.ma.array(pairwise_diff,mask=(pairwise_diff<0)).argmin(0)
Out[178]: array([[1, 3, 5]], dtype=int64)

Are there functions to retrieve the histogram counts of a Series in pandas?

There is a method to plot Series histograms, but is there a function to retrieve the histogram counts to do further calculations on top of it?
I keep using numpy's functions to do this and converting the result to a DataFrame or Series when I need this. It would be nice to stay with pandas objects the whole time.
If your Series was discrete you could use value_counts:
In [11]: s = pd.Series([1, 1, 2, 1, 2, 2, 3])
In [12]: s.value_counts()
Out[12]:
2 3
1 3
3 1
dtype: int64
You can see that s.hist() is essentially equivalent to s.value_counts().plot().
If it was of floats an awful hacky solution could be to use groupby:
s.groupby(lambda i: np.floor(2*s[i]) / 2).count()
Since hist and value_counts don't use the Series' index, you may as well treat the Series like an ordinary array and use np.histogram directly. Then build a Series from the result.
In [4]: s = Series(randn(100))
In [5]: counts, bins = np.histogram(s)
In [6]: Series(counts, index=bins[:-1])
Out[6]:
-2.968575 1
-2.355032 4
-1.741488 5
-1.127944 26
-0.514401 23
0.099143 23
0.712686 12
1.326230 5
1.939773 0
2.553317 1
dtype: int32
This is a really convenient way to organize the result of a histogram for subsequent computation.
To index by the center of each bin instead of the left edge, you could use bins[:-1] + np.diff(bins)/2.
If you know the number of bins you want, you can use pandas' cut function, which is now accessible via value_counts. Using the same random example:
s = pd.Series(np.random.randn(100))
s.value_counts(bins=5)
Out[55]:
(-0.512, 0.311] 40
(0.311, 1.133] 25
(-1.335, -0.512] 14
(1.133, 1.956] 13
(-2.161, -1.335] 8
Based on this answer from a related question you can get the bin edges and histogram counts as follows:
s = pd.Series(np.random.randn(100))
ax = s.hist()
for rect in dd.patches:
((x0, y0), (x1, y1)) = rect.get_bbox().get_points()
print(((x0, y0), (x1, y1)))