Rolling median of date-indexed data with duplicate dates - pandas

My date-indexed data can have multiple observations for a given date.
I want to get the rolling median of a value but am not getting the result that I am looking for:
df = pd.DataFrame({
'date': ['2020-06-22', '2020-06-23','2020-06-24','2020-06-24', '2020-06-25', '2020-06-26'],
'value': [2,8,5,1,3,7]
})
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Attempt to get the 3-day rolling median of 'value':
df['value'].rolling('3D').median()
# This yields the following, i.e. one median value
# per **observation**
# (two values for 6/24 in this example):
date
2020-06-22 2.0
2020-06-23 5.0
2020-06-24 5.0
2020-06-24 3.5
2020-06-25 4.0
2020-06-26 4.0
Name: value, dtype: float64
# I was hoping to get one median value
# per **distinct date** in the index
# The median for 6/24, for example, would be computed
# from **all** observations on 6/22, 6/23 and 6/24(2 observations)
date
2020-06-22 NaN
2020-06-23 NaN
2020-06-24 3.5
2020-06-25 4.0
2020-06-26 4.0
Name: value, dtype: float64
How do I need to change my code?

As far as I can tell, your code produces the right answer for the second occurrence of 2020-06-24, as 3.5 is the median of 4 numbers 2,8,5,1. The first occurrence of 2020-06-24only uses its own value and the ones from the two prior days. Presumably, and I am speculating here, it is looking at the '3D' window in the elements in the rows preceding it in the timeseries, not following.
So I think your code only needs a small modification to satisfy your requirement and that is if there are multiple rows with the same date we should just pick the last one. We will do this below with groupby. Also you want the first two values to be NaN rather than medians of shorter time series -- this can be achieved by passing min_periods = 3 in the rolling function. Here is all the code, I put the median into its own column
df['median'] = df['value'].rolling('3D', min_periods = 3).median()
df.groupby(level = 0, axis = 0).last()
prints
value median
date
2020-06-22 2 NaN
2020-06-23 8 NaN
2020-06-24 1 3.5
2020-06-25 3 4.0
2020-06-26 7 4.0

Related

How to get the groupby nth row directly in the row as an item?

I have Date, Time, Open, High, low, Close, data on a minute basis of a stock. It is arranged in ascending order ( date wise ). I want to make a new column and for every day (for each row) insert the yesterday price at second row of last date). So for instance I have mentioned price of 18812.3 in front of 11th Jan since last date was 10th Jan and its second row has a price of 18812.3. Similarly I have done it for day before yesterday too. I tried using nth of groupby object but for I have to create a group by object. The below code is getting the a new Dataframe but I would like to create a column directly having the desired values.
test = bn_futures.groupby('Date')['Open','High','Low','Close'].nth(1).reset_index()
Try: (check comments)
# Convert Date to datetime64 and set it as index
df = df.assign(Date=pd.to_datetime(df['Date'], dayfirst=True)).set_index('Date')
# Find second value for each day
prices = df.groupby(level=0)['Open'].nth(1).squeeze()
# Find last row for each day
mask = ~df.index.duplicated(keep='last')
# Create new columns
df.loc[mask, 'price at yesterday'] = prices.shift(1)
df.loc[mask, 'price 2d ago'] = prices.shift(2)
Output:
>>> df
Open price at yesterday price 2d ago
Date
2015-01-09 1 NaN NaN
2015-01-09 2 NaN NaN
2015-01-09 3 NaN NaN
2015-01-10 4 NaN NaN
2015-01-10 5 NaN NaN
2015-01-10 6 2.0 NaN
2015-01-11 7 NaN NaN
2015-01-11 8 NaN NaN
2015-01-11 9 5.0 2.0
Setup a MRE:
df = pd.DataFrame({'Date': ['09-01-2015', '09-01-2015', '09-01-2015',
'10-01-2015', '10-01-2015', '10-01-2015',
'11-01-2015', '11-01-2015', '11-01-2015'],
'Open': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

Pandas: vectorize sliding time window aggregation

I have a big dataframe from which I need sliding time windows averages for a given set of query points. I tried with df.rolling but this wouldn't allow me for querying arbitary points. The following works, but seems inefficient and does not allow for vectorized usage:
import pandas as pd
df = pd.DataFrame({'B': range(5)},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
query = pd.date_range(df.index[0], df.index[-1], freq='s')
time_window = pd.Timedelta(seconds=2)
f = lambda t: df[(t - time_window < df.index) & (df.index <= t)]["B"].mean()
[f(t) for t in query] # works but is slow
f(query) # throws ValueError length must match
Probably this can be done better ...
Edit: The real application has measures which appear randomly between 30 and 90 seconds. Sometimes there are periods with several days or weeks without data. The time_window is typically 15 minutes. The overall time horizon is 10 years.
You're skipping just a small step.
Your "query" is really a time series resampling operation. That is, in addition to calculating a rolling mean, you are also trying to smoothly resample the time series at a frequency of one second. You can do that using the asfreq method, applying it prior to the rolling operation:
resample_rolling = df.asfreq('1s').rolling(pd.Timedelta(seconds=2)).mean()
print(np.array([f(t) for t in query]))
print(resample_rolling.to_numpy()[:, 0])
Output:
[0. 0. 1. 1.5 2. 3. 3.5]
[0. 0. 1. 1.5 2. 3. 3.5]
Note that by default, the asfreq method fills missing values in with nan values.
>>> df.asfreq(pd.Timedelta(seconds=1))
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 NaN
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 NaN
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
The rolling operation then ignores those values. If instead you want to fill the values with something other than nans, you have two options. You can supply a fill_value:
>>> df.asfreq('1s', fill_value=0.0)
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 0.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
Or you can specify a method, such as backfill, which uses the next value in the series:
>>> df.asfreq('1s', method='backfill')
B
2013-01-01 09:00:00 0
2013-01-01 09:00:01 1
2013-01-01 09:00:02 1
2013-01-01 09:00:03 2
2013-01-01 09:00:04 3
2013-01-01 09:00:05 3
2013-01-01 09:00:06 4
The resulting rolling mean is then different, of course:
>>> df.asfreq('1s', method='backfill').rolling('1s').mean()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 3.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
After some research I came up with the following solution with two rolling windows, one for entering the window and one for leaving:
import pandas as pd, numpy as np
df = pd.DataFrame({'B': range(5)},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
query = pd.date_range(df.index[0], df.index[-1], freq='s')
time_window = pd.Timedelta(seconds=2)
aggregates = ['mean']
### Preparation
# one data point for each point entering the window
df1 = df.rolling(window=time_window, closed='right').agg(aggregates)
# one data point for each point leaving the window - use reverted df
df2 = df[::-1].rolling(window=time_window, closed='left').agg(aggregates)
df2.index += time_window
# Caution: for my real data in the reverted rolling method, I had
# to add a small Timedelta to window to function properly
# merge both together and remove duplicates
df_windowed = pd.concat([df1, df2])
df_windowed.sort_index(inplace=True)
df_windowed = df_windowed[~df_windowed.index.duplicated(keep='first')]
### the vectorized function
# Caution: get_indexer returns -1 for not found values (below df.index.min()),
# which is interpreted as last value. But last value of df_windows is always NaN
f = lambda t: df_windowed.iloc[
df_windowed.index.get_indexer(t, method='ffill') if isinstance(t, (pd.Index, pd.Series, np.ndarray,)) else
df_windowed.index.get_loc(t, method='ffill')
]["B"]["mean"].to_numpy()
f(query)

summary of converted and churned customers from time series dataframe

I have similar to the below dataframe and I would like to create a few summary stats around the behaviour of customers over time
pd.DataFrame([
['id1','23/5/2019','not_emailed']
,['id1','24/5/2019','not_emailed']
,['id1','25/5/2019','emailed']
,['id1','26/5/2019','emailed']
,['id1','27/5/2019','emailed']
,['id1','28/5/2019','emailed']
,['id1','29/5/2019','emailed']
,['id1','30/5/2019','emailed']
,['id1','31/5/2019','emailed']
,['id1','1/6/2019','emailed']
,['id1','2/6/2019','emailed']
,['id2','23/5/2019','not_emailed']
,['id2','24/5/2019','not_emailed']
,['id2','25/5/2019','emailed']
,['id2','26/5/2019','emailed']
,['id2','27/5/2019','emailed']
,['id3','29/5/2019','not_emailed']
,['id3','30/5/2019','emailed']
,['id3','31/5/2019','emailed']
,['id3','1/6/2019','emailed']
,['id3','2/6/2019','emailed']
,['id4','29/5/2019','not_emailed']
,['id4','30/5/2019','emailed']
,['id4','31/5/2019','emailed']
,['id4','1/6/2019','emailed']
,['id4','2/6/2019','emailed']
,['id4','2/7/2019','emailed']
,['id4','3/7/2019','emailed']
,['id4','4/7/2019','emailed']
],columns=['id','date','status'])
The main scenarios that could be observed in this data set are:
id1 emailed on 25th but not converted
id2 emailed on 27th and converted on 28th because we dont see any more logs for this id
id3 emailed on 30th and converted on 3rd because we dont see any more logs for this id
id4 emailed on 30th and converted on 3rd but churned againon the 2nd
I would like to get a summary of that information per day
How many emailed, how many converted, how many churned that had previously converted
A desired potential output could be:
pd.DataFrame([
['29/5/2019',10,3,1] ,
['30/5/2019',10,2,1]
],columns=['date','emailed_total','converted_total','churned_total']
)
Not that numbers above are random and don't reflect the stats of the first dataset shared
My approaches so far:
1)
partially solves the problem:
find first day of emailed
calculate days passed since first
group by the elapsed days and aggregate
works but not for churn customers
2)
loop through dates
filter out unique ids emailed
loop through dates in the future and calculate the differences between sets
does the job but not very clean and pythonic
I have written the code to answer your question as I understand it at the moment. But as I commented, the status regarding churning has been fluffed up, so there are only two different totals. It is not done. The name of the column is not the name you want either.
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df2 = df.groupby(['date','status']).agg('count').unstack().fillna(0)
df2.columns = df2.columns.droplevel()
df2 = df2.rename_axis(columns=None).reset_index()
df2.sort_index(ascending=True, inplace=True)
df
date emailed not_emailed
0 2019-05-23 0.0 2.0
1 2019-05-24 0.0 2.0
2 2019-05-25 2.0 0.0
3 2019-05-26 2.0 0.0
4 2019-05-27 2.0 0.0
5 2019-05-28 1.0 0.0
6 2019-05-29 1.0 2.0
7 2019-05-30 3.0 0.0
8 2019-05-31 3.0 0.0
9 2019-06-01 3.0 0.0
10 2019-06-02 3.0 0.0
11 2019-07-02 1.0 0.0
12 2019-07-03 1.0 0.0
13 2019-07-04 1.0 0.0
As per your request i made some insight about your data, time to conversion and unconverted id time.
I hope it helps
df['date']=pd.to_datetime(df['date'],infer_datetime_format=True)
df.sort_values(by='date',inplace=True)
dates=df['date'].unique()
ids=df['id'].unique()
df=df.set_index(['id','date'])
out=pd.DataFrame(index=dates)
for i, new_df in df.groupby(level=0):
new_df=new_df.droplevel(0)
new_df=new_df.rename(columns={'status':i})
out=out.merge(new_df, how='outer', left_index=True, right_index=True)
not_converted=out[out.columns[out.iloc[-1,:]=='emailed']]
converted=out[out.columns[out.iloc[-1,:].isnull()]]
start_mailing_date_NC=(not_converted=='emailed').cumsum().idxmin() #not converted id metrics
delta_NC=(dates[-1]-start_mailing_date_NC) #dates[-1] could be changed to actual date
print("Days from first mail unconverted by id: ")
print(delta_NC.to_string())
print(' Mean Days not converted: %s'%(delta_NC.mean()))
print( '\n')
start_mailing_date=(converted=='emailed').cumsum().idxmin() #converted id metrics
conversion_mailing_date=(converted=='emailed').cumsum().idxmax()#converted id metrics
delta=(conversion_mailing_date-start_mailing_date)
print("Days to conversion by id: ")
print(delta.to_string())
print(' Mean Days to conversion: %s'%(delta.mean()))
output:
Days from first mail unconverted by id:
id4 42 days
Mean Days not converted: 42 days 00:00:00
Days to conversion by id:
id1 10 days
id2 4 days
id3 10 days
Mean Days to conversion: 8 days 00:00:00

DataFrame: Moving average with rolling, mean and shift while ignoring NaN

I have a data set, let's say, 420x1. Now I would to calculate the moving average of the past 30 days, excluding the current date.
If I do the following:
df.rolling(window = 30).mean().shift(1)
my df results in a window with lots of NaNs, which is probably caused by NaNs in the original dataframe here and there (1 NaN within the 30 data points results the MA to be NaN).
Is there a method that ignores NaN (avoiding apply-method, I run it on large data so performance is key)? I do not want to replace the value with 0 because that could skew the results.
the same applies than to moving standard deviation.
For example you can adding min_periods, and NaN is gone
df=pd.DataFrame({'A':[1,2,3,np.nan,2,3,4,np.nan]})
df.A.rolling(window=2,min_periods=1).mean()
Out[7]:
0 1.0
1 1.5
2 2.5
3 3.0
4 2.0
5 2.5
6 3.5
7 4.0
Name: A, dtype: float64
Option 1
df.dropna().rolling('30D').mean()
Option 2
df.interpolate('index').rolling('30D').mean()
Option 2.5
df.interpolate('index').rolling(30).mean()
Option 3
s.rolling('30D').apply(np.nanmean)
Option 3.5
df.rolling(30).apply(np.nanmean)
You can try dropna() to remove the nan values or fillna() to replace the nan with specific value.
Or you can filter out all nan value by notnull() or isnull() within your operation.
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df2
one two three
a 0.434024 -0.749472 -1.393307
b NaN NaN NaN
c 0.897861 0.032307 -0.602912
d NaN NaN NaN
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
g NaN NaN NaN
h -1.772906 -1.342019 -0.948151
df3 = df2[df2['one'].notnull()]
# use ~isnull() would return the same result
# df3 = df2[~df2['one'].isnull()]
print df3
one two three
a 0.434024 -0.749472 -1.393307
c 0.897861 0.032307 -0.602912
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
h -1.772906 -1.342019 -0.948151
For further reference, Pandas has a clean documentary about handling missing data(read this).

Pandas expanding window with min_periods

I want to compute expanding window statistics, but with a minimum number of periods of 3, rather than 1. That is, I want it start computing the statistic after the window of 3 values, and then include all values up until that point:
value expanding_min
------------------------
6 NaN
5 NaN
2 NaN
3 2
1 1
however, using
df['expanding_min']= df.groupby(groupby)['value'].transform(lambda x: pd.rolling_min(x, window=len(x), min_periods=3))
or
df['expanding_min']= df.groupby(groupby)['value'].transform(lambda x: pd.expanding_min(x, min_periods=3))
I get the following error:
ValueError: min_periods (3) must be <= window (1)
This works for me, changing from value to df.value:
pd.expanding_min(df.value, min_periods=3)
or
pd.rolling_min(df.value, window=len(df.value), min_periods=3)
both output:
0 NaN
1 NaN
2 2
3 2
4 1
dtype: float64
Perhaps your window is being set by some other 'value' whose length is 1? This is why pandas is giving the error message