How to find consecutive zeros in time series - pandas

I have a data frame that its index is hourly date and its column is counts. Looks like the following table :
date counts
2017-03-31 00:00:00+00:00 0.0
2017-03-31 01:00:00+00:00 0.0
2017-03-31 02:00:00+00:00 0.0
2017-03-31 03:00:00+00:00 0.0
2017-03-31 04:00:00+00:00 0.0
... ...
2022-06-19 19:00:00+00:00 6.0
2022-06-19 20:00:00+00:00 6.0
2022-06-19 21:00:00+00:00 1.0
2022-06-19 22:00:00+00:00 1.0
2022-06-19 23:00:00+00:00 1.0
If there are 15 hours worth of zero counts in a row, they are considered as error and I want to flag them. Data frame is not complete and there are missing dates(gaps) in the data.
I tried to use resampling the data frame to 15 hours and find dates with sum of resampled 15 hours are zero but didn't give me the correct answer

If counts is guaranteed to be non-negative, you can use rolling and check for the max value:
df["is_error"] = df["counts"].rolling(15).max() == 0
If counts can be negative, you have to check both min and max:
r = df["counts"].rolling(15)
df["is_error"] = r.min().eq(0) & r.max().eq(0)

Assuming the dates are sorted, group by successive 0 and get the group size, if ≥ 15 flag it True:
m = df['counts'].ne(0)
c = df.groupby(m.cumsum())['counts'].transform('size')
df['error'] = c.gt(15).mask(m, False)

Related

mininum value of a resample (not 0)

i have a dataframe (df) indexed by dates (freq: 15 minutes): (little example)
datetime
Value
2019-09-02 16:15:00
0.00
2019-09-02 16:30:00
3.07
2019-09-02 16:45:00
1.05
And i want to resample my dataframe to freq: 1 month. Also I need calculate the min value in this month reaching this:
df_min = df.resample('1M').min()
Up to this point, all good but i need the min value not be 0, so i want something like min(i>0) but i dont know how to get it
here is one way to do it
assumption: datetime is an index
# make the 0 as nan and take the min
df_min= df.replace(0, np.nan).resample('1M').min()
Value
datetime
2019-09-30 1.05

How to match Datetimeindex for all but the year?

I have a dataset with missing values and a Datetimeindex. I would like to fill this values with the mean values of other values reported at the same month, day and hour. If there is no values reported at this specific month/day/hour for all years I would like to get the interpolated value mean values of the nearest hour reported. How can I achieve this? Right now my approach is this:
df_Na = df_Na[df_Na['Generation'].isna()]
df_raw = df_raw[~df_raw['Generation'].isna()]
# reduce to month
same_month = df_raw[df_raw.index.month.isin(df_Na.index.month)]
# reduce to same day
same_day = same_month[same_month.index.day.isin(df_Na.index.day)]
# reduce to hour
same_hour = same_day[same_day.index.hour.isin(df_Na.index.hour)]
df_Na are all missing values I liked to fill and df_raw are all reported values from which I liked to get the mean value. I have a huge dataset which is why I would like to avoid a for loop at all cost.
My Data looks like this:
df_Na
Generation
2017-12-02 19:00:00 NaN
2021-01-12 00:00:00 NaN
2021-01-12 01:00:00 NaN
..............................
2021-02-12 20:00:00 NaN
2021-02-12 21:00:00 NaN
2021-02-12 22:00:00 NaN
df_raw
Generation
2015-09-12 00:00:00 0.0
2015-09-12 01:00:00 19.0
2015-09-12 02:00:00 0.0
..............................
2021-12-11 21:00:00 0.0
2021-12-11 22:00:00 180.0
2021-12-11 23:00:00 0.0
Use GroupBy.transform with mean for averages per MM-DD HH and replace missing values by DataFrame.fillna:
df = df.fillna(df.groupby(df.index.strftime('%m-%d %H')).transform('mean'))
And then if necessary add DataFrame.interpolate:
df = df.interpolate(method='nearest')

pandas groupby several criteria

I have a dataframe that looks like this
which contains every minute of a year.
I need to simplify it on hourly base and to get only hours of the year and then maximum of Reserved and Used columns for the respective hours.
I made this, which works, but not totally for my purposes
df = df.assign(date=df.date.dt.round('H'))
df1 = df.groupby('date').agg({'Reserved': ['max'], 'Used': ['max'] }).droplevel(1, axis=1).reset_index()
which just groups the minutes into hours.
date Reserved Used
0 2020-01-01 00:00:00 2176 0.0
1 2020-01-01 01:00:00 2176 0.0
2 2020-01-01 02:00:00 2176 0.0
3 2020-01-01 03:00:00 2176 0.0
4 2020-01-01 04:00:00 2176 0.0
... ... ... ...
8780 2020-12-31 20:00:00 3450 50.0
8781 2020-12-31 21:00:00 3450 0.0
8782 2020-12-31 22:00:00 3450 0.0
8783 2020-12-31 23:00:00 3450 0.0
8784 2021-01-01 00:00:00 3450 0.0
Now I need group it more to plot several curves, containing only 24 points (for every hour) based on several criteria
average used and reserved for the whole year (so to group together every 00 hour, every 01 hour, etc.)
average used and reserved for every month (so to group every 00 hour, 01 hour etc for each month individually)
average used and reserved for weekdays and for weekends
I know this is only the similar groupby as before, but I somehow miss the logic of doing it.
Could anybody help?
Thanks.

Comparing one date with starting date and ending date with duplicates

i have the below sample of data, and i need to create a function that will take a sales date and compare it with the below dates and returns the discount name & percentage, but as below the discount dates are not unique and some times overlaps, so in case the date falls in two different discounts names it has to return the highest duplicate discount name based on the percentage in case the sale date falls in more than one.
Discount Name Start Date End Date Percentage
0 First 2020-07-24 2020-11-25 0.10
1 First 2020-09-13 2020-10-29 0.10
2 First 2020-12-07 2020-12-10 0.10
3 First 2020-12-28 2021-01-19 0.10
4 First 2020-06-14 2020-06-14 0.10
5 Second 2020-06-16 2020-06-18 0.15
6 Second 2020-06-21 2020-06-22 0.15
7 Second 2020-06-22 2020-06-23 0.15
8 Second 2020-07-07 2020-07-08 0.15
9 Third 2020-06-02 2020-06-12 0.20
10 Third 2020-05-19 2020-06-01 0.20
11 Third 2020-05-06 2020-05-17 0.20
12 Third 2020-04-30 2020-05-03 0.20
Screen Shot of Dataframe
i truly hope that someone can help me on this. thanks
This function should do the trick
def discout_rate(df, sales_date):
return df[(df['Start Date'] <= sales_date) & (df['End Date'] >= sales_date)]['Percentage'].max()
The sales_date should be of type datetime.datetime and the columns Start Date and End Date too.

Group by Index of Row in Pandas

I want to group and sum every 7 rows together (Hence to get a total of each week). There are currently two columns. One for date and the other for a float.
1/22/2020 NaN
1/23/2020 0.0
1/24/2020 1.0
1/25/2020 0.0
1/26/2020 3.0
1/27/2020 0.0
1/28/2020 0.0
1/29/2020 0.0
1/30/2020 0.0
1/31/2020 2.0
2/1/2020 1.0
2/2/2020 0.0
2/3/2020 3.0
2/4/2020 0.0
2/5/2020 0.0
2/6/2020 0.0
2/7/2020 0.0
2/8/2020 0.0
2/9/2020 0.0
2/10/2020 0.0
2/11/2020 1.0
2/12/2020 0.0
2/13/2020 1.0
2/14/2020 0.0
2/15/2020 0.0
2/16/2020 0.0
2/17/2020 0.0
2/18/2020 0.0
2/19/2020 0.0
2/20/2020 0.0
... ...
2/28/2020 0.0
2/29/2020 8.0
3/1/2020 6.0
3/2/2020 23.0
3/3/2020 20.0
3/4/2020 31.0
3/5/2020 68.0
3/6/2020 45.0
3/7/2020 119.0
3/8/2020 114.0
3/9/2020 64.0
3/10/2020 194.0
3/11/2020 397.0
3/12/2020 452.0
3/13/2020 590.0
3/14/2020 710.0
3/15/2020 61.0
3/16/2020 1389.0
3/17/2020 1789.0
3/18/2020 906.0
3/19/2020 3068.0
3/20/2020 4009.0
3/21/2020 4017.0
3/23/2020 25568.0
3/24/2020 10074.0
3/25/2020 12043.0
3/26/2020 18058.0
3/27/2020 17822.0
3/28/2020 19825.0
3/29/2020 19408.0
Assuming your date column is called dt and your value column is val:
import numpy as np
# in case if it's not already date time format:
df["dt"]=pd.to_datetime(df["dt"])
# your data looks sorted, but in case if it's not - that's the prerequisite here:
df=df.sort_values("dt")
df=df.groupby(np.arange(len(df))//7).agg({"dt": (min, max), "val": sum})
The aggregation for dt is done only so you can explicitly indicate aggregated interval - it might be enough to just take min for instance, or ignore it at all...
Set the date column as the index and use resample
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.resample('1W').sum()