mininum value of a resample (not 0) - dataframe

i have a dataframe (df) indexed by dates (freq: 15 minutes): (little example)
datetime
Value
2019-09-02 16:15:00
0.00
2019-09-02 16:30:00
3.07
2019-09-02 16:45:00
1.05
And i want to resample my dataframe to freq: 1 month. Also I need calculate the min value in this month reaching this:
df_min = df.resample('1M').min()
Up to this point, all good but i need the min value not be 0, so i want something like min(i>0) but i dont know how to get it

here is one way to do it
assumption: datetime is an index
# make the 0 as nan and take the min
df_min= df.replace(0, np.nan).resample('1M').min()
Value
datetime
2019-09-30 1.05

Related

How to match Datetimeindex for all but the year?

I have a dataset with missing values and a Datetimeindex. I would like to fill this values with the mean values of other values reported at the same month, day and hour. If there is no values reported at this specific month/day/hour for all years I would like to get the interpolated value mean values of the nearest hour reported. How can I achieve this? Right now my approach is this:
df_Na = df_Na[df_Na['Generation'].isna()]
df_raw = df_raw[~df_raw['Generation'].isna()]
# reduce to month
same_month = df_raw[df_raw.index.month.isin(df_Na.index.month)]
# reduce to same day
same_day = same_month[same_month.index.day.isin(df_Na.index.day)]
# reduce to hour
same_hour = same_day[same_day.index.hour.isin(df_Na.index.hour)]
df_Na are all missing values I liked to fill and df_raw are all reported values from which I liked to get the mean value. I have a huge dataset which is why I would like to avoid a for loop at all cost.
My Data looks like this:
df_Na
Generation
2017-12-02 19:00:00 NaN
2021-01-12 00:00:00 NaN
2021-01-12 01:00:00 NaN
..............................
2021-02-12 20:00:00 NaN
2021-02-12 21:00:00 NaN
2021-02-12 22:00:00 NaN
df_raw
Generation
2015-09-12 00:00:00 0.0
2015-09-12 01:00:00 19.0
2015-09-12 02:00:00 0.0
..............................
2021-12-11 21:00:00 0.0
2021-12-11 22:00:00 180.0
2021-12-11 23:00:00 0.0
Use GroupBy.transform with mean for averages per MM-DD HH and replace missing values by DataFrame.fillna:
df = df.fillna(df.groupby(df.index.strftime('%m-%d %H')).transform('mean'))
And then if necessary add DataFrame.interpolate:
df = df.interpolate(method='nearest')

pandas groupby several criteria

I have a dataframe that looks like this
which contains every minute of a year.
I need to simplify it on hourly base and to get only hours of the year and then maximum of Reserved and Used columns for the respective hours.
I made this, which works, but not totally for my purposes
df = df.assign(date=df.date.dt.round('H'))
df1 = df.groupby('date').agg({'Reserved': ['max'], 'Used': ['max'] }).droplevel(1, axis=1).reset_index()
which just groups the minutes into hours.
date Reserved Used
0 2020-01-01 00:00:00 2176 0.0
1 2020-01-01 01:00:00 2176 0.0
2 2020-01-01 02:00:00 2176 0.0
3 2020-01-01 03:00:00 2176 0.0
4 2020-01-01 04:00:00 2176 0.0
... ... ... ...
8780 2020-12-31 20:00:00 3450 50.0
8781 2020-12-31 21:00:00 3450 0.0
8782 2020-12-31 22:00:00 3450 0.0
8783 2020-12-31 23:00:00 3450 0.0
8784 2021-01-01 00:00:00 3450 0.0
Now I need group it more to plot several curves, containing only 24 points (for every hour) based on several criteria
average used and reserved for the whole year (so to group together every 00 hour, every 01 hour, etc.)
average used and reserved for every month (so to group every 00 hour, 01 hour etc for each month individually)
average used and reserved for weekdays and for weekends
I know this is only the similar groupby as before, but I somehow miss the logic of doing it.
Could anybody help?
Thanks.

How to fill pandas dataframe with max() values

I have a dataframe where each day starts at 7:00 and ends at 22:10 in 5 minute intervals.
In the df are around 200 days (weekend days and some specific days are excluded)
Date Time Volume
0 2019-09-03 07:00:00 70000 778
1 2019-09-03 07:05:00 70500 1267
2 2019-09-03 07:10:00 71000 1208
3 2019-09-03 07:15:00 71500 715
4 2019-09-03 07:20:00 72000 372
I need another column, let's call it 'lastdayVolume', with the max value of Volume of the prior day
For example, in 2019-09-03 (between 7:00 and 22:10) the maximum volume value in a single row is 50000, then I need in every row of 2019-09-04 the value 50000 in column 'lastdayVolume'.
How would you do this without decreasing the lenght of the dataframe?
Have you tried
df.resample('1D', on='Date').max()
This should give you one row per day with the maximal value at this day.
EDIT: To combine that with the old Data, you can use a left join. Its a bit messy but
pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
In [54]: pd.merge(df, df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), left_on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date), right_index=True, how='left')
Out[54]:
Date Time Volume lastdayVolume
0 2019-09-03 07:00:00 70000 778 800.0
1 2019-09-03 07:05:00 70500 1267 800.0
2 2019-09-03 07:10:00 71000 1208 800.0
3 2019-09-03 07:15:00 71500 715 800.0
4 2019-09-03 07:20:00 72000 372 800.0
0 2019-09-02 08:00:00 70000 800 NaN
seems to work out.
Equivalently you can use the slightly shorter
df.join(df.resample('1D', on='Date')['Volume'].max().rename('lastdayVolume'), on=pd.to_datetime((df['Date'] - pd.Timedelta('1d')).dt.date))
here.
The first DataFrame is your old one, the second is the one I calculated above (with appropriate renaming). For the values to merge on you use your 'Date' column which contains timestamps, offset it by one day and converted to an actual date on the left. On the right simply use the index.
The left join ensures you don't accidentally drop rows if you have no transactions the day before.
EDIT 2: To find out that maximum in a certain timerange, you can use
df.set_index('Date').between_time('15:30:00', '22:10:00')
to filter the DataFrame. Afterwards resample as before
df.join(df.set_index('Date').between_time('15:30:00', '22:10:00').resample('1D')...
where the on parameter in the resample is no longer necessary as the Date went into the index.

What is timestamp granularity computing in pandas?

I have the following dataset
df = pd.DataFrame({'timestamp': pd.date_range('1/1/2020', '3/1/2020 23:59', freq='12h'),
'col1': np.random.randint(100,size=122)}).\
sort_values('timestamp')
I want to compute a daily, weekly and monthly sum of the col1. If I use 'W' granularity for the timestamp column I receive a ValueError : ValueError: <Week: weekday=6> is a non-fixed frequency and I read that is recommended to use 7D, 30D etc.
My question is how pandas compute 7D or 30D granularity? If I add another column
df['timestamp2']= df.timestamp.dt.floor('30D')
df.groupby('timestamp2')[['col1']].sum()
I get the following result:
timestamp2 col1
2019-12-10 778
2020-01-09 3100
2020-02-08 2470
Why does pandas returns those dates if my min date is Jan 1, 2020 and maximum timestamp is 1 Mar, 2020?
The origin is the POSIX origin: 1970-01-01. By using .floor('30D') the allowable values are 1970-01-01, 1970-01-31, ... and all other 30 day multiples. Your dates are close to the 608-610th multiples.
pd.to_datetime('1970-01-01') + pd.DateOffset(days=30*608)
#Timestamp('2019-12-10 00:00:00')
pd.to_datetime('1970-01-01') + pd.DateOffset(days=30*609)
#Timestamp('2020-01-09 00:00:00')
If what you want is instead 30D periods from your first observation, then resample is how you can aggregate:
df.resample('30D', on='timestamp')['timestamp'].agg(['min', 'max'])
min max
timestamp
2020-01-01 2020-01-01 2020-01-30 12:00:00 # starts from 1st date
2020-01-31 2020-01-31 2020-02-29 12:00:00
2020-03-01 2020-03-01 2020-03-01 12:00:00

PANDAS: Converting all datetimes in column to another format

I have a Pandas dataframe containing a datetime column, in which all the values are formatted like this:
25/09/15 12:00:00. I'd like to reformat this field in all the rows, in order to match this format: 25.09.15 12:00.
Here some sample data:
Date | Value
25/08/15 12:00:00 | 49.0
25/08/15 13:00:00 | 49.5
The date column datatype is string.
Thank you in advance
Use Series.dt.strftime to format datetime
df
Date Value
0 2015-08-25 12:00:00 49.0
1 2015-08-25 13:00:00 49.5
df['Date'] = df['Date'].dt.strftime('%Y.%m.%d %H:%M')
df
Date Value
0 2015.08.25 12:00 49.0
1 2015.08.25 13:00 49.5
if column type is str than you need to convert first to datetime
df.Date = pd.to_datetime(df.Date)