Suppose that I have a data-frame (DF). Index of this data-frame is timestamp from 11 AM to 6 PM every day and this data-frame contains 30 days. I want to group it every 30 minutes. This is the function I'm using:
out = DF.groupby(pd.Grouper(freq='30min'))
The start date of output is correct, but it considers the whole day (24h) for grouping. For example, In the new timestamp, I have something like this:
11:00:00
11:30:00
12:00:00
12:30:00
...
18:00:00
18:30:00
...
23:00:00
23:30:00
...
2:00:00
2:30:00
...
...
10:30:00
11:00:00
11:30:00
As a result, many outputs are empty because from 6:00 PM to 11 AM, I don't have any data.
One possible solution should be DatetimeIndex.floor:
out = DF.groupby(DF.index.floor('30min'))
Or use dropna after aggregate function:
out = DF.groupby(pd.Grouper(freq='30min')).mean().dropna()
As mentioned in comment to original post this is as expected. If you want to remove empty groups simply slice them afterwards. Assuming in this case you are using count to aggregate:
df = df.groupby(pd.Grouper(freq='30min')).count()
df = df[df > 0]
Related
I have the following three dataframes:
df1:
date_time system_load
01-01-2017 00:00:00 208111
01-01-2017 01:00:00 208311
01-01-2017 02:00:00 208311
01-01-2017 03:00:00 208011
............... ...
31-12-2017 20:00:00 208611
31-12-2017 21:00:00 208411
31-12-2017 22:00:00 208111
31-12-2017 23:00:00 208911
The system load values of df1 has no problem.
df2:
date_time system_load
01-01-2018 00:00:00 208111
01-01-2018 01:00:00 208311
01-01-2018 02:00:00 208311
01-01-2018 03:00:00 208011
............... ...
31-12-2018 20:00:00 209611
31-12-2018 21:00:00 209411
31-12-2018 22:00:00 209111
31-12-2018 23:00:00 209911
The system load values of df2 is missed starting from 06-03-2018 20:00:00 till up to 24-10-2018 22:00:00.
df3:
date_time system_load
01-01-2019 00:00:00 309119
01-01-2019 01:00:00 309391
01-01-2019 02:00:00 309811
01-01-2019 03:00:00 309711
............... ...
31-12-2019 20:00:00 309611
31-12-2019 21:00:00 309411
31-12-2019 22:00:00 309111
31-12-2019 23:00:00 309911
The system load values of df3 has no problem.
What I want is to interpolate in suitable way the missed hourly records in df2 using the corresponding df1 and df3 hourly records (06-03-2017 20:00:00 till up to 24-10-2017 22:00:00 and 06-03-2019 20:00:00 till up to 24-10-2019 22:00:00 respectively). Based on "Pierre D"'s valuable comment I attached my scaled data.
Here is a very basic strategy that just takes data from neighboring years to fill the missing values. The offset is chosen to be precisely 52 weeks, so as to reflect possible weekly seasonality.
# get the whole series together, and resample to have missing data as NaN:
s = pd.concat([df1, df2, df3])['system_load'].resample('H').asfreq()
offset = 52 * 7 * 24 # 52 weeks, 7 days/week, 24 hours/day
filler = pd.concat([s.shift(offset), s.shift(-offset)], axis=1).mean(axis=1)
out = s.where(~s.isna(), filler)
# optional: make a new df2 with the filled values
df2mod = out.truncate(
before='2018',
after=pd.Timestamp('2019') - pd.Timedelta(1)
).to_frame('system_load')
Notes:
out contains the "filled" series for the whole system_load using neighboring years.
we use pandas.DataFrame.mean() to build the filler series as the mean of the two neighboring years, in a way that takes care of NaN (e.g. if one year or the other has NaN, then the mean is the only non-NaN value).
this is one of the most basic ways of filling the missing data, and likely won't fool a careful observer. Depending on the intended usage of the reconstructed data, a more elaborate strategy should be considered. Data reconstruction is an active field of research, and there are sophisticated methods in the literature. For example, one could use a GAN to build a resulting series that would be very hard to discriminate from real data.
Goal is to compute delta between two times, each in separate DF columns and in 24-Hour clock format, and add to a new column "triptime"
Here is my input code, which has no dates, just 24hour clock strings.
df = pd.DataFrame({'DepartureTime': ['2330', '1700', '0900'], 'ArrivalTime': ['0030','1900','1100']})
Here is my attempt
df['DepartureTime'] = pd.to_datetime(df.DepartureTime, format='%H%M')
df['ArrivalTime'] = pd.to_datetime(df.ArrivalTime, format='%H%M')
df['triptime'] = df.ArrivalTime - df.DepartureTime
Which outputs a problem as can be seen in the first row below. Unfortunately my pipeline data assumes no change in dates. Any guidance on how I can have the triptime column showing the actual trip time, without prefix of days?
IIUC you can add astype() to return only the difference in hours.
df['triptime'] = (df.ArrivalTime - df.DepartureTime).astype('timedelta64[h]')
#output
DepartureTime ArrivalTime triptime
0 1900-01-01 23:30:00 1900-01-01 00:30:00 -23.0
1 1900-01-01 17:00:00 1900-01-01 19:00:00 2.0
2 1900-01-01 09:00:00 1900-01-01 11:00:00 2.0
One way to get the interval when the day turns is to select all values less than zero and add 24. Apparently it solves the problem but it is not something I like. It seems highly susceptible to errors.
df.loc[df['triptime'] < 0, 'triptime'] = df['triptime'] + 24
#output
DepartureTime ArrivalTime triptime
0 1900-01-01 23:30:00 1900-01-01 00:30:00 1.0
1 1900-01-01 17:00:00 1900-01-01 19:00:00 2.0
2 1900-01-01 09:00:00 1900-01-01 11:00:00 2.0
The most correct and fail-safe way would be to have, in addition to the time of departure and arrival, the entire dates
If after calculations you want to remove the dates and keep only the hours, use .dt.time
df['DepartureTime'] = df['DepartureTime'].dt.time
df['ArrivalTime'] = df['ArrivalTime'].dt.time
#output
DepartureTime ArrivalTime triptime
0 23:30:00 00:30:00 1.0
1 17:00:00 19:00:00 2.0
2 09:00:00 11:00:00 2.0
I am trying to count the datetime occurrences every 12 hours as follows using dt.floor.
Here I created a data frame contains 2 days with 1-hour intervals. I have two questions regarding the output.
I am expecting the summary would be for every 12 hours i.e, first-row in the output1 should be 12:00 and second row would be 24:00. Instead, I get 00:00 and 12:00. Why is this?
Is it possible to create a summary using a specific time? for example, count every 6 Am and 6 PM?
code and input
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
input1.groupby(input1['datetime'].dt.floor('12H')).count()
output-1
datetime
datetime
2018-01-01 00:00:00 12
2018-01-01 12:00:00 12
2018-01-02 00:00:00 12
2018-01-02 12:00:00 12
output-2
datetime
datetime
2018-01-01 06:00:00 6
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 6
There is no 24th hour. The time part of a datetime in pandas exists in the range [00:00:00, 24:00:00), which ensures that there's only ever a single representation of the same exact time. (Notice the closure).
import pandas as pd
pd.to_datetime('2012-01-01 24:00:00')
#ParserError: hour must be in 0..23: 2012-01-01 24:00:00
For the second point as of pd.__version__ == '1.1.0' you can specify the offset parameter when you resample. You can also specify which side should be used for the labels. For older versions you will need to use the base argument.
# pandas < 1.1.0
#input1.resample('12H', on='datetime', base=6).count()
input1.resample('12H', on='datetime', offset='6H').count()
# datetime
#datetime
#2017-12-31 18:00:00 6
#2018-01-01 06:00:00 12
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 6
# Change labels
input1.resample('12H', on='datetime', offset='6H', label='right').count()
# datetime
#datetime
#2018-01-01 06:00:00 6
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 12
#2018-01-03 06:00:00 6
I modified your input data slightly, in order to use resample:
import pandas as pd
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
# add a dummy column
input1['x'] = 'x'
# convert datetime to index...
input1 = input1.set_index('datetime')
# ...so we can use resample, and loffset lets us start at 6 am
t = input1.resample('12h', loffset=pd.Timedelta(hours=6)).count()
# show results
print(t.head())
x
datetime
2018-01-01 06:00:00 12
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 12
I have a dataframe in 1 column with all different times.
Time
-----
10:00
11:30
12:30
14:10
...
I need to do a quantile range on this dataframe with the code below:
df.quantile([0,0.5,1],numeric_only=False)
Following the link below, the quantile does work.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
As my column as in object, I need to convert to pd.datetime or pd.Timestamp.
When I convert to pd.datetime, I will have all my time inserted with dates too.
If I format it to %H:%M, the column turns back to object which cannot work with quantile under numeric_only mode.
How can I convert to datetime format in %H:%M and still stick to datetime format?
Below was the code I used:
df = pd.DataFrame({"Time":["10:10","09:10","12:00","13:23","15:23","17:00","17:30"]})
df['Time2'] = pd.to_datetime(df['Time']).dt.strftime('%H:%M')
df['Time2'] = df['Time2'].astype('datetime64[ns]')
How can I convert to datetime format in %H:%M and still stick to datetime format?
Impossible in pandas, maybe closer is use timedeltas:
df = pd.DataFrame({"Time":["10:10","09:10","12:00","13:23","15:23","17:00","17:30"]})
df['Time2'] = pd.to_timedelta(df['Time'].add(':00'))
print (df)
Time Time2
0 10:10 10:10:00
1 09:10 09:10:00
2 12:00 12:00:00
3 13:23 13:23:00
4 15:23 15:23:00
5 17:00 17:00:00
6 17:30 17:30:00
Is there any way to reset the date portion to the last day of the month while preserving the time? For example:
2018-01-02 23:00:00 -> 2018-01-31 23:00:00
2018-04-04 10:00:00 -> 2018-04-30 10:00:00
The Oracle function last_day() does exactly this. Try:
select last_day(sysdate), sysdate
from dual
to see how it works.
Ironically, I usually find the preservation of the date to be counterintuitive, so my usual usage is more like:
select last_day(trunc(sysdate))
from dual