dt.floor count for every 12 hours in Pandas - pandas

I am trying to count the datetime occurrences every 12 hours as follows using dt.floor.
Here I created a data frame contains 2 days with 1-hour intervals. I have two questions regarding the output.
I am expecting the summary would be for every 12 hours i.e, first-row in the output1 should be 12:00 and second row would be 24:00. Instead, I get 00:00 and 12:00. Why is this?
Is it possible to create a summary using a specific time? for example, count every 6 Am and 6 PM?
code and input
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
input1.groupby(input1['datetime'].dt.floor('12H')).count()
output-1
datetime
datetime
2018-01-01 00:00:00 12
2018-01-01 12:00:00 12
2018-01-02 00:00:00 12
2018-01-02 12:00:00 12
output-2
datetime
datetime
2018-01-01 06:00:00 6
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 6

There is no 24th hour. The time part of a datetime in pandas exists in the range [00:00:00, 24:00:00), which ensures that there's only ever a single representation of the same exact time. (Notice the closure).
import pandas as pd
pd.to_datetime('2012-01-01 24:00:00')
#ParserError: hour must be in 0..23: 2012-01-01 24:00:00
For the second point as of pd.__version__ == '1.1.0' you can specify the offset parameter when you resample. You can also specify which side should be used for the labels. For older versions you will need to use the base argument.
# pandas < 1.1.0
#input1.resample('12H', on='datetime', base=6).count()
input1.resample('12H', on='datetime', offset='6H').count()
# datetime
#datetime
#2017-12-31 18:00:00 6
#2018-01-01 06:00:00 12
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 6
# Change labels
input1.resample('12H', on='datetime', offset='6H', label='right').count()
# datetime
#datetime
#2018-01-01 06:00:00 6
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 12
#2018-01-03 06:00:00 6

I modified your input data slightly, in order to use resample:
import pandas as pd
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
# add a dummy column
input1['x'] = 'x'
# convert datetime to index...
input1 = input1.set_index('datetime')
# ...so we can use resample, and loffset lets us start at 6 am
t = input1.resample('12h', loffset=pd.Timedelta(hours=6)).count()
# show results
print(t.head())
x
datetime
2018-01-01 06:00:00 12
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 12

Related

Replacing seconds from the date time in pandas [duplicate]

I have following dataframe in pandas
code time
1 003002
1 053003
1 060002
1 073001
1 073003
I want to generate following dataframe in pandas
code time new_time
1 003002 00:30:00
1 053003 05:30:00
1 060002 06:00:00
1 073001 07:30:00
1 073003 07:30:00
I am doing it with following code
df['new_time'] = pd.to_datetime(df['time'] ,format='%H%M%S').dt.time
How can I do it in pandas?
Use Series.dt.floor:
df['time'] = pd.to_datetime(df['time'], format='%H%M%S').dt.floor('T').dt.time
Or remove last 2 values by indexing, then change format to %H%M:
df['time'] = pd.to_datetime(df['time'].str[:-2], format='%H%M').dt.time
print (df)
code time
0 1 00:30:00
1 1 05:30:00
2 1 06:00:00
3 1 07:30:00
4 1 07:30:00
An option using astype:
pd.to_datetime(df_oclh.Time).astype('datetime64[m]').dt.time
'datetime64[m]' symbolizes the time we want to convert to which is datetime with minutes being the largest granulariy of time wanted. Alternatively you could use [s] for seconds (rid of milliseconds) or [H] for hours (rid of minutes, seconds and milliseconds)

expand datetime data in pandas like interpolation

I have the following data of dates and every date is assigned to the value 1
is there a way to somehow get a pandas list of hourly DateTime list such that all the values are 0 except for the one's I have in my xls file?
it is similar to interpolating but interpolating just interpolates whereas here I want just the rest of the date to be filled as 0.I want the entire 24 hours of the below dates to be assigned as one.I tried to do it in a for loop method but it just takes longer than ever and is very much nonpractical
Use pandas datetime accessor pd.Series.dt.date to extract the date part from datetime objects. And then use .isin() to match the values.
# sample data
df = pd.DataFrame({ # list of dates
"date": [date(2020,10,2), date(2020,10,4)]
})
df_hr = pd.DataFrame({ # list of hours from Oct.1 to 4
"hr": [datetime(2020,10,1,0,0) + i * timedelta(hours=1) for i in range(24*4)]
})
df_hr["flag"] = 0
df_hr.loc[df_hr["hr"].dt.date.isin(df["date"]), "flag"] = 1
# show the first and last hour of each day
df_hr.loc[[0,23,24,47,48,71,72,95]]
Out[111]:
hr flag
0 2020-10-01 00:00:00 0
23 2020-10-01 23:00:00 0
24 2020-10-02 00:00:00 1
47 2020-10-02 23:00:00 1
48 2020-10-03 00:00:00 0
71 2020-10-03 23:00:00 0
72 2020-10-04 00:00:00 1
95 2020-10-04 23:00:00 1

pandas resample uneven hourly data into 1D or 24h bins

I have weekly hourly FX data which I need to resample into '1D' or '24hr' bins Monday through Thursday 12:00pm and at 21:00 on Friday, totaling 5 days per week:
Date rate
2020-01-02 00:00:00 0.673355
2020-01-02 01:00:00 0.67311
2020-01-02 02:00:00 0.672925
2020-01-02 03:00:00 0.67224
2020-01-02 04:00:00 0.67198
2020-01-02 05:00:00 0.67223
2020-01-02 06:00:00 0.671895
2020-01-02 07:00:00 0.672175
2020-01-02 08:00:00 0.672085
2020-01-02 09:00:00 0.67087
2020-01-02 10:00:00 0.6705800000000001
2020-01-02 11:00:00 0.66884
2020-01-02 12:00:00 0.66946
2020-01-02 13:00:00 0.6701600000000001
2020-01-02 14:00:00 0.67056
2020-01-02 15:00:00 0.67124
2020-01-02 16:00:00 0.6691699999999999
2020-01-02 17:00:00 0.66883
2020-01-02 18:00:00 0.66892
2020-01-02 19:00:00 0.669345
2020-01-02 20:00:00 0.66959
2020-01-02 21:00:00 0.670175
2020-01-02 22:00:00 0.6696300000000001
2020-01-02 23:00:00 0.6698350000000001
2020-01-03 00:00:00 0.66957
So the number of hours in each some days of the week is uneven, ie "Monday" = 00:00:00 Monday through 12:00:00 Monday, "Tuesday" (and also Weds, Thu) = i.e. 13:00:00 Monday though 12:00:00 Tuesday, and Friday = 13:00:00 through 21:00:00
In trying to find a solution I see that base is now deprecated, and offset/origin methods aren't working as expected, likely due to uneven number of rows per day:
df.rate.resample('24h', offset=12).ohlc()
I've spent hours attempting to find a solution
How can one simply bin into ohlc() columns all data rows between each 12:00:00 timestamp?
the desired output would look something like this:
Out[69]:
open high low close
2020-01-02 00:00:00.0000000 0.673355 0.673355 0.673355 0.673355
2020-01-03 00:00:00.0000000 0.673110 0.673110 0.668830 0.669570
2020-01-04 00:00:00.0000000 0.668280 0.668280 0.664950 0.666395
2020-01-05 00:00:00.0000000 0.666425 0.666425 0.666425 0.666425
Is this what you are looking for, using both origin and offset as parameters:
df.resample('24h', origin='start_day', offset='13h').ohlc()
For your example, this gives me:
open high low close
datetime
2020-01-01 13:00:00 0.673355 0.673355 0.66884 0.66946
2020-01-02 13:00:00 0.670160 0.671240 0.66883 0.66957
Since the period lengths are unequal, IMO it is necessary to craft the mapping wheel yourself. Speaking precisely, the 1.5-day length on Monday makes it impossible for freq='D' to do the mapping correctly at once.
The hand-crafted code is also able to guard against records outside the well-defined periods.
Data
A slightly different timestamp is used to demonstrate the correctness of the code. The days are from Mon. to Fri.
import pandas as pd
import numpy as np
from datetime import datetime
import io
from pandas import Timestamp, Timedelta
df = pd.read_csv(io.StringIO("""
rate
Date
2020-01-06 00:00:00 0.673355
2020-01-06 23:00:00 0.673110
2020-01-07 00:00:00 0.672925
2020-01-07 12:00:00 0.672240
2020-01-07 13:00:00 0.671980
2020-01-07 23:00:00 0.672230
2020-01-08 00:00:00 0.671895
2020-01-08 12:00:00 0.672175
2020-01-08 23:00:00 0.672085
2020-01-09 00:00:00 0.670870
2020-01-09 12:00:00 0.670580
2020-01-09 23:00:00 0.668840
2020-01-10 00:00:00 0.669460
2020-01-10 12:00:00 0.670160
2020-01-10 21:00:00 0.670560
2020-01-10 22:00:00 0.671240
2020-01-10 23:00:00 0.669170
"""), sep=r"\s{2,}", engine="python")
df.set_index(pd.to_datetime(df.index), inplace=True)
Code
def find_day(ts: Timestamp):
"""Find the trading day with irregular length"""
wd = ts.isoweekday()
if wd == 1:
return ts.date()
elif wd in (2, 3, 4):
return ts.date() - Timedelta("1D") if ts.hour <= 12 else ts.date()
elif wd == 5:
if ts.hour <= 12:
return ts.date() - Timedelta("1D")
elif 13 <= ts.hour <= 21:
return ts.date()
# out of range or nulls
return None
# map the timestamps, and set as new index
df.set_index(pd.DatetimeIndex(df.index.map(find_day)), inplace=True)
# drop invalid values and collect ohlc
ans = df["rate"][df.index.notnull()].resample("D").ohlc()
Result
print(ans)
open high low close
Date
2020-01-06 0.673355 0.673355 0.672240 0.672240
2020-01-07 0.671980 0.672230 0.671895 0.672175
2020-01-08 0.672085 0.672085 0.670580 0.670580
2020-01-09 0.668840 0.670160 0.668840 0.670160
2020-01-10 0.670560 0.670560 0.670560 0.670560
I ended up using a combination of grouby and datetime day of the week identification to arrive at my specific solution
# get idxs of time to rebal (12:00:00)-------------------------------------
df['idx'] = range(len(df)) # get row index
days = [] # identify each row by day of week
for i in range(len(df.index)):
days.append(df.index[i].date().weekday())
df['day'] = days
dtChgIdx = [] # stores "12:00:00" rows
justDates = df.index.date.tolist() # gets just dates
res = [] # removes duplicate dates
[res.append(x) for x in justDates if x not in res]
justDates = res
grouped_dates = df.groupby(df.index.date) # group entire df by dates
for i in range(len(grouped_dates)):
tempDf = grouped_dates.get_group(justDates[i]) # look at each grouped dates
if tempDf['day'][0] == 6:
continue # skip Sundays
times = [] # gets just the time portion of index
for y in range(len(tempDf.index)):
times.append(str(tempDf.index[y])[-8:])
tempDf['time'] = times # add time column to df
tempDf['dayCls'] = np.where(tempDf['time'] == '12:00:00',1,0) # idx "12:00:00" row
dtChgIdx.append(tempDf.loc[tempDf['dayCls'] == 1, 'idx'][0]) # idx value

How to convert numbers in an hour column to actual hours

I have an 'hour' column in a pandas dataframe that is simply a list of numbers from 0 to 23 representing hours. How can I convert them to an hour format such as 01:00 when the numbers are single digit ( like 1 ) and double digit (like 18)? The single digit numbers need to have a leading zero, a colon and two trailing zeros. The double digit numbers need only a colon and two trailing zeros. How can this be accomplished in a dataframe? Also, I have a 'date' column that needs to merge with the hour column after the hour column is converted.
e.g. date hour
2018-07-01 0
2018-07-01 1
2018-07-01 3
...
2018-07-01 21
2018-07-01 22
2018-07-01 23
Needs to look like:
date
2018-07-01 01:00
...
2018-07-01 23:00
The source of the data is a .csv file.
Thanks for your consideration. I'm new to pandas and I can't find in their documentation how to do this considering the single and double digit numbers.
Convert hours to timedeltas by to_timedelta and add to datetimes converted by to_datetime if necessary:
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')
print (df)
date hour
0 2018-07-01 00:00:00 0
1 2018-07-01 01:00:00 1
2 2018-07-01 03:00:00 3
3 2018-07-01 21:00:00 21
4 2018-07-01 22:00:00 22
5 2018-07-01 23:00:00 23
If need also remove hour column use DataFrame.pop
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df.pop('hour'), unit='h')
print (df)
date
0 2018-07-01 00:00:00
1 2018-07-01 01:00:00
2 2018-07-01 03:00:00
3 2018-07-01 21:00:00
4 2018-07-01 22:00:00
5 2018-07-01 23:00:00

Using grouper to group a timestamp in a specific range

Suppose that I have a data-frame (DF). Index of this data-frame is timestamp from 11 AM to 6 PM every day and this data-frame contains 30 days. I want to group it every 30 minutes. This is the function I'm using:
out = DF.groupby(pd.Grouper(freq='30min'))
The start date of output is correct, but it considers the whole day (24h) for grouping. For example, In the new timestamp, I have something like this:
11:00:00
11:30:00
12:00:00
12:30:00
...
18:00:00
18:30:00
...
23:00:00
23:30:00
...
2:00:00
2:30:00
...
...
10:30:00
11:00:00
11:30:00
As a result, many outputs are empty because from 6:00 PM to 11 AM, I don't have any data.
One possible solution should be DatetimeIndex.floor:
out = DF.groupby(DF.index.floor('30min'))
Or use dropna after aggregate function:
out = DF.groupby(pd.Grouper(freq='30min')).mean().dropna()
As mentioned in comment to original post this is as expected. If you want to remove empty groups simply slice them afterwards. Assuming in this case you are using count to aggregate:
df = df.groupby(pd.Grouper(freq='30min')).count()
df = df[df > 0]