expand datetime data in pandas like interpolation - pandas

I have the following data of dates and every date is assigned to the value 1
is there a way to somehow get a pandas list of hourly DateTime list such that all the values are 0 except for the one's I have in my xls file?
it is similar to interpolating but interpolating just interpolates whereas here I want just the rest of the date to be filled as 0.I want the entire 24 hours of the below dates to be assigned as one.I tried to do it in a for loop method but it just takes longer than ever and is very much nonpractical

Use pandas datetime accessor pd.Series.dt.date to extract the date part from datetime objects. And then use .isin() to match the values.
# sample data
df = pd.DataFrame({ # list of dates
"date": [date(2020,10,2), date(2020,10,4)]
})
df_hr = pd.DataFrame({ # list of hours from Oct.1 to 4
"hr": [datetime(2020,10,1,0,0) + i * timedelta(hours=1) for i in range(24*4)]
})
df_hr["flag"] = 0
df_hr.loc[df_hr["hr"].dt.date.isin(df["date"]), "flag"] = 1
# show the first and last hour of each day
df_hr.loc[[0,23,24,47,48,71,72,95]]
Out[111]:
hr flag
0 2020-10-01 00:00:00 0
23 2020-10-01 23:00:00 0
24 2020-10-02 00:00:00 1
47 2020-10-02 23:00:00 1
48 2020-10-03 00:00:00 0
71 2020-10-03 23:00:00 0
72 2020-10-04 00:00:00 1
95 2020-10-04 23:00:00 1

Related

Replacing seconds from the date time in pandas [duplicate]

I have following dataframe in pandas
code time
1 003002
1 053003
1 060002
1 073001
1 073003
I want to generate following dataframe in pandas
code time new_time
1 003002 00:30:00
1 053003 05:30:00
1 060002 06:00:00
1 073001 07:30:00
1 073003 07:30:00
I am doing it with following code
df['new_time'] = pd.to_datetime(df['time'] ,format='%H%M%S').dt.time
How can I do it in pandas?
Use Series.dt.floor:
df['time'] = pd.to_datetime(df['time'], format='%H%M%S').dt.floor('T').dt.time
Or remove last 2 values by indexing, then change format to %H%M:
df['time'] = pd.to_datetime(df['time'].str[:-2], format='%H%M').dt.time
print (df)
code time
0 1 00:30:00
1 1 05:30:00
2 1 06:00:00
3 1 07:30:00
4 1 07:30:00
An option using astype:
pd.to_datetime(df_oclh.Time).astype('datetime64[m]').dt.time
'datetime64[m]' symbolizes the time we want to convert to which is datetime with minutes being the largest granulariy of time wanted. Alternatively you could use [s] for seconds (rid of milliseconds) or [H] for hours (rid of minutes, seconds and milliseconds)

dt.floor count for every 12 hours in Pandas

I am trying to count the datetime occurrences every 12 hours as follows using dt.floor.
Here I created a data frame contains 2 days with 1-hour intervals. I have two questions regarding the output.
I am expecting the summary would be for every 12 hours i.e, first-row in the output1 should be 12:00 and second row would be 24:00. Instead, I get 00:00 and 12:00. Why is this?
Is it possible to create a summary using a specific time? for example, count every 6 Am and 6 PM?
code and input
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
input1.groupby(input1['datetime'].dt.floor('12H')).count()
output-1
datetime
datetime
2018-01-01 00:00:00 12
2018-01-01 12:00:00 12
2018-01-02 00:00:00 12
2018-01-02 12:00:00 12
output-2
datetime
datetime
2018-01-01 06:00:00 6
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 6
There is no 24th hour. The time part of a datetime in pandas exists in the range [00:00:00, 24:00:00), which ensures that there's only ever a single representation of the same exact time. (Notice the closure).
import pandas as pd
pd.to_datetime('2012-01-01 24:00:00')
#ParserError: hour must be in 0..23: 2012-01-01 24:00:00
For the second point as of pd.__version__ == '1.1.0' you can specify the offset parameter when you resample. You can also specify which side should be used for the labels. For older versions you will need to use the base argument.
# pandas < 1.1.0
#input1.resample('12H', on='datetime', base=6).count()
input1.resample('12H', on='datetime', offset='6H').count()
# datetime
#datetime
#2017-12-31 18:00:00 6
#2018-01-01 06:00:00 12
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 6
# Change labels
input1.resample('12H', on='datetime', offset='6H', label='right').count()
# datetime
#datetime
#2018-01-01 06:00:00 6
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 12
#2018-01-03 06:00:00 6
I modified your input data slightly, in order to use resample:
import pandas as pd
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
# add a dummy column
input1['x'] = 'x'
# convert datetime to index...
input1 = input1.set_index('datetime')
# ...so we can use resample, and loffset lets us start at 6 am
t = input1.resample('12h', loffset=pd.Timedelta(hours=6)).count()
# show results
print(t.head())
x
datetime
2018-01-01 06:00:00 12
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 12

How to convert numbers in an hour column to actual hours

I have an 'hour' column in a pandas dataframe that is simply a list of numbers from 0 to 23 representing hours. How can I convert them to an hour format such as 01:00 when the numbers are single digit ( like 1 ) and double digit (like 18)? The single digit numbers need to have a leading zero, a colon and two trailing zeros. The double digit numbers need only a colon and two trailing zeros. How can this be accomplished in a dataframe? Also, I have a 'date' column that needs to merge with the hour column after the hour column is converted.
e.g. date hour
2018-07-01 0
2018-07-01 1
2018-07-01 3
...
2018-07-01 21
2018-07-01 22
2018-07-01 23
Needs to look like:
date
2018-07-01 01:00
...
2018-07-01 23:00
The source of the data is a .csv file.
Thanks for your consideration. I'm new to pandas and I can't find in their documentation how to do this considering the single and double digit numbers.
Convert hours to timedeltas by to_timedelta and add to datetimes converted by to_datetime if necessary:
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')
print (df)
date hour
0 2018-07-01 00:00:00 0
1 2018-07-01 01:00:00 1
2 2018-07-01 03:00:00 3
3 2018-07-01 21:00:00 21
4 2018-07-01 22:00:00 22
5 2018-07-01 23:00:00 23
If need also remove hour column use DataFrame.pop
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df.pop('hour'), unit='h')
print (df)
date
0 2018-07-01 00:00:00
1 2018-07-01 01:00:00
2 2018-07-01 03:00:00
3 2018-07-01 21:00:00
4 2018-07-01 22:00:00
5 2018-07-01 23:00:00

What is timestamp granularity computing in pandas?

I have the following dataset
df = pd.DataFrame({'timestamp': pd.date_range('1/1/2020', '3/1/2020 23:59', freq='12h'),
'col1': np.random.randint(100,size=122)}).\
sort_values('timestamp')
I want to compute a daily, weekly and monthly sum of the col1. If I use 'W' granularity for the timestamp column I receive a ValueError : ValueError: <Week: weekday=6> is a non-fixed frequency and I read that is recommended to use 7D, 30D etc.
My question is how pandas compute 7D or 30D granularity? If I add another column
df['timestamp2']= df.timestamp.dt.floor('30D')
df.groupby('timestamp2')[['col1']].sum()
I get the following result:
timestamp2 col1
2019-12-10 778
2020-01-09 3100
2020-02-08 2470
Why does pandas returns those dates if my min date is Jan 1, 2020 and maximum timestamp is 1 Mar, 2020?
The origin is the POSIX origin: 1970-01-01. By using .floor('30D') the allowable values are 1970-01-01, 1970-01-31, ... and all other 30 day multiples. Your dates are close to the 608-610th multiples.
pd.to_datetime('1970-01-01') + pd.DateOffset(days=30*608)
#Timestamp('2019-12-10 00:00:00')
pd.to_datetime('1970-01-01') + pd.DateOffset(days=30*609)
#Timestamp('2020-01-09 00:00:00')
If what you want is instead 30D periods from your first observation, then resample is how you can aggregate:
df.resample('30D', on='timestamp')['timestamp'].agg(['min', 'max'])
min max
timestamp
2020-01-01 2020-01-01 2020-01-30 12:00:00 # starts from 1st date
2020-01-31 2020-01-31 2020-02-29 12:00:00
2020-03-01 2020-03-01 2020-03-01 12:00:00

Group hourly tagged data bonded by start & end data to proportional daily data using Pandas DF

I'm using unavailability data per product that have hourly start & end per each period that the product was unavailable, following is an example:
import pandas as pd
import datetime as dt
unavability = pd.DataFrame([[dt.datetime(2017, 10, 19,11), dt.datetime(2017, 10, 19,12),'broom'],
[dt.datetime(2017, 10, 19,9),dt.datetime(2017, 10, 19,10),'broom'],
[dt.datetime(2017, 10, 19,1), dt.datetime(2017, 10, 19,2),'bike'],
[dt.datetime(2017, 10, 19,22),dt.datetime(2017, 10, 20,3),'bike']],
columns=['start_date', 'end_date','product'])
print unavability
start_date end_date product
0 2017-10-19 11:00:00 2017-10-19 12:00:00 broom
1 2017-10-19 09:00:00 2017-10-19 10:00:00 broom
2 2017-10-19 01:00:00 2017-10-19 02:00:00 bike
3 2017-10-19 22:00:00 2017-10-20 03:00:00 bike
I'm looking to group the data into unavailability proportion per each date & product, so i would like to convert the Data Frame above into the following, keep in mind that i want it to work even if the unavailability period lasts for more than 49 hours (overlaps 3 days
desired=pd.DataFrame([[dt.datetime(2017, 10, 19),'broom',22/24.0],#2 houres of unavalability
[dt.datetime(2017, 10, 20),'broom',24/24.0], #Product fully available at that day
[dt.datetime(2017, 10, 19),'bike',22/24.0], # 2 hour of unavalability - from 22 to 24
[dt.datetime(2017, 10, 20),'bike',21/24.0]], # 3 hour of unavalability - from 00 to 3
columns=['date', 'product','avalability_proportion'])
print desired
date product avalability_proportion
0 2017-10-19 broom 0.916667
1 2017-10-20 broom 1.000000
2 2017-10-19 bike 0.916667
3 2017-10-20 bike 0.875000
Some toughs:
I thought of creating for a start a transformation that would create for all available products the theoretical hours as suggested here: Fill missing timeseries data using pandas or numpy , then create a join to the original data, and some how fill it, not sure if its a smart one.
Any help on this would be awesome, thanks in advance!
My silly solution and hope this will help:
df = unavability
# if date is changed, remember changed rows
df['is_date_changed'] = df.start_date.dt.date != df.end_date.dt.date
df.loc[df.is_date_changed,'intermediate_date'] = pd.to_datetime(df.end_date.dt.date)
df_date_is_changed = df.loc[df.is_date_changed]
df_date_not_changed = df.loc[~df.is_date_changed]
# expand every changed row to two,
# and append those rows to the date_not_changed dataframe.
# for example,
# 2017-10-19 22:00:00 2017-10-20 03:00:00
# will be expand into two rows:
# 2017-10-19 22:00:00 2017-10-20 00:00:00
# 2017-10-20 00:00:00 2017-10-20 03:00:00
for idx,row in df_date_is_changed.iterrows():
row1 = [row['start_date'],row['intermediate_date'],row['product'],None,None]
df_date_not_changed.loc[-1] = row1
df_date_not_changed.index = df_date_not_changed.index + 1
row2 = [row['intermediate_date'],row['end_date'],row['product'],None,None]
df_date_not_changed.loc[-1] = row2
df_date_not_changed.index = df_date_not_changed.index + 1
df = df_date_not_changed
df['date'] = df.apply(
lambda x:min(x['start_date'],x['end_date']),
axis=1)
df.date = df.date.dt.date
df['time_delta'] = df.end_date - df.start_date
df.groupby(['product','date']).agg({'time_delta':'sum'})