How to convert numbers in an hour column to actual hours - pandas

I have an 'hour' column in a pandas dataframe that is simply a list of numbers from 0 to 23 representing hours. How can I convert them to an hour format such as 01:00 when the numbers are single digit ( like 1 ) and double digit (like 18)? The single digit numbers need to have a leading zero, a colon and two trailing zeros. The double digit numbers need only a colon and two trailing zeros. How can this be accomplished in a dataframe? Also, I have a 'date' column that needs to merge with the hour column after the hour column is converted.
e.g. date hour
2018-07-01 0
2018-07-01 1
2018-07-01 3
...
2018-07-01 21
2018-07-01 22
2018-07-01 23
Needs to look like:
date
2018-07-01 01:00
...
2018-07-01 23:00
The source of the data is a .csv file.
Thanks for your consideration. I'm new to pandas and I can't find in their documentation how to do this considering the single and double digit numbers.

Convert hours to timedeltas by to_timedelta and add to datetimes converted by to_datetime if necessary:
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')
print (df)
date hour
0 2018-07-01 00:00:00 0
1 2018-07-01 01:00:00 1
2 2018-07-01 03:00:00 3
3 2018-07-01 21:00:00 21
4 2018-07-01 22:00:00 22
5 2018-07-01 23:00:00 23
If need also remove hour column use DataFrame.pop
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df.pop('hour'), unit='h')
print (df)
date
0 2018-07-01 00:00:00
1 2018-07-01 01:00:00
2 2018-07-01 03:00:00
3 2018-07-01 21:00:00
4 2018-07-01 22:00:00
5 2018-07-01 23:00:00

Related

Replacing seconds from the date time in pandas [duplicate]

I have following dataframe in pandas
code time
1 003002
1 053003
1 060002
1 073001
1 073003
I want to generate following dataframe in pandas
code time new_time
1 003002 00:30:00
1 053003 05:30:00
1 060002 06:00:00
1 073001 07:30:00
1 073003 07:30:00
I am doing it with following code
df['new_time'] = pd.to_datetime(df['time'] ,format='%H%M%S').dt.time
How can I do it in pandas?
Use Series.dt.floor:
df['time'] = pd.to_datetime(df['time'], format='%H%M%S').dt.floor('T').dt.time
Or remove last 2 values by indexing, then change format to %H%M:
df['time'] = pd.to_datetime(df['time'].str[:-2], format='%H%M').dt.time
print (df)
code time
0 1 00:30:00
1 1 05:30:00
2 1 06:00:00
3 1 07:30:00
4 1 07:30:00
An option using astype:
pd.to_datetime(df_oclh.Time).astype('datetime64[m]').dt.time
'datetime64[m]' symbolizes the time we want to convert to which is datetime with minutes being the largest granulariy of time wanted. Alternatively you could use [s] for seconds (rid of milliseconds) or [H] for hours (rid of minutes, seconds and milliseconds)

Interpolate hourly load of a selected months of a year from the same months of the previous year and the next year in python pandas?

I have the following three dataframes:
df1:
date_time system_load
01-01-2017 00:00:00 208111
01-01-2017 01:00:00 208311
01-01-2017 02:00:00 208311
01-01-2017 03:00:00 208011
............... ...
31-12-2017 20:00:00 208611
31-12-2017 21:00:00 208411
31-12-2017 22:00:00 208111
31-12-2017 23:00:00 208911
The system load values of df1 has no problem.
df2:
date_time system_load
01-01-2018 00:00:00 208111
01-01-2018 01:00:00 208311
01-01-2018 02:00:00 208311
01-01-2018 03:00:00 208011
............... ...
31-12-2018 20:00:00 209611
31-12-2018 21:00:00 209411
31-12-2018 22:00:00 209111
31-12-2018 23:00:00 209911
The system load values of df2 is missed starting from 06-03-2018 20:00:00 till up to 24-10-2018 22:00:00.
df3:
date_time system_load
01-01-2019 00:00:00 309119
01-01-2019 01:00:00 309391
01-01-2019 02:00:00 309811
01-01-2019 03:00:00 309711
............... ...
31-12-2019 20:00:00 309611
31-12-2019 21:00:00 309411
31-12-2019 22:00:00 309111
31-12-2019 23:00:00 309911
The system load values of df3 has no problem.
What I want is to interpolate in suitable way the missed hourly records in df2 using the corresponding df1 and df3 hourly records (06-03-2017 20:00:00 till up to 24-10-2017 22:00:00 and 06-03-2019 20:00:00 till up to 24-10-2019 22:00:00 respectively). Based on "Pierre D"'s valuable comment I attached my scaled data.
Here is a very basic strategy that just takes data from neighboring years to fill the missing values. The offset is chosen to be precisely 52 weeks, so as to reflect possible weekly seasonality.
# get the whole series together, and resample to have missing data as NaN:
s = pd.concat([df1, df2, df3])['system_load'].resample('H').asfreq()
offset = 52 * 7 * 24 # 52 weeks, 7 days/week, 24 hours/day
filler = pd.concat([s.shift(offset), s.shift(-offset)], axis=1).mean(axis=1)
out = s.where(~s.isna(), filler)
# optional: make a new df2 with the filled values
df2mod = out.truncate(
before='2018',
after=pd.Timestamp('2019') - pd.Timedelta(1)
).to_frame('system_load')
Notes:
out contains the "filled" series for the whole system_load using neighboring years.
we use pandas.DataFrame.mean() to build the filler series as the mean of the two neighboring years, in a way that takes care of NaN (e.g. if one year or the other has NaN, then the mean is the only non-NaN value).
this is one of the most basic ways of filling the missing data, and likely won't fool a careful observer. Depending on the intended usage of the reconstructed data, a more elaborate strategy should be considered. Data reconstruction is an active field of research, and there are sophisticated methods in the literature. For example, one could use a GAN to build a resulting series that would be very hard to discriminate from real data.

expand datetime data in pandas like interpolation

I have the following data of dates and every date is assigned to the value 1
is there a way to somehow get a pandas list of hourly DateTime list such that all the values are 0 except for the one's I have in my xls file?
it is similar to interpolating but interpolating just interpolates whereas here I want just the rest of the date to be filled as 0.I want the entire 24 hours of the below dates to be assigned as one.I tried to do it in a for loop method but it just takes longer than ever and is very much nonpractical
Use pandas datetime accessor pd.Series.dt.date to extract the date part from datetime objects. And then use .isin() to match the values.
# sample data
df = pd.DataFrame({ # list of dates
"date": [date(2020,10,2), date(2020,10,4)]
})
df_hr = pd.DataFrame({ # list of hours from Oct.1 to 4
"hr": [datetime(2020,10,1,0,0) + i * timedelta(hours=1) for i in range(24*4)]
})
df_hr["flag"] = 0
df_hr.loc[df_hr["hr"].dt.date.isin(df["date"]), "flag"] = 1
# show the first and last hour of each day
df_hr.loc[[0,23,24,47,48,71,72,95]]
Out[111]:
hr flag
0 2020-10-01 00:00:00 0
23 2020-10-01 23:00:00 0
24 2020-10-02 00:00:00 1
47 2020-10-02 23:00:00 1
48 2020-10-03 00:00:00 0
71 2020-10-03 23:00:00 0
72 2020-10-04 00:00:00 1
95 2020-10-04 23:00:00 1

dt.floor count for every 12 hours in Pandas

I am trying to count the datetime occurrences every 12 hours as follows using dt.floor.
Here I created a data frame contains 2 days with 1-hour intervals. I have two questions regarding the output.
I am expecting the summary would be for every 12 hours i.e, first-row in the output1 should be 12:00 and second row would be 24:00. Instead, I get 00:00 and 12:00. Why is this?
Is it possible to create a summary using a specific time? for example, count every 6 Am and 6 PM?
code and input
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
input1.groupby(input1['datetime'].dt.floor('12H')).count()
output-1
datetime
datetime
2018-01-01 00:00:00 12
2018-01-01 12:00:00 12
2018-01-02 00:00:00 12
2018-01-02 12:00:00 12
output-2
datetime
datetime
2018-01-01 06:00:00 6
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 6
There is no 24th hour. The time part of a datetime in pandas exists in the range [00:00:00, 24:00:00), which ensures that there's only ever a single representation of the same exact time. (Notice the closure).
import pandas as pd
pd.to_datetime('2012-01-01 24:00:00')
#ParserError: hour must be in 0..23: 2012-01-01 24:00:00
For the second point as of pd.__version__ == '1.1.0' you can specify the offset parameter when you resample. You can also specify which side should be used for the labels. For older versions you will need to use the base argument.
# pandas < 1.1.0
#input1.resample('12H', on='datetime', base=6).count()
input1.resample('12H', on='datetime', offset='6H').count()
# datetime
#datetime
#2017-12-31 18:00:00 6
#2018-01-01 06:00:00 12
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 6
# Change labels
input1.resample('12H', on='datetime', offset='6H', label='right').count()
# datetime
#datetime
#2018-01-01 06:00:00 6
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 12
#2018-01-03 06:00:00 6
I modified your input data slightly, in order to use resample:
import pandas as pd
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
# add a dummy column
input1['x'] = 'x'
# convert datetime to index...
input1 = input1.set_index('datetime')
# ...so we can use resample, and loffset lets us start at 6 am
t = input1.resample('12h', loffset=pd.Timedelta(hours=6)).count()
# show results
print(t.head())
x
datetime
2018-01-01 06:00:00 12
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 12

Panda DF Convert All Dates to YYYY-MM-DD format

i have data that looks like this stored in a DF and I'm trying to convert the "DATE" column so that all the dates are in the format of yyyy-mm-dd format instead of yyyy-dd-mm as you can see when the date changes by the "TIME" column to a new day (some of the dates not shown are already set to the YYYY-MM-DD format but I'm trying to change all of them to the YYYY-MM-DD format):
DATE TIME BAFFIN BAY GATUN II GATUN I KLONDIKE IIIG \
8778 2016-01-01 1900 8.926278 8.046583 7.649784 7.333993
8779 2016-01-01 2000 8.817666 4.395097 4.748931 6.672631
8780 2016-01-01 2100 8.704014 6.384826 7.128692 6.115349
8781 2016-01-01 2200 8.496358 8.261933 8.166153 6.242737
8782 2016-01-01 2300 8.434297 4.656991 5.894877 5.781445
8783 2016-02-01 0000 8.528372 3.056838 3.086056 5.023564
8784 2016-02-01 0100 8.783731 4.614589 4.894076 5.042875
8785 2016-02-01 0200 8.572500 3.860174 4.641366 5.174426
8786 2016-02-01 0300 8.279557 2.076971 2.644479 5.492729
8787 2016-02-01 0400 8.378920 3.562210 2.806703 5.356025
I'm trying to set it the "DATE" column to a datetime column with specifying the format but it does nothing:
df2['DATE'] = pd.to_datetime(df2['DATE'],format='%Y-%m-%d')
thank you in advance for your help!
Can you try this
pd.to_datetime(df['TIME'], dayfirst=True)
0 2016-01-01
1 2016-01-01
2 2016-01-01
3 2016-01-01
4 2016-01-01
5 2016-01-02
6 2016-01-02
7 2016-01-02
8 2016-01-02
9 2016-01-02
consider joining 'DATE' and 'TIME' to get a complete datetime column. Assuming both columns are of dtype obj (string), you can combine them using the + operator and then call pd.to_datetime with a specified format. Ex:
import pandas as pd
df = pd.DataFrame({'DATE': ['2016-01-01', '2016-02-01'],
'TIME': ['1900', '0000']})
df['DateTime'] = pd.to_datetime(df['DATE']+df['TIME'], format='%Y-%d-%m%H%M')
# df['DateTime']
# 0 2016-01-01 19:00:00
# 1 2016-01-02 00:00:00
# Name: DateTime, dtype: datetime64[ns]