Pandas Datetime conversion - pandas

I have the following dataframe;
Date = ['01-Jan','01-Jan','01-Jan','01-Jan']
Heure = ['00:00','01:00','02:00','03:00']
value =[1,2,3,4]
df = pd.DataFrame({'value':value,'Date':Date,'Hour':Heure})
print(df)
Date Hour value
0 01-Jan 00:00 1
1 01-Jan 01:00 2
2 01-Jan 02:00 3
3 01-Jan 03:00 4
I am trying to create a datetime index, knowing that the file I am working with is for 2015. I have tried a lot of things but can get it to work! I tried to only convert the date and the month, but even that does not work:
df.index = pd.to_datetime(df['Date'],format='%d-%m')
I expect the following result:
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
Does anyone know how to do it?
Thanks,

You need to explicitely add 2015 somehow, and include the Hour column as well. I would do something like this:
df.index = pd.to_datetime(df.Date + '-2015 ' + df.Hour, format='%d-%b-%Y %H:%M')
>>> df
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4

You can replace the default 1900 by using replace
s=pd.to_datetime(df['Date']+df['Hour'],format='%d-%b%H:%M').apply(lambda x : x.replace(year=2015))
s
Out[131]:
0 2015-01-01 00:00:00
1 2015-01-01 01:00:00
2 2015-01-01 02:00:00
3 2015-01-01 03:00:00
dtype: datetime64[ns]
df.index=s

Related

Select data between 2 datetime fields based on current date/time

I have a table that has the following values (reduced for brevity)
Period
Periodfrom
Periodto
Glperiodoracle
Glperiodcalendar
88
2022-01-01 00:00:00
2022-01-28 00:00:00
JAN-FY2022
JAN-2022
89
2022-01-29 00:00:00
2022-02-25 00:00:00
FEB-FY2022
FEB-2022
90
2022-02-26 00:00:00
2022-04-01 00:00:00
MAR-FY2022
MAR-2022
91
2022-04-02 00:00:00
2022-04-29 00:00:00
APR-FY2022
APR-2022
92
2022-04-30 00:00:00
2022-05-27 00:00:00
MAY-FY2022
MAY-2022
93
2022-05-28 00:00:00
2022-07-01 00:00:00
JUN-FY2022
JUN-2022
94
2022-07-02 00:00:00
2022-07-29 00:00:00
JUL-FY2022
JUL-2022
95
2022-07-30 00:00:00
2022-08-26 00:00:00
AUG-FY2022
AUG-2022
96
2022-08-27 00:00:00
2022-09-30 00:00:00
SEP-FY2022
SEP-2022
97
2022-10-01 00:00:00
2022-10-28 00:00:00
OCT-FY2023
OCT-2022
I want to make a stored procedure that when executed (without receiving parameters) will return the single row corresponding to the date between PeriodFrom and PeriodTo based on execution date.
I have something like this:
Select top 1 Period,
Periodfrom,
Periodto,
Glperiodoracle,
Glperiodcalendar
From Calendar_Period
Where Periodfrom <= getdate()
And Periodto >= getdate()
I understand that using BETWEEN could lead to errors, but would this work in the edge cases taking in account seconds, right?
Looks like (i) your end date is inclusive (ii) the time portion is always 00:00. So the correct and most performant query would be:
where cast(getdate() as date) between Periodfrom and Periodto
It will, for example, return the first row when the current time is 2022-01-28 23:59:59.999.

Pandas DateTime Calculating Daily Averages

I have 2 columns of data in a pandas DF that looks like this with the "DateTime" column in format YYYY-MM-DD HH:MM:SS - this is first 24 hrs but the df is for one full year or 8784 x 2.
BAFFIN BAY DateTime
8759 8.112838 2016-01-01 00:00:00
8760 7.977169 2016-01-01 01:00:00
8761 8.420204 2016-01-01 02:00:00
8762 9.515370 2016-01-01 03:00:00
8763 9.222840 2016-01-01 04:00:00
8764 8.872423 2016-01-01 05:00:00
8765 8.776145 2016-01-01 06:00:00
8766 9.030668 2016-01-01 07:00:00
8767 8.394983 2016-01-01 08:00:00
8768 8.092915 2016-01-01 09:00:00
8769 8.946967 2016-01-01 10:00:00
8770 9.620883 2016-01-01 11:00:00
8771 9.535951 2016-01-01 12:00:00
8772 8.861761 2016-01-01 13:00:00
8773 9.077692 2016-01-01 14:00:00
8774 9.116074 2016-01-01 15:00:00
8775 8.724343 2016-01-01 16:00:00
8776 8.916940 2016-01-01 17:00:00
8777 8.920438 2016-01-01 18:00:00
8778 8.926278 2016-01-01 19:00:00
8779 8.817666 2016-01-01 20:00:00
8780 8.704014 2016-01-01 21:00:00
8781 8.496358 2016-01-01 22:00:00
8782 8.434297 2016-01-01 23:00:00
I am trying to calculate daily averages of the "BAFFIN BAY" and I've tried these approaches:
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DateTime')).mean()
davg_df2 = df2.groupby(pd.Grouper(freq='1D', key='DateTime')).mean()
davg_df2 = df2.groupby(by=df2['DateTime'].dt.date).mean()
All of these approaches yields the same answer as shown below :
BAFFIN BAY
DateTime
2016-01-01 6.008044
However, if you do the math, the correct average for 2016-01-01 is 8.813134 Thank you kindly for your help. I'm assuming the grouping is just by day or 24hrs to make consecutive DAILY averages but the 3 approaches above clearly is looking at other data in my 8784 x 2 DF.
I just ran your df with this code and i get 8.813134:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.groupby(by=pd.Grouper(freq='D', key='DateTime')).mean()
print(df)
Output:
BAFFIN BAY
DateTime
2016-01-01 8.813134

How to convert to datetime if the format of dates changes gradually through the column?

df.head():
start_date end_date
0 03.09.2013 03.09.2025
1 09.08.2019 14.05.2020
2 03.08.2015 03.08.2019
3 31.03.2014 31.03.2019
4 02.02.2015 02.02.2019
5 21.08.2019 21.08.2024
when I do df.tail():
start_date end_date
30373 2019-07-05 00:00:00 2023-07-05 00:00:00
30374 2019-06-11 00:00:00 2023-06-11 00:00:00
30375 19.01.2017 2020-02-09 00:00:00 #these 2 start dates are just same as in head
30376 11.12.2009 2011-12-11 00:00:00
30377 2019-07-30 00:00:00 2023-07-30 00:00:00
when i do
df[start_date] = pd.to_datetime(df[start_date])
some dates have month converted as days.
The format is inconsistent through the column. How to convert properly?
Use dayfirst=True parameter:
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
Or specify format by http://strftime.org/:
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m.%Y')
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
print (df)
start_date end_date
0 2013-09-03 2025-09-03
1 2019-08-09 2020-05-14
2 2015-08-03 2019-08-03
3 2014-03-31 2019-03-31
4 2015-02-02 2019-02-02
5 2019-08-21 2024-08-21

Pandas datetime comparison

I have the following dataframe:
start = ['31/12/2011 01:00','31/12/2011 01:00','31/12/2011 01:00','01/01/2013 08:00','31/12/2012 20:00']
end = ['02/01/2013 01:00','02/01/2014 01:00','02/01/2014 01:00','01/01/2013 14:00','01/01/2013 04:00']
df = pd.DataFrame({'start':start,'end':end})
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y %H:%M')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y %H:%M')
print(df)
end start
0 2013-01-02 01:00:00 2011-12-31 01:00:00
1 2014-01-02 01:00:00 2011-12-31 01:00:00
2 2014-01-02 01:00:00 2011-12-31 01:00:00
3 2013-01-01 14:00:00 2013-01-01 08:00:00
4 2013-01-01 04:00:00 2012-12-31 20:00:00
I am tying to compare df.end and df.start to two given dates, year_start and year_end:
year_start = pd.to_datetime(2013,format='%Y')
year_end = pd.to_datetime(2013+1,format='%Y')
print(year_start)
print(year_end)
2013-01-01 00:00:00
2014-01-01 00:00:00
But i can't get my comparison to work (comparison in conditions):
conditions = [(df['start'].any()< year_start) and (df['end'].any()> year_end)]
choices = [8760]
df['test'] = np.select(conditions, choices, default=0)
I also tried to define year_end and year_start as follows but it does not work either:
year_start = np.datetime64(pd.to_datetime(2013,format='%Y'))
year_end = np.datetime64(pd.to_datetime(2013+1,format='%Y'))
Any idea on how I could make it work?
Try this:
In [797]: df[(df['start']< year_start) & (df['end']> year_end)]
Out[797]:
end start
1 2014-01-02 01:00:00 2011-12-31 01:00:00
2 2014-01-02 01:00:00 2011-12-31 01:00:00

from 15 minutes interval to hourly interval counts

am using excel sheet to display data from sql with this query
SELECT itable.Timestamp, itable.Time,
Sum(itable.CallsOffered)AS CallsOffered, Sum(itable.CallsAnswered)AS CallsAnswered, Sum(itable.CallsAnsweredAftThreshold)AS CallsAnsweredAftThreshold,
sum(CallsAnsweredDelay)AS CallsAnsweredDelay
FROM tablename itable
WHERE
(itable.Timestamp>=?) AND (itable.Timestamp<=?) AND
(itable.Application in ('1','2','3','4'))
GROUP BY itable.Timestamp, itable.Time
ORDER BY itable.Timestamp, itable.Time
and i get a data with an interval of 15 minutes like this :
Timestamp Time CallsOffered CallsAnswered CallsAnsweredAftThreshold CallsAnsweredDelay
6/1/2014 0:00 00:00 0 1 1 52
6/1/2014 0:15 00:15 3 1 1 23
6/1/2014 0:30 00:30 3 3 2 89
6/1/2014 0:45 00:45 0 0 0 0
6/1/2014 1:00 01:00 0 0 0 0
6/1/2014 1:15 01:15 4 1 1 12
6/1/2014 1:30 01:30 1 1 1 39
6/1/2014 1:45 01:45 0 0 0 0
6/1/2014 2:00 02:00 2 1 0 7
6/1/2014 2:15 02:15 1 1 1 80
6/1/2014 2:30 02:30 3 2 2 75
6/1/2014 2:45 02:45 0 0 0 0
6/1/2014 3:00 03:00 0 0 0 0
and i want to convert the interval from being 15 minutes to hourly interval
like this
2014-07-01 00:00:00.000
2014-07-01 01:00:00.000
2014-07-01 02:00:00.000
2014-07-01 03:00:00.000
2014-07-01 04:00:00.000
2014-07-01 05:00:00.000
2014-07-01 06:00:00.000
2014-07-01 07:00:00.000
2014-07-01 08:00:00.000
2014-07-01 09:00:00.000
2014-07-01 10:00:00.000
2014-07-01 11:00:00.000
2014-07-01 12:00:00.000
2014-07-01 13:00:00.000
2014-07-01 14:00:00.000
the query i came up with is :
select
timestamp = DATEADD(hour,datediff(hour,0,app.Timestamp),0),
Sum(app.CallsOffered)AS CallsOffered,
Sum(app.CallsAnswered)AS CallsAnswered,
Sum(app.CallsAnsweredAftThreshold)AS CallsAnsweredAftThreshold,
sum(CallsAnsweredDelay)AS CallsAnsweredDelay,
max(MaxCallsAnsDelay) as MaxCallsAnsDelay ,
max(app.MaxCallsAbandonedDelay)as MaxCallsAbandonedDelay
from tablename app
where Timestamp >='2014-7-1' AND timestamp<='2014-7-2' and
(app.Application in (
'1',
'2',
'3',
'4')
group by DATEADD(hour,datediff(hour,0,Timestamp),0)
order by Timestamp;
i get the result i want when i run in in Microsoft Sql server Managment studio
but it gives me a long error when i try running the same query in Microsoft Query in excel the error is like i cant start with timestamp
and that its giving me error for DATEADD ,DATEDIFF
so is there something i should change in my query or anything i can do to get an hourly count interval instead of 15 minutes count interval as ive shown
and thank you in advance