Pandas Convert YYMMDD & HHMM column to single DateTime Column - pandas

I'm having all kinds of trouble combining these two date columns into a single datetime column. The data looks like this:
dfn2.head(3)
Out[134]:
Plant_Name YYMMDD HHMM BestGuess(kWh)
0 BII NEE STIPA 20180101 0100 20715.0
1 BII NEE STIPA 20180101 0200 15742.0
2 BII NEE STIPA 20180101 0300 16934.0
dfn2.dtypes
Out[138]:
Plant_Name object
YYMMDD object
HHMM object
BestGuess(kWh) float64
dtype: object
I've tried several options and I'm not getting the expected result from:
dfn2['Datetime'] = (pd.to_datetime(dfn2['YYMMDD'],format='%Y%m%d').add(pd.to_timedelta(dfn2['HHMM'], 'h')))
dfn2.head(3)
Out[101]:
Plant_Name YYMMDD HHMM BestGuess(kWh) Datetime
0 BII NEE STIPA 20180101 100 20715.0 2018-01-05 04:00:00
1 BII NEE STIPA 20180101 200 15742.0 2018-01-09 08:00:00
2 BII NEE STIPA 20180101 300 16934.0 2018-01-13 12:00:00
I'm expecting the 'Datetime' column of the first 3 rows to look like:
2018-01-01 01:00:00
2018-01-01 02:00:00
2018-01-01 03:00:00
not like what the result shows above. I've also tried this lambda solution and the result looks the same:
dfn2['DateTime'] = dfn2['YYMMDD'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d')) + (pd.to_timedelta(dfn2.HHMM, unit='H'))
dfn2.head(3)
Out[103]:
Plant_Name YYMMDD HHMM BestGuess(kWh) Datetime DateTime
0 BII NEE STIPA 20180101 100 20715.0 2018-01-05 04:00:00 2018-01-05 04:00:00
1 BII NEE STIPA 20180101 200 15742.0 2018-01-09 08:00:00 2018-01-09 08:00:00
2 BII NEE STIPA 20180101 300 16934.0 2018-01-13 12:00:00 2018-01-13 12:00:00
Am I missing something? thank you,

You can do something like this
pd.to_datetime(df['YYMMDD'].astype(str) + ' ' + df['HHMM'].astype(str))
P.S. you can do this directly while reading the CSV file using the parameterparse_dates=[['YYMMDD', 'HHMM']]

Related

Resample 10D but until end of months

I would like to resample a DataFrame with frequences of 10D but cutting the last decade always at the end of the month.
ES:
print(df)
 data
index
2010-01-01 145.08
2010-01-02 143.69
2010-01-03 101.06
2010-01-04 57.63
2010-01-05 65.46
...
2010-02-24 48.06
2010-02-25 87.41
2010-02-26 71.97
2010-02-27 73.1
2010-02-28 41.43
Apply something like df.resample('10DM').mean()
data
index
2010-01-10 97.33
2010-01-20 58.58
2010-01-31 41.43
2010-02-10 35.17
2010-02-20 32.44
2010-02-28 55.44
note that the 1st and 2nd decades are normal 10D resample, but the 3rd can be 8-9-10-11 days based on month and year.
Thanks in advance.
Sample data (easy to check):
# df = pd.DataFrame({"value": np.arange(1, len(dti)+1)}, index=dti)
>>> df
value
2010-01-01 1
2010-01-02 2
2010-01-03 3
2010-01-04 4
2010-01-05 5
...
2010-02-24 55
2010-02-25 56
2010-02-26 57
2010-02-27 58
2010-02-28 59
You need to create groups by (days, month, year):
grp = df.groupby([pd.cut(df.index.day, [0, 10, 20, 31]),
pd.Grouper(freq='M'),
pd.Grouper(freq='Y')])
Now you can compute the mean for each group:
out = grp['value'].apply(lambda x: (x.index.max(), x.mean())).apply(pd.Series) \
.reset_index(drop=True).rename(columns={0:'date', 1:'value'}) \
.set_index('date').sort_index()
Output result:
>>> out
value
date
2010-01-10 5.5
2010-01-20 15.5
2010-01-31 26.0
2010-02-10 36.5
2010-02-20 46.5
2010-02-28 55.5

How to convert to datetime if the format of dates changes gradually through the column?

df.head():
start_date end_date
0 03.09.2013 03.09.2025
1 09.08.2019 14.05.2020
2 03.08.2015 03.08.2019
3 31.03.2014 31.03.2019
4 02.02.2015 02.02.2019
5 21.08.2019 21.08.2024
when I do df.tail():
start_date end_date
30373 2019-07-05 00:00:00 2023-07-05 00:00:00
30374 2019-06-11 00:00:00 2023-06-11 00:00:00
30375 19.01.2017 2020-02-09 00:00:00 #these 2 start dates are just same as in head
30376 11.12.2009 2011-12-11 00:00:00
30377 2019-07-30 00:00:00 2023-07-30 00:00:00
when i do
df[start_date] = pd.to_datetime(df[start_date])
some dates have month converted as days.
The format is inconsistent through the column. How to convert properly?
Use dayfirst=True parameter:
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
Or specify format by http://strftime.org/:
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m.%Y')
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
print (df)
start_date end_date
0 2013-09-03 2025-09-03
1 2019-08-09 2020-05-14
2 2015-08-03 2019-08-03
3 2014-03-31 2019-03-31
4 2015-02-02 2019-02-02
5 2019-08-21 2024-08-21

Pandas Datetime conversion

I have the following dataframe;
Date = ['01-Jan','01-Jan','01-Jan','01-Jan']
Heure = ['00:00','01:00','02:00','03:00']
value =[1,2,3,4]
df = pd.DataFrame({'value':value,'Date':Date,'Hour':Heure})
print(df)
Date Hour value
0 01-Jan 00:00 1
1 01-Jan 01:00 2
2 01-Jan 02:00 3
3 01-Jan 03:00 4
I am trying to create a datetime index, knowing that the file I am working with is for 2015. I have tried a lot of things but can get it to work! I tried to only convert the date and the month, but even that does not work:
df.index = pd.to_datetime(df['Date'],format='%d-%m')
I expect the following result:
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
Does anyone know how to do it?
Thanks,
You need to explicitely add 2015 somehow, and include the Hour column as well. I would do something like this:
df.index = pd.to_datetime(df.Date + '-2015 ' + df.Hour, format='%d-%b-%Y %H:%M')
>>> df
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
You can replace the default 1900 by using replace
s=pd.to_datetime(df['Date']+df['Hour'],format='%d-%b%H:%M').apply(lambda x : x.replace(year=2015))
s
Out[131]:
0 2015-01-01 00:00:00
1 2015-01-01 01:00:00
2 2015-01-01 02:00:00
3 2015-01-01 03:00:00
dtype: datetime64[ns]
df.index=s

Interpolating datetime Index

I have a DataFrame (df) as follow where 'date' is a datetime index (Y-M-D):
df :
values
date
2010-01-01 10
2010-01-02 20
2010-01-03 - 30
I want to create a new df with interpolated datetime index as follow:
values
date
2010-01-01 12:00:00 10
2010-01-01 17:00:00 15 # mean value betw. 2010-01-01 and 2010-01-02
2010-01-02 12:00:00 20
2010-01-02 17:00:00 - 5 # mean value betw. 2010-01-02 and 2010-01-03
2010-01-03 12:00:00 -30
Can anyone help me on this?
I believe need add 12 hours to index first, then reindex by union new indices with 17 and last interpolate:
df1 = df.set_index(df.index + pd.Timedelta(12, unit='h'))
idx = (df.index + pd.Timedelta(17, unit='h')).union(df1.index)
df2 = df1.reindex(idx).interpolate()
print (df2)
values
date
2010-01-01 12:00:00 10.0
2010-01-01 17:00:00 15.0
2010-01-02 12:00:00 20.0
2010-01-02 17:00:00 -5.0
2010-01-03 12:00:00 -30.0
2010-01-03 17:00:00 -30.0

from 15 minutes interval to hourly interval counts

am using excel sheet to display data from sql with this query
SELECT itable.Timestamp, itable.Time,
Sum(itable.CallsOffered)AS CallsOffered, Sum(itable.CallsAnswered)AS CallsAnswered, Sum(itable.CallsAnsweredAftThreshold)AS CallsAnsweredAftThreshold,
sum(CallsAnsweredDelay)AS CallsAnsweredDelay
FROM tablename itable
WHERE
(itable.Timestamp>=?) AND (itable.Timestamp<=?) AND
(itable.Application in ('1','2','3','4'))
GROUP BY itable.Timestamp, itable.Time
ORDER BY itable.Timestamp, itable.Time
and i get a data with an interval of 15 minutes like this :
Timestamp Time CallsOffered CallsAnswered CallsAnsweredAftThreshold CallsAnsweredDelay
6/1/2014 0:00 00:00 0 1 1 52
6/1/2014 0:15 00:15 3 1 1 23
6/1/2014 0:30 00:30 3 3 2 89
6/1/2014 0:45 00:45 0 0 0 0
6/1/2014 1:00 01:00 0 0 0 0
6/1/2014 1:15 01:15 4 1 1 12
6/1/2014 1:30 01:30 1 1 1 39
6/1/2014 1:45 01:45 0 0 0 0
6/1/2014 2:00 02:00 2 1 0 7
6/1/2014 2:15 02:15 1 1 1 80
6/1/2014 2:30 02:30 3 2 2 75
6/1/2014 2:45 02:45 0 0 0 0
6/1/2014 3:00 03:00 0 0 0 0
and i want to convert the interval from being 15 minutes to hourly interval
like this
2014-07-01 00:00:00.000
2014-07-01 01:00:00.000
2014-07-01 02:00:00.000
2014-07-01 03:00:00.000
2014-07-01 04:00:00.000
2014-07-01 05:00:00.000
2014-07-01 06:00:00.000
2014-07-01 07:00:00.000
2014-07-01 08:00:00.000
2014-07-01 09:00:00.000
2014-07-01 10:00:00.000
2014-07-01 11:00:00.000
2014-07-01 12:00:00.000
2014-07-01 13:00:00.000
2014-07-01 14:00:00.000
the query i came up with is :
select
timestamp = DATEADD(hour,datediff(hour,0,app.Timestamp),0),
Sum(app.CallsOffered)AS CallsOffered,
Sum(app.CallsAnswered)AS CallsAnswered,
Sum(app.CallsAnsweredAftThreshold)AS CallsAnsweredAftThreshold,
sum(CallsAnsweredDelay)AS CallsAnsweredDelay,
max(MaxCallsAnsDelay) as MaxCallsAnsDelay ,
max(app.MaxCallsAbandonedDelay)as MaxCallsAbandonedDelay
from tablename app
where Timestamp >='2014-7-1' AND timestamp<='2014-7-2' and
(app.Application in (
'1',
'2',
'3',
'4')
group by DATEADD(hour,datediff(hour,0,Timestamp),0)
order by Timestamp;
i get the result i want when i run in in Microsoft Sql server Managment studio
but it gives me a long error when i try running the same query in Microsoft Query in excel the error is like i cant start with timestamp
and that its giving me error for DATEADD ,DATEDIFF
so is there something i should change in my query or anything i can do to get an hourly count interval instead of 15 minutes count interval as ive shown
and thank you in advance