I'm having all kinds of trouble combining these two date columns into a single datetime column. The data looks like this:
dfn2.head(3)
Out[134]:
Plant_Name YYMMDD HHMM BestGuess(kWh)
0 BII NEE STIPA 20180101 0100 20715.0
1 BII NEE STIPA 20180101 0200 15742.0
2 BII NEE STIPA 20180101 0300 16934.0
dfn2.dtypes
Out[138]:
Plant_Name object
YYMMDD object
HHMM object
BestGuess(kWh) float64
dtype: object
I've tried several options and I'm not getting the expected result from:
dfn2['Datetime'] = (pd.to_datetime(dfn2['YYMMDD'],format='%Y%m%d').add(pd.to_timedelta(dfn2['HHMM'], 'h')))
dfn2.head(3)
Out[101]:
Plant_Name YYMMDD HHMM BestGuess(kWh) Datetime
0 BII NEE STIPA 20180101 100 20715.0 2018-01-05 04:00:00
1 BII NEE STIPA 20180101 200 15742.0 2018-01-09 08:00:00
2 BII NEE STIPA 20180101 300 16934.0 2018-01-13 12:00:00
I'm expecting the 'Datetime' column of the first 3 rows to look like:
2018-01-01 01:00:00
2018-01-01 02:00:00
2018-01-01 03:00:00
not like what the result shows above. I've also tried this lambda solution and the result looks the same:
dfn2['DateTime'] = dfn2['YYMMDD'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d')) + (pd.to_timedelta(dfn2.HHMM, unit='H'))
dfn2.head(3)
Out[103]:
Plant_Name YYMMDD HHMM BestGuess(kWh) Datetime DateTime
0 BII NEE STIPA 20180101 100 20715.0 2018-01-05 04:00:00 2018-01-05 04:00:00
1 BII NEE STIPA 20180101 200 15742.0 2018-01-09 08:00:00 2018-01-09 08:00:00
2 BII NEE STIPA 20180101 300 16934.0 2018-01-13 12:00:00 2018-01-13 12:00:00
Am I missing something? thank you,
You can do something like this
pd.to_datetime(df['YYMMDD'].astype(str) + ' ' + df['HHMM'].astype(str))
P.S. you can do this directly while reading the CSV file using the parameterparse_dates=[['YYMMDD', 'HHMM']]
Related
I would like to resample a DataFrame with frequences of 10D but cutting the last decade always at the end of the month.
ES:
print(df)
data
index
2010-01-01 145.08
2010-01-02 143.69
2010-01-03 101.06
2010-01-04 57.63
2010-01-05 65.46
...
2010-02-24 48.06
2010-02-25 87.41
2010-02-26 71.97
2010-02-27 73.1
2010-02-28 41.43
Apply something like df.resample('10DM').mean()
data
index
2010-01-10 97.33
2010-01-20 58.58
2010-01-31 41.43
2010-02-10 35.17
2010-02-20 32.44
2010-02-28 55.44
note that the 1st and 2nd decades are normal 10D resample, but the 3rd can be 8-9-10-11 days based on month and year.
Thanks in advance.
Sample data (easy to check):
# df = pd.DataFrame({"value": np.arange(1, len(dti)+1)}, index=dti)
>>> df
value
2010-01-01 1
2010-01-02 2
2010-01-03 3
2010-01-04 4
2010-01-05 5
...
2010-02-24 55
2010-02-25 56
2010-02-26 57
2010-02-27 58
2010-02-28 59
You need to create groups by (days, month, year):
grp = df.groupby([pd.cut(df.index.day, [0, 10, 20, 31]),
pd.Grouper(freq='M'),
pd.Grouper(freq='Y')])
Now you can compute the mean for each group:
out = grp['value'].apply(lambda x: (x.index.max(), x.mean())).apply(pd.Series) \
.reset_index(drop=True).rename(columns={0:'date', 1:'value'}) \
.set_index('date').sort_index()
Output result:
>>> out
value
date
2010-01-10 5.5
2010-01-20 15.5
2010-01-31 26.0
2010-02-10 36.5
2010-02-20 46.5
2010-02-28 55.5
df.head():
start_date end_date
0 03.09.2013 03.09.2025
1 09.08.2019 14.05.2020
2 03.08.2015 03.08.2019
3 31.03.2014 31.03.2019
4 02.02.2015 02.02.2019
5 21.08.2019 21.08.2024
when I do df.tail():
start_date end_date
30373 2019-07-05 00:00:00 2023-07-05 00:00:00
30374 2019-06-11 00:00:00 2023-06-11 00:00:00
30375 19.01.2017 2020-02-09 00:00:00 #these 2 start dates are just same as in head
30376 11.12.2009 2011-12-11 00:00:00
30377 2019-07-30 00:00:00 2023-07-30 00:00:00
when i do
df[start_date] = pd.to_datetime(df[start_date])
some dates have month converted as days.
The format is inconsistent through the column. How to convert properly?
Use dayfirst=True parameter:
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
Or specify format by http://strftime.org/:
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m.%Y')
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
print (df)
start_date end_date
0 2013-09-03 2025-09-03
1 2019-08-09 2020-05-14
2 2015-08-03 2019-08-03
3 2014-03-31 2019-03-31
4 2015-02-02 2019-02-02
5 2019-08-21 2024-08-21
I have the following dataframe;
Date = ['01-Jan','01-Jan','01-Jan','01-Jan']
Heure = ['00:00','01:00','02:00','03:00']
value =[1,2,3,4]
df = pd.DataFrame({'value':value,'Date':Date,'Hour':Heure})
print(df)
Date Hour value
0 01-Jan 00:00 1
1 01-Jan 01:00 2
2 01-Jan 02:00 3
3 01-Jan 03:00 4
I am trying to create a datetime index, knowing that the file I am working with is for 2015. I have tried a lot of things but can get it to work! I tried to only convert the date and the month, but even that does not work:
df.index = pd.to_datetime(df['Date'],format='%d-%m')
I expect the following result:
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
Does anyone know how to do it?
Thanks,
You need to explicitely add 2015 somehow, and include the Hour column as well. I would do something like this:
df.index = pd.to_datetime(df.Date + '-2015 ' + df.Hour, format='%d-%b-%Y %H:%M')
>>> df
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
You can replace the default 1900 by using replace
s=pd.to_datetime(df['Date']+df['Hour'],format='%d-%b%H:%M').apply(lambda x : x.replace(year=2015))
s
Out[131]:
0 2015-01-01 00:00:00
1 2015-01-01 01:00:00
2 2015-01-01 02:00:00
3 2015-01-01 03:00:00
dtype: datetime64[ns]
df.index=s
I have a DataFrame (df) as follow where 'date' is a datetime index (Y-M-D):
df :
values
date
2010-01-01 10
2010-01-02 20
2010-01-03 - 30
I want to create a new df with interpolated datetime index as follow:
values
date
2010-01-01 12:00:00 10
2010-01-01 17:00:00 15 # mean value betw. 2010-01-01 and 2010-01-02
2010-01-02 12:00:00 20
2010-01-02 17:00:00 - 5 # mean value betw. 2010-01-02 and 2010-01-03
2010-01-03 12:00:00 -30
Can anyone help me on this?
I believe need add 12 hours to index first, then reindex by union new indices with 17 and last interpolate:
df1 = df.set_index(df.index + pd.Timedelta(12, unit='h'))
idx = (df.index + pd.Timedelta(17, unit='h')).union(df1.index)
df2 = df1.reindex(idx).interpolate()
print (df2)
values
date
2010-01-01 12:00:00 10.0
2010-01-01 17:00:00 15.0
2010-01-02 12:00:00 20.0
2010-01-02 17:00:00 -5.0
2010-01-03 12:00:00 -30.0
2010-01-03 17:00:00 -30.0
am using excel sheet to display data from sql with this query
SELECT itable.Timestamp, itable.Time,
Sum(itable.CallsOffered)AS CallsOffered, Sum(itable.CallsAnswered)AS CallsAnswered, Sum(itable.CallsAnsweredAftThreshold)AS CallsAnsweredAftThreshold,
sum(CallsAnsweredDelay)AS CallsAnsweredDelay
FROM tablename itable
WHERE
(itable.Timestamp>=?) AND (itable.Timestamp<=?) AND
(itable.Application in ('1','2','3','4'))
GROUP BY itable.Timestamp, itable.Time
ORDER BY itable.Timestamp, itable.Time
and i get a data with an interval of 15 minutes like this :
Timestamp Time CallsOffered CallsAnswered CallsAnsweredAftThreshold CallsAnsweredDelay
6/1/2014 0:00 00:00 0 1 1 52
6/1/2014 0:15 00:15 3 1 1 23
6/1/2014 0:30 00:30 3 3 2 89
6/1/2014 0:45 00:45 0 0 0 0
6/1/2014 1:00 01:00 0 0 0 0
6/1/2014 1:15 01:15 4 1 1 12
6/1/2014 1:30 01:30 1 1 1 39
6/1/2014 1:45 01:45 0 0 0 0
6/1/2014 2:00 02:00 2 1 0 7
6/1/2014 2:15 02:15 1 1 1 80
6/1/2014 2:30 02:30 3 2 2 75
6/1/2014 2:45 02:45 0 0 0 0
6/1/2014 3:00 03:00 0 0 0 0
and i want to convert the interval from being 15 minutes to hourly interval
like this
2014-07-01 00:00:00.000
2014-07-01 01:00:00.000
2014-07-01 02:00:00.000
2014-07-01 03:00:00.000
2014-07-01 04:00:00.000
2014-07-01 05:00:00.000
2014-07-01 06:00:00.000
2014-07-01 07:00:00.000
2014-07-01 08:00:00.000
2014-07-01 09:00:00.000
2014-07-01 10:00:00.000
2014-07-01 11:00:00.000
2014-07-01 12:00:00.000
2014-07-01 13:00:00.000
2014-07-01 14:00:00.000
the query i came up with is :
select
timestamp = DATEADD(hour,datediff(hour,0,app.Timestamp),0),
Sum(app.CallsOffered)AS CallsOffered,
Sum(app.CallsAnswered)AS CallsAnswered,
Sum(app.CallsAnsweredAftThreshold)AS CallsAnsweredAftThreshold,
sum(CallsAnsweredDelay)AS CallsAnsweredDelay,
max(MaxCallsAnsDelay) as MaxCallsAnsDelay ,
max(app.MaxCallsAbandonedDelay)as MaxCallsAbandonedDelay
from tablename app
where Timestamp >='2014-7-1' AND timestamp<='2014-7-2' and
(app.Application in (
'1',
'2',
'3',
'4')
group by DATEADD(hour,datediff(hour,0,Timestamp),0)
order by Timestamp;
i get the result i want when i run in in Microsoft Sql server Managment studio
but it gives me a long error when i try running the same query in Microsoft Query in excel the error is like i cant start with timestamp
and that its giving me error for DATEADD ,DATEDIFF
so is there something i should change in my query or anything i can do to get an hourly count interval instead of 15 minutes count interval as ive shown
and thank you in advance