Pandas Datatime Excel data with different formats - dataframe

I've got automatic sensor data stored in .csv in inconsistent formats.
For instance:
2/29/2020 8:00:00 PM
2/29/2020 9:00:00 PM
2/29/2020 10:00:00 PM
2/29/2020 11:00:00 PM
3-1-2020 00:00
3-1-2020 01:00
3-1-2020 02:00
....
3-2-2020 00:00
3-2-2020 01:00
3-2-2020 02:00
3-2-2020 03:00
3-2-2020 04:00
I import this data with
df = pd.read_csv(path+r'\mydata.csv', sep=';', header=0)
Time = df.iloc[:,0]
Time = pd.to_datetime(PL_Time)
Now of course simply doing Time = pd.to_datetime(PL_Time) doesn't work as the data is in inconsistent format. Do you have any suggestions on how to effectively deal with the conversion of this dataset to the format yyyy-mm-dd hh:mm?

Related

Converting time variable into a numeric value in SAS

I've done a substracting of time variables (sleeptime = waketime - bedtime) and, although I get the correct result I need to categorize the sleeptime into 2 categories (sleep =0 if sleeptime => 7hours or sleep=1 if < than 7h).
The problem is that when I categorize the variable, I don't get the classification right. This is what I get:
bedtime waketime sleeptime sleep
22:00:00 07:00:00 09:00:00 1
22:30:00 06:30:00 08:00:00 1
00:55:00 08:10:00 07:15:00 0
02:30:00 08:30:00 06:00:00 1
Here's the code I've used:
data have; set want;
sleeptime = waketime - bedtime;
if sleeptime => '07:00:00't then sleep=0;
if sleeptime < '07:00:00't then sleep=1; run;
I've been think into converting the sleeptime into a value so that it's easier to categorize, for example:
bedtime waketime sleeptime sleeptime1
22:00:00 07:00:00 09:00:00 9
22:30:00 06:30:00 08:00:00 8
02:30:00 08:30:00 06:00:00 6
Any thoughts? Thanks for the help!
Time variables are numeric, so you're fine leaving it alone... but you're forgetting about midnight!
Either keep your variables as datetime (which keeps the date, so it lets you do this sort of thing just as you did it), or fudge it:
data have;
input bedtime :time8. waketime :time8.;
datalines;
22:00:00 07:00:00
22:30:00 06:30:00
00:55:00 08:10:00
02:30:00 08:30:00
;;;;
run;
data want;
set have;
sleeptime = waketime-bedtime + (86400*(bedtime gt waketime));
format bedtime waketime sleeptime time8.;
run;
This only works if you're sure it's always going to be true that waketime should be after bedtime. Seems likely, but worth pointing out. (And, 86400 is the number of seconds in 24 hours - you can also use '24:00:00't if you want.)

What's the difference between changing datetime string to datetime by pd.to_datetime & datetime.strptime()

I have a df that looks similar to this (shortened version, with less rows):
Time (EDT) Open High Low Close
0 02.01.2006 19:00:00 0.85224 0.85498 0.85224 0.85498
1 02.01.2006 20:00:00 0.85498 0.85577 0.85423 0.85481
2 02.01.2006 21:00:00 0.85481 0.85646 0.85434 0.85646
3 02.01.2006 22:00:00 0.85646 0.85705 0.85623 0.85651
4 02.01.2006 23:00:00 0.85643 0.85691 0.85505 0.85653
5 03.01.2006 00:00:00 0.85653 0.8569 0.85601 0.85626
6 03.01.2006 01:00:00 0.85626 0.85653 0.85524 0.8557
7 03.01.2006 02:00:00 0.85558 0.85597 0.85486 0.85597
8 03.01.2006 03:00:00 0.85597 0.85616 0.85397 0.8548
9 03.01.2006 04:00:00 0.85469 0.85495 0.8529 0.85328
10 03.01.2006 05:00:00 0.85316 0.85429 0.85222 0.85401
11 03.01.2006 06:00:00 0.85401 0.8552 0.853 0.8552
12 03.01.2006 07:00:00 0.8552 0.8555 0.85319 0.85463
13 03.01.2006 08:00:00 0.85477 0.85834 0.8545 0.85788
14 03.01.2006 09:00:00 0.85788 0.85838 0.85341 0.85416
15 03.01.2006 10:00:00 0.8542 0.8542 0.85006 0.85111
16 03.01.2006 11:00:00 0.85115 0.85411 0.85 0.85345
17 03.01.2006 12:00:00 0.85337 0.85432 0.8526 0.85413
18 03.01.2006 13:00:00 0.85413 0.85521 0.85363 0.85363
19 03.01.2006 14:00:00 0.85325 0.8561 0.85305 0.85606
20 03.01.2006 15:00:00 0.8561 0.85675 0.85578 0.85599
I need to convert the date string to datetime, then set date column as index, and resample. When I use method 1, I can't resample properly, the data how it resamples is wrong and it creates extra future dates. Let say my last date is 2018-11, I will see 2018-12 something like that.
method 1:
df['Time (EDT)'] = pd.to_datetime(df['Time (EDT)']) <---- this takes long also, because theres 90000 rows
df.set_index('Time (EDT)', inplace=True)
ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)
result:
Time (EDT) Open High Low Close
1/1/2006 21:00 0.86332 0.86332 0.86268 0.86321
1/2/2006 1:00 0.86321 0.86438 0.86111 0.86164
1/2/2006 5:00 0.86164 0.86222 0.8585 0.86134
1/2/2006 9:00 0.86149 0.86297 0.85695 0.85793
1/2/2006 13:00 0.85801 0.85947 0.85759 0.8591
1/2/2006 17:00 0.8591 0.86034 0.85757 0.85825
1/2/2006 21:00 0.85825 0.85969 0.84377 0.84412
1/3/2006 1:00 0.84445 0.8468 0.84286 0.84642
1/3/2006 5:00 0.84659 0.8488 0.84494 0.84872
1/3/2006 9:00 0.84829 0.84915 0.84271 0.84416
1/3/2006 13:00 0.84372 0.8453 0.84346 0.84423
1/3/2006 17:00 0.84426 0.84693 0.84426 0.84516
1/3/2006 21:00 0.84523 0.8458 0.84442 0.84579
When I use method 2. It resamples properly.
method 2:
def to_datetime_obj(date_string):
datetime_obj = datetime.strptime(date_string[:], '%d.%m.%Y %H:%M:%S')
return datetime_obj
datetime_objs = None
date_list = df['Time (EDT)'].tolist()
datetime_objs=list(map(to_datetime_obj, date_list)) <--- this is faster also
df.iloc[:,:1] = datetime_objs
df.set_index('Time (EDT)', inplace=True)
ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)
result:
Time (EDT) Open High Low Close
1/2/2006 17:00 0.85224 0.85577 0.85224 0.85481
1/2/2006 21:00 0.85481 0.85705 0.85434 0.85626
1/3/2006 1:00 0.85626 0.85653 0.8529 0.85328
1/3/2006 5:00 0.85316 0.85834 0.85222 0.85788
1/3/2006 9:00 0.85788 0.85838 0.85 0.85413
1/3/2006 13:00 0.85413 0.85675 0.85305 0.85525
1/3/2006 17:00 0.85525 0.85842 0.85502 0.85783
1/3/2006 21:00 0.85783 0.85898 0.85736 0.85774
1/4/2006 1:00 0.85774 0.85825 0.8558 0.85595
1/4/2006 5:00 0.85595 0.85867 0.85577 0.85839
1/4/2006 9:00 0.85847 0.85981 0.85586 0.8578
1/4/2006 13:00 0.85773 0.85886 0.85597 0.85653
1/4/2006 17:00 0.85653 0.85892 0.85642 0.8584
1/4/2006 21:00 0.8584 0.85863 0.85658 0.85715
1/5/2006 1:00 0.85715 0.8588 0.85641 0.85791
1/5/2006 5:00 0.85803 0.86169 0.85673 0.86065
The df.index of method 1 and 2 are the same visually before resampling.
They are both pandas.core.indexes.datetimes.DatetimeIndex
But when I compare them, they are actually different method1_df.index != method2_df.index
Why is that? How to fix? Thanks.
It's surprising that a vectorized method (pd.to_datetime), written in Cython is slower than a pure Python method (datetime.strptime).
You can specify the format to pd.to_datetime whicch speeds it up a lot:
pd.to_datetime(df['Time (EDT)'], format='%d.%m.%Y %H:%M:%S')
For your second problem, I think it may have something to do with the order of day and month in your string data. Have you verified that the two methods actually give you the same datetimes?
s1 = pd.to_datetime(df['Time (EDT)'])
s2 = pd.Series(map(to_datetime_obj, date_list))
(s1 == s2).all()
For me datetime.strptime was 3 times faster than pd.to_datetime for 2 operations per row on a 880,000+ rows DataFrame.

Pandas DateTime Calculating Daily Averages

I have 2 columns of data in a pandas DF that looks like this with the "DateTime" column in format YYYY-MM-DD HH:MM:SS - this is first 24 hrs but the df is for one full year or 8784 x 2.
BAFFIN BAY DateTime
8759 8.112838 2016-01-01 00:00:00
8760 7.977169 2016-01-01 01:00:00
8761 8.420204 2016-01-01 02:00:00
8762 9.515370 2016-01-01 03:00:00
8763 9.222840 2016-01-01 04:00:00
8764 8.872423 2016-01-01 05:00:00
8765 8.776145 2016-01-01 06:00:00
8766 9.030668 2016-01-01 07:00:00
8767 8.394983 2016-01-01 08:00:00
8768 8.092915 2016-01-01 09:00:00
8769 8.946967 2016-01-01 10:00:00
8770 9.620883 2016-01-01 11:00:00
8771 9.535951 2016-01-01 12:00:00
8772 8.861761 2016-01-01 13:00:00
8773 9.077692 2016-01-01 14:00:00
8774 9.116074 2016-01-01 15:00:00
8775 8.724343 2016-01-01 16:00:00
8776 8.916940 2016-01-01 17:00:00
8777 8.920438 2016-01-01 18:00:00
8778 8.926278 2016-01-01 19:00:00
8779 8.817666 2016-01-01 20:00:00
8780 8.704014 2016-01-01 21:00:00
8781 8.496358 2016-01-01 22:00:00
8782 8.434297 2016-01-01 23:00:00
I am trying to calculate daily averages of the "BAFFIN BAY" and I've tried these approaches:
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DateTime')).mean()
davg_df2 = df2.groupby(pd.Grouper(freq='1D', key='DateTime')).mean()
davg_df2 = df2.groupby(by=df2['DateTime'].dt.date).mean()
All of these approaches yields the same answer as shown below :
BAFFIN BAY
DateTime
2016-01-01 6.008044
However, if you do the math, the correct average for 2016-01-01 is 8.813134 Thank you kindly for your help. I'm assuming the grouping is just by day or 24hrs to make consecutive DAILY averages but the 3 approaches above clearly is looking at other data in my 8784 x 2 DF.
I just ran your df with this code and i get 8.813134:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.groupby(by=pd.Grouper(freq='D', key='DateTime')).mean()
print(df)
Output:
BAFFIN BAY
DateTime
2016-01-01 8.813134

Pandas: How to convert datetime convert to %H:%H and stays as datetime format?

I have a dataframe in 1 column with all different times.
Time
-----
10:00
11:30
12:30
14:10
...
I need to do a quantile range on this dataframe with the code below:
df.quantile([0,0.5,1],numeric_only=False)
Following the link below, the quantile does work.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
As my column as in object, I need to convert to pd.datetime or pd.Timestamp.
When I convert to pd.datetime, I will have all my time inserted with dates too.
If I format it to %H:%M, the column turns back to object which cannot work with quantile under numeric_only mode.
How can I convert to datetime format in %H:%M and still stick to datetime format?
Below was the code I used:
df = pd.DataFrame({"Time":["10:10","09:10","12:00","13:23","15:23","17:00","17:30"]})
df['Time2'] = pd.to_datetime(df['Time']).dt.strftime('%H:%M')
df['Time2'] = df['Time2'].astype('datetime64[ns]')
How can I convert to datetime format in %H:%M and still stick to datetime format?
Impossible in pandas, maybe closer is use timedeltas:
df = pd.DataFrame({"Time":["10:10","09:10","12:00","13:23","15:23","17:00","17:30"]})
df['Time2'] = pd.to_timedelta(df['Time'].add(':00'))
print (df)
Time Time2
0 10:10 10:10:00
1 09:10 09:10:00
2 12:00 12:00:00
3 13:23 13:23:00
4 15:23 15:23:00
5 17:00 17:00:00
6 17:30 17:30:00

Pandas Datetime conversion

I have the following dataframe;
Date = ['01-Jan','01-Jan','01-Jan','01-Jan']
Heure = ['00:00','01:00','02:00','03:00']
value =[1,2,3,4]
df = pd.DataFrame({'value':value,'Date':Date,'Hour':Heure})
print(df)
Date Hour value
0 01-Jan 00:00 1
1 01-Jan 01:00 2
2 01-Jan 02:00 3
3 01-Jan 03:00 4
I am trying to create a datetime index, knowing that the file I am working with is for 2015. I have tried a lot of things but can get it to work! I tried to only convert the date and the month, but even that does not work:
df.index = pd.to_datetime(df['Date'],format='%d-%m')
I expect the following result:
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
Does anyone know how to do it?
Thanks,
You need to explicitely add 2015 somehow, and include the Hour column as well. I would do something like this:
df.index = pd.to_datetime(df.Date + '-2015 ' + df.Hour, format='%d-%b-%Y %H:%M')
>>> df
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
You can replace the default 1900 by using replace
s=pd.to_datetime(df['Date']+df['Hour'],format='%d-%b%H:%M').apply(lambda x : x.replace(year=2015))
s
Out[131]:
0 2015-01-01 00:00:00
1 2015-01-01 01:00:00
2 2015-01-01 02:00:00
3 2015-01-01 03:00:00
dtype: datetime64[ns]
df.index=s