Efficient Dataframe column (Object) to DateTime conversion - pandas

I'm attempting to create a new column that contains the data of the Date input column as a datetime. I'd also happily accept changing the datatype of the Date column, but I'm just as unsure how to to that.
I'm currently using DateTime = dd.to_datetime. I'm importing from a CSV and letting dask decide on data types.
I'm fairly new to this, so I've tried a few stackoverflow answers, but I'm just fumbling and getting more errors than answers.
My input date string is, for example:
2019-20-09 04:00
This is what I currently have,
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
# Dataframes implement the Pandas API
import dask.dataframe as dd
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\State_Weathergrids.csv')
print(ddf.describe(include='all'))
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%y-%d-%m %H:%M')
The error I'm receiving is below. I 'm assuming that the last line is the most relevant piece, but for the life of me I cannot work out why the date format given doesn't match the format I'm specifying.
TypeError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py in _convert_listlike_datetimes(arg, box, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
290 try:
--> 291 values, tz = conversion.datetime_to_datetime64(arg)
292 return DatetimeIndex._simple_new(values, name=name, tz=tz)
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
....
ValueError: time data '2019-20-09 04:00' does not match format '%y-%d-%m %H:%M' (match)
Data Frame current properties using describe:
Dask DataFrame Structure:
Location Date Temperature RH
npartitions=1
float64 object float64 float64
... ... ... ...
Dask Name: describe, 971 tasks
Sample Data
+-----------+------------------+-------------+--------+
| Location | Date | Temperature | RH |
+-----------+------------------+-------------+--------+
| 1075 | 2019-20-09 04:00 | 6.8 | 99.3 |
| 1075 | 2019-20-09 05:00 | 6.4 | 100.0 |
| 1075 | 2019-20-09 06:00 | 6.7 | 99.3 |
| 1075 | 2019-20-09 07:00 | 8.6 | 95.4 |
| 1075 | 2019-20-09 08:00 | 12.2 | 76.0 |
+-----------+------------------+-------------+--------+

Try this,
['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M', errors = 'ignore')
errors ignore will return Nan wherever to_datetime fails..
For more detail visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

Related

finding maximum value in a moving period? in data frame [duplicate]

I have a table df with columns "timestamp" and "Y". I want to add another column "MaxY" which contains the largest Y value at most 24 hours in the future. That is
df.MaxY.iloc[i] = df[(df.timestamp > df.timestamp.iloc[i]) &
(df.timestamp < df.timestamp.iloc[i] + timedelta(hours=24))].Y.max()
Obviously, computing it like that is very slow. Is there a better way?
In a similar case of computing "SumY" I can do it using a trick with cumsum(). However here similar tricks don't seem to work.
As requested, an example table (MaxY is the output. Input is the first two columns only).
-------------------------------
| timestamp | Y | MaxY |
-------------------------------
| 2016-03-29 12:00 | 1 | 3 | rows 2 and 3 fall within 24 hours, so MaxY = max(2,3)
| 2016-03-29 13:00 | 2 | 4 | rows 3 and 4 fall in the time interval, so MaxY = max(3, 4)
| 2016-03-30 11:00 | 3 | 4 | rows 4, 5, 6 all fall in the interval so MaxY = max(4, 3, 2)
| 2016-03-30 12:30 | 4 | 3 | max (3, 2)
| 2016-03-30 13:30 | 3 | 2 | row 6 is the only row in the interval
| 2016-03-30 14:00 | 2 | nan? | there are no rows in the time interval. Any value will do.
-------------------------------
Here's a way with resample/rolling. I get a weird warning using pandas version 0.18.0 and python 3.5. I don't think it's a concern but not sure why it is generated.
This assumes index is 'timestamp', if not, precede the following with df = df.set_index('timestamp'):
>>> df2 = df.resample('30min').sort_index(ascending=False).fillna(np.nan)
>>> df2 = df2.rolling(48,min_periods=1).max()
>>> df.join(df2,rsuffix='2')
Y Y2
timestamp
2016-03-29 12:00:00 1 3.0
2016-03-29 13:00:00 2 4.0
2016-03-30 11:00:00 3 4.0
2016-03-30 12:30:00 4 4.0
2016-03-30 13:30:00 3 3.0
2016-03-30 14:00:00 2 2.0
On this tiny dataframe it seems to be about twice as fast, but you'd have to test it on a larger dataframe to get a reasonable idea of relative speed.
Hopefully this is somewhat self expanatory. The ascending sort is necessary because rolling only allows a backwards or centered window as far as I can tell.
Consider an apply() solution that may run faster. Function returns the max of a time-conditional series from each row.
import pandas as pd
from datetime import timedelta
def daymax(row):
ser = df.Y[(df.timestamp > row) &
(df.timestamp <= row + timedelta(hours=24))]
return ser.max()
df['MaxY'] = df.timestamp.apply(daymax)
print(df)
# timestamp Y MaxY
#0 2016-03-29 12:00:00 1 3.0
#1 2016-03-29 13:00:00 2 4.0
#2 2016-03-30 11:00:00 3 4.0
#3 2016-03-30 12:30:00 4 3.0
#4 2016-03-30 13:30:00 3 2.0
#5 2016-03-30 14:00:00 2 NaN
what's wrong with
df['MaxY'] = df[::-1].Y.shift(-1).rolling('24H').max()
df[::-1] reverses the df (you want it "backwards") and shift(-1) takes care of the "in the future".

argument must be a string or a number, not 'datetime.datetime', but i have a string (Pandas + Matplotlib)

I have a pandas dataframe
published | sentiment
2022-01-31 10:00:00 | 0
2021-12-29 00:30:00 | 5
2021-12-20 | -5
Since some rows don't have hours, minutes and seconds I delete them:
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
I get:
published | sentiment
2022-01-31 | 0
2021-12-29 | 5
2021-12-20 | -5
If I plot the data:
plt.pyplot.plot_date(df['published'],df['sentiment'] )
I get this error:
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
But I don't know why since it should be a string.
How can I plot it (possibly keeping the temporal order)? Thank you
Try like this:
import pandas as pd
from matplotlib import pyplot as plt
values=[('2022-01-31 10:00:00',0),('2021-12-29 00:30:00',5),('2021-12-20',-5)]
cols=['published','sentiment']
df_dominant_topic2 = pd.DataFrame.from_records(values, columns=cols)
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
#you may sort the data by date
df_dominant_topic2.sort_values(by='published', ascending=True, inplace=True)
plt.plot(df_dominant_topic2['published'],df_dominant_topic2['sentiment'])
plt.show()

How to read time column in pandas and how to convert it into milliseconds

I used this code to read excel file
df=pd.read_excel("XYZ.xlsb",engine='pyxlsb',dtype={'Time':str})
This is just to show what i am getting after reading excel file.
import pandas as pd
import numpy as np
data = {'Name':['T1','T2','T3'],
'Time column in excel':['01:57:15', '00:30:00', '05:00:00'],
'Time column in Python':['0.0814236111111111', '0.0208333333333333', '0.208333333333333']}
df = pd.DataFrame(data)
print (df)
| left | Time column in excel | Time column in Python|
| T1 | 01:57:15 | 0.0814236111111111 |
| T2 | 00:30:00 | 0.0208333333333333 |
| T3 | 05:00:00 | 0.208333333333333 |
I want read this time exactly as in excel.and want to convert into milliseconds,as i want to use time to calculate time difference in percentage for further working
try dividing the microsecond of the datetime by 1000
def get_milliseconds(dt):
return dt.microsecond / 1000

Loading csv with pandas, wrong columns

I loaded a csv into a DataFrame with pandas.
The format is the following:
Timestamp | 1014.temperature | 1014.humidity | 1015.temperature | 1015.humidity ....
-------------------------------------------------------------------------------------
2017-... | 23.12 | 12.2 | 25.10 | 10.34 .....
The problem is that the '1014' or '1015' numbers are supposed to be ID's that are supposed to be in a special column.
I would like to end up with the following format for my DF:
TimeStamp | ID | Temperature | Humidity
-----------------------------------------------
. | | |
.
.
.
The CSV is tab separated.
Thanks in advance guys!
import pandas as pd
from io import StringIO
# create sample data frame
s = """Timestamp|1014.temperature|1014.humidity|1015.temperature|1015.humidity
2017|23.12|12.2|25.10|10.34"""
df = pd.read_csv(StringIO(s), sep='|')
df = df.set_index('Timestamp')
# split columns on '.' with list comprehension
l = [col.split('.') for col in df.columns]
# create multi index columns
df.columns = pd.MultiIndex.from_tuples(l)
# stack column level 0, reset the index and rename level_1
final = df.stack(0).reset_index().rename(columns={'level_1': 'ID'})
Timestamp ID humidity temperature
0 2017 1014 12.20 23.12
1 2017 1015 10.34 25.10

KeyError after resampling

I have two dataframes df indexed by Datetime and df2 which has column Date (Series).
Before resampling I can run:
>>>> df[df2['Date'][0]]
and obtain all rows corresponding to day df2['Date'][0] which is 2013-08-07 in this example. However after resampling by day I can no longer obtain the row corresponding to that day as:
>>>> df.resample('D', how=np.max)[df2['Date'][0]]
KeyError: u'no item named 2013-08-07'
although that day is in the dataset
>>>> df.resample('D', how=np.max).head()
| Temp | etc
Date | |
---------------------------
2013-08-07 | 26.1 |
---------------------------
2013-08-08 | 28.2 |
---------------------------
etc
I am not sure whether it is a bug or it is designed to be like this, or, if the latter is true, why. But you can do this to get the desired result:
In [168]:
df1=pd.DataFrame(np.random.random(100), columns=['Temp'])
df1.index=pd.date_range('2013-08-07',periods=100,freq='5H')
df1.index.name='Date'
In [169]:
df2=pd.DataFrame(pd.date_range('2013-08-07',periods=23, freq='D'), columns=['Date'])
In [170]:
#You can do this
df3=df1.resample('D', how=np.max)
print df3[df3.index==df2['Date'][0]]
Temp
Date
2013-08-07 0.8128
[1 rows x 1 columns]
In [171]:
df3[df2['Date'][0]]
#Error