convert numpy.datetime64 into epoch time - numpy

I am trying to convert my numpy array new_feat_dt containing numpy.datetime64 into epoch time. I want to make sure when the conversion happens the date stays in utc format?
I am using numpy 1.16.4 and python3.6
I have tried two ways of conversion as shown in code below.
import numpy as np
new_feat_dt = [np.datetime64('2019-07-25T14:23:01'), np.datetime64('2019-07-25T14:25:01'), np.datetime64('2019-07-25T14:27:01')]
final= [(x - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's') for x in new_feat_dt]
print (final)
print(type(final[0]))
final2= [np.datetime64(x,'s').astype(int) for x in new_feat_dt]
print (final2)
print(type(final2[0]))
Output of the above code:
[1564064581.0, 1564064701.0, 1564064821.0]
<class 'numpy.float64'>
[1564064581, 1564064701, 1564064821]
<class 'numpy.int32'>
The above is happening because the times in new_feat_dt array is considered as GMT. I want it to be considered as my local time which is ('US/Eastern').
The correct conversion should be this:
[1564078981,1564079101,1564079221]

The numpy.datetime64 is a timezone naive datetime type. To add timezone information into the datetime, try use python's datetime with pytz module.
import numpy as np
import pytz
from datetime import datetime
new_feat_dt = [np.datetime64('2019-07-25T14:23:01'), np.datetime64('2019-07-25T14:25:01'), np.datetime64('2019-07-25T14:27:01')]
eastern = pytz.timezone('US/Eastern')
final = [int(eastern.localize(dt.astype(datetime)).timestamp()) for dt in new_feat_dt]
print(final)
The output:
[1564078981, 1564079101, 1564079221]
It's probably better to initialize all your new_feat_dt using datetime.datetime.

Related

Xarray datetime to ordinal

In pandas there is a toordinal function to convert the datetime to ordinal, such as:Convert date to ordinal python? or Pandas datetime column to ordinal. I have an xarray dataarray with time coordinate that I want to convert it to ordinal. Is there similar panda's toordinal to do it in xarray?
sample:
Coordinates:
time
array(['2019-07-31T10:00:00.000000000', '2019-07-31T10:15:00.000000000',
'2019-07-31T10:30:00.000000000', '2019-07-31T10:45:00.000000000',
'2019-07-31T11:00:00.000000000', '2019-07-31T11:15:00.000000000',
'2019-07-31T11:30:00.000000000', '2019-07-31T11:45:00.000000000',
'2019-07-31T12:00:00.000000000'], dtype='datetime64[ns]')
I didn't find a xarray-native way to do it.
But, you can work around it by converting the time values to datetime objects, on which you can then use toordinal:
import pandas as pd
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
time_ordinal = [pd.to_datetime(x).toordinal() for x in ds.time.values]
print(time_ordinal[:5])
# [734869, 734869, 734869, 734869, 734870]

Parse CSV with far future dates to Parquet

I’m trying to read a CSV into Pandas, and then write it to Parquet. The challenge is that the CSV has a date column with a value of 3000-12-31, and apparently Pandas has no way to store that value as an actual date. Because of that, PyArrow fails to read the date value.
An example file and code to reproduce is
test.csv
t
3000-12-31
import pandas as pd
import pyarrow as pa
df = pd.read_csv("test.csv", parse_dates=["t"])
schema = pa.schema([pa.field("t", pa.date64())])
table = pa.Table.from_pandas(df, schema=schema)
This gives (a somewhat unhelpful error)
TypeError: an integer is required (got type str)
What's the right way to do this?
Pandas datetime columns (which use the datetime64[ns] data type) indeed cannot store such dates.
One possible workaround to convert the strings to datetime.datetime objects in an object dtype column. And then pyarrow should be able to accept them to create a date column.
This conversion could eg be done with dateutil:
>>> import dateutil
>>> df['t'] = df['t'].apply(dateutil.parser.parse)
>>> df
t
0 3000-12-31 00:00:00
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]
or if you use a fixed format, using datetime.date.strptime is probably more reliable:
>>> import datetime
>>> df['t'] = df['t'].apply(lambda s: datetime.datetime.strptime(s, "%Y-%m-%d"))
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]

numpy equivalent of pandas.Timestamp.floor('15min')

I try to calculate floor of datetime64 type pandas series to obtain equivalent of pandas.Timestamp.round('15min') for '1D', '1H', '15min', '5min', '1min' intervals.
I can do it if I convert datetime64 to pandas Timestamp directly:
pd.to_datetime(df.DATA_CZAS.to_numpy()).floor('15min')
But how to do that without conversion to pandas (which is quite slow) ?
Remark, I can't convert datetime64[ns] to int as :
df.time_variable.astype(int)
>>> cannot astype a datetimelike from [datetime64[ns]] to [int32]
type(df.time_variable)
>>> pandas.core.series.Series
df.time_variable.dtypes
>>> dtype('<M8[ns]')
Fortunately, Numpy allows to convert between datetime of different
resolutions and also integers.
So you can use the following code:
result = (a.astype('datetime64[m]').astype(int) // 15 * 15)\
.astype('datetime64[m]').astype('datetime64[s]')
Read the above code in the following sequence:
a.astype('datetime64[m]') - convert to minute resolution (the
number of minutes since the Unix epoch).
.astype(int) - convert to int (the same number of minutes, but as int).
(... // 15 * 15) - divide by 15 with rounding down and multiply
by 15. Just here the rounding appears.
.astype('datetime64[m]') - convert back to datetime (minute
precision).
.astype('datetime64[s]') - convert to the original (second)
presicion (optional).
To test the code I created the following array:
a = np.array(['2007-07-12 01:12:10', '2007-08-13 01:15:12',
'2007-09-14 01:17:16', '2007-10-15 01:30:00'], dtype='datetime64')
The result of my rounding down is:
array(['2007-07-12T01:00:00', '2007-08-13T01:15:00',
'2007-09-14T01:15:00', '2007-10-15T01:30:00'], dtype='datetime64[s]')

How to convert numpy.timedelta64 to minutes

I have a date time column in a Pandas DataFrame and I'd like to convert it to minutes or seconds.
For example: I want to convert 00:27:00 to 27 mins.
example = data['duration'][0]
example
result: numpy.timedelta64(1620000000000,'ns')
What's the best way to achieve this?
Use array.astype() to convert the type of an array safely:
>>> import numpy as np
>>> a = np.timedelta64(1620000000000,'ns')
>>> a.astype('timedelta64[m]')
numpy.timedelta64(27,'m')

Getting usable dates from Axes.get_xlim() in a pandas time series plot

I'm trying to get the xlimits of a plot as a python datetime object from a time series plot created with pandas. Using ax.get_xlim() returns the axis limits as a numpy.float64, and I can't figure out how to convert the numbers to a usable datetime.
import pandas
from matplotlib import dates
import matplotlib.pyplot as plt
from datetime import datetime
from numpy.random import randn
ts = pandas.Series(randn(10000), index=pandas.date_range('1/1/2000',
periods=10000, freq='H'))
ts.plot()
ax = plt.gca()
ax.set_xlim(datetime(2000,1,1))
d1, d2 = ax.get_xlim()
print "%s(%s) to %s(%s)" % (d1, type(d1), d2, type(d2))
print "Using matplotlib: %s" % dates.num2date(d1)
print "Using datetime: %s" % datetime.fromtimestamp(d1)
which returns:
262968.0 (<type 'numpy.float64'>) to 272967.0 (<type 'numpy.float64'>)
Using matplotlib: 0720-12-25 00:00:00+00:00
Using datetime: 1970-01-03 19:02:48
According to the pandas timeseries docs, pandas uses the numpy.datetime64 dtype. I'm using pandas version '0.9.0'.
I am using get_xlim() instead directly accessing the pandas series because I am using the xlim_changed callback to do other things when the user moves around in the plot area.
Hack to get usable values
For the above example, the limits are returned in hours since the Epoch. So I can convert to seconds since the Epoch and use time.gmtime() to get somewhere usable, but this still doesn't feel right.
In [66]: d1, d2 = ax.get_xlim()
In [67]: time.gmtime(d1*60*60)
Out[67]: time.struct_time(tm_year=2000, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=1, tm_isdst=0)
The current behavior of matplotlib.dates:
datetime objects are converted to floating point numbers which represent time in days since 0001-01-01 UTC, plus 1. For example, 0001-01-01, 06:00 is 1.25, not 0.25. The helper functions date2num(), num2date() and drange() are used to facilitate easy conversion to and from datetime and numeric ranges.
pandas.tseries.converter.PandasAutoDateFormatter() seems to build on this, so:
x = pandas.date_range(start='01/01/2000', end='01/02/2000')
plt.plot(x, x)
matplotlib.dates.num2date(plt.gca().get_xlim()[0])
gives:
datetime.datetime(2000, 1, 1, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x7ff73a60f290>)
# First convert to pandas Period
period = pandas.tseries.period.Period(ordinal=int(d1), freq=ax.freq)
# Then convert to pandas timestamp
ts = period.to_timestamp()
# Then convert to date object
dt = ts.to_datetime()