numpy equivalent of pandas.Timestamp.floor('15min') - pandas

I try to calculate floor of datetime64 type pandas series to obtain equivalent of pandas.Timestamp.round('15min') for '1D', '1H', '15min', '5min', '1min' intervals.
I can do it if I convert datetime64 to pandas Timestamp directly:
pd.to_datetime(df.DATA_CZAS.to_numpy()).floor('15min')
But how to do that without conversion to pandas (which is quite slow) ?
Remark, I can't convert datetime64[ns] to int as :
df.time_variable.astype(int)
>>> cannot astype a datetimelike from [datetime64[ns]] to [int32]
type(df.time_variable)
>>> pandas.core.series.Series
df.time_variable.dtypes
>>> dtype('<M8[ns]')

Fortunately, Numpy allows to convert between datetime of different
resolutions and also integers.
So you can use the following code:
result = (a.astype('datetime64[m]').astype(int) // 15 * 15)\
.astype('datetime64[m]').astype('datetime64[s]')
Read the above code in the following sequence:
a.astype('datetime64[m]') - convert to minute resolution (the
number of minutes since the Unix epoch).
.astype(int) - convert to int (the same number of minutes, but as int).
(... // 15 * 15) - divide by 15 with rounding down and multiply
by 15. Just here the rounding appears.
.astype('datetime64[m]') - convert back to datetime (minute
precision).
.astype('datetime64[s]') - convert to the original (second)
presicion (optional).
To test the code I created the following array:
a = np.array(['2007-07-12 01:12:10', '2007-08-13 01:15:12',
'2007-09-14 01:17:16', '2007-10-15 01:30:00'], dtype='datetime64')
The result of my rounding down is:
array(['2007-07-12T01:00:00', '2007-08-13T01:15:00',
'2007-09-14T01:15:00', '2007-10-15T01:30:00'], dtype='datetime64[s]')

Related

How to prevent value out of range for a double-precision float?

I have a large dataframe that will be regularly updated and then converted to a tensorflow dataset. The dataframe may be updated with values exceeding the range of a double precision float but tensorflow can't convert these values. I need a way to round all the out of range values to within range.
Using the code:
test_data_x = test_data_x.astype(float)
test_dataset_x = tf.data.Dataset.from_tensor_slices(test_data_x.to_dict(orient="list"))
produces the error "Can't convert Python sequence with a value out of range for a double-precision float."
Is there a way to convert my data so that all the values are made to be in range?
Reproducible example:
It's weird. The numbers don't even have to exceed the float64 maximum value!
max_f32 = np.finfo('float32').max
df = pd.DataFrame([[max_f32, max_f32 * 2], [max_f32, max_f32 * 2]])
print(df.dtypes)
0
0
float32
1
float64
tf.data.Dataset.from_tensor_slices(df.to_dict(orient="list"))
ValueError: Can't convert Python sequence with a value out of range for a double-precision float.
It works, however, if the float64 value is not so large:
df = pd.DataFrame([[max_f32, max_f32 + (max_f32/2**30)], [max_f32, max_f32 + 2*100]])

How to plot unix timestam

I have a time serie determined by sec.nsec (unix time?) where a signal is either 0 or 1 and I want to plot it to have a square signal. Currently I have the following code:
from matplotlib.pyplot import *
time = ['1633093403.754783918', '1633093403.755350983', '1633093403.760918965', '1633093403.761298577', '1633093403.761340378', '1633093403.761907443']
data = [1, 0, 1, 0, 1, 0]
plot(time, data)
show()
This plots:
Is there any conversion needed for the time before plotting? I cannot have date:time as this points might have ns to ms between them
Thank you.
EDIT: The values of the list for time are strings
To convert unix timestamp strings to datetime64 you need to fist convert to float, and then convert to datetime64 with the correct units:
time = ['1633093403.754783918', '1633093403.755350983', '1633093403.760918965', '1633093403.761298577', '1633093403.761340378', '1633093403.761907443']
time = (np.asarray(time).astype(float)).astype('datetime64[s]')
print(time.dtype)
print(time)
yields:
datetime64[s]
['2021-10-01T13:03:23' '2021-10-01T13:03:23' '2021-10-01T13:03:23'
Note the nanoseconds have been stripped. If you want to keep those...
time = (np.asarray(time).astype(float)*1e9).astype('datetime64[ns]')
yields:
datetime64[ns]
['2021-10-01T13:03:23.754783744' '2021-10-01T13:03:23.755351040'
'2021-10-01T13:03:23.760918784' '2021-10-01T13:03:23.761298688'
'2021-10-01T13:03:23.761340416' '2021-10-01T13:03:23.761907456']
This all works because datetime64 has the same "epoch" or zero as unix timestamps (1970-01-01T00:00:00.000000)
Once you do this conversion, plotting should work fine.

How to filter on a column that has both float and datetime

I have a column in my dataframe that has both datetime values and float values. How do I filter out the float values? I have tried the following:
import datetime
a = pd.DataFrame([10.0,datetime.datetime.now(),20.0])
a = a[a.dtype!=float]
That does not work because pandas says the entire column is datatype object. The goal would be to get rid of the 10 and the 20 and just leave the current time value.
I highly suspect that the floats that you see are NaN values. So, I would suggest this:
a_float_free = a.dropna()
On the other hand, if my doubt is wrong you can then filter out the floats using
import datetime
a = pd.DataFrame([10.0,datetime.datetime.now(),20.0])
a_float_free = a[a[0].apply(lambda x: not isinstance(x, float))]
PS: In your dummy example in the question, you have given ints instead of floats. I took the liberty of changing them to float

multiplying difference between two dates in days by float vectorized form

I have a function which calculates the difference between two dates and then multiplies that by a rate. i would like to use this in a one off example, but also apply to a pd.Series in a vectorized format for large scale calculations. currently it is getting hung up at
(start_date - end_date).days
AttributeError: 'Series' object has no attribute 'days'
pddt = lambda x: pd.to_datetime(x)
def cost(start_date, end_date, cost_per_day)
start_date=pddt(start_date)
end_date=pddt(end_date)
total_days = (end_date-start_date).days
cost = total_days * cost_per_day
return cost
a={'start_date': ['2020-07-01','2020-07-02'], 'end_date': ['2020-07-04','2020-07-10'],'cost_per_day': [2,1.5]}
df = pd.DataFrame.from_dict(a)
costs = cost(a.start_date, a.end_date, a.cost_per_day)
cost_adhoc = cost('2020-07-15', '2020-07-22',3)
if i run it with the series i get the following error
AttributeError: 'Series' object has no attribute 'days'
if I try to correct it by adding .dt.days then when I only use a single input i get the following error
AttributeError: 'Timestamp' object has no attribute 'dt'
you can change the function
total_days = (end_date-start_date) / np.timedelta64(1, 'D')
Assuming both variables are datetime objects, the expression (end_date-start_date) gives you a timedelta object [docs]. It holds time difference as days, seconds, and microseconds. To convert that to days for example, you would use (end_date-start_date).total_seconds()/(24*60*60).
For the given question, the goal is to multiply daily costs with the total number of days. pandas uses a subclass of timedelta (timedelta64[ns] by default) which facilitates getting the total days (no total_seconds() needed), see frequency conversion. All you need to do is change the timedelta to dtype timedelta64[D] (D for daily frequency):
import pandas as pd
df = pd.DataFrame({'start_date': ['2020-07-01', '2020-07-02'],
'end_date': ['2020-07-04', '2020-07-10'],
'cost_per_day': [2, 1.5]})
# make sure dtype is datetime:
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# multiply cost/d with total days: end_date-start_date converted to days
df['total_cost'] = df['cost_per_day'] * (df['end_date']-df['start_date']).astype('timedelta64[D]')
# df['total_cost']
# 0 6.0
# 1 12.0
# Name: total_cost, dtype: float64
Note: you don't need to use a pandas.DataFrame here, working with pandas.Series also does the trick. However, since pandas was created for these kind of operations, it brings a lot of convenience. Especially here, you don't need to do any iteration in Python; it's done for you in fast C code.

Getting usable dates from Axes.get_xlim() in a pandas time series plot

I'm trying to get the xlimits of a plot as a python datetime object from a time series plot created with pandas. Using ax.get_xlim() returns the axis limits as a numpy.float64, and I can't figure out how to convert the numbers to a usable datetime.
import pandas
from matplotlib import dates
import matplotlib.pyplot as plt
from datetime import datetime
from numpy.random import randn
ts = pandas.Series(randn(10000), index=pandas.date_range('1/1/2000',
periods=10000, freq='H'))
ts.plot()
ax = plt.gca()
ax.set_xlim(datetime(2000,1,1))
d1, d2 = ax.get_xlim()
print "%s(%s) to %s(%s)" % (d1, type(d1), d2, type(d2))
print "Using matplotlib: %s" % dates.num2date(d1)
print "Using datetime: %s" % datetime.fromtimestamp(d1)
which returns:
262968.0 (<type 'numpy.float64'>) to 272967.0 (<type 'numpy.float64'>)
Using matplotlib: 0720-12-25 00:00:00+00:00
Using datetime: 1970-01-03 19:02:48
According to the pandas timeseries docs, pandas uses the numpy.datetime64 dtype. I'm using pandas version '0.9.0'.
I am using get_xlim() instead directly accessing the pandas series because I am using the xlim_changed callback to do other things when the user moves around in the plot area.
Hack to get usable values
For the above example, the limits are returned in hours since the Epoch. So I can convert to seconds since the Epoch and use time.gmtime() to get somewhere usable, but this still doesn't feel right.
In [66]: d1, d2 = ax.get_xlim()
In [67]: time.gmtime(d1*60*60)
Out[67]: time.struct_time(tm_year=2000, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=1, tm_isdst=0)
The current behavior of matplotlib.dates:
datetime objects are converted to floating point numbers which represent time in days since 0001-01-01 UTC, plus 1. For example, 0001-01-01, 06:00 is 1.25, not 0.25. The helper functions date2num(), num2date() and drange() are used to facilitate easy conversion to and from datetime and numeric ranges.
pandas.tseries.converter.PandasAutoDateFormatter() seems to build on this, so:
x = pandas.date_range(start='01/01/2000', end='01/02/2000')
plt.plot(x, x)
matplotlib.dates.num2date(plt.gca().get_xlim()[0])
gives:
datetime.datetime(2000, 1, 1, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x7ff73a60f290>)
# First convert to pandas Period
period = pandas.tseries.period.Period(ordinal=int(d1), freq=ax.freq)
# Then convert to pandas timestamp
ts = period.to_timestamp()
# Then convert to date object
dt = ts.to_datetime()