I want to compute the time difference between times in a DateTimeIndex
import pandas as pd
p = pd.DatetimeIndex(['1985-11-14', '1985-11-28', '1985-12-14', '1985-12-28'], dtype='datetime64[ns]')
I can compute the time difference of two times:
p[1] - p[0]
gives
Timedelta('14 days 00:00:00')
But p[1:] - p[:-1] doesn't work and gives
DatetimeIndex(['1985-12-28'], dtype='datetime64[ns]', freq=None)
and a future warning:
FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference()
Any thought on how how I can (easily) compute the time difference between values in a DateTimeIndex? And why does it work for 1 value, but not for the entire DateTimeIndex?
Convert the DatetimeIndex to a Series using to_series() and then call diff to calculate inter-row differences:
In [5]:
p.to_series().diff()
Out[5]:
1985-11-14 NaT
1985-11-28 14 days
1985-12-14 16 days
1985-12-28 14 days
dtype: timedelta64[ns]
As to why it failed, the - operator here is attempting to perform a set difference or intersection of your different index ranges, you're trying to subtract the values from one range with another which diff does.
when you did p[1] - p[0] the - is performing a scalar subtraction but when you do this on an index it thinks that you're perform a set operation
The - operator is working, it's just not doing what you expect. In the second situation it is acting to give the difference of the two datetime indices, that is the value that is in p[1:] but not in p[:-1]
There may be a better solution, but it would work to perform the operation element wise:
[e - k for e,k in zip(p[1:], p[:-1])]
I used None to fill the first difference value, but I'm sure you can figure out how you would like to deal with that case.
>>> [None] + [p[n] - p[n-1] for n in range(1, len(p))]
[None,
Timedelta('14 days 00:00:00'),
Timedelta('16 days 00:00:00'),
Timedelta('14 days 00:00:00')]
BTW, to just get the day difference:
[None] + [(p[n] - p[n-1]).days for n in range(1, len(p))]
[None, 14, 16, 14]
Related
I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16
I try to calculate floor of datetime64 type pandas series to obtain equivalent of pandas.Timestamp.round('15min') for '1D', '1H', '15min', '5min', '1min' intervals.
I can do it if I convert datetime64 to pandas Timestamp directly:
pd.to_datetime(df.DATA_CZAS.to_numpy()).floor('15min')
But how to do that without conversion to pandas (which is quite slow) ?
Remark, I can't convert datetime64[ns] to int as :
df.time_variable.astype(int)
>>> cannot astype a datetimelike from [datetime64[ns]] to [int32]
type(df.time_variable)
>>> pandas.core.series.Series
df.time_variable.dtypes
>>> dtype('<M8[ns]')
Fortunately, Numpy allows to convert between datetime of different
resolutions and also integers.
So you can use the following code:
result = (a.astype('datetime64[m]').astype(int) // 15 * 15)\
.astype('datetime64[m]').astype('datetime64[s]')
Read the above code in the following sequence:
a.astype('datetime64[m]') - convert to minute resolution (the
number of minutes since the Unix epoch).
.astype(int) - convert to int (the same number of minutes, but as int).
(... // 15 * 15) - divide by 15 with rounding down and multiply
by 15. Just here the rounding appears.
.astype('datetime64[m]') - convert back to datetime (minute
precision).
.astype('datetime64[s]') - convert to the original (second)
presicion (optional).
To test the code I created the following array:
a = np.array(['2007-07-12 01:12:10', '2007-08-13 01:15:12',
'2007-09-14 01:17:16', '2007-10-15 01:30:00'], dtype='datetime64')
The result of my rounding down is:
array(['2007-07-12T01:00:00', '2007-08-13T01:15:00',
'2007-09-14T01:15:00', '2007-10-15T01:30:00'], dtype='datetime64[s]')
I have a function which calculates the difference between two dates and then multiplies that by a rate. i would like to use this in a one off example, but also apply to a pd.Series in a vectorized format for large scale calculations. currently it is getting hung up at
(start_date - end_date).days
AttributeError: 'Series' object has no attribute 'days'
pddt = lambda x: pd.to_datetime(x)
def cost(start_date, end_date, cost_per_day)
start_date=pddt(start_date)
end_date=pddt(end_date)
total_days = (end_date-start_date).days
cost = total_days * cost_per_day
return cost
a={'start_date': ['2020-07-01','2020-07-02'], 'end_date': ['2020-07-04','2020-07-10'],'cost_per_day': [2,1.5]}
df = pd.DataFrame.from_dict(a)
costs = cost(a.start_date, a.end_date, a.cost_per_day)
cost_adhoc = cost('2020-07-15', '2020-07-22',3)
if i run it with the series i get the following error
AttributeError: 'Series' object has no attribute 'days'
if I try to correct it by adding .dt.days then when I only use a single input i get the following error
AttributeError: 'Timestamp' object has no attribute 'dt'
you can change the function
total_days = (end_date-start_date) / np.timedelta64(1, 'D')
Assuming both variables are datetime objects, the expression (end_date-start_date) gives you a timedelta object [docs]. It holds time difference as days, seconds, and microseconds. To convert that to days for example, you would use (end_date-start_date).total_seconds()/(24*60*60).
For the given question, the goal is to multiply daily costs with the total number of days. pandas uses a subclass of timedelta (timedelta64[ns] by default) which facilitates getting the total days (no total_seconds() needed), see frequency conversion. All you need to do is change the timedelta to dtype timedelta64[D] (D for daily frequency):
import pandas as pd
df = pd.DataFrame({'start_date': ['2020-07-01', '2020-07-02'],
'end_date': ['2020-07-04', '2020-07-10'],
'cost_per_day': [2, 1.5]})
# make sure dtype is datetime:
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# multiply cost/d with total days: end_date-start_date converted to days
df['total_cost'] = df['cost_per_day'] * (df['end_date']-df['start_date']).astype('timedelta64[D]')
# df['total_cost']
# 0 6.0
# 1 12.0
# Name: total_cost, dtype: float64
Note: you don't need to use a pandas.DataFrame here, working with pandas.Series also does the trick. However, since pandas was created for these kind of operations, it brings a lot of convenience. Especially here, you don't need to do any iteration in Python; it's done for you in fast C code.
I'm trying to perform specific operations based on the age of data in days within a dataframe. What I am looking for is something like as follows:
import pandas as pd
if 10days < (pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)):
print 'The data is older than 10 days'
Is there something I can replace "10days" with or some other way I can perform operations based on the difference between two Timestamp values?
What you're looking for is pd.Timedelta('10D'), pd.Timedelta(10, unit='D') (or unit='days' or unit='day'), or pd.Timedelta(days=10). For example,
In [37]: pd.Timedelta(days=10) < pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)
Out[37]: False
In [38]: pd.Timedelta(days=5) < pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)
Out[38]: True
I am not sure I understand the parameter min_periods in Pandas rolling functions : why does it have to be smaller than the window parameter?
I would like to compute (for instance) the rolling max minus rolling min with a window of ten values BUT I want to wait maybe 20 values before starting computations:
In[1]: import pandas as pd
In[2]: import numpy as np
In[3]: df = pd.DataFrame(columns=['A','B'], data=np.random.randint(low=0,high=100,size=(100,2)))
In[4]: roll = df['A'].rolling(window=10, min_periods=20)
In[5]: df['C'] = roll.max() - roll.min()
In[6]: roll
Out[6]: Rolling [window=10,min_periods=20,center=False,axis=0]
In[7]: df['C'] = roll.max()-roll.min()
I get the following error:
ValueError: Invalid min_periods size 20 greater than window 10
I thought that min_periods was there to tell how many values the function had to wait before starting computations. The documentation says:
min_periods : int, default None
Minimum number of observations in window required to have a value
(otherwise result is NA)
I had not been carefull to the "in window" detail here...
Then what would be the most efficient way to achieve what I am trying to achieve? Should I do something like:
roll = df.loc[20:,'A'].rolling(window=10)
df['C'] = roll.max() - roll.min()
Is there a more efficient way?
the min_period = n option simply means that you require at least n valid observations to compute your rolling stats.
Example, suppose min_period = 5 and you have a rolling mean over the last 10 observations. Now, what happens if 6 of the last 10 observations are actually missing values? Then, given that 4<5 (indeed, there are only 4 non-missing values here and you require at least 5 non-missing observations), the rolling mean will be missing as well.
It's a very, very important option.
From the documentation
min_periods : int, default None Minimum number of observations in
window required to have a value (otherwise result is NA).
The min period argument is just a way to apply the function to a smaller sample than the rolling window. So let say you want the rolling minimum of window of 10, passing the min period argument of 5 would allow to calculate the min of the first 5 data, then the first 6, then 7,8,9 and finally 10. Now that pandas can start rolling his 10 data point windows, because it has more than 10 data point, it will keep period window of 10.