Pandas DataFrame How to query the closest datetime index? - pandas

How do i query for the closest index from a Pandas DataFrame? The index is DatetimeIndex
2016-11-13 20:00:10.617989120 7.0 132.0
2016-11-13 22:00:00.022737152 1.0 128.0
2016-11-13 22:00:28.417561344 1.0 132.0
I tried this:
df.index.get_loc(df.index[0], method='nearest')
but it give me InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Same error if I tried this:
dt = datetime.datetime.strptime("2016-11-13 22:01:25", "%Y-%m-%d %H:%M:%S")
df.index.get_loc(dt, method='nearest')
But if I remove method='nearest' it works, but that is not I want, I want to find the closest index from my query datetime

It seems you need first get position by get_loc and then select by []:
dt = pd.to_datetime("2016-11-13 22:01:25.450")
print (dt)
2016-11-13 22:01:25.450000
print (df.index.get_loc(dt, method='nearest'))
2
idx = df.index[df.index.get_loc(dt, method='nearest')]
print (idx)
2016-11-13 22:00:28.417561344
#if need select row to Series use iloc
s = df.iloc[df.index.get_loc(dt, method='nearest')]
print (s)
b 1.0
c 132.0
Name: 2016-11-13 22:00:28.417561344, dtype: float64

DatetimeIndex.get_loc is now deprecated in favour of DatetimeIndex.get_indexer...
ts = pd.to_datetime('2022-05-26 13:19:48.154000') # example time
iloc_idx = df.index.get_indexer([ts], method='nearest') # returns absolute index into df e.g. array([5])
loc_idx = df.index[iloc_idx] # if you want named index
my_val = df.iloc[iloc_idx]
my_val = df.loc[loc_idx] # as above so below...

I believe jezrael solution works, but not on my dataframe (which i have no clue why). This is the solution that I came up with.
from bisect import bisect #operate as sorted container
timestamps = np.array(df.index)
upper_index = bisect(timestamps, np_dt64, hi=len(timestamps)-1) #find the upper index of the closest time stamp
df_index = df.index.get_loc(min(timestamps[upper_index], timestamps[upper_index-1],key=lambda x: abs(x - np_dt64))) #find the closest between upper and lower timestamp

I know it's an old question, but while searching for the same problems as Bryan Fok, I landed here. So for future searchers getting here, I post my solution.
My index had 4 non-unique items (possibly due to rounding errors when recording the data). The following worked and showed the correct data:
dt = pd.to_datetime("2016-11-13 22:01:25.450")
s = df.loc[df.index.unique()[df.index.unique().get_loc(dt, method='nearest')]]
However, in case your nearest index occures multiple times, this will return multiple rows. If you want to catch that, you could test for it with:
if len(s) != len(df.columns):
# do what is appropriate for your case
# e.g. selecting only the first occurence
s.iloc[0]
Edit: fixed the catching after some test

Related

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

pandas groupby keeping other columns

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).
I have data on firms over time. This generates some dummy data that looks like my use case:
import numpy as np
import pandas as pd
dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
index=np.tile(dates,nfirms),
columns=cols)
df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))
I need to calculate ranks of column E within each date (the index), but keeping column id.
If I just use groupby and .rank I get this:
df.groupby(level=0)['E'].rank()
1926-01-31 3226.0
1926-02-28 1042.0
1926-03-31 1611.0
1926-04-30 2591.0
1926-05-31 30.0
...
2019-08-31 1973.0
2019-09-30 227.0
2019-10-31 4381.0
2019-11-30 1654.0
2019-12-31 1572.0
Name: E, Length: 5640000, dtype: float64
This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?
If the order in the output is the same as in the output, I think I can do this:
df['ranks'] = df.groupby(level=0)['E'].rank()
But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.
(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

pandas resample when cumulative function returns data frame

I would like to use resampling function from pandas but applying my own custom function. The problem I'm facing is that the custom function returns a pandas Data Frame instead of a single array.
The following example illustrate my problem:
>>> import pandas as pd
>>> import numpy as np
>>> def f(data):
... return ((1+data).cumprod(axis=0)-1)
...
>>> data = np.random.randn(1000,3)
>>> index = pd.date_range("20170101", periods = 1000, freq="B")
>>> df = pd.DataFrame(data= data, index =index)
Now suppose I want to resample the business days to business end month frequency:
>>> resampler = df.resample("BM")
If I apply now the my function f I don't get the desired result. I would like to get the last row of my output from f.
>>> resampler.apply(f)
this is becaumes the cumprod in my function f returns a pandas data frame. I could write my f such that it returns just the last row. However, I would like to use this function in other places as well to return the whole Data Frame. This could be solved via introducing a flag like "last_row" in the function f which steers to return the complete or just the last row. But this solutions seem rather nasty.
Just define your function f with a last_row parameter. You can default it to False so that it returns the entire dataframe. When True it returns the last row
def f(data, last_row=False):
df = ((1+data).cumprod(axis=0)-1)
if last_row:
return df.iloc[-1]
return df
Get the last row
df.resample('BM').apply(f, last_row=True)
0 1 2
2017-01-31 0.185662 -0.580058 -1.004879
2017-02-28 -1.004035 -0.999878 17.059846
2017-03-31 -0.995280 -1.000001 -1.000507
2017-04-28 -1.000656 -240.369487 -1.002645
2017-05-31 47.646827 -72.042190 -1.000016
....
Return all the rows as you already did.
df.resample('BM').apply(f)
I think you could refactor in the following way, which will be much faster for larger dataframes:
(1+df).resample('BM').prod() - 1
0 1 2
2017-01-31 -0.999436 -1.259078 -1.000215
2017-02-28 -1.221404 0.342863 9.841939
2017-03-31 -0.820196 -1.002598 -0.450662
2017-04-28 -1.000299 2.739184 -1.035557
2017-05-31 -0.999986 -0.920445 -2.103289
That gives the same answer as #TedPetrou although you can't tell because we used different random seeds, but you can easily test this yourself. Though actually, I'm still sorting out why this gives the same answer via prod() rather than cumprod(). Anyway, as you can see this is a mix of intuition and reverse engineering I'm using here and will update as I double check things...
For this relatively small dataframe with 1,000 rows, this way is only around twice as fast, but if you increase the rows you'll find this way scales much better (about 250x faster at 10,000 rows).
Alternative approaches: These give different answers from the above (and from each other) but I wonder if they might be closer to what you are looking for?
(1+df).resample('BM').mean().expanding().apply( lambda x: x.prod() - 1)
(1+df).expanding().apply( lambda x: x.prod() - 1).resample('BM').mean()

What's the Pandas way to write `if()` conditional between two `timeseries` columns?

My naive approach to Pandas Series needs some pointers. I have one Pandas DataFrame with two joined tables. The left table had timestamp with title Time1 and the right had Time2; My new DataFrame has both.
At this step I'm comparing the two datetime columns using helper functions g() and f():
df['date_error'] = g(df['Time1'], df['Time2'])
The working helper function g() compares two datetime values:
def g(newer,older):
value = newer > older
return value
This gives me a column (True,False) values. When I use the conditional in the helper function f(), I get an error because newer and older are Pandas Series:
def f(newer,older):
if newer > older:
delta = (newer - older)
else :
# arbitrairly large value to maintain col dtype
delta = datetime.timedelta(minutes=1000)
return delta
Ok. Fine. I know I'm not unpacking the Pandas Series correctly, because I can get this to work with the following monstrosity:
def f(newer,older):
delta = []
for (k,v),(k2,v2) in zip(newer.iteritems(), older.iteritems()):
if v > v2 :
delta.append(v - v2)
else :
# arbitrairly large value to maintain col dtype
delta.append(datetime.timedelta(minutes=1000))
return pd.Series(delta)
What's the Pandas way a conditional between two DataFrame columns?
Usually where is the pandas equivalent of if:
df = pd.DataFrame([['1/1/01 11:00', '1/1/01 12:00'],
['1/1/01 14:00', '1/1/01 13:00']],
columns = ['Time1', 'Time2']
).apply(pd.to_datetime)
(df.Time1 - df.Time2).where(df.Time1 > df.Time2)
0 NaT
1 01:00:00
dtype: timedelta64[ns]
If you don't want nulls in this column you could call fillna(1000) afterwards, however note that this datatype supports a null value NaT (not a time).

Trouble working with date indexes with Multi-Index

I am trying to understand how the date-related features of indexing in pandas work.
If I have this data frame:
dates = pd.date_range('6/1/2000', periods=12, freq='M')
df1 = DataFrame(randn(12, 2), index=dates, columns=['A', 'B'])
I know that we can extract records from 2000 using df1['2000'] or a range of dates using df1['2000-09':'2001-03'].
But suppose instead I have a dataframe with a multi-index
index = pd.MultiIndex.from_arrays([dates, list('HIJKHIJKHIJK')], names=['date', 'id'])
df2 = DataFrame(randn(12, 2), index=index, columns=['C', 'D'])
Is there a way to extract rows with a year 2000 as we did with a single index? It appears that df2.xs('2000-06-30') works for accessing a particular date, but df2.xs('2000') does not return anything. Is xs not the right way to go about this?
You don't need to use xs for this, but you can index using .loc.
One of the example you tried, would then look like df2.loc['2000-09':'2001-03']. The only problem is that the 'partial string parsing' feature does not work yet when using multi-index. So you have to provide actual datetimes:
In [17]: df2.loc[pd.Timestamp('2000-09'):pd.Timestamp('2001-04')]
Out[17]:
C D
date id
2000-09-30 K -0.441505 0.364074
2000-10-31 H 2.366365 -0.404136
2000-11-30 I 0.371168 1.218779
2000-12-31 J -0.579180 0.026119
2001-01-31 K 0.450040 1.048433
2001-02-28 H 1.090321 1.676140
2001-03-31 I -0.272268 0.213227
But note that in this case pd.Timestamp('2001-03') would be interpreted as 2001-03-01 00:00:00(an actual moment in time). Therefore, you have to adjust the start/stop values a little bit.
A selection for a full year (eg df1['2000']) would then become df2.loc[pd.Timestamp('2000'):pd.Timestamp('2001')] or df2.loc[pd.Timestamp('2000-01-01'):pd.Timestamp('2000-12-31')]