Trouble working with date indexes with Multi-Index - pandas

I am trying to understand how the date-related features of indexing in pandas work.
If I have this data frame:
dates = pd.date_range('6/1/2000', periods=12, freq='M')
df1 = DataFrame(randn(12, 2), index=dates, columns=['A', 'B'])
I know that we can extract records from 2000 using df1['2000'] or a range of dates using df1['2000-09':'2001-03'].
But suppose instead I have a dataframe with a multi-index
index = pd.MultiIndex.from_arrays([dates, list('HIJKHIJKHIJK')], names=['date', 'id'])
df2 = DataFrame(randn(12, 2), index=index, columns=['C', 'D'])
Is there a way to extract rows with a year 2000 as we did with a single index? It appears that df2.xs('2000-06-30') works for accessing a particular date, but df2.xs('2000') does not return anything. Is xs not the right way to go about this?

You don't need to use xs for this, but you can index using .loc.
One of the example you tried, would then look like df2.loc['2000-09':'2001-03']. The only problem is that the 'partial string parsing' feature does not work yet when using multi-index. So you have to provide actual datetimes:
In [17]: df2.loc[pd.Timestamp('2000-09'):pd.Timestamp('2001-04')]
Out[17]:
C D
date id
2000-09-30 K -0.441505 0.364074
2000-10-31 H 2.366365 -0.404136
2000-11-30 I 0.371168 1.218779
2000-12-31 J -0.579180 0.026119
2001-01-31 K 0.450040 1.048433
2001-02-28 H 1.090321 1.676140
2001-03-31 I -0.272268 0.213227
But note that in this case pd.Timestamp('2001-03') would be interpreted as 2001-03-01 00:00:00(an actual moment in time). Therefore, you have to adjust the start/stop values a little bit.
A selection for a full year (eg df1['2000']) would then become df2.loc[pd.Timestamp('2000'):pd.Timestamp('2001')] or df2.loc[pd.Timestamp('2000-01-01'):pd.Timestamp('2000-12-31')]

Related

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

pandas groupby keeping other columns

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).
I have data on firms over time. This generates some dummy data that looks like my use case:
import numpy as np
import pandas as pd
dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
index=np.tile(dates,nfirms),
columns=cols)
df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))
I need to calculate ranks of column E within each date (the index), but keeping column id.
If I just use groupby and .rank I get this:
df.groupby(level=0)['E'].rank()
1926-01-31 3226.0
1926-02-28 1042.0
1926-03-31 1611.0
1926-04-30 2591.0
1926-05-31 30.0
...
2019-08-31 1973.0
2019-09-30 227.0
2019-10-31 4381.0
2019-11-30 1654.0
2019-12-31 1572.0
Name: E, Length: 5640000, dtype: float64
This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?
If the order in the output is the same as in the output, I think I can do this:
df['ranks'] = df.groupby(level=0)['E'].rank()
But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.
(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

pandas PeriodIndex, select 12 months of data based on last period

I have a large table of data, indexed with periods 2017-4 through 2019-3. What's the best way to get two 12 months of data slices?
I'm basically trying to find the correct way to select df['2018-4':'2019-3'] and df['2017-4':2018-3] without manually typing in the slices.
Play data:
np.random.seed(0)
ind = pd.period_range(start='2017-4', end='2019-3', freq='M')
df = pd.DataFrame(np.random.randint(0, 100, (len(ind), 2)), columns=['A', 'B'], index=ind)
df.head()

Pandas DataFrame How to query the closest datetime index?

How do i query for the closest index from a Pandas DataFrame? The index is DatetimeIndex
2016-11-13 20:00:10.617989120 7.0 132.0
2016-11-13 22:00:00.022737152 1.0 128.0
2016-11-13 22:00:28.417561344 1.0 132.0
I tried this:
df.index.get_loc(df.index[0], method='nearest')
but it give me InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Same error if I tried this:
dt = datetime.datetime.strptime("2016-11-13 22:01:25", "%Y-%m-%d %H:%M:%S")
df.index.get_loc(dt, method='nearest')
But if I remove method='nearest' it works, but that is not I want, I want to find the closest index from my query datetime
It seems you need first get position by get_loc and then select by []:
dt = pd.to_datetime("2016-11-13 22:01:25.450")
print (dt)
2016-11-13 22:01:25.450000
print (df.index.get_loc(dt, method='nearest'))
2
idx = df.index[df.index.get_loc(dt, method='nearest')]
print (idx)
2016-11-13 22:00:28.417561344
#if need select row to Series use iloc
s = df.iloc[df.index.get_loc(dt, method='nearest')]
print (s)
b 1.0
c 132.0
Name: 2016-11-13 22:00:28.417561344, dtype: float64
DatetimeIndex.get_loc is now deprecated in favour of DatetimeIndex.get_indexer...
ts = pd.to_datetime('2022-05-26 13:19:48.154000') # example time
iloc_idx = df.index.get_indexer([ts], method='nearest') # returns absolute index into df e.g. array([5])
loc_idx = df.index[iloc_idx] # if you want named index
my_val = df.iloc[iloc_idx]
my_val = df.loc[loc_idx] # as above so below...
I believe jezrael solution works, but not on my dataframe (which i have no clue why). This is the solution that I came up with.
from bisect import bisect #operate as sorted container
timestamps = np.array(df.index)
upper_index = bisect(timestamps, np_dt64, hi=len(timestamps)-1) #find the upper index of the closest time stamp
df_index = df.index.get_loc(min(timestamps[upper_index], timestamps[upper_index-1],key=lambda x: abs(x - np_dt64))) #find the closest between upper and lower timestamp
I know it's an old question, but while searching for the same problems as Bryan Fok, I landed here. So for future searchers getting here, I post my solution.
My index had 4 non-unique items (possibly due to rounding errors when recording the data). The following worked and showed the correct data:
dt = pd.to_datetime("2016-11-13 22:01:25.450")
s = df.loc[df.index.unique()[df.index.unique().get_loc(dt, method='nearest')]]
However, in case your nearest index occures multiple times, this will return multiple rows. If you want to catch that, you could test for it with:
if len(s) != len(df.columns):
# do what is appropriate for your case
# e.g. selecting only the first occurence
s.iloc[0]
Edit: fixed the catching after some test

Using slicers with MultiIndex in pandas

I am trying to use a slicer on a pandas dataframe with a MultiIndex:
dates = pd.date_range('6/30/2000', periods=12, freq='M')
index = MultiIndex.from_arrays([dates, list('HIJKHIJKHIJKHIJK')], names=['date', 'id'])
df = DataFrame(randn(12, 4), index=index, columns=['A', 'B', 'C', 'D'])
I would like to get the rows where id='H'. From this comment in a related question I asked, and from reading the documentation, I thought this:
df.loc[(slice(None), 'H'),:]
or perhaps this:
df.loc[(slice(None), ['H']),:]
would work. The first one returns this error:
IndexError: index 12 is out of bounds for axis 1 with size 12
and the second one gives this error:
IndexError: indices are out-of-bounds
From looking at other questions, I thought perhaps I need to sort by the 2nd-level index before trying to slice. I'm not really sure what I'm doing here, but I tried to use df.sort_index() but am having trouble with the syntax. I'm also not sure whether that is even the issue here.
You've got a problem in your line
index = MultiIndex.from_arrays([dates, list('HIJKHIJKHIJKHIJK')], names=['date', 'id'])
The arrays should be the same length. This not raising is probably a bug.
If you fix that then either of
In [58]: df.xs('H', level='id')
Out[58]:
A B C D
date
2000-06-30 -0.645203 0.965441 0.150037 -0.083979
2000-10-31 -1.222954 0.498284 -1.249005 -1.664407
2001-02-28 -0.941248 2.025381 0.450256 1.182266
In [59]: df.loc[(slice(None), 'H'), :]
Out[59]:
A B C D
date id
2000-06-30 H -0.645203 0.965441 0.150037 -0.083979
2000-10-31 H -1.222954 0.498284 -1.249005 -1.664407
2001-02-28 H -0.941248 2.025381 0.450256 1.182266
should work, depending on whether you'd like to drop the id level.