Error: "not found in axis" when droping list of index values - pandas

I am trying to drop two days every year from a dataframe with hourly values from 6am-8pm for dates from the 15.07 to 20.10. Therefore, I created a list with all the dates that should be dropped like this:
for i in range(0,6):
Year = 1999 + i
drop_list.append(str(Year)+'-07-15')
drop_list.append(str(Year)+'-07-16')
My data looks like this:
data
When I now call:
y = y.drop(drop_list)
I get:
KeyError: "['1999-07-15' '1999-07-16' '2000-07-15' '2000-07-16' '2001-07-15'\n '2001-07-16' '2002-07-15' '2002-07-16' '2003-07-15' '2003-07-16'\n '2004-07-15' '2004-07-16'] not found in axis"
Any suggestions what I am missing?

If you want to drop from a datetime index, you need to pass datetimes.
E.g.
>>> s = pd.Series([1,2,3], index=pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03']))
>>> s
2020-01-01 1
2020-01-02 2
2020-01-03 3
dtype: int64
>>> s.drop(pd.to_datetime(['2020-01-01']))
2020-01-02 2
2020-01-03 3
dtype: int64
In your case, y.drop(pd.to_datetime(drop_list)) might work. For the future, please see https://stackoverflow.com/help/how-to-ask

Related

Slicing by date, using a variable start date

I trying to slice according to a date column (which is calculated based on the index), and only cumulative summing based on the Start Date beside it.
Here is a small sample code to copy/run:
import numpy
import pandas
data = pandas.DataFrame(
{"Bought" : [1,3,4,6]}, index=pandas.to_datetime(['01-01-2020','02-01-2020','03-01-2020','04-01-2020']))
data['StartDate'] = data.index
data['Cum bought2'] = data.loc[data['StartDate']:]['Bought'].cumsum()
It gives me the error "cannot do slice indexing on DatetimeIndex with these indexers".
If I change the data.loc[data['StartDate']:] to a set value (i.e. '02-01-2020'), then it works fine. But I want the start date to be variable and taken from another column.
Edit1: new example. This is close, but the 3rd row shouldn't calculate a value since the Start Date hasn't been reached yet.
import numpy
import pandas
data = pandas.DataFrame(
{"Bought" : [1,3,4,6]}, index=pandas.to_datetime(['01-01-2020','02-01-2020','03-01-2020','04-01-2020']))
data['StartDate'] = ['02-01-2020','02-01-2020','04-01-2020','04-01-2020']
data['Cum Bought'] = data.loc[data['StartDate'].iloc[0]:]['Bought'].cumsum()
Edit2: Also, any idea how to resolve if there are pandas.NaT in the Start Date? I don't want to delete those rows completely, just treat them as zero in calculations.
import numpy
import pandas
data = pandas.DataFrame(
{"Bought" : [1,3,4,6]}, index=pandas.to_datetime(['01-01-2020','02-01-2020','03-01-2020','04-01-2020']))
data['StartDate'] = [pandas.NaT,'02-01-2020','04-01-2020','04-01-2020']
data['Cum Bought'] = data.loc[data['StartDate'].iloc[0]:]['Bought'].cumsum()
You're trying to index with a Series as bound of a slice, which doesn't make sense. You need one value. data.loc[data['StartDate'].iloc[0]:] or data.loc[data['StartDate'].min():] would work.
In your case, you should probably just use:
data['Cum bought2'] = data['Bought'].cumsum()
Or if you're not sure that the dates are sorted:
data['Cum bought2'] = data['Bought'].sort_index().cumsum()
Output:
Bought StartDate Cum bought2
2020-01-01 1 2020-01-01 1
2020-02-01 3 2020-02-01 4
2020-03-01 4 2020-03-01 8
2020-04-01 6 2020-04-01 14

Inconsistent output for pandas groupby-resample with missing values in first time bin

I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.
I submitted bug report 46826 to pandas.
The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

Pandas Series: Decrement DateTime by 100 Years

I have a pandas series as follows...
0 2039-03-16
1 2056-01-21
2 2051-11-18
3 2064-03-05
4 2048-06-05
Name: BIRTH, dtype: datetime64
It was created from string data as follows
s = data['BIRTH']
s = pd.to_datetime(s)
s
I want to convert all dates after year 2040 to 1940
I can do this for a single record as follows
s.iloc[0].replace(year=d.year-100)
but I really want to just run it over the whole series. I can't work it out. Help!??
PS - I know there's ways outside of pandas using Python's DT module but I'd like to learn how to do this within Pandas please
Using DateOffset is the obvious choice here:
df['date'] - pd.offsets.DateOffset(years=100)
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Assign it back:
df['date'] -= pd.offsets.DateOffset(years=100)
df
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
We have the offsets module to deal with non-fixed frequencies, it comes in handy in situations like these.
To fix your code, you'd have wanted to apply datetime.replace rowwise using apply (not recommended):
df['date'].apply(lambda x: x.replace(year=x.year-100))
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Or using a list comprehension,
df.assign(date=[x.replace(year=x.year-100) for x in df['date']])
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Neither of these handle NaT entries very well.

Problems getting two columns into datetime.datetime format

I have code at the moment written to change two columns of my dataframe from strings into datetime.datetime objects similar to the following:
def converter(date):
date = dt.strptime(date, '%m/%d/%Y %H:%M:%S')
return date
df = pd.DataFrame({'A':['12/31/9999 0:00:00','1/1/2018 0:00:00'],
'B':['4/1/2015 0:00:00','11/1/2014 0:00:00']})
df['A'] = df['A'].apply(converter)
df['B'] = df['B'].apply(converter)
When I run this code and print the dataframe, it comes out like this
A B
0 9999-12-31 00:00:00 2015-04-01
1 2018-01-01 00:00:00 2014-11-01
When I checked the data types of each column, they read
A object
B datetime64[ns]
But when I check the format of the actual cells of the first row, they read
<class 'datetime.datetime'>
<class 'pandas._libs.tslib.Timestamp'>
After experimenting around, I think I've run into an out of bounds error because of the date '12/31/9999 0:00:00' in column 'A' and this is causing this column to be cast as a datetime.datetime object. My question is how I can also convert column 'B' of my dataframe to a datetime.datetime object so that I can run a query on the columns similar to
df.query('A > B')
without getting an error or the wrong output.
Thanks in advance
Since '9999' is just some dummy year, you can simplify your life by choosing a dummy year which is in bounds (or one that makes more sense given your actual data):
import pandas as pd
df.replace('9999', '2060', regex=True).apply(pd.to_datetime)
Output:
A B
0 2060-12-31 2015-04-01
1 2018-01-01 2014-11-01
A datetime64[ns]
B datetime64[ns]
dtype: object
As #coldspeed points out, it's perhaps better to remove those bad dates:
df.apply(pd.to_datetime, errors='coerce')
# A B
#0 NaT 2015-04-01
#1 2018-01-01 2014-11-01

detecting jumps on pandas index dates

I managed to load historical data on data series on a large set of financial instruments, indexed by date.
I am plotting volume , price information without any issue.
What I want to achieve now is to determine if there is any big jump in dates, to see if I am missing large chunks of data.
The idea I had in mind was somehow to plot the difference in between two consecutive dates in the index and if the number is superior to 3 or 4 ( which is bigger than a week end and a bank holiday on a friday or monday ) then there is an issue.
Problem is I can figure out how do compute simply df[next day]-df[day], where df is indexed by day
You can use the shift Series method (note the DatetimeIndex method shifts by freq):
In [11]: rng = pd.DatetimeIndex(['20120101', '20120102', '20120106']) # DatetimeIndex like df.index
In [12]: s = pd.Series(rng) # df.index instead of rng
In [13]: s - s.shift()
Out[13]:
0 NaT
1 1 days, 00:00:00
2 4 days, 00:00:00
dtype: timedelta64[ns]
In [14]: s - s.shift() > pd.offsets.Day(3).nanos
Out[14]:
0 False
1 False
2 True
dtype: bool
Depending on what you want, perhaps you could either do any, or find the problematic values...
In [15]: (s - s.shift() > pd.offsets.Day(3).nanos).any()
Out[15]: True
In [16]: s[s - s.shift() > pd.offsets.Day(3).nanos]
Out[16]:
2 2012-01-06 00:00:00
dtype: datetime64[ns]
Or perhaps find the maximum jump (and where it is):
In [17]: (s - s.shift()).max() # it's weird this returns a Series...
Out[17]:
0 4 days, 00:00:00
dtype: timedelta64[ns]
In [18]: (s - s.shift()).idxmax()
Out[18]: 2
If you really wanted to plot this, simply plotting the difference would work:
(s - s.shift()).plot()