Pandas -- get dates closest to nth day of month - pandas

This Python code identifies the rows where the day of the month equals 5. For a month that does not have day 5, because it is a weekend or holiday, I want the mask to be True for the earlier date that is closest to day 5. I could write a loop to identify such dates, but is there an array formula to do this?
import pandas as pd
infile = "dates.csv"
df = pd.read_csv(infile)
dtimes = pd.to_datetime(df.iloc[:,0])
mask = (dtimes.dt.day == 5)

For test purpose I created the following DataFrame (with a single column):
xxx
0 2022-11-03
1 2022-11-04
2 2022-11-07
3 2022-12-02
4 2022-12-05
5 2022-12-06
6 2023-01-04
7 2023-01-05
8 2023-01-06
9 2023-02-02
10 2023-02-03
11 2023-02-06
12 2023-02-07
13 2023-04-02
14 2023-04-05
15 2023-04-06
Because I based my solution on groupby method, I created dtimes
as a Series with the index equal to values:
wrk = pd.to_datetime(df.iloc[:,0])
dtimes = pd.Series(wrk.values, index=wrk)
Then, to find the valid date within the current group of dates
(a single month), I defined the followig function:
def findDate(grp):
if grp.size == 0:
return None
dd = grp.dt.day
if dd.eq(5).any():
dd = dd[dd.eq(5)]
else:
dd = dd[dd.lt(5)]
return dd.index[-1]
To find valid dates, for "existing" months, run:
validDates = dtimes.groupby(pd.Grouper(freq='M')).apply(findDate).dropna()
The result is:
xxx
2022-11-30 2022-11-04
2022-12-31 2022-12-05
2023-01-31 2023-01-05
2023-02-28 2023-02-03
2023-04-30 2023-04-05
dtype: datetime64[ns]
And to create your mask, run:
mask = dtimes.isin(validDates).values
To see the filtered rows, run:
df[mask]
getting:
xxx
1 2022-11-04
4 2022-12-05
7 2023-01-05
10 2023-02-03
14 2023-04-05

Related

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

Pandas dataframe diff except some rows?

df
end_date dt_eps
0 20200930 0.9625
1 20200630 0.5200
2 20200331 0.2130
3 20191231 1.2700
4 20190930 -0.1017
5 20190630 -0.1058
6 20190331 0.0021
7 20181231 0.0100
Note: the value of end_date must be the last day of each year quarter and the sequence is sorted by near and the type is string.
Goal
create q_dt_eps column: calculate the diff of dt_eps between the nearest day but it is the same as dt_eps when the quarter is Q1. For example, the q_dt_eps for 20200930 is 0.4425(0.9625-0.5200) while 20200331 is 1.2700.
Try
df['q_dt_eps']=df['dt_eps'].diff(periods=-1)
But it could not return the same value of dt_eps when the quarter is Q1.
You can just convert the date to datetime, extract the quarter of the date, and then create your new column using np.where, keeping the original value when quarter is equal to 1, otherwise using the shifted value.
import numpy as np
import pandas as pd
df = pd.DataFrame({'end_date':['20200930', '20200630', '20200331',
'20191231', '20190930', '20190630', '20190331', '20181231'],
'dt_eps':[0.9625, 0.52, 0.213, 1.27, -.1017, -.1058, .0021, .01]})
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y%m%d')
df['qtr'] = df['end_date'].dt.quarter
df['q_dt_eps'] = np.where(df['qtr']==1, df['dt_eps'], df['dt_eps'].diff(-1))
df
end_date dt_eps qtr q_dt_eps
0 2020-09-30 0.9625 3 0.4425
1 2020-06-30 0.5200 2 0.3070
2 2020-03-31 0.2130 1 0.2130
3 2019-12-31 1.2700 4 1.3717
4 2019-09-30 -0.1017 3 0.0041
5 2019-06-30 -0.1058 2 -0.1079
6 2019-03-31 0.0021 1 0.0021
7 2018-12-31 0.0100 4 NaN

How could I remove duplicates if duplicates mean less than 30days?

(using sql or pandas)
I want to delete records if the Date difference between two records is less than 30 days.
But first record of ID must be remained.
#example
ROW ID DATE
1 A 2020-01-01 -- first
2 A 2020-01-03
3 A 2020-01-31
4 A 2020-02-05
5 A 2020-02-28
6 A 2020-03-09
7 B 2020-03-06 -- first
8 B 2020-05-07
9 B 2020-06-02
#expected results
ROW ID DATE
1 A 2020-01-01
4 A 2020-02-05
6 A 2020-03-09
7 B 2020-03-06
8 B 2020-05-07
ROW 2,3 are within 30 days from ROW 1
ROW 5 is within 30 days from ROW 4
ROW 9 is within 30 days from ROW 8
To cope with your task it is not possible to call any
vectorized methods.
The cause is that after a row is recognized as a duplicate, then
this row "does not count" when you check further rows.
E.g. after rows 2020-01-03 and 2020-01-31 were deleted (as
"too close" to the previous row) then 2020-02-05 row should be
left, because now the distance to the previous row (2020-01-01)
is big enough.
So I came up with a solution based on a "function with memory":
def isDupl(elem):
if isDupl.prev is None:
isDupl.prev = elem
return False
dDiff = (elem - isDupl.prev).days
rv = dDiff <= 30
if not rv:
isDupl.prev = elem
return rv
This function should be invoked for each DATE in the
current group (with same ID) but before that isDupl.prev
must be set to None.
So the function to apply to each group of rows is:
def isDuplGrp(grp):
isDupl.prev = None
return grp.DATE.apply(isDupl)
And to get the expected result, run:
df[~(df.groupby('ID').apply(isDuplGrp).reset_index(level=0, drop=True))]
(you may save it back to df).
The result is:
ROW ID DATE
0 1 A 2020-01-01
3 4 A 2020-02-05
5 6 A 2020-03-09
6 7 B 2020-03-06
7 8 B 2020-05-07
And finally, a remark about the other solution:
It contains rows:
3 4 A 2020-02-05
4 5 A 2020-02-28
which are only 23 days apart, so this solution is wrong.
The same pertains to rows:
5 A 2020-02-28
6 A 2020-03-09
which are also too close in time.
You can try this:
Convert date to datetime64
Get the first date from each group df.groupby('ID')['DATE'].transform('first')
Add a filter to keep only dates greater than 30 days
Append the first date of each group to the dataframe
Code:
df['DATE'] = pd.to_datetime(df['DATE'])
df1 = df[(df['DATE'] - df.groupby('ID')['DATE'].transform('first')) >= pd.Timedelta(30, unit='D')]
df1 = df1.append(df.groupby('ID', as_index=False).agg('first')).sort_values(by=['ID', 'DATE'])
print(df1)
ROW ID DATE
0 1 A 2020-01-01
2 3 A 2020-01-31
3 4 A 2020-02-05
4 5 A 2020-02-28
5 6 A 2020-03-09
1 7 B 2020-03-06
7 8 B 2020-05-07
8 9 B 2020-06-02

pandas to_datetime convert 6PM to 18

is there a nice way to convert Series data, represented like 1PM or 11AM to 13 and 11 accordingly with to_datetime or similar (other, than re)
data:
series
1PM
11AM
2PM
6PM
6AM
desired output:
series
13
11
14
18
6
pd.to_datetime(df['series']) gives the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 11:00:00
You can provide the format you want to use, with as format %I%p:
pd.to_datetime(df['series'], format='%I%p').dt.hour
The .dt.hour [pandas-doc] will thus obtain the hour for that timestamp. This gives us:
>>> df = pd.DataFrame({'series': ['1PM', '11AM', '2PM', '6PM', '6AM']})
>>> pd.to_datetime(df['series'], format='%I%p').dt.hour
0 13
1 11
2 14
3 18
4 6
Name: series, dtype: int64

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)