pandas reindex fill in missing dates - pandas

I have a dataframe with an index of dates. Each data is the first of the month. I want to fill in all missing dates in the index at a daily level.
I thought this should work:
daily=pd.date_range('2016-01-01', '2018-01-01', freq='D')
df=df.reindex(daily)
But it's returning NA in rows that should have data in (1st of the month dates) Can anyone see the issue?

Use reindex with parameter method='ffill' or resample with ffill for more general solution, because is not necessary create new index by date_range:
df = pd.DataFrame({'a': range(13)},
index=pd.date_range('2016-01-01', '2017-01-01', freq='MS'))
print (df)
a
2016-01-01 0
2016-02-01 1
2016-03-01 2
2016-04-01 3
2016-05-01 4
2016-06-01 5
2016-07-01 6
2016-08-01 7
2016-09-01 8
2016-10-01 9
2016-11-01 10
2016-12-01 11
2017-01-01 12
daily=pd.date_range('2016-01-01', '2018-01-01', freq='D')
df1 = df.reindex(daily, method='ffill')
Another solution:
df1 = df.resample('D').ffill()
print (df1.head())
a
2016-01-01 0
2016-01-02 0
2016-01-03 0
2016-01-04 0
2016-01-05 0

Related

Get the value in a dataframe based on a value and a date in another dataframe

I tried countless answers to similar problems here on SO but couldn't find anything that works for this scenario. It's driving me nuts.
I have these two Dataframes:
df_op:
index
Date
Close
Name
LogRet
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
1
2022-11-29 00:00:00
280.57
QQQ
-0.0076
2
2022-12-13 00:00:00
342.46
ADBE
0.0126
3
2022-12-13 00:00:00
256.92
MSFT
0.0173
df_quotes:
index
Date
Close
Name
72
2022-11-29 00:00:00
141.17
AAPL
196
2022-11-29 00:00:00
240.33
MSFT
73
2022-11-30 00:00:00
148.03
AAPL
197
2022-11-30 00:00:00
255.14
MSFT
11
2022-11-30 00:00:00
293.36
QQQ
136
2022-12-01 00:00:00
344.11
ADBE
198
2022-12-01 00:00:00
254.69
MSFT
12
2022-12-02 00:00:00
293.72
QQQ
I would like to add a column to df_op that indicates the close of the stock in df_quotes 2 days later. For example, the first row of df_op should become:
index
Date
Close
Name
LogRet
Next
0
2022-11-29 00:00:00
240.33
MSFT
-0.0059
254.69
In other words:
for each row in df_op, find the corresponding Name in df_quotes with Date of 2 days later and copy its Close to df_op in column 'Next'.
I tried tens of combinations like this without success:
df_quotes[df_quotes['Date'].isin(df_op['Date'] + pd.DateOffset(days=2)) & df_quotes['Name'].isin(df_op['Name'])]
How can I do this without recurring to loops?
Try this:
#first convert to datetime
df_op['Date'] = pd.to_datetime(df_op['Date'])
df_quotes['Date'] = pd.to_datetime(df_quotes['Date'])
#merge on Date and Name, but the date is offset 2 business days
(pd.merge(df_op,
df_quotes[['Date','Close','Name']].rename({'Close':'Next'},axis=1),
left_on=['Date','Name'],
right_on=[df_quotes['Date'] - pd.tseries.offsets.BDay(2),'Name'],
how = 'left')
.drop(['Date_x','Date_y'],axis=1))
Output:
Date index Close Name LogRet Next
0 2022-11-29 0 240.33 MSFT -0.0059 254.69
1 2022-11-29 1 280.57 QQQ -0.0076 NaN
2 2022-12-13 2 342.46 ADBE 0.0126 NaN
3 2022-12-13 3 256.92 MSFT 0.0173 NaN

Panda DF Convert All Dates to YYYY-MM-DD format

i have data that looks like this stored in a DF and I'm trying to convert the "DATE" column so that all the dates are in the format of yyyy-mm-dd format instead of yyyy-dd-mm as you can see when the date changes by the "TIME" column to a new day (some of the dates not shown are already set to the YYYY-MM-DD format but I'm trying to change all of them to the YYYY-MM-DD format):
DATE TIME BAFFIN BAY GATUN II GATUN I KLONDIKE IIIG \
8778 2016-01-01 1900 8.926278 8.046583 7.649784 7.333993
8779 2016-01-01 2000 8.817666 4.395097 4.748931 6.672631
8780 2016-01-01 2100 8.704014 6.384826 7.128692 6.115349
8781 2016-01-01 2200 8.496358 8.261933 8.166153 6.242737
8782 2016-01-01 2300 8.434297 4.656991 5.894877 5.781445
8783 2016-02-01 0000 8.528372 3.056838 3.086056 5.023564
8784 2016-02-01 0100 8.783731 4.614589 4.894076 5.042875
8785 2016-02-01 0200 8.572500 3.860174 4.641366 5.174426
8786 2016-02-01 0300 8.279557 2.076971 2.644479 5.492729
8787 2016-02-01 0400 8.378920 3.562210 2.806703 5.356025
I'm trying to set it the "DATE" column to a datetime column with specifying the format but it does nothing:
df2['DATE'] = pd.to_datetime(df2['DATE'],format='%Y-%m-%d')
thank you in advance for your help!
Can you try this
pd.to_datetime(df['TIME'], dayfirst=True)
0 2016-01-01
1 2016-01-01
2 2016-01-01
3 2016-01-01
4 2016-01-01
5 2016-01-02
6 2016-01-02
7 2016-01-02
8 2016-01-02
9 2016-01-02
consider joining 'DATE' and 'TIME' to get a complete datetime column. Assuming both columns are of dtype obj (string), you can combine them using the + operator and then call pd.to_datetime with a specified format. Ex:
import pandas as pd
df = pd.DataFrame({'DATE': ['2016-01-01', '2016-02-01'],
'TIME': ['1900', '0000']})
df['DateTime'] = pd.to_datetime(df['DATE']+df['TIME'], format='%Y-%d-%m%H%M')
# df['DateTime']
# 0 2016-01-01 19:00:00
# 1 2016-01-02 00:00:00
# Name: DateTime, dtype: datetime64[ns]

Selecting data from one dataframe base on column in second dataframe

I have a dataframe (df), contains datetime columns startdate, enddate and volume of product
If I want to look at one particular date that fit in between startdate and enddate and its total volume, i can do it with no problem at all (see code).
However if I create a second dataframe (call it report), create a list of date that I would like to look at the total volume of product from first df, I came up with an error:
Can only compare identically-labeled Series objects
I read up on things like dropping index on the second df or sorting dates but they don't seem to work
So my working code for requesting volume fitted within startdate and enddate, say first of july 2019:
df[(df['StartDate'] >= '2019-07-01') & (df['EndDate'] <= '2019-10-31')]['Volume'].sum()
but if i create a second df (report):
report = pd.Series(pd.date_range('today', periods=len(df), freq='D').normalize(),name='Date')
report = pd.DataFrame(report)
and request what i want to see:
report['trial'] = df[(df['StartDate'] >= report.Date) & (df['EndDate'] <= report.Date)]['Volume'].sum()
got this error: 'Can only compare identically-labeled Series objects'
Any advice/suggestions welcome, thanks!
First, some sample data:
np.random.seed(42)
dates = pd.date_range('2019-01-01', '2019-12-01', freq='MS')
df = pd.DataFrame({
'StartDate': dates,
'EndDate': dates + pd.offsets.MonthEnd(),
'Volume': np.random.randint(1, 10, len(dates))
})
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
2 2019-03-01 2019-03-31 8
3 2019-04-01 2019-04-30 5
4 2019-05-01 2019-05-31 7
5 2019-06-01 2019-06-30 3
6 2019-07-01 2019-07-31 7
7 2019-08-01 2019-08-31 8
8 2019-09-01 2019-09-30 5
9 2019-10-01 2019-10-31 4
10 2019-11-01 2019-11-30 8
11 2019-12-01 2019-12-31 8
And the report dates:
reports = pd.to_datetime(['2019-01-15', '2019-02-15', '2019-08-15'])
Using numpy's array broadcasting:
start = df['StartDate'].values
end = df['EndDate'].values
d = reports.values[:, None]
df[np.any((start <= d) & (d <= end), axis=0)]
Result:
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
7 2019-08-01 2019-08-31 8

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT