Selecting data from one dataframe base on column in second dataframe - pandas

I have a dataframe (df), contains datetime columns startdate, enddate and volume of product
If I want to look at one particular date that fit in between startdate and enddate and its total volume, i can do it with no problem at all (see code).
However if I create a second dataframe (call it report), create a list of date that I would like to look at the total volume of product from first df, I came up with an error:
Can only compare identically-labeled Series objects
I read up on things like dropping index on the second df or sorting dates but they don't seem to work
So my working code for requesting volume fitted within startdate and enddate, say first of july 2019:
df[(df['StartDate'] >= '2019-07-01') & (df['EndDate'] <= '2019-10-31')]['Volume'].sum()
but if i create a second df (report):
report = pd.Series(pd.date_range('today', periods=len(df), freq='D').normalize(),name='Date')
report = pd.DataFrame(report)
and request what i want to see:
report['trial'] = df[(df['StartDate'] >= report.Date) & (df['EndDate'] <= report.Date)]['Volume'].sum()
got this error: 'Can only compare identically-labeled Series objects'
Any advice/suggestions welcome, thanks!

First, some sample data:
np.random.seed(42)
dates = pd.date_range('2019-01-01', '2019-12-01', freq='MS')
df = pd.DataFrame({
'StartDate': dates,
'EndDate': dates + pd.offsets.MonthEnd(),
'Volume': np.random.randint(1, 10, len(dates))
})
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
2 2019-03-01 2019-03-31 8
3 2019-04-01 2019-04-30 5
4 2019-05-01 2019-05-31 7
5 2019-06-01 2019-06-30 3
6 2019-07-01 2019-07-31 7
7 2019-08-01 2019-08-31 8
8 2019-09-01 2019-09-30 5
9 2019-10-01 2019-10-31 4
10 2019-11-01 2019-11-30 8
11 2019-12-01 2019-12-31 8
And the report dates:
reports = pd.to_datetime(['2019-01-15', '2019-02-15', '2019-08-15'])
Using numpy's array broadcasting:
start = df['StartDate'].values
end = df['EndDate'].values
d = reports.values[:, None]
df[np.any((start <= d) & (d <= end), axis=0)]
Result:
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
7 2019-08-01 2019-08-31 8

Related

Error in creating a new columns in pandas dataframe

Tried creating a new column to categorize different time frames into categories using np.select. However, python throws an error saying shape mismatch. I'm not sure how to get it corrected.
For your logic, it's simple to use hour attribute of datetime
import numpy as np
s = pd.Series(pd.date_range("1-Apr-2021", "now", freq="4H"), name="start_date")
(s.to_frame()
.join(pd.Series(np.select([s.dt.hour.between(1,6),
s.dt.hour.between(7,12)],
[1,2],0), name="cat"))
.head(8)
)
start_date
cat
0
2021-04-01 00:00:00
0
1
2021-04-01 04:00:00
1
2
2021-04-01 08:00:00
2
3
2021-04-01 12:00:00
2
4
2021-04-01 16:00:00
0
5
2021-04-01 20:00:00
0
6
2021-04-02 00:00:00
0
7
2021-04-02 04:00:00
1

How can I increment a date in datetime64[ns] format by 6 calendar months?

I never realised that incrementing a simple date in Python would be such an insurmountable challenge, but have given up after 2 hours of trying and searching on this forum. I have a dataframe with a column effective_date, which contains entries like 2019-01-02 and datatype datetime64[ns].
I've tried:
data['effective_date'] = pd.to_datetime(data['effective_date'].values)
data['six_mth_interval'] = data['effective_date'].apply(lambda x: x['effective_date'].values + relativedelta(months=6))
... but I get the following error:
<ipython-input-315-b81c59eb6b0d> in <lambda>(x)
----> 1 data['six_mth_interval'] = data['effective_date'].apply(lambda x: x['effective_date'] + relativedelta(months=6))
TypeError: 'Timestamp' object is not subscriptable
Existing articles on S/O have not been helpful.
Run:
data['six_mth_interval'] = data.effective_date + pd.DateOffset(months=6)
For an example DataFrame the result is:
effective_date six_mth_interval
0 2019-08-01 2020-02-01
1 2019-08-02 2020-02-02
2 2019-08-25 2020-02-25
3 2019-08-26 2020-02-26
4 2019-08-27 2020-02-27
5 2019-08-28 2020-02-28
6 2019-08-29 2020-02-29
7 2019-08-30 2020-02-29
8 2019-08-31 2020-02-29
9 2019-09-01 2020-03-01
10 2019-09-02 2020-03-02
Your code failed because when you apply a function to a Series
(a column of a DataFrame), then the argument is each single
element of this Series.
So you could write:
data['six_mth_interval'] = data['effective_date'].apply(
lambda x: x + pd.DateOffset(months=6))
but my code is faster.

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0

pandas reindex fill in missing dates

I have a dataframe with an index of dates. Each data is the first of the month. I want to fill in all missing dates in the index at a daily level.
I thought this should work:
daily=pd.date_range('2016-01-01', '2018-01-01', freq='D')
df=df.reindex(daily)
But it's returning NA in rows that should have data in (1st of the month dates) Can anyone see the issue?
Use reindex with parameter method='ffill' or resample with ffill for more general solution, because is not necessary create new index by date_range:
df = pd.DataFrame({'a': range(13)},
index=pd.date_range('2016-01-01', '2017-01-01', freq='MS'))
print (df)
a
2016-01-01 0
2016-02-01 1
2016-03-01 2
2016-04-01 3
2016-05-01 4
2016-06-01 5
2016-07-01 6
2016-08-01 7
2016-09-01 8
2016-10-01 9
2016-11-01 10
2016-12-01 11
2017-01-01 12
daily=pd.date_range('2016-01-01', '2018-01-01', freq='D')
df1 = df.reindex(daily, method='ffill')
Another solution:
df1 = df.resample('D').ffill()
print (df1.head())
a
2016-01-01 0
2016-01-02 0
2016-01-03 0
2016-01-04 0
2016-01-05 0

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT