Error in creating a new columns in pandas dataframe - pandas

Tried creating a new column to categorize different time frames into categories using np.select. However, python throws an error saying shape mismatch. I'm not sure how to get it corrected.

For your logic, it's simple to use hour attribute of datetime
import numpy as np
s = pd.Series(pd.date_range("1-Apr-2021", "now", freq="4H"), name="start_date")
(s.to_frame()
.join(pd.Series(np.select([s.dt.hour.between(1,6),
s.dt.hour.between(7,12)],
[1,2],0), name="cat"))
.head(8)
)
start_date
cat
0
2021-04-01 00:00:00
0
1
2021-04-01 04:00:00
1
2
2021-04-01 08:00:00
2
3
2021-04-01 12:00:00
2
4
2021-04-01 16:00:00
0
5
2021-04-01 20:00:00
0
6
2021-04-02 00:00:00
0
7
2021-04-02 04:00:00
1

Related

How can I increment a date in datetime64[ns] format by 6 calendar months?

I never realised that incrementing a simple date in Python would be such an insurmountable challenge, but have given up after 2 hours of trying and searching on this forum. I have a dataframe with a column effective_date, which contains entries like 2019-01-02 and datatype datetime64[ns].
I've tried:
data['effective_date'] = pd.to_datetime(data['effective_date'].values)
data['six_mth_interval'] = data['effective_date'].apply(lambda x: x['effective_date'].values + relativedelta(months=6))
... but I get the following error:
<ipython-input-315-b81c59eb6b0d> in <lambda>(x)
----> 1 data['six_mth_interval'] = data['effective_date'].apply(lambda x: x['effective_date'] + relativedelta(months=6))
TypeError: 'Timestamp' object is not subscriptable
Existing articles on S/O have not been helpful.
Run:
data['six_mth_interval'] = data.effective_date + pd.DateOffset(months=6)
For an example DataFrame the result is:
effective_date six_mth_interval
0 2019-08-01 2020-02-01
1 2019-08-02 2020-02-02
2 2019-08-25 2020-02-25
3 2019-08-26 2020-02-26
4 2019-08-27 2020-02-27
5 2019-08-28 2020-02-28
6 2019-08-29 2020-02-29
7 2019-08-30 2020-02-29
8 2019-08-31 2020-02-29
9 2019-09-01 2020-03-01
10 2019-09-02 2020-03-02
Your code failed because when you apply a function to a Series
(a column of a DataFrame), then the argument is each single
element of this Series.
So you could write:
data['six_mth_interval'] = data['effective_date'].apply(
lambda x: x + pd.DateOffset(months=6))
but my code is faster.

dt.floor count for every 12 hours in Pandas

I am trying to count the datetime occurrences every 12 hours as follows using dt.floor.
Here I created a data frame contains 2 days with 1-hour intervals. I have two questions regarding the output.
I am expecting the summary would be for every 12 hours i.e, first-row in the output1 should be 12:00 and second row would be 24:00. Instead, I get 00:00 and 12:00. Why is this?
Is it possible to create a summary using a specific time? for example, count every 6 Am and 6 PM?
code and input
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
input1.groupby(input1['datetime'].dt.floor('12H')).count()
output-1
datetime
datetime
2018-01-01 00:00:00 12
2018-01-01 12:00:00 12
2018-01-02 00:00:00 12
2018-01-02 12:00:00 12
output-2
datetime
datetime
2018-01-01 06:00:00 6
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 6
There is no 24th hour. The time part of a datetime in pandas exists in the range [00:00:00, 24:00:00), which ensures that there's only ever a single representation of the same exact time. (Notice the closure).
import pandas as pd
pd.to_datetime('2012-01-01 24:00:00')
#ParserError: hour must be in 0..23: 2012-01-01 24:00:00
For the second point as of pd.__version__ == '1.1.0' you can specify the offset parameter when you resample. You can also specify which side should be used for the labels. For older versions you will need to use the base argument.
# pandas < 1.1.0
#input1.resample('12H', on='datetime', base=6).count()
input1.resample('12H', on='datetime', offset='6H').count()
# datetime
#datetime
#2017-12-31 18:00:00 6
#2018-01-01 06:00:00 12
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 6
# Change labels
input1.resample('12H', on='datetime', offset='6H', label='right').count()
# datetime
#datetime
#2018-01-01 06:00:00 6
#2018-01-01 18:00:00 12
#2018-01-02 06:00:00 12
#2018-01-02 18:00:00 12
#2018-01-03 06:00:00 6
I modified your input data slightly, in order to use resample:
import pandas as pd
input1 = pd.DataFrame(pd.date_range('1/1/2018 00:00:00', periods=48, freq='H'))
input1.columns = ["datetime"]
# add a dummy column
input1['x'] = 'x'
# convert datetime to index...
input1 = input1.set_index('datetime')
# ...so we can use resample, and loffset lets us start at 6 am
t = input1.resample('12h', loffset=pd.Timedelta(hours=6)).count()
# show results
print(t.head())
x
datetime
2018-01-01 06:00:00 12
2018-01-01 18:00:00 12
2018-01-02 06:00:00 12
2018-01-02 18:00:00 12

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0

Selecting data from one dataframe base on column in second dataframe

I have a dataframe (df), contains datetime columns startdate, enddate and volume of product
If I want to look at one particular date that fit in between startdate and enddate and its total volume, i can do it with no problem at all (see code).
However if I create a second dataframe (call it report), create a list of date that I would like to look at the total volume of product from first df, I came up with an error:
Can only compare identically-labeled Series objects
I read up on things like dropping index on the second df or sorting dates but they don't seem to work
So my working code for requesting volume fitted within startdate and enddate, say first of july 2019:
df[(df['StartDate'] >= '2019-07-01') & (df['EndDate'] <= '2019-10-31')]['Volume'].sum()
but if i create a second df (report):
report = pd.Series(pd.date_range('today', periods=len(df), freq='D').normalize(),name='Date')
report = pd.DataFrame(report)
and request what i want to see:
report['trial'] = df[(df['StartDate'] >= report.Date) & (df['EndDate'] <= report.Date)]['Volume'].sum()
got this error: 'Can only compare identically-labeled Series objects'
Any advice/suggestions welcome, thanks!
First, some sample data:
np.random.seed(42)
dates = pd.date_range('2019-01-01', '2019-12-01', freq='MS')
df = pd.DataFrame({
'StartDate': dates,
'EndDate': dates + pd.offsets.MonthEnd(),
'Volume': np.random.randint(1, 10, len(dates))
})
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
2 2019-03-01 2019-03-31 8
3 2019-04-01 2019-04-30 5
4 2019-05-01 2019-05-31 7
5 2019-06-01 2019-06-30 3
6 2019-07-01 2019-07-31 7
7 2019-08-01 2019-08-31 8
8 2019-09-01 2019-09-30 5
9 2019-10-01 2019-10-31 4
10 2019-11-01 2019-11-30 8
11 2019-12-01 2019-12-31 8
And the report dates:
reports = pd.to_datetime(['2019-01-15', '2019-02-15', '2019-08-15'])
Using numpy's array broadcasting:
start = df['StartDate'].values
end = df['EndDate'].values
d = reports.values[:, None]
df[np.any((start <= d) & (d <= end), axis=0)]
Result:
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
7 2019-08-01 2019-08-31 8

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT