Calculate age in months - optimize date transformations in pandas - pandas

I am trying a very simple thing - to calculate age in months between two columns and save it to a new column
df['AGE'] = (df.apply(lambda x: (x['DAX'].year - int(x['BIRTH_DATE'][:4])) * 12 +
x['DAX'].month - int(x['BIRTH_DATE'][5:7])
if x['BIRTH_DATE'] is not None
and int(x['BIRTH_DATE'][:4]) > 1900
else -1 # data quality
, axis=1).astype(np.int8))
I am doing this while loading a pretty big 2Gb csv file. DAX is parsed directly in the reader while BIRTH_DATE is left as string.
And this simple calculation bumps up the load time by a factor of x10. Is there any smarter way to calculate age in months on big data frames?
Here is a sample of data:
DAX BIRTH_DATE
2015-01-01 1931-12-03
2015-01-01 1991-04-19
2015-01-01 1992-10-11
2015-01-01 1982-05-20
2015-01-01 1987-12-20
2015-01-01 1976-07-30
2015-01-01 1951-05-11
2015-01-01 1993-05-06
2015-01-01 1989-02-27
I am trying to get another column 'AGE' as number of months since birthday.

first convert BIRTH_DATE to datetime dtype:
In [257]: df['BIRTH_DATE'] = pd.to_datetime(df['BIRTH_DATE'], errors='coerce')
check:
In [258]: df.dtypes
Out[258]:
DAX datetime64[ns]
BIRTH_DATE datetime64[ns]
dtype: object
now we can do this simple math:
In [259]: df['AGE'] = df.DAX.dt.year*12 + df.DAX.dt.month - \
(df.BIRTH_DATE.dt.year*12 + df.BIRTH_DATE.dt.month)
In [260]: df
Out[260]:
DAX BIRTH_DATE AGE
0 2015-01-01 1931-12-03 997
1 2015-01-01 1991-04-19 285
2 2015-01-01 1992-10-11 267
3 2015-01-01 1982-05-20 392
4 2015-01-01 1987-12-20 325
5 2015-01-01 1976-07-30 462
6 2015-01-01 1951-05-11 764
7 2015-01-01 1993-05-06 260
8 2015-01-01 1989-02-27 311

As you have not provided any sample data I'm not completely sure what format your data is in. Something like this should work though and be considerably faster than using apply():
df['AGE'] = (df.DAX - df.BIRTH_DATE.astype('datetime64[ns]')).dt.days / 30
Again without the data I'm not sure what you're data quality steps needs to do, but it will probably be fixed after the above line like so:
df.loc[df['AGE'].isnull(), 'AGE'] = -1

Related

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

Python: Convert string to datetime, calculate time difference, and select rows with time difference more than 3 days

I have a dataframe that contains two string date columns. First I would like to convert the two column into datetime and calculate the time difference. Then I would like to select rows with a time difference of more than 3 days.
simple df
ID Start End
234 2020-11-16 20:25 2020-11-18 00:10
62 2020-11-02 02:50 2020-11-15 21:56
771 2020-11-17 03:03 2020-11-18 00:10
desired df
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
Current input
df['End'] = pd.to_datetime(z['End'])
df['Start'] = pd.to_datetime(z['Start'])
df['Time difference'] = df['End'] - df['Start']
How can I select rows that has a time difference of more than 3 days?
Thanks in advance! I appreciate any help on this!!
Your just missing one line, convert to days then query
df[df['Time difference'].dt.days > 3]
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
df=df.set_index('ID').apply(lambda x: pd.to_datetime(x))#Set ID as index to allow coercing of dates to datetime
df=df.assign(Timedifference =df['End'].sub(df['Start'])).reset_index()#Calculate time difference and reset index
df[df['Timedifference'].dt.days.gt(3)]#Mask a bollean selection to filter youre desired

difference in two date column in Pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]
IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35

Date serial number and date need to convert in date format

when I am reading google spreadsheet in dataframe getting data in below format
42836
42837
42838
42844
42845
42846
42849
42850
42851
2/1/2018
2/2/2018
But i need to convert all in date format
IIUC setting up the origin date and using np.where, base on my experience
the origin in Excel is December 30, 1899.
s1=pd.to_datetime(pd.to_numeric(df.date,errors='coerce'),errors='coerce',origin='1899-12-30',unit='D')
s2=pd.to_datetime(df.date,errors='coerce')
df['new']=np.where(df.date.str.contains('/'),s2,s1)
df
Out[282]:
date new
0 42837 2017-04-12
1 42838 2017-04-13
2 42844 2017-04-19
3 42845 2017-04-20
4 42846 2017-04-21
5 42849 2017-04-24
6 42850 2017-04-25
7 42851 2017-04-26
8 2/1/2018 2018-02-01
9 2/2/2018 2018-02-02
Use datetime with timedelta.
base year is 1.1.1900 then add the days as timedelta.
the for loop just shows the first three of your dates.
if you need a different format use strftime("%Y-%m-%d %H:%M:%S", gmtime())
import datetime as dt
date = dt.datetime(1900,1,1)
dates = [42836, 42837, 42838]
for aDay in dates:<br>
print(date+dt.timedelta(days=aDay))

convert to date time format

I have column Monthyear(dtype = object), I want to convert to Date time formate.
I tried below this code, but it is not working.....
AGENT MONTHYEAR
45 SEP-2018
567 AUG-2017
432 APR-2018
Reatiers_Sales_Monthlywises_above_13['MONTHYEARS'] = Reatiers_Sales_Monthlywises_above_13['MONTHYEAR'].apply(lambda x: x.strftime('%B-%Y'))
Reatiers_Sales_Monthlywises_above_13
```
Pls support to convert this object dtype to DateTime
IF you want to keep it in year-month format, you need to convert it to period dtype.
pd.to_datetime(df.MONTHYEAR).dt.to_period('M')
Out[206]:
0 2018-09
1 2017-08
2 2018-04
Name: MONTHYEAR, dtype: period[M]
If you want it in Datetime dtype, it will be in the format of year-month-date
pd.to_datetime(df.MONTHYEAR)
Out[207]:
0 2018-09-01
1 2017-08-01
2 2018-04-01
Name: MONTHYEAR, dtype: datetime64[ns]
Note: strftime in your apply will convert it to string/object dtype, so I don't know whether that is your intention to use it.
Try using dateutil parser
It will convert string into date
NOTE: it adds 03 as a day because current day is 03
from dateutil import parser
df = pd.DataFrame(data={"AGENT":[45,567,432],
"MONTHYEAR":['SEP-2018','AUG-2017','APR-2018']})
df['MONTHYEAR'] = df['MONTHYEAR'].apply(lambda x :parser.parse(str(x)))
AGENT MONTHYEAR
0 45 2018-09-03
1 567 2017-08-03
2 432 2018-04-03