convert to date time format - pandas

I have column Monthyear(dtype = object), I want to convert to Date time formate.
I tried below this code, but it is not working.....
AGENT MONTHYEAR
45 SEP-2018
567 AUG-2017
432 APR-2018
Reatiers_Sales_Monthlywises_above_13['MONTHYEARS'] = Reatiers_Sales_Monthlywises_above_13['MONTHYEAR'].apply(lambda x: x.strftime('%B-%Y'))
Reatiers_Sales_Monthlywises_above_13
```
Pls support to convert this object dtype to DateTime

IF you want to keep it in year-month format, you need to convert it to period dtype.
pd.to_datetime(df.MONTHYEAR).dt.to_period('M')
Out[206]:
0 2018-09
1 2017-08
2 2018-04
Name: MONTHYEAR, dtype: period[M]
If you want it in Datetime dtype, it will be in the format of year-month-date
pd.to_datetime(df.MONTHYEAR)
Out[207]:
0 2018-09-01
1 2017-08-01
2 2018-04-01
Name: MONTHYEAR, dtype: datetime64[ns]
Note: strftime in your apply will convert it to string/object dtype, so I don't know whether that is your intention to use it.

Try using dateutil parser
It will convert string into date
NOTE: it adds 03 as a day because current day is 03
from dateutil import parser
df = pd.DataFrame(data={"AGENT":[45,567,432],
"MONTHYEAR":['SEP-2018','AUG-2017','APR-2018']})
df['MONTHYEAR'] = df['MONTHYEAR'].apply(lambda x :parser.parse(str(x)))
AGENT MONTHYEAR
0 45 2018-09-03
1 567 2017-08-03
2 432 2018-04-03

Related

Python: Convert string to datetime, calculate time difference, and select rows with time difference more than 3 days

I have a dataframe that contains two string date columns. First I would like to convert the two column into datetime and calculate the time difference. Then I would like to select rows with a time difference of more than 3 days.
simple df
ID Start End
234 2020-11-16 20:25 2020-11-18 00:10
62 2020-11-02 02:50 2020-11-15 21:56
771 2020-11-17 03:03 2020-11-18 00:10
desired df
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
Current input
df['End'] = pd.to_datetime(z['End'])
df['Start'] = pd.to_datetime(z['Start'])
df['Time difference'] = df['End'] - df['Start']
How can I select rows that has a time difference of more than 3 days?
Thanks in advance! I appreciate any help on this!!
Your just missing one line, convert to days then query
df[df['Time difference'].dt.days > 3]
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
df=df.set_index('ID').apply(lambda x: pd.to_datetime(x))#Set ID as index to allow coercing of dates to datetime
df=df.assign(Timedifference =df['End'].sub(df['Start'])).reset_index()#Calculate time difference and reset index
df[df['Timedifference'].dt.days.gt(3)]#Mask a bollean selection to filter youre desired

difference in two date column in Pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]
IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35

Date serial number and date need to convert in date format

when I am reading google spreadsheet in dataframe getting data in below format
42836
42837
42838
42844
42845
42846
42849
42850
42851
2/1/2018
2/2/2018
But i need to convert all in date format
IIUC setting up the origin date and using np.where, base on my experience
the origin in Excel is December 30, 1899.
s1=pd.to_datetime(pd.to_numeric(df.date,errors='coerce'),errors='coerce',origin='1899-12-30',unit='D')
s2=pd.to_datetime(df.date,errors='coerce')
df['new']=np.where(df.date.str.contains('/'),s2,s1)
df
Out[282]:
date new
0 42837 2017-04-12
1 42838 2017-04-13
2 42844 2017-04-19
3 42845 2017-04-20
4 42846 2017-04-21
5 42849 2017-04-24
6 42850 2017-04-25
7 42851 2017-04-26
8 2/1/2018 2018-02-01
9 2/2/2018 2018-02-02
Use datetime with timedelta.
base year is 1.1.1900 then add the days as timedelta.
the for loop just shows the first three of your dates.
if you need a different format use strftime("%Y-%m-%d %H:%M:%S", gmtime())
import datetime as dt
date = dt.datetime(1900,1,1)
dates = [42836, 42837, 42838]
for aDay in dates:<br>
print(date+dt.timedelta(days=aDay))

Pandas: tz_convert using apply returns object rather than datetime

I have a dataframe indexed by timestamps in UTC, along with 2 columns specifying the time zone and daylight savings offsets in minutes from UTC:
time_zone daylight_saving
END_DATE
2017-06-02 00:00:00+00:00 0 60
2017-06-02 01:00:00+00:00 0 60
2017-06-02 02:00:00+00:00 0 60
2017-06-02 03:00:00+00:00 0 60
2017-06-02 04:00:00+00:00 0 60
I'm attempting to convert the timestamps to the local timezone by using pytz.FixedOffset. Using a static offset works fine, I get a datetime with the appropriate timezone:
In [51]: df.tz_convert(pytz.FixedOffset(120))[['time_zone','daylight_saving']].head()
Out[51]:
time_zone daylight_saving
END_DATE
2017-06-02 02:00:00+02:00 0 60
2017-06-02 03:00:00+02:00 0 60
2017-06-02 04:00:00+02:00 0 60
2017-06-02 05:00:00+02:00 0 60
2017-06-02 06:00:00+02:00 0 60
In [52]: df.tz_convert(pytz.FixedOffset(120))[['time_zone','daylight_saving']].head().index
Out[52]:
DatetimeIndex(['2017-06-02 02:00:00+02:00', '2017-06-02 03:00:00+02:00',
'2017-06-02 04:00:00+02:00', '2017-06-02 05:00:00+02:00',
'2017-06-02 06:00:00+02:00'],
dtype='datetime64[ns, pytz.FixedOffset(120)]', name='END_DATE', freq=None)
In order to do this using the offset columns, however, I need to use the apply method:
In [63]: r_df.apply(lambda r:
r['END_DATE'].tz_convert(pytz.FixedOffset(r['time_zone'] +
r['daylight_saving'])), axis=1).head()
Out[63]:
0 2017-06-02 01:00:00+01:00
1 2017-06-02 02:00:00+01:00
2 2017-06-02 03:00:00+01:00
3 2017-06-02 04:00:00+01:00
4 2017-06-02 05:00:00+01:00
dtype: object
As you can see in the output, this returns an object series, not a datetime series as I expected.
If I try to convert it back using pd.to_datetime, I am forced to return it to UTC, defeating the purpose of applying the timezone.
Is there any way to convert this back to a dt while retaining the tz info?
I stumbled onto the same issue and reported it to the Pandas community, who redirected me to an older issue referencing the same problem. Sadly there is still no solution to the issue, but if you would like to track the issue you can check out:
The issue I reported.
The issue I was redirected to.
I encountered the same issue today in the exact same situation.
Found a workaraund by chaining the call to tz_convert with dt.tz_localize(tz=None).
# function to apply
def tz_func(x):
return x.dt.tz_convert(x.name).dt.tz_localize(tz=None)
# group by timezone and transform with function
r_df.groupby("time_zone")["END_DATE"].transform(tz_func)
Then the resulting series will be of datetime type and not object, since having localized dates produces the pd.Series with type "object".

Calculate age in months - optimize date transformations in pandas

I am trying a very simple thing - to calculate age in months between two columns and save it to a new column
df['AGE'] = (df.apply(lambda x: (x['DAX'].year - int(x['BIRTH_DATE'][:4])) * 12 +
x['DAX'].month - int(x['BIRTH_DATE'][5:7])
if x['BIRTH_DATE'] is not None
and int(x['BIRTH_DATE'][:4]) > 1900
else -1 # data quality
, axis=1).astype(np.int8))
I am doing this while loading a pretty big 2Gb csv file. DAX is parsed directly in the reader while BIRTH_DATE is left as string.
And this simple calculation bumps up the load time by a factor of x10. Is there any smarter way to calculate age in months on big data frames?
Here is a sample of data:
DAX BIRTH_DATE
2015-01-01 1931-12-03
2015-01-01 1991-04-19
2015-01-01 1992-10-11
2015-01-01 1982-05-20
2015-01-01 1987-12-20
2015-01-01 1976-07-30
2015-01-01 1951-05-11
2015-01-01 1993-05-06
2015-01-01 1989-02-27
I am trying to get another column 'AGE' as number of months since birthday.
first convert BIRTH_DATE to datetime dtype:
In [257]: df['BIRTH_DATE'] = pd.to_datetime(df['BIRTH_DATE'], errors='coerce')
check:
In [258]: df.dtypes
Out[258]:
DAX datetime64[ns]
BIRTH_DATE datetime64[ns]
dtype: object
now we can do this simple math:
In [259]: df['AGE'] = df.DAX.dt.year*12 + df.DAX.dt.month - \
(df.BIRTH_DATE.dt.year*12 + df.BIRTH_DATE.dt.month)
In [260]: df
Out[260]:
DAX BIRTH_DATE AGE
0 2015-01-01 1931-12-03 997
1 2015-01-01 1991-04-19 285
2 2015-01-01 1992-10-11 267
3 2015-01-01 1982-05-20 392
4 2015-01-01 1987-12-20 325
5 2015-01-01 1976-07-30 462
6 2015-01-01 1951-05-11 764
7 2015-01-01 1993-05-06 260
8 2015-01-01 1989-02-27 311
As you have not provided any sample data I'm not completely sure what format your data is in. Something like this should work though and be considerably faster than using apply():
df['AGE'] = (df.DAX - df.BIRTH_DATE.astype('datetime64[ns]')).dt.days / 30
Again without the data I'm not sure what you're data quality steps needs to do, but it will probably be fixed after the above line like so:
df.loc[df['AGE'].isnull(), 'AGE'] = -1