how to removes '-' in dates with a df column of dates - pandas

I want to add 3 columns together. column 1 and 2 have dates in the format of "%Y-%m-%d" but i want to change the format to "%Y%m%d" for each date in both columns and then add it to a third column of strings. I have tried the below code.
df['code'] = df['date_one'].strftime("%Y%m%d") + df['date_two'].strftime("%Y%m%d") + df['id'].astype(str)
But i keep getting an error saying series object has no strftime, can someone please help me?

You should use pd.to_datetime for this:
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y%m%d')
Date
0 20120507
1 20120507
2 20120507
3 20120507

Related

Changing date order

I have csv file containing a set of dates.
The format is like:
14/06/2000
15/08/2002
10/10/2009
09/09/2001
01/03/2003
11/12/2000
25/11/2002
23/09/2001
For some reason pandas.to_datetime() does not work on my data.
So, I have split the column into 3 columns, as day, month and year.
And now I am trying to combine the columns without "/" with:
df["period"] = df["y"].astype(str) + df["m"].astype(str)
But the problem is instead of getting:
200006
I get:
20006
One zero is missing.
Could you please help me with that?
This will allow you to take the column of dates and turn it into pd.to_datetime()
#This is assuming the column name is 0 as it was on my df
#you can change that to whatever the column name is in your dataframe
df[0] = pd.to_datetime(df[0], infer_datetime_format=True)
df[0] = df[0].sort_values(ascending = False, ignore_index = True)
df
The dayfirst= parameter might help you:
print(df)
0
0 14/06/2000
1 15/08/2002
2 10/10/2009
3 09/09/2001
4 01/03/2003
5 11/12/2000
6 25/11/2002
7 23/09/2001
pd.to_datetime(df[0], dayfirst=True).sort_values()
0 2000-06-14
5 2000-12-11
3 2001-09-09
7 2001-09-23
1 2002-08-15
6 2002-11-25
4 2003-03-01
2 2009-10-10

Pyspark: how to fix 'could not parse datatype: interval' error

I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.
I already had a date_of_birth column available, so I inserted a current_date column with the following code:
import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))
df= df.withColumn("current_date", lit(today))
Displaying my df confirms that it worked. Looks a little something like this:
id
date_of_birth
current_date
01
1995-01-01
2021-12-2021
02
1987-02-16
2021-12-2021
I inserted the age column by substracting the values of date_of_birth and current_date.
df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))
Cell runs without a problem.
Here's where I'm stuck:
Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:
'could not parse datatype: interval'
I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.
How can I fix this?
Is there a way to display the age in years (int) in this particular scenario?
PS: both the date_of_birth and current_date cols have a date type.
Solved it. Mike's comment helped tons. Thank you!
Here's how I solved it:
# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))
# update data with current_date() function
df = df .withColumn("current_date", f.current_date())
# insert new column age with dummy data (in this case, 1s)
df = df .withColumn("age", lit(1))
# update data with months_between() function, divide by 12 to obtain years.
df = df .withColumn("age", f.months_between(df.current_date, df .date_of_birth)/12)
# round and cast as interger to get rid of decimals
df = df .withColumn("age", f.round(df["age"]).cast('integer'))
Would use one of the pyspark functions for calculating difference between dates.
pyspark.sql.functions.datediff
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.datediff.html
pyspark.sql.functions.months_between
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.months_between.html

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

Extract value from Series

I want to add
I get all values from column:
from collections import Counter
coun_ = set(train_df['time1'].dt.hour)
Then I add new columns to data frame and fill there default values:
for i in coun_:
train_df['hour'+str(i)] = 0
Now I want to get hour from time1 and set 1 to right column. Forexample, if time1 equals 10 then I put 1 to hour10. I do several ways without success, one of them.
for hour in [train_df]:
hour['hour' + hour['time1'].dt.hour.to_string()] = 1
The question is how I can extract only value from Series and concat it?
Use get_dummies with DataFrame.add_prefix adn append to original by DataFrame.join:
df = df.join(pd.get_dummies(train_df['time1'].dt.hour).add_prefix('hour'))

Excluding specfic columns in Pandas for column based computations

Year A B C D
1900 1 2 3 4
1901 2 3 4 5
I have a dataset which aligns with the above format.
When i want to perform calculations on column values the year is getting added to the column values and distorting the result. For example
df['mean'] = df.mean(axis='columns')
In the above example i just want to exclude year from calculations. I have 100 plus columns in my data frame and i cannot manually use each of the columns . 'year' is also the Index for my dataframe
I realized the problem and solution
df.set_index(['Year']
df['mean'] = df.mean(axis='columns')
This did not work
But when i added inplace = True , it worked.
'df.set_index(['Year'],inplace = True)'
df['mean'] = df.mean(axis='columns')
You can also drop the year column and create a new dataframe and after applying the mean to individual columns we can add the year column.
df2 = df.drop('Year')
df2['Mean']=df.mean(axis='columns')
df2.concat(df.Year,df2)