Pandas: Read dates in column which has different formats - pandas

I would like to read a csv, with dates in a column, but the dates are in different formats within the column.
Specifically, some dates are in "dd/mm/yyyy" format, and some are in "4####" format (excel 1900 date system, serial number represents days elapsed since 1900/1/1).
Is there any way to use read_csv or pandas.to_datetime to convert the column to datetime?
Have tried using pandas.to_datetime with no parameters to no avail.
df["Date"] = pd.to_datetime(df["Date"])
Returns
ValueError: year 42613 is out of range
Presumably it can read the "dd/mm/yyyy" format fine but produces an error for the "4####" format.
Note: the column is mixed type as well
Appreciate any help
Example
dates = ['25/07/2016', '42315']
df = DataFrame (dates, columns=['Date'])
#desired output ['25/07/2016', '07/11/2015']

Let's try:
dates = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
m = dates.isna()
dates.loc[m] = (
pd.TimedeltaIndex(df.loc[m, 'Date'].astype(int), unit='d')
+ pd.Timestamp(year=1899, month=12, day=30)
)
df['Date'] = dates
Or alternatively with seconds conversion:
dates = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
m = dates.isna()
dates.loc[m] = pd.to_datetime(
(df.loc[m, 'Date'].astype(int) - 25569) * 86400.0,
unit='s'
)
df['Date'] = dates
df:
Date
0 2016-07-25
1 2015-11-07
Explanation:
First convert to datetime all that can be done with pd.to_datetime:
dates = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
Check which values couldn't be converted:
m = dates.isna()
Convert NaTs
a. Offset as days since 1899-12-30 using TimedeltaIndex + pd.Timestamp:
dates.loc[m] = (
pd.TimedeltaIndex(df.loc[m, 'Date'].astype(int), unit='d')
+ pd.Timestamp(year=1899, month=12, day=30)
)
b. Or convert serial days to seconds mathematically:
dates.loc[m] = pd.to_datetime(
(df.loc[m, 'Date'].astype(int) - 25569) * 86400.0,
unit='s'
)
Update the Date column:
df['Date'] = dates

Related

Convert HH:MM:SS and MM:SS to seconds in same column in Pandas

I am trying to convert a timestamp string into an integer and having trouble.
The column in my dataframe looks like this:
time
30:03
1:15:02
I have tried df['time'].str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1])) but that only works if every value in the column is HH:MM:SS but my column is mixed. Some people took 30 minutes to finish the task and some took an hour, etc.
You can make a custom convert function where you check for time format:
def convert(x):
x = x.split(":")
if len(x) == 2:
return int(x[0]) * 60 + int(x[1])
return int(x[0]) * 3600 + int(x[1]) * 60 + int(x[2])
df["time"] = df["time"].apply(convert)
print(df)
Prints:
time
0 1803
1 4502

Converting to datetime - ParserError: Unknown string format: 2022-02-17 7

I have a pandas dataframe with some string values that have the hour of a date in one-digit format if the hour is smaller than 10, like this:
2022-02-17 7
I now want to get this strings to datetime format but when applying
df['datetime'] = pd.to_datetime(df['datetime'], infer_datetime_format=True)
I get a ParserError:
ParserError: Unknown string format: 2022-02-17 7
How can I solve this?
Use:
df = pd.DataFrame({'datetime':['2022-02-17 7']})
df['datetime'] = df['datetime'].str.replace(' (\d{1})', ' 0\\1')
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H')
The result:
A more general case which can handle two-digits too:
df = pd.DataFrame({'datetime':['2022-02-17 17', '2022-02-17 7']})
df['datetime'] = df['datetime'].str.replace('\s(\d)$', ' 0\\1')
The result:

list comprehension vs lambda function in pandas dataframe

I'm trying to convert decimal years to datetime format in Python. I've managed to make the conversion using a list comprehension, but I cannot get a lambda function working to do the same thing. What am I doing wrong? How can I use a lambda function to make this conversion?
from datetime import datetime, timedelta
import calendar
df = pd.DataFrame(data = [2021.3, 2021.6], columns = ['dec_date'])
# define a function to convert decimal dates to datetime
def convert_partial_year(number):
# round down to get the year
year = int(number)
# get the fractional year
year_fraction = number - year
# get the number of days in the given year
days_in_year = (365 + calendar.isleap(year))
# convert the fractional year into days
d = timedelta(days=year_fraction * days_in_year)
# convert the year into a date format
day_one = datetime(year, 1, 1)
# add the days into the year onto the date format
date = d + day_one
# return the result
return date
# my lambda function does not work
df.assign(
date = lambda x: convert_partial_year(x.dec_date)
)
# my list comprehension does work
df.assign(
date = [convert_partial_year(x) for x in df.dec_date]
)

remove few dates from date range

from datetime import timedelta, date
def daterange(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
start_dt = date(2015, 12, 20)
end_dt = date(2016, 1, 11)
for dt in daterange(start_dt, end_dt):
print(dt.strftime("%Y-%m-%d"))
I have date range as above stated, but I have few dates from this date range to ignore. These dates are in dataframe.
How can I take these dates out from this date range? Anyone please suggest. Dataframe with distinct dates are below.
Pardata = spark.read.parquet("/mnt/Test/data.parquet")
Pardata.createOrReplaceTempView("parfile")
ParRes = spark.sql("SELECT distinct date FROM parfile ")
Use left_anti join:
dates = [[dt.strftime("%Y-%m-%d")] for dt in daterange(start_dt, end_dt)]
dates_df = spark.createDataFrame(dates, ["date"])
dates_df.join(ParRes, dates_df("date") === ParRes("date"), "left_anti").show()
First, create a DataFrame dates_df from that range of dates. Then use left_anti join, which filters out dates from ParRes Dataframe in the dates_df Dataframe according to the key date.

Difference between 2 dates which are objects in dataframe

Dataframe has 2 dates which are of "object" datatype. StartDate and EndDate are in the mm/dd/yyyy format.
Name StartDate EndDate
bou1 1/9/2017 1/10/2017
bou2 12/31/2016 1/10/2017
Output:
Name StartDate EndDate Diff
bou1 1/9/2017 1/10/2017 1
bou2 12/31/2016 1/10/2017 10
Any suggestions would be appreciated !!
you first need to convert to datetime for those columns and then subtract.
try
df['startDate'] = pd.to_datetime(df['startDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
df['difInDate'] = (abs(df['startDate'].sub(df['EndDate'], axis = 0))) / np.timedelta64(1, 'D')
print(df['difInDate'])
abs is just to make it days positive because, you are subtracting from small date to big date
alternatively you can use (df['EndDate'].sub(df['StartDate'] too
# Recreating your dataframe with dates stored as strings
df = pd.DataFrame({'Name' : ['bou1', 'bou2'],
'StartDate': ['01/09/2017','12/31/2016'],
'EndDate' : ['01/10/2017', '01/10/2017']})
# Date strings converted with pd.Datetime
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
# .dt handles your calculation and .days outputs in days
df['Diff'] = (df['EndDate'] - df['StartDate']).dt.days
# Just prints the columns in your order
df[['Name', 'StartDate', 'EndDate', 'Diff']]