I have one column Start_Time with a timestamp, and one column Time_Zone_Offset, an integer. How can I add the Time_Zone_Offset to Start_Time as a number of hours?
Example MyTable:
id Start_Time Time_Zone_Offset
1 2020-01-12 00:00:00 1
2 2020-01-12 00:00:00 2
Desired output:
id Local_Start_Time
1 2020-01-12 01:00:00
2 2020-01-12 02:00:00
Attempt:
SELECT id, Start_time + INTERVAL time_zone_offset HOURS AS Local_Start_Time
FROM MyTable
This doesn't seem to work, and I can't use from_utc_timestamp as I don't have the actual timezone details, just the time-zone offset at the time being considered.
(Hope you are using pyspark)
In deed, coudn't make it work with SQL, I manage to get to the result by converting to timestamp, its probably not the best way but it works (i proceeded step by step to make sure the references were working, thought i would need a user defined function, but apparently not)
from pyspark.sql.functions import col,explode,lit
from pyspark.sql import functions as F
df2 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time"))
df2.show()
df3 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time") + df["Time_Zone_Offset"]*60*60)
df3.show()
df4 = df3.withColumn('Start_Time', F.from_unixtime("Start_Time", "YYYY-MM-DD HH:00:00")).show()
Adding an alternative to Benoit's answer using a python UDF:
from pyspark.sql import SQLContext
from datetime import datetime, timedelta
from pyspark.sql.types import TimestampType
# Defining pyspark function to add hours onto datetime
def addHours(my_datetime, hours):
# Accounting for NULL (None in python) values
if (hours is None) or (my_datetime is None):
adjusted_datetime = None
else:
adjusted_datetime = my_datetime + timedelta(hours = hours)
return adjusted_datetime
# Registering the function as a UDF to use in SQL, and defining the output type as 'TimestampType' (this is important, the default is StringType)
sqlContext.udf.register("add_hours", addHours, TimestampType());
followed by:
SELECT id, add_hours(Start_Time, Time_Zone_Offset) AS Local_Start_Time
FROM MyTable
Beginning from Spark 3.0, you may use the make_interval(years, months, weeks, days, hours, mins, secs) function if you want to add intervals using values from other columns.
SELECT
id
, Start_time + make_interval(0, 0, 0, 0, time_zone_offset, 0, 0) AS Local_Start_Time
FROM MyTable
For anyone else coming to this question and using Spark SQL via Databricks, the dateadd function works in the same way as most other SQL languages:
select dateadd(microsecond,30,'2022-11-04') as microsecond
,dateadd(millisecond,30,'2022-11-04') as millisecond
,dateadd(second ,30,'2022-11-04') as second
,dateadd(minute ,30,'2022-11-04') as minute
,dateadd(hour ,30,'2022-11-04') as hour
,dateadd(day ,30,'2022-11-04') as day
,dateadd(week ,30,'2022-11-04') as week
,dateadd(month ,30,'2022-11-04') as month
,dateadd(quarter ,30,'2022-11-04') as quarter
,dateadd(year ,30,'2022-11-04') as year
Output
microsecond
millisecond
second
minute
hour
day
week
month
quarter
year
2022-11-04T00:00:00.000+0000
2022-11-04T00:00:00.030+0000
2022-11-04T00:00:30.000+0000
2022-11-04T00:30:00.000+0000
2022-11-05T06:00:00.000+0000
2022-12-04T00:00:00.000+0000
2023-06-02T00:00:00.000+0000
2025-05-04T00:00:00.000+0000
2030-05-04T00:00:00.000+0000
2052-11-04T00:00:00.000+0000
I have a date column of format YYYY-MM-DD and want to convert it to an int type, consecutively, where 1= Jan 1, 2000. So if I have a date 2000-01-31, it will convert to 31. If I have a date 2020-01-31 it will convert to (365*20yrs + 5 leap days), etc.
Is this possible to do in pandas?
I looked at Pandas: convert date 'object' to int, but this solution converts to an int 8 digits long.
First subtract column by Timestamp, convert timedelts to days by Series.dt.days and last add 1:
df = pd.DataFrame({"Date": ["2000-01-29", "2000-01-01", "2014-03-31"]})
d = '2000-01-01'
df["new"] = pd.to_datetime(df["Date"]).sub(pd.Timestamp(d)).dt.days + 1
print( df )
Date new
0 2000-01-29 29
1 2000-01-01 1
2 2014-03-31 5204
I am working with a time series data in pandas df that doesn't have a real calendar date but an index value that indicates an equal time interval in between each value. I'm trying to convert it into a datetime type with daily or weekly frequency. Is there a way to keep the values same while changing the type (like without setting an actual calander date)?
Index,Col1,Col2
1,6.5,0.7
2,6.2,0.3
3,0.4,2.1
pd.to_datetime can create dates when given time units relative to some origin. The default is the POSIX origin 1970-01-01 00:00:00 and time in nanoseconds.
import pandas as pd
df['date1'] = pd.to_datetime(df.index, unit='D', origin='2010-01-01')
df['date2'] = pd.to_datetime(df.index, unit='W')
Output:
# Col1 Col2 date1 date2
#Index
#1 6.5 0.7 2010-01-02 1970-01-08
#2 6.2 0.3 2010-01-03 1970-01-15
#3 0.4 2.1 2010-01-04 1970-01-22
Alternatively, you can add timedeltas to the specified start:
pd.to_datetime('2010-01-01') + pd.to_timedelta(df.index, unit='D')
or just keep them as a timedelta:
pd.to_timedelta(df.index, unit='D')
#TimedeltaIndex(['1 days', '2 days', '3 days'], dtype='timedelta64[ns]', name='Index', freq=None)
I'm using Pandas to read a .csv file that a 'Timestamp' date column in the format:
31/12/2016 00:00
I use the following line to convert it to a datetime64 dtype:
time = pd.to_datetime(df['Timestamp'])
The column has an entry corresponding to every 15mins for almost a year, and I've run into a problem when I want to plot more than 1 months worth.
Pandas seems to change the format from ISO to US upon reading (so YYYY:MM:DD to YYYY:DD:MM), so my plots have 30 day gaps whenever the datetime represents a new day. A plot of the first 5 days looks like:
This is the raw data in the file either side of the jump:
01/01/2017 23:45
02/01/2017 00:00
If I print the values being plotted (after reading) around the 1st jump, I get:
2017-01-01 23:45:00
2017-02-01 00:00:00
So is there a way to get pandas to read the dates properly?
Thanks!
You can specify a format parameter in pd.to_datetime to tell pandas how to parse the date exactly, which I suppose is what you need:
time = pd.to_datetime(df['Timestamp'], format='%d/%m/%Y %H:%M')
pd.to_datetime('02/01/2017 00:00')
#Timestamp('2017-02-01 00:00:00')
pd.to_datetime('02/01/2017 00:00', format='%d/%m/%Y %H:%M')
#Timestamp('2017-01-02 00:00:00')
I have the following two time column,"Time1" and "Time2".I have to calculate the "Difference" column,which is (Time2-Time1) in Pandas:
Time1 Time2 Difference
8:59:45 9:27:30 -1 days +23:27:45
9:52:29 10:08:54 -1 days +23:16:26
8:07:15 8:07:53 00:00:38
When Time1 and Time2 are in different hours,I am getting result as"-1 days +" .My desired output for First two values are given below:
Time1 Time2 Difference
8:59:45 9:27:30 00:27:45
9:52:29 10:08:54 00:16:26
How can I get this output in Pandas?
Both time values are in 'datetime64[ns]' dtype.
The issue is not that time1 and time2 are in different hours, it's that time2 is before time1 so time2-time1 is negative, and this is how negative timedeltas are stored. If you just want the difference in minutes as a negative number, you could extract the minutes before calculating the difference:
(df.Time1.dt.minute- df.Time2.dt.minute)
I was not able to reproduce the issue using pandas 17.1:
import pandas as pd
d = {
"start_time": [
"8:59:45",
"9:52:29",
"8:07:15"
],
"end_time": [
"9:27:30",
"10:08:54",
"8:07:53"
]
}
from datetime import datetime
df = pd.DataFrame(data=d)
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df.end_time - df.start_time
0 00:27:45
1 00:16:25
2 00:00:38
dtype: timedelta64[ns]