Adding variable hours to timestamp in Spark SQL - sql

I have one column Start_Time with a timestamp, and one column Time_Zone_Offset, an integer. How can I add the Time_Zone_Offset to Start_Time as a number of hours?
Example MyTable:
id Start_Time Time_Zone_Offset
1 2020-01-12 00:00:00 1
2 2020-01-12 00:00:00 2
Desired output:
id Local_Start_Time
1 2020-01-12 01:00:00
2 2020-01-12 02:00:00
Attempt:
SELECT id, Start_time + INTERVAL time_zone_offset HOURS AS Local_Start_Time
FROM MyTable
This doesn't seem to work, and I can't use from_utc_timestamp as I don't have the actual timezone details, just the time-zone offset at the time being considered.

(Hope you are using pyspark)
In deed, coudn't make it work with SQL, I manage to get to the result by converting to timestamp, its probably not the best way but it works (i proceeded step by step to make sure the references were working, thought i would need a user defined function, but apparently not)
from pyspark.sql.functions import col,explode,lit
from pyspark.sql import functions as F
df2 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time"))
df2.show()
df3 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time") + df["Time_Zone_Offset"]*60*60)
df3.show()
df4 = df3.withColumn('Start_Time', F.from_unixtime("Start_Time", "YYYY-MM-DD HH:00:00")).show()

Adding an alternative to Benoit's answer using a python UDF:
from pyspark.sql import SQLContext
from datetime import datetime, timedelta
from pyspark.sql.types import TimestampType
# Defining pyspark function to add hours onto datetime
def addHours(my_datetime, hours):
# Accounting for NULL (None in python) values
if (hours is None) or (my_datetime is None):
adjusted_datetime = None
else:
adjusted_datetime = my_datetime + timedelta(hours = hours)
return adjusted_datetime
# Registering the function as a UDF to use in SQL, and defining the output type as 'TimestampType' (this is important, the default is StringType)
sqlContext.udf.register("add_hours", addHours, TimestampType());
followed by:
SELECT id, add_hours(Start_Time, Time_Zone_Offset) AS Local_Start_Time
FROM MyTable

Beginning from Spark 3.0, you may use the make_interval(years, months, weeks, days, hours, mins, secs) function if you want to add intervals using values from other columns.
SELECT
id
, Start_time + make_interval(0, 0, 0, 0, time_zone_offset, 0, 0) AS Local_Start_Time
FROM MyTable

For anyone else coming to this question and using Spark SQL via Databricks, the dateadd function works in the same way as most other SQL languages:
select dateadd(microsecond,30,'2022-11-04') as microsecond
,dateadd(millisecond,30,'2022-11-04') as millisecond
,dateadd(second ,30,'2022-11-04') as second
,dateadd(minute ,30,'2022-11-04') as minute
,dateadd(hour ,30,'2022-11-04') as hour
,dateadd(day ,30,'2022-11-04') as day
,dateadd(week ,30,'2022-11-04') as week
,dateadd(month ,30,'2022-11-04') as month
,dateadd(quarter ,30,'2022-11-04') as quarter
,dateadd(year ,30,'2022-11-04') as year
Output
microsecond
millisecond
second
minute
hour
day
week
month
quarter
year
2022-11-04T00:00:00.000+0000
2022-11-04T00:00:00.030+0000
2022-11-04T00:00:30.000+0000
2022-11-04T00:30:00.000+0000
2022-11-05T06:00:00.000+0000
2022-12-04T00:00:00.000+0000
2023-06-02T00:00:00.000+0000
2025-05-04T00:00:00.000+0000
2030-05-04T00:00:00.000+0000
2052-11-04T00:00:00.000+0000

Related

How to convert a numeric column (without a calendar date) into datetime

I am working with a time series data in pandas df that doesn't have a real calendar date but an index value that indicates an equal time interval in between each value. I'm trying to convert it into a datetime type with daily or weekly frequency. Is there a way to keep the values same while changing the type (like without setting an actual calander date)?
Index,Col1,Col2
1,6.5,0.7
2,6.2,0.3
3,0.4,2.1
pd.to_datetime can create dates when given time units relative to some origin. The default is the POSIX origin 1970-01-01 00:00:00 and time in nanoseconds.
import pandas as pd
df['date1'] = pd.to_datetime(df.index, unit='D', origin='2010-01-01')
df['date2'] = pd.to_datetime(df.index, unit='W')
Output:
# Col1 Col2 date1 date2
#Index
#1 6.5 0.7 2010-01-02 1970-01-08
#2 6.2 0.3 2010-01-03 1970-01-15
#3 0.4 2.1 2010-01-04 1970-01-22
Alternatively, you can add timedeltas to the specified start:
pd.to_datetime('2010-01-01') + pd.to_timedelta(df.index, unit='D')
or just keep them as a timedelta:
pd.to_timedelta(df.index, unit='D')
#TimedeltaIndex(['1 days', '2 days', '3 days'], dtype='timedelta64[ns]', name='Index', freq=None)

Get the time spent since midnight in dataframe

I have a dataframe which has a column of type Timestamp. I want to find the time elapsed (in seconds) since midnight as a new column. How to do it in a simple way ?
Eg :
Input :
samples['time']
2018-10-01 00:00:01.000000000
2018-10-01 00:00:12.000000000
type(samples['time'].iloc[0])
<class 'pandas._libs.tslib.Timestamp'>
Output :
samples['time_elapsed']
1
12
Current answers either too complicated or specialized.
samples = pd.DataFrame(data=['2018-10-01 00:00:01', '2018-10-01 00:00:12'], columns=['time'], dtype='datetime64[ns]')
samples['time_elapsed'] = ((samples['time'] - samples['time'].dt.normalize()) / pd.Timedelta('1 second')).astype(int)
print(samples)
time time_elapsed
0 2018-10-01 00:00:01 1
1 2018-10-01 00:00:12 12
normalize() removes the time component from the datetime (moves clock back to midnight).
pd.Timedelta('1 s') sets the unit of measurement, i.e. number of seconds in the timedelta.
.astype(int) casts the decimal number of seconds to int. Use round functionality if that is preferred.
Note that the date part in each row may be other (not from one
and the same day), so you can not take any "base date" (midnight)
for the whole DataFrame, as it can be seen in one of other solutions.
My intention was also not to "contaminate" the source DataFrame
with any intermediate columns, e.g. the time (actually date and time)
as string converted to "true" DateTime.
Then my proposition is:
convert the DateTime string to DateTime,
take the time part from it,
compute the number of seconds from hour / minute / second
part.
All the above steps in a dedicated function.
So to do the task, define a function:
def secSinceMidnight(datTimStr):
tt = pd.to_datetime(datTimStr).time()
return tt.hour * 3600 + tt.minute * 60 + tt.second
Then call:
samples['Secs'] = samples.time.apply(secSinceMidnight)
For source data:
samples = pd.DataFrame(data=[
[ '2018-10-01 00:00:01' ], [ '2018-10-01 00:00:12' ],
[ '2018-11-02 01:01:10' ], [ '2018-11-04 03:02:15' ] ],
columns = ['time']);
when you print the result, you will see:
time Secs
0 2018-10-01 00:00:01 1
1 2018-10-01 00:00:12 12
2 2018-11-02 01:01:10 3670
3 2018-11-04 03:02:15 10935
Doing this in Pandas is very simple!
midnight = pd.Timestamp('2018-10-01 00:00:00')
print(pd.Timestamp('2018-10-01 00:00:01.000000000') - midnight).seconds
>
1
And by extension we can use an apply on a Pandas Series:
samples = pd.DataFrame(['2018-10-01 00:00:01.000000000', '2018-10-01 00:00:12.000000000'], columns=['time'])
samples.time = pd.to_datetime(samples.time)
midnight = pd.Timestamp('2018-10-01 00:00:00')
samples['time_elapsed'] = samples['time'].apply(lambda x: (x - midnight).seconds)
samples
>
time time_elapsed
0 2018-10-01 00:00:01 1
1 2018-10-01 00:00:12 12
Note that the answers here use an alternative method: comparing the timestamp to itself converted to a date. This zeros all time data and so is the equivalent of midnight of that day. This method might be slightly more performant.
I ran into the same problem in my one of my projects and here's how I solved it (assuming your time column has already been converted to Timestamp):
(samples['time'] - samples['time'].dt.normalize()) / pd.Timedelta(seconds=1)
The beauty of this approach is that you can change the last part to get seconds, minutes, hours or days elapsed:
... / pd.Timedelta(seconds=1) # seconds elapsed
... / pd.Timedelta(minutes=1) # minutes elapsed
... / pd.Timedelta(hours=1) # hours elapsed
... / pd.Timedelta(days=1) # days elapsed
datetime = samples['time']
(datetime - datetime.dt.normalize()).dt.total_seconds()
We can do :
samples['time'].dt.hour * 3600 +
samples['time'].dt.minute * 60 +
samples['time'].dt.second

How to generate a time series column from today to the next 600 days in pandas?

How to generate a time series column from today to the next 600 days in pandas?
I'm a new pandas learner. I can generate a new column as follows:
dates = pd.date_range('2010-01-01', '2011-8-23', freq='D')
Output:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
'2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
'2010-01-09', '2010-01-10',
...
'2011-08-14', '2011-08-15', '2011-08-16', '2011-08-17',
'2011-08-18', '2011-08-19', '2011-08-20', '2011-08-21',
'2011-08-22', '2011-08-23'],
dtype='datetime64[ns]', length=600, freq='D')
My question is: what should we do if we do only know the starting date, and the time period 600 days? we don't know the ending date. How to modify the code?
And another follow up questions, how to set the starting date to current or yesterday's date?
Just change the period to 600, you should get your out put
pd.date_range(start='2010-01-01', periods=5, freq='D')
Out[335]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
'2010-01-05'],
dtype='datetime64[ns]', freq='D')
For get today'date
pd.to_datetime('today')
Out[338]: Timestamp('2017-09-29 00:00:00')
First, import core package datetime
import datetime
Then you can instantiate a datetime object and add 600 days using the timedelta() method
start_date = "2010-01-01"
start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")
end_date = start_date + datetime.timedelta(days=600)
To now get the string back, we can use strftime() like:
end_date = end_date.strftime("%Y-%m-%d")
> '2011-08-24'

Difference in two simpledateformat dates in SQL

I am attempting to calculate difference in an entry/exit time by doing something like this:
SELECT entry_time - exit_time AS dwell
However, believe the values are in SimpleDateFormat:
2017-07-15T13:00:37-05:00
and as a result, I have had trouble figuring out what to CAST or CONVERT them to in Postgre Sql
I was able to do this in PySpark, but now I need to do this just using standard SQL. Here is an example of what worked in Spark:
import pyspark.sql.functions as F
timeFmt = "yyyy-MM-dd'T'HH:mm:ss"
timeDiff = (F.unix_timestamp('exit_time', format=timeFmt)
- F.unix_timestamp('entry_time', format=timeFmt))
new_sdf = sdf.withColumn("Dwell", timeDiff)
Any help is greatly appreciated!
You can just cast such a value to get a timestamp in PostgreSQL:
SELECT CAST('2017-07-15T13:00:37-05:00' AS timestamp with time zone);
timestamptz
------------------------
2017-07-15 20:00:37+02
(1 row)
Then use timestamp arithmetic and subtract the two timestamps to get an interval.

Substract Days to a Date in HIVE APACHE

How I can substract a number of days of a date, having as a result another date, for example: 01/12/2016 - 10 = 21/11/2016
(date argument)
hive> select date_sub(date '2016-12-01',10);
OK
2016-11-21
or
(string argument)
hive> select date_sub('2016-12-01',10);
OK
2016-11-21
date_sub(date/timestamp/string startdate, tinyint/smallint/int days)
Subtracts a number of days to startdate: date_sub('2008-12-31', 1) =
'2008-12-30'. Prior to Hive 2.1.0 (HIVE-13248) the return type was a
String because no Date type existed when the method was created.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
there exist a hive udf to substract days to the hive datehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, you have two options, transform your date to the following format to use the udf directly
yyyy-MM-dd
or you can transform your current date to timestamp and apply the udf, for example
date_sub(from_unixtime(unix_timestamp('12/03/2010' , 'dd/MM/yyyy')), 10) -- subs 10 days
I hope it helps,
regards!