extracting HOUR from an interval in spark sql - apache-spark-sql

I was wondering how to properly extract amount of hours between given 2 timestamps objects.
For instance, when the following SQL query gets executed:
select x, extract(HOUR FROM x) as result
from
(select (TIMESTAMP'2021-01-22T05:00:00' - TIMESTAMP'2021-01-01T09:00:00') as x)
The result value is 20, while I'd expect it to be 500.
It seems odd to me considering that x value indicates the expected return value.
Can anyone please explain to me what I'm doing wrong and perhaps suggest additional way of query so the desired result would return?
Thanks in advance!

I think you have to do the maths with this one as datediff in SparkSQL only supports days. This worked for me:
SELECT (unix_timestamp(to_timestamp('2021-01-22T05:00:00') ) - unix_timestamp(to_timestamp('2021-01-01T09:00:00'))) / 60 / 60 diffInHours
My results (in Synapse Notebook, not Databricks but I expect it to be the same):
The unix_timestamp function converts the timestamp to a Unix timestamp (in seconds) and then you can apply date math to it. Subtracting them gives the number of seconds between the two timestamps. Divide by 60 for the number minutes between the two dates and by 60 again for the number of hours between the two dates.

Related

Calculate the only necessary difference and Group by between two Timestamps in PostgreSQL

I see similar question but didn't find helpful answer for my problem.
I have one query:
select au.username
,ej.id as "job_id"
,max(ec."timestamp") as "last_commit"
,min(ec."timestamp") as "first_commit"
,age(max(ec."timestamp"), min(ec."timestamp")) as "diff as age"
,to_char(age(max(ec."timestamp")::timestamp, min(ec."timestamp")::timestamp),'HH:MI:SS') as "diff as char"
,et.id as "task_id"
from table and etc..
And that my output(Sorry for picture but its best view):
So, as you can see I have timestamp with zones, and I trying calculate difference between last_commit and first_commit. Within function age its goes well, but I need extract only hours and minutes from this subtraction. N.B! only hours and minutes not days for example job_id=1 first row, the difference is 2 minutes and 42 seconds and where job_id=2 second row, the difference is 2 hours and 2 minutes and 55 sec, not 16 days X 24 hours, I don't need calculate days. When I try to_char its return not exactly what I expect. Last two columns within green color in my picture, show what I expect and want. So for every row calculate difference between last and first commit included only hours and minutes (in other words calculate only time not dates) and calculate total sum by task_id as represent in last column in pic.
Thanks.
try this :
SELECT age(max(ec."timestamp"), min(ec."timestamp")) - date_trunc('day', age(max(ec."timestamp"), min(ec."timestamp")))
You can try converting a Timestamp type to just time, like in this answer.
The result of string SQL is:
select au.username
,ej.id as "job_id"
,max(ec."timestamp") as "last_commit"
,min(ec."timestamp") as "first_commit"
,(max(ec."timestamp")::time-min(ec."timestamp")::time) as "diff as age"
,to_char(age(max(ec."timestamp")::timestamp, min(ec."timestamp")::timestamp),'HH:MI:SS') as "diff as char"
,et.id as "task_id"
There are otres possible solutions working with timestamp, but this one i consider simple.

How do I do date diff in a spark sql environment?

I have a table with a creation date and an action date. I'd like to get the number number of minutes between the two dates. I looked at the docs and I'm having trouble finding a solution.
%sql
SELECT datediff(creation_dt, actions_dt)
FROM actions
limit 10
This gives me the number of days between the two dates. One record looks like
2019-07-31 23:55:22.0 | 2019-07-31 23:55:21 | 0
How can I get the number of minutes?
As stated in the comments, if you are using Spark or Pyspark then the withColumn method is best.
BUT
If you are using the SparkSQL environment then you could use the unix_timestamp() function to get what you need
select ((unix_timestamp('2019-09-09','yyyy-MM-dd') - unix_timestamp('2018-09-09','yyyy-MM-dd'))/60);
Swap the dates with your column names and define what your date pattern is as the parameters.
Both dates are converted into seconds and the difference is taken. We then divide by 60 to get the minutes.
525600.0

SQL Server adding two time columns in a single table and putting result into a third column

I have a table containing two time columns like this:
Time1 Time2
07:34:33 08:22:44
I want to add the time in both these columns and put the result of addition into a third column may be Time3
Any help would be appreciated..Thanks
If the value you expect as the result is 15:57:17 then you can get it by calculating for instance the number of seconds from midnight for Time1 and add that value to Time2:
select dateadd(second,datediff(second,0,time1),time2) as Time3
from your_table
I'm not sure how meaningful adding two discrete time values together is though, unless they are meant to represent duration in which case the time datatype might not be the best as it is meant for time of day data and only has a range of 00:00:00.0000000 through 23:59:59.9999999 and an addition could overflow (and hence wrap around).
If the result you want isn't 15:57:17 then you should clarify the question and add the desired output.
The engine doesn't understand addition of two time values, because it thinks you can't add two times of day. You get:
Msg 8117, Level 16, State 1, Line 8
Operand data type time is invalid for add operator.
If these are elapsed times, not times of day, you could take them apart with DATEPART, but in SQL Server 2008 you will have to use a CONVERT to put the value back together, plus have all the gymnastics to do base 60 addition.
If you have the option, it would be best to store the time values as NUMERIC with a positive scale, where the unit of measure is hours, and then break them down when finally reporting them. Something like this:
DECLARE
#r NUMERIC(7, 5);
SET #r = 8.856;
SELECT FLOOR(#r) AS Hours, FLOOR(60 * (#r - FLOOR(#r))) AS Minutes, 60 * ((60 * #r) - FLOOR(60 * #r)) AS Seconds
Returns
Hours Minutes Seconds
8 51 21.60000
There is an advantage to writing a user-defined function to do this, to eliminate the repeated 60 * #r calculations.

sqlalchemy select by date column only x newset days

suppose I have a table MyTable with a column some_date (date type of course) and I want to select the newest 3 months data (or x days).
What is the best way to achieve this?
Please notice that the date should not be measured from today but rather from the date range in the table (which might be older then today)
I need to find the maximum date and compare it to each row - if the difference is less than x days, return it.
All of this should be done with sqlalchemy and without loading the entire table.
What is the best way of doing it? must I have a subquery to find the maximum date? How do I select last X days?
Any help is appreciated.
EDIT:
The following query works in Oracle but seems inefficient (is max calculated for each row?) and I don't think that it'll work for all dialects:
select * from my_table where (select max(some_date) from my_table) - some_date < 10
You can do this in a single query and without resorting to creating datediff.
Here is an example I used for getting everything in the past day:
one_day = timedelta(hours=24)
one_day_ago = datetime.now() - one_day
Message.query.filter(Message.created > one_day_ago).all()
You can adapt the timedelta to whatever time range you are interested in.
UPDATE
Upon re-reading your question it looks like I failed to take into account the fact that you want to compare two dates which are in the database rather than today's day. I'm pretty sure that this sort of behavior is going to be database specific. In Postgres, you can use straightforward arithmetic.
Operations with DATEs
1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
You're probably using timestamps if you are storing dates in postgres. Doing math with timestamps produces an interval object. Sqlalachemy works with timedeltas as a representation of intervals. So you could do something like:
one_day = timedelta(hours=24)
Model.query.join(ModelB, Model.created - ModelB.created < interval)
I haven't tested this exactly, but I've done things like this and they have worked.
I ended up doing two selects - one to get the max date and another to get the data
using the datediff recipe from this thread I added a datediff function and using the query q = session.query(MyTable).filter(datediff(max_date, some_date) < 10)
I still don't think this is the best way, but untill someone proves me wrong, it will have to do...

Difference between Timestamp is 15 minutes

This is the CREATED_TIME 2012-07-17 00:00:22 and this is the Corresponding Timestamp 1342508427000. Here timestamp is 5 seconds more than the CREATED_TIME. I need to see below scenario
Currently I have a query, in which I am joining on created_time and timestamp like this-
ON (UNIX_TIMESTAMP(testingtable1.created_time) = (prod_and_ts.timestamps / 1000))
So in above case, it will not match as timestamp is 5 seconds more than created_time. But I need if the difference between either of the two is within 15 minutes then I need to match it.
So I wrote the below JOIN query- I am not sure whether this is the right way to do it or not?
ON ((UNIX_TIMESTAMP(testingtable1.created_time) - (prod_and_ts.timestamps / 1000)) / 60* 1000 <= 15)
How I can do the above case if difference between timestamps is within 15 minutes then data will get matched by the above ON clause
I'd prefer the (specifically designed for this purpose!) date and time functions instead of doing al these kinds of calculations with timestamps. You wouldn't believe how much trouble this can cause. Make sure you read and understand this and this