How do I do date diff in a spark sql environment? - sql

I have a table with a creation date and an action date. I'd like to get the number number of minutes between the two dates. I looked at the docs and I'm having trouble finding a solution.
%sql
SELECT datediff(creation_dt, actions_dt)
FROM actions
limit 10
This gives me the number of days between the two dates. One record looks like
2019-07-31 23:55:22.0 | 2019-07-31 23:55:21 | 0
How can I get the number of minutes?

As stated in the comments, if you are using Spark or Pyspark then the withColumn method is best.
BUT
If you are using the SparkSQL environment then you could use the unix_timestamp() function to get what you need
select ((unix_timestamp('2019-09-09','yyyy-MM-dd') - unix_timestamp('2018-09-09','yyyy-MM-dd'))/60);
Swap the dates with your column names and define what your date pattern is as the parameters.
Both dates are converted into seconds and the difference is taken. We then divide by 60 to get the minutes.
525600.0

Related

SQL - Returning max count, after breaking down a day into hourly rows

I need to write a SQL query that helps return the highest count in a given hourly range. The problem is that in my table, it just logs orders as they come and doesn’t have a unique identifier that separates hours from hours.
So basically, I need to find the highest number of orders (on any given hour), from 7/08/2022, - 7/15/2022, have a table that does not distinguish distinct hour sets, and logs orders as they come.
I have tried to use a query that combines MAX(), COUNT(), and DATETIME(), but to no avail.
Can I please receive some help?
I've had to tackle this kind of measurement in the past..
Here's what I did for 15 minute intervals:
My datetime column is named datreg in my database log area.
cast(round(floor(cast(datreg as float(53))*24*4)/(24*4),5) as smalldatetime
I times by 4 in this formula, to get 4 intervals inside my 24 hour period.. For you it would look like this to get just hourly intervals:
cast(round(floor(cast(datreg as float(53))*24)/(24),5) as smalldatetime
This is a little piece of magic when it comes to dashboards and reports.

extracting HOUR from an interval in spark sql

I was wondering how to properly extract amount of hours between given 2 timestamps objects.
For instance, when the following SQL query gets executed:
select x, extract(HOUR FROM x) as result
from
(select (TIMESTAMP'2021-01-22T05:00:00' - TIMESTAMP'2021-01-01T09:00:00') as x)
The result value is 20, while I'd expect it to be 500.
It seems odd to me considering that x value indicates the expected return value.
Can anyone please explain to me what I'm doing wrong and perhaps suggest additional way of query so the desired result would return?
Thanks in advance!
I think you have to do the maths with this one as datediff in SparkSQL only supports days. This worked for me:
SELECT (unix_timestamp(to_timestamp('2021-01-22T05:00:00') ) - unix_timestamp(to_timestamp('2021-01-01T09:00:00'))) / 60 / 60 diffInHours
My results (in Synapse Notebook, not Databricks but I expect it to be the same):
The unix_timestamp function converts the timestamp to a Unix timestamp (in seconds) and then you can apply date math to it. Subtracting them gives the number of seconds between the two timestamps. Divide by 60 for the number minutes between the two dates and by 60 again for the number of hours between the two dates.

What is the fastest way to populate table with dates after certain day?

Let's assume that we have the following input parameters:
date [Date]
period [Integer]
The task is the following: build the table which has two columns: date and dayname.
So, if we have date = 2018-07-12 and period = 3 the table should look like this:
date |dayname
-------------------
2018-07-12|THURSDAY
2018-07-13|FRIDAY
2018-07-14|SATURDAY
My solution is the following:
select add_days(date, -1) into previousDay from "DUMMY";
for i in 1..:period do
select add_days(previousDay, i) into nextDay from "DUMMY";
:result.insert((nextDay, dayname(nextDay));
end for;
but I don't like the loop. I assume that it might be a problem in the performance if there are more complicated values that I want to put to result table.
What would be the better solution to achieve the target?
Running through a loop and inserting values one by one is most certainly the slowest possible option to accomplish the task.
Instead, you could use SAP HANA's time series feature.
With a statement like
SELECT to_date(GENERATED_PERIOD_START)
FROM SERIES_GENERATE_TIMESTAMP('INTERVAL 1 DAY', '01.01.0001', '31.12.9999')
you could generate a bounded range of valid dates with a given interval length.
In my tests using this approach brought the time to insert a set of dates from ca. 9 minutes down to 7 seconds...
I've written about that some time ago here and here if you want some more examples for that.
In those examples, I even included the use of series tables that allow for efficient compression of timestamp column values.
Series Data functions include SERIES_GENERATE_DATE which returns a set of values in date data format. So you don't have to bother to convert returned data into desired date format.
Here is a sample code
declare d int := 5;
declare dstart date := '01.01.2018';
SELECT generated_period_start FROM SERIES_GENERATE_DATE('INTERVAL 1 DAY', :dstart, add_days(:dstart, :d));

sqlalchemy select by date column only x newset days

suppose I have a table MyTable with a column some_date (date type of course) and I want to select the newest 3 months data (or x days).
What is the best way to achieve this?
Please notice that the date should not be measured from today but rather from the date range in the table (which might be older then today)
I need to find the maximum date and compare it to each row - if the difference is less than x days, return it.
All of this should be done with sqlalchemy and without loading the entire table.
What is the best way of doing it? must I have a subquery to find the maximum date? How do I select last X days?
Any help is appreciated.
EDIT:
The following query works in Oracle but seems inefficient (is max calculated for each row?) and I don't think that it'll work for all dialects:
select * from my_table where (select max(some_date) from my_table) - some_date < 10
You can do this in a single query and without resorting to creating datediff.
Here is an example I used for getting everything in the past day:
one_day = timedelta(hours=24)
one_day_ago = datetime.now() - one_day
Message.query.filter(Message.created > one_day_ago).all()
You can adapt the timedelta to whatever time range you are interested in.
UPDATE
Upon re-reading your question it looks like I failed to take into account the fact that you want to compare two dates which are in the database rather than today's day. I'm pretty sure that this sort of behavior is going to be database specific. In Postgres, you can use straightforward arithmetic.
Operations with DATEs
1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
You're probably using timestamps if you are storing dates in postgres. Doing math with timestamps produces an interval object. Sqlalachemy works with timedeltas as a representation of intervals. So you could do something like:
one_day = timedelta(hours=24)
Model.query.join(ModelB, Model.created - ModelB.created < interval)
I haven't tested this exactly, but I've done things like this and they have worked.
I ended up doing two selects - one to get the max date and another to get the data
using the datediff recipe from this thread I added a datediff function and using the query q = session.query(MyTable).filter(datediff(max_date, some_date) < 10)
I still don't think this is the best way, but untill someone proves me wrong, it will have to do...

How to find last 7 days records using pig latin?

I am a beginner to Pig latin. I have a requirement to find the last 7 days records from the csv with contains last 4 years of data.
Can anyone help me to understand this.
A more generic way is to compare each line of data and check if it is older than 7 days or not?
For this, we need to capture the date in each line of data. Let the set of data be a relation dataSet with a column field as date of chararray type.
In Pig 0.11 you can convert the date field from chararray to datetime data type using the ToDate() function, and then get the difference between the CurrentTime() and date using DaysBetween() and filter according to it. For example:
lastSevenDaysRec = FILTER dataSet BY DaysBetween(CurrentTime(), ToDate(date, 'yyyy MM dd')) <= 7
You can check the following documentation for detailed understanding of different date time functions in Pig Latin. Also you can have a look at the valid formats to use in the ToDate function
Assuming that your set of data is A and there is one line per day, and it has a field named date, you could try something similar to this:
B = GROUP A BY date;
B = ORDER A BY group DESC;
B = LIMIT B BY 7;
And then, you would have the last seven days records grouped per date.