Recursive CTE in Amazon Redshift - sql

We are trying to port a code to run on Amazon Redshift, but Refshift won't run the recursive CTE function. Any good soul that knows how to port this?
with tt as (
select t.*, row_number() over (partition by id order by time) as seqnum
from t
),
recursive cte as (
select t.*, time as grp_start
from tt
where seqnum = 1
union all
select tt.*,
(case when tt.time < cte.grp_start + interval '3 second'
then tt.time
else tt.grp_start
end)
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*,
(case when grp_start = lag(grp_start) over (partition by id order by time)
then 0 else 1
end) as isValid
from cte;
Or, a different code to reproduce the logic below.
It is a binary result that:
it is 1 if it is the first known value of an ID
it is 1 if it is 3 seconds or later than the previous "1" of that ID
It is 0 if it is less than 3 seconds than the previous "1" of that ID
Note 1: this is not the difference in seconds from the previous record
Note 2: there are many IDs in the data set
Note 3: original dataset has ID and Date
Desired output:
https://i.stack.imgur.com/k4KUQ.png
Dataset poc:
http://www.sqlfiddle.com/#!15/41d4b

As of this writing, Redshift does support recursive CTE's: see documentation here
To note when creating a recursive CTE in Redshift:
start the query: with recursive
column names must be declared for all recursive cte's
Consider the following example for creating a list of dates using recursive CTE's:
with recursive
start_dt as (select current_date s_dt)
, end_dt as (select dateadd(day, 1000, current_date) e_dt)
-- the recusive cte, note declaration of the column `dt`
, dates (dt) as (
-- start at the start date
select s_dt dt from start_dt
union all
-- recursive lines
select dateadd(day, 1, dt)::date dt -- converted to date to avoid type mismatch
from dates
where dt <= (select e_dt from end_dt) -- stop at the end date
)
select *
from dates

The below code could help you.
SELECT id, time, CASE WHEN sec_diff is null or prev_sec_diff - sec_diff > 3
then 1
else 0
end FROM (
select id, time, sec_diff, lag(sec_diff) over(
partition by id order by time asc
)
as prev_sec_diff
from (
select id, time, date_part('s', time - lag(time) over(
partition by id order by time asc
)
)
as sec_diff from hon
) x
) y

Related

greenplum string_agg conversion into hivesql supported

We are migrating greenplum sql query to hivesql and please find below statement available, string_agg. how do we migrate, kindly help us. below sample greenplum code needed for migration hive.
select string_agg(Display_String, ';' order by data_day )
select string_agg(Display_String, ';' order by data_day )
from
(
select data_day,
sum(revenue)/1000000.00 as revenue,
data_day||' '||trim(to_char(sum(revenue),'9,999,999,999')) as Display_String
from(
select case when data_date = current_date then 'D:'
when data_date = current_date - 1 then ' D-01:'
when data_date = current_date - 2 then ' D-02:'
when data_date = current_date - 7 then ' D-07:'
when data_date = current_date - 28 then ' D-28:'
end data_day, revenue/1000000.00 revenue
from test.testable
where data_date between current_date - 28 and current_date and hour <=(Select hour from ( select row_number() over(order by hour desc) iRowsID, hour from test.testable where data_date = current_date and type = 'UVC')tbl1
where irowsid = 2) and type in( 'UVC')
order by 1 desc) a
group by 1)aa;
There is nothing like this in hive. However you can use collect list and partition by/Order by to calculate it.
select concat_ws(';', max(concat_str))
FROM (
SELECT collect_list(Display_String) over (order by data_day ) concat_str
FROM
(your above SQL) s ) concat_qry)r
Explanation -
collect list concats the string and while doing it it, order by orders data on day column.
Outermost MAX() will pickup max data for the concatenated string.
Pls note this is a very slow operation. Test performance as well before implementing it.
Here is a sample SQL and result to help you.
select
id, concat_ws(';', max(concat_str))
from
( select
s.id, collect_list(s.c) over (partition by s.id order by s.c ) concat_str
from
( select 1 id,'ax' c union
select 1,'b'
union select 2,'f'
union select 2,'g'
union all select 1,'b'
union all select 1,'b' )s
) gs
group by id

How to get max date among others ids for current id using BigQuery?

I need to get max date for each row over other ids. Of course I can do this with CROSS JOIN and JOIN .
Like this
WITH t AS (
SELECT 1 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-09-01','2021-09-09', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 2 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-20','2021-09-03', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 3 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-25','2021-09-05', INTERVAL 1 DAY)) rep_date
)
SELECT id, rep_date, MAX(rep_date) OVER (PARTITION BY id) max_date, max_date_over_others FROM t
JOIN (
SELECT t.id, MAX(max_date) max_date_over_others FROM t
CROSS JOIN (
SELECT id, MAX(rep_date) max_date FROM t
GROUP BY 1
) t1
WHERE t1.id <> t.id
GROUP BY 1
) USING (id)
But it's too wired for huge tables. So I'm looking for the some simpler way to do this. Any ideas?
Your version is good enough I think. But if you want to try other options - consider below approach. It might looks more verbose from first look - but should be more optimal and cheaper to compare with your version with cross join
temp as (
select id,
greatest(
ifnull(max(max_date_for_id) over preceding_ids, '1970-01-01'),
ifnull(max(max_date_for_id) over following_ids, '1970-01-01')
) as max_date_for_rest_ids
from (
select id, max(rep_date) max_date_for_id
from t
group by id
)
window
preceding_ids as (order by id rows between unbounded preceding and 1 preceding),
following_ids as (order by id rows between 1 following and unbounded following)
)
select *
from t
join temp
using (id)
Assuming your original table data just has columns id and dt - wouldn't this solve it? I'm using the fact that if an id has the max dt of everything, then it gets the second-highest over the other id values.
WITH max_dates AS
(
SELECT
id,
MAX(dt) AS max_dt
FROM
data
GROUP BY
id
),
with_top1_value AS
(
SELECT
*,
MAX(dt) OVER () AS max_overall_dt_1,
MIN(dt) OVER () AS min_overall_dt
FROM
max_dates
),
with_top2_values AS
(
SELECT
*,
MAX(CASE WHEN dt = max_overall_dt_1 THEN min_overall_dt ELSE dt END) AS max_overall_dt2
FROM
with_top1_value
),
SELECT
*,
CASE WHEN dt = max_overall_dt1 THEN max_overall_dt2 ELSE max_overall_dt1 END AS max_dt_of_others
FROM
with_top2_values

SQL - get counts based on rolling window per unique id

I'm working with a table that has an id and date column. For each id, there's a 90-day window where multiple transactions can be made. The 90-day window starts when the first transaction is made and the clock resets once the 90 days are over. When the new 90-day window begins triggered by a new transaction I want to start the count from the beginning at one. I would like to generate something like this with the two additional columns (window and count) in SQL:
id date window count
name1 7/7/2019 first 1
name1 12/31/2019 second 1
name1 1/23/2020 second 2
name1 1/23/2020 second 3
name1 2/12/2020 second 4
name1 4/1/2020 third 1
name2 6/30/2019 first 1
name2 8/14/2019 first 2
I think getting the rank of the window can be done with a CASE statement and MIN(date) OVER (PARTITION BY id). This is what I have in mind for that:
CASE WHEN MIN(date) OVER (PARTITION BY id) THEN 'first'
WHEN DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) <= 90 THEN 'first'
WHEN DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) > 90 AND DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) <= 180 THEN 'third'
WHEN DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) > 180 AND DATEDIFF(day, date, MIN(date) OVER (PARTITION BY id)) <= 270 THEN 'fourth'
ELSE NULL END
And incrementing the counts within the windows would be ROW_NUMBER() OVER (PARTITION BY id, window)?
You cannot solve this problem with window functions only. You need to iterate through the dataset, which can be done with a recursive query:
with
tab as (
select t.*, row_number() over(partition by id order by date) rn
from mytable t
)
cte as (
select id, date, rn, date date0 from tab where rn = 1
union all
select t.id, t.date, t.rn, greatest(t.date, c.date + interval '90' day)
from cte c
inner join tab t on t.id = c.id and t.rn = c.rn + 1
)
select
id,
date,
dense_rank() over(partition by id order by date0) grp,
count(*) over(partition by id order by date0, date) cnt
from cte
The first query in the with clause ranks records having the same id by increasing date; then, the recursive query traverses the data set and computes the starting date of each group. The last step is numbering the groups and computing the window count.
GMB is totally correct that a recursive CTE is needed. I offer this as an alternative form for two reasons. First, because it uses SQL Server syntax, which appears to be the database being used in the question. Second, because it directly calculates window and count without window functions:
with t as (
select t.*, row_number() over (partition by id order by date) as seqnum
from tbl t
),
cte as (
select t.id, t.date, dateadd(day, 90, t.date) as window_end, 1 as window, 1 as count, seqnum
from t
where seqnum = 1
union all
select t.id, t.date,
(case when t.date > cte.window_end then dateadd(day, 90, t.date)
else cte.window_end
end) as window_end,
(case when t.date > cte.window_end then window + 1 else window end) as window,
(case when t.date > cte.window_end then 1 else cte.count + 1 end) as count,
t.seqnum
from cte join
t
on t.id = cte.id and
t.seqnum = cte.seqnum + 1
)
select id, date, window, count
from cte
order by 1, 2;
Here is a db<>fiddle.

SQL how to write a query that return missing date ranges?

I am trying to figure out how to write a query that looks at certain records and finds missing date ranges between today and 9999-12-31.
My data looks like below:
ID |start_dt |end_dt |prc_or_disc_1
10412 |2018-07-17 00:00:00.000 |2018-07-20 00:00:00.000 |1050.000000
10413 |2018-07-23 00:00:00.000 |2018-07-26 00:00:00.000 |1040.000000
So for this data I would want my query to return:
2018-07-10 | 2018-07-16
2018-07-21 | 2018-07-22
2018-07-27 | 9999-12-31
I'm not really sure where to start. Is this possible?
You can do that using the lag() function in MS SQL (but that is available starting with 2012?).
with myData as
(
select *,
lag(end_dt,1) over (order by start_dt) as lagEnd
from myTable),
myMax as
(
select Max(end_dt) as maxDate from myTable
)
select dateadd(d,1,lagEnd) as StartDate, dateadd(d, -1, start_dt) as EndDate
from myData
where lagEnd is not null and dateadd(d,1,lagEnd) < start_dt
union all
select dateAdd(d,1,maxDate) as StartDate, cast('99991231' as Datetime) as EndDate
from myMax
where maxDate < '99991231';
If lag() is not available in MS SQL 2008, then you can mimic it with row_number() and joining.
select
CASE WHEN DATEDIFF(day, end_dt, ISNULL(LEAD(start_dt) over (order by ID), '99991231')) > 1 then end_dt +1 END as F1,
CASE WHEN DATEDIFF(day, end_dt, ISNULL(LEAD(start_dt) over (order by ID), '99991231')) > 1 then ISNULL(LEAD(start_dt) over (order by ID) - 1, '99991231') END as F2
from t
Working SQLFiddle example is -> Here
FOR 2008 VERSION
SELECT
X.end_dt + 1 as F1,
ISNULL(Y.start_dt-1, '99991231') as F2
FROM t X
LEFT JOIN (
SELECT
*
, (SELECT MAX(ID) FROM t WHERE ID < A.ID) as ID2
FROM t A) Y ON X.ID = Y.ID2
WHERE DATEDIFF(day, X.end_dt, ISNULL(Y.start_dt, '99991231')) > 1
Working SQLFiddle example is -> Here
This should work in 2008, it assumes that ranges in your table do not overlap. It will also eliminate rows where the end_date of the current row is a day before the start date of the next row.
with dtRanges as (
select start_dt, end_dt, row_number() over (order by start_dt) as rownum
from table1
)
select t2.end_dt + 1, coalesce(start_dt_next -1,'99991231')
FROM
( select dr1.start_dt, dr1.end_dt,dr2.start_dt as start_dt_next
from dtRanges dr1
left join dtRanges dr2 on dr2.rownum = dr1.rownum + 1
) t2
where
t2.end_dt + 1 <> coalesce(start_dt_next,'99991231')
http://sqlfiddle.com/#!18/65238/1
SELECT
*
FROM
(
SELECT
end_dt+1 AS start_dt,
LEAD(start_dt-1, 1, '9999-12-31')
OVER (ORDER BY start_dt)
AS end_dt
FROM
yourTable
)
gaps
WHERE
gaps.end_dt >= gaps.start_dt
I would, however, strongly urge you to use end dates that are "exclusive". That is, the range is everything up to but excluding the end_dt.
That way, a range of one day becomes '2018-07-09', '2018-07-10'.
It's really clear that my range is one day long, if you subtract one from the other you get a day.
Also, if you ever change to needing hour granularity or minute granularity you don't need to change your data. It just works. Always. Reliably. Intuitively.
If you search the web you'll find plenty of documentation on why inclusive-start and exclusive-end is a very good idea from a software perspective. (Then, in the query above, you can remove the wonky +1 and -1.)
This solves your case, but provide some sample data if there will ever be overlaps, fringe cases, etc.
Take one day after your end date and 1 day before the next line's start date.
DECLARE # TABLE (ID int, start_dt DATETIME, end_dt DATETIME, prc VARCHAR(100))
INSERT INTO # (id, start_dt, end_dt, prc)
VALUES
(10410, '2018-07-09 00:00:00.00','2018-07-12 00:00:00.000','1025.000000'),
(10412, '2018-07-17 00:00:00.00','2018-07-20 00:00:00.000','1050.000000'),
(10413, '2018-07-23 00:00:00.00','2018-07-26 00:00:00.000','1040.000000')
SELECT DATEADD(DAY, 1, end_dt)
, DATEADD(DAY, -1, LEAD(start_dt, 1, '9999-12-31') OVER(ORDER BY id) )
FROM #
You may want to take a look at this:
http://sqlfiddle.com/#!18/3a224/1
You just have to edit the begin range to today and the end range to 9999-12-31.

SQL SELECT rows where the difference between consecutive columns is less than X

Basically Mysql: Find rows, where timestamp difference is less than x, but I want to stop at the first value whose timestamp difference is larger than X.
I got so far:
SELECT *
FROM (
SELECT *,
(LEAD(datetime) OVER (ORDER BY datetime)) - datetime AS difference
FROM history
) AS sq
WHERE difference < '00:01:00'
Which seems to correctly return all rows where the difference between the row and the one "behind" it is less than a minute, but that means I still get large jumps in the datetimes, which I don't want - I want to select the most recent "run" of rows, where a "run" is defined as "the timestamps in datetime differ by less than a minute".
e.g., I have rows whose hypothetical timestamps are as follows:
24, 22, 21, 19, 18, 12, 11, 9, 7...
And my limit of differences is 3, i.e. I want the run of the rows whose difference between "timestamps" is less than 3; therefore just:
24, 22, 21, 19, 18
Is this possible in SQL?
You can use lag to get the previous row's timestamp and check if the current row is within 3 minutes of it. Reset the group if the condition fails. After this grouping is done, you have the find the latest such group, use max to get it. Then get all those rows from the latest group.
Include a partition by clause in the window functions lag, sum andmax if this has to be done for each id in the table.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
--checking if the current row's timestamp is within 3 minutes of the next row
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + interval '3 minute' THEN 0 ELSE 1 END col
from t) x
)
select dt
from (select g.*,max(grp) over() maxgrp --getting the latest group
from grps g
) g
where grp = maxgrp
The above would get you the members in the latest group even though it has one row. To avoid such results get the latest group which has more than 1 row.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + 3 THEN 0 ELSE 1 END col
from t) x
)
,grpcnts as (select g.*,count(*) over(partition by grp) grpcnt from grps g)
select dt from (select g.*,max(grp) over() maxgrp
from grpcnts g
where grpcnt > 1
) g
where grp = maxgrp
You can do this by using a flag based on the lead() or lag() values. I believe this does what you want:
SELECT h.*
FROM (SELECT h.*,
SUM( (next_datetime < datetime + interval '1 minute')::int) OVER (ORDER BY datetime DESC) as grp
FROM (SELECT h.*,
LEAD(h.datetime) OVER (ORDER BY h.datetime)) as next_datetime
FROM history h
) h
WHERE next_datetime < datetime + interval '1 hour'
) h
WHERE grp IS NULL OR grp = 0;
This can be easily solved with recursive CTEs (this will select your rows one-by-one and stops when there is no row in range interval '1 min'):
with recursive h as (
select * from (
select *
from history
order by history.datetime desc
limit 1
) s
union all
select * from (
select history.*
from h
join history on history.datetime >= h.datetime - interval '1 min'
and history.datetime < h.datetime
order by history.datetime desc
limit 1
) s
)
select * from h
This should be efficient if you have an index on history.datetime. Though, if you care about performance, you should test it against the window-function based ones. (I personally get a headache when see as much subqueries and window functions as needed for this problem. The irony in my answer is that postgresql does not support the ORDER BY clause directly inside recrursive CTEs, so I had to use 2 meaningless subqueries to "hide" them).
rextester