greenplum string_agg conversion into hivesql supported - hive

We are migrating greenplum sql query to hivesql and please find below statement available, string_agg. how do we migrate, kindly help us. below sample greenplum code needed for migration hive.
select string_agg(Display_String, ';' order by data_day )
select string_agg(Display_String, ';' order by data_day )
from
(
select data_day,
sum(revenue)/1000000.00 as revenue,
data_day||' '||trim(to_char(sum(revenue),'9,999,999,999')) as Display_String
from(
select case when data_date = current_date then 'D:'
when data_date = current_date - 1 then ' D-01:'
when data_date = current_date - 2 then ' D-02:'
when data_date = current_date - 7 then ' D-07:'
when data_date = current_date - 28 then ' D-28:'
end data_day, revenue/1000000.00 revenue
from test.testable
where data_date between current_date - 28 and current_date and hour <=(Select hour from ( select row_number() over(order by hour desc) iRowsID, hour from test.testable where data_date = current_date and type = 'UVC')tbl1
where irowsid = 2) and type in( 'UVC')
order by 1 desc) a
group by 1)aa;

There is nothing like this in hive. However you can use collect list and partition by/Order by to calculate it.
select concat_ws(';', max(concat_str))
FROM (
SELECT collect_list(Display_String) over (order by data_day ) concat_str
FROM
(your above SQL) s ) concat_qry)r
Explanation -
collect list concats the string and while doing it it, order by orders data on day column.
Outermost MAX() will pickup max data for the concatenated string.
Pls note this is a very slow operation. Test performance as well before implementing it.
Here is a sample SQL and result to help you.
select
id, concat_ws(';', max(concat_str))
from
( select
s.id, collect_list(s.c) over (partition by s.id order by s.c ) concat_str
from
( select 1 id,'ax' c union
select 1,'b'
union select 2,'f'
union select 2,'g'
union all select 1,'b'
union all select 1,'b' )s
) gs
group by id

Related

Pro Rata in BigQuery [duplicate]

This question already has answers here:
SQL equally distribute value by rows
(2 answers)
Do loop in BigQuery
(1 answer)
Closed 1 year ago.
I want to pro rate a table like this:
into a table like this:
essentially I want to create rows for the days between date_start and date end and then divide spend by how many days there are.
I am currently using the query below to do this, using BigQuery scripting - I know this probably is a horrible way of querying this but I'm not sure how else to do it. It takes about 30 seconds to run this query for just 3 rows.
DECLARE i INT64 DEFAULT 1;
DECLARE n int64;
SET n = (SELECT COUNT(*) FROM `pro_rata_test.data`);
DELETE FROM `pro_rata_test.pro_rata` WHERE TRUE;
WHILE i <= n DO
INSERT INTO
pro_rata_test.pro_rata
SELECT
day,
country,
campaign,
other,
SUM(spend)/(
SELECT
DATETIME_DIFF(DATETIME(TIMESTAMP(date_end)),
DATETIME(TIMESTAMP(date_start)),
DAY) + 1
FROM (
SELECT *, ROW_NUMBER() OVER(ORDER BY date_start) AS rn FROM `pro_rata_test.data`)
WHERE
rn = i) AS spend
FROM (
SELECT *, ROW_NUMBER() OVER(ORDER BY date_start) AS rn FROM `pro_rata_test.data`),
UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) day
WHERE
rn = i
GROUP BY
day,
country,
campaign,
other
ORDER BY
day;
SET
i = i + 1;
END WHILE
Try generate_date_array and unnest:
with mytable as (
select date '2021-01-01' as date_start, date '2021-01-10' as date_end, 100 as spend, 'FR' as country, 'Campaign1' as campaign, 'test1' as Other union all
select date '2021-01-11', date '2021-02-27', 150, 'UK', 'Campaign1', 'test2' union all
select date '2021-03-20', date '2021-04-20', 500, 'UK', 'Campaign2', 'test2'
)
select
day,
country,
campaign,
other,
spend/(date_diff(date_end, date_start, day)+1) as spend
from mytable, unnest(generate_date_array(date_start, date_end)) as day
order by day

Multiple SELECTS and a CTE table

I have this statement which returns values for the dates that exist in the table, the cte then just fills in the half hourly intervals.
with cte (reading_date) as (
select date '2020-11-17' from dual
union all
select reading_date + interval '30' minute
from cte
where reading_date + interval '30' minute < date '2020-11-19'
)
select c.reading_date, d.reading_value
from cte c
left join dcm_reading d on d.reading_date = c.reading_date
order by c.reading_date
However, later on I needed to use A SELECT within a SELECT like this:
SELECT serial_number,
register,
reading_date,
reading_value,,
ABS(A_plus)
FROM
(
SELECT
serial_number,
register,
TO_DATE(reading_date, 'DD-MON-YYYY HH24:MI:SS') AS reading_date,
reading_value,
LAG(reading_value,1, 0) OVER(ORDER BY reading_date) AS previous_read,
LAG(reading_value, 1, 0) OVER (ORDER BY reading_date) - reading_value AS A_plus,
reading_id
FROM DCM_READING
WHERE device_id = 'KXTE4501'
AND device_type = 'E'
AND serial_number = 'A171804699'
AND reading_date BETWEEN TO_DATE('17-NOV-2019' || ' 000000', 'DD-MON-YYYY HH24MISS') AND TO_DATE('19-NOV-2019' || ' 235959', 'DD-MON-YYYY HH24MISS')
ORDER BY reading_date)
ORDER BY serial_number, reading_date;
For extra information:
I am selecting data from a table that exists, and using lag function to work out difference in reading_value from previous record. However, later on I needed to insert dummy data where there are missing half hour reads. The CTE table brings back a list of all half hour intervals between the two dates I am querying on.
ultimately I want to get a result that has all the reading_dates in half hour, the reading_value (if there is one) and then difference between the reading_values that do exist. For the half hourly reads that don't have data returned from table DCM_READING I want to just return NULL.
Is it possible to use a CTE table with multiple selects?
Not sure what you would like to achieve, but you can have multiple CTEs or even nest them:
with
cte_1 as
(
select username
from dba_users
where oracle_maintained = 'N'
),
cte_2 as
(
select owner, round(sum(bytes)/1024/1024) as megabytes
from dba_segments
group by owner
),
cte_3 as
(
select username, megabytes
from cte_1
join cte_2 on cte_1.username = cte_2.owner
)
select *
from cte_3
order by username;

Windows functions orderen by date when some dates doesn't exist

Suppose this example query:
select
id
, date
, sum(var) over (partition by id order by date rows 30 preceding) as roll_sum
from tab
When some dates are not present on date column the window will not consider the unexistent dates. How could i make this windowns aggregation including these unexistent dates?
Many thanks!
You can join a sequence containing all dates from a desired interval.
select
*
from (
select
d.date,
q.id,
q.roll_sum
from unnest(sequence(date '2000-01-01', date '2030-12-31')) d
left join ( your_query ) q on q.date = d.date
) v
where v.date > (select min(my_date) from tab2)
and v.date < (select max(my_date) from tab2)
In standard SQL, you would typically use a window range specification, like:
select
id,
date,
sum(var) over (
partition by id
order by date
range interval '30' day preceding
) as roll_sum
from tab
However I am unsure that Presto supports this syntax. You can resort a correlated subquery instead:
select
id,
date,
(
select sum(var)
from tab t1
where
t1.id = t.id
and t1.date >= t.date - interval '30' day
and t1.date <= t.date
) roll_sum
from tab t
I don't think Presto support window functions with interval ranges. Alas. There is an old fashioned way to doing this, by counting "ins" and "outs" of values:
with t as (
select id, date, var, 1 as is_orig
from t
union all
select id, date + interval '30 day', -var, 0
from t
)
select id.*
from (select id, date, sum(var) over (partition by id order by date) as running_30,
sum(is_org) as is_orig
from t
group by id, date
) id
where is_orig > 0

Recursive CTE in Amazon Redshift

We are trying to port a code to run on Amazon Redshift, but Refshift won't run the recursive CTE function. Any good soul that knows how to port this?
with tt as (
select t.*, row_number() over (partition by id order by time) as seqnum
from t
),
recursive cte as (
select t.*, time as grp_start
from tt
where seqnum = 1
union all
select tt.*,
(case when tt.time < cte.grp_start + interval '3 second'
then tt.time
else tt.grp_start
end)
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*,
(case when grp_start = lag(grp_start) over (partition by id order by time)
then 0 else 1
end) as isValid
from cte;
Or, a different code to reproduce the logic below.
It is a binary result that:
it is 1 if it is the first known value of an ID
it is 1 if it is 3 seconds or later than the previous "1" of that ID
It is 0 if it is less than 3 seconds than the previous "1" of that ID
Note 1: this is not the difference in seconds from the previous record
Note 2: there are many IDs in the data set
Note 3: original dataset has ID and Date
Desired output:
https://i.stack.imgur.com/k4KUQ.png
Dataset poc:
http://www.sqlfiddle.com/#!15/41d4b
As of this writing, Redshift does support recursive CTE's: see documentation here
To note when creating a recursive CTE in Redshift:
start the query: with recursive
column names must be declared for all recursive cte's
Consider the following example for creating a list of dates using recursive CTE's:
with recursive
start_dt as (select current_date s_dt)
, end_dt as (select dateadd(day, 1000, current_date) e_dt)
-- the recusive cte, note declaration of the column `dt`
, dates (dt) as (
-- start at the start date
select s_dt dt from start_dt
union all
-- recursive lines
select dateadd(day, 1, dt)::date dt -- converted to date to avoid type mismatch
from dates
where dt <= (select e_dt from end_dt) -- stop at the end date
)
select *
from dates
The below code could help you.
SELECT id, time, CASE WHEN sec_diff is null or prev_sec_diff - sec_diff > 3
then 1
else 0
end FROM (
select id, time, sec_diff, lag(sec_diff) over(
partition by id order by time asc
)
as prev_sec_diff
from (
select id, time, date_part('s', time - lag(time) over(
partition by id order by time asc
)
)
as sec_diff from hon
) x
) y

Filter rows by those created within a close timeframe

I have a application where users create orders that are stored in a Oracle database. I'm trying to find a bug that only happens when a user creates orders within 30 seconds of the last order they created.
Here is the structure of the order table:
order_id | user_id | creation_date
I would like to write a query that can give me a list of orders where the creation_date is within 30 seconds of the last order for the same user. The results will hopefully help me find the bug.
I tried using the Oracle LAG() function but it doesn't seem to with the WHERE clause.
Any thoughts?
SELECT O.*
FROM YourTable O
WHERE EXISTS (
SELECT *
FROM YourTable O2
WHERE
O.creation_date > O2.creation_date
AND O.user_id = O2.user_id
AND O.creation_date - (30 / 86400) <= O2.creation_date
);
See this in action in a Sql Fiddle.
You can use the LAG function if you want, you would just have to wrap the query into a derived table and then put your WHERE condition in the outer query.
SELECT distinct
t1.order_id, t1.user_id, t1.creation_date
FROM
YourTable t1
join YourTable t2
on t2.user_id = t1.user_id
and t2.creation_date between t1.creation_date - 30/86400 and t1.creation_date
and t2.rowid <> t1.rowid
order by 3 desc
Example of using LAG():
SELECT id, (pss - css) time_diff_in_seconds
, creation_date, prev_date
FROM
(
SELECT id, creation_date, prev_date
, EXTRACT(SECOND From creation_date) css
, EXTRACT(SECOND From prev_date) pss
FROM
(
SELECT id, creation_date
, LAG(creation_date, 1, creation_date) OVER (ORDER BY creation_date) prev_date
FROM
( -- Table/data --
SELECT 1 id, timestamp '2013-03-20 13:56:58' creation_date FROM dual
UNION ALL
SELECT 2, timestamp '2013-03-20 13:57:27' FROM dual
UNION ALL
SELECT 3, timestamp '2013-03-20 13:59:16' FROM dual
)))
--WHERE (pss - css) <= 30
/
ID TIME_DIFF_IN_SECONDS
--------------------------
1 0 <<-- if uncomment where
2 31
3 11 <<-- if uncomment where