How to Dense Rank Sets of data - sql

I am trying to get a dense rank to group sets of data together. In my table I have ID, GRP_SET, SUB_SET, and INTERVAL which simply represents a date field. When records are inserted using an ID they get inserted as GRP_SETs of 3 rows shown as a SUB_SET. As you can see when inserts happen the interval can change slightly before it finishes inserting the set.
Here is some example data and the DRANK column represents what ranking I'm trying to get.
with q as (
select 1 id, 'a' GRP_SET, 1 as SUB_SET, 123 as interval, 1 as DRANK from dual union all
select 1, 'a', 2, 123, 1 from dual union all
select 1, 'a', 3, 124, 1 from dual union all
select 1, 'b', 1, 234, 2 from dual union all
select 1, 'b', 2, 235, 2 from dual union all
select 1, 'b', 3, 235, 2 from dual union all
select 1, 'a', 1, 331, 3 from dual union all
select 1, 'a', 2, 331, 3 from dual union all
select 1, 'a', 3, 331, 3 from dual)
select * from q
Example Data
ID GRP_SET SUBSET INTERVAL DRANK
1 a 1 123 1
1 a 2 123 1
1 a 3 124 1
1 b 1 234 2
1 b 3 235 2
1 b 2 235 2
1 a 1 331 3
1 a 2 331 3
1 a 3 331 3
Here is the query I Have that gets close but I seem to need something like a:
Partition By: ID
Order within partition by: ID, Interval
Change Rank when: ID, GRP_SET (change)
select
id, GRP_SET, SUB_SET, interval,
DENSE_RANK() over (partition by ID order by id, GRP_SET) as DRANK_TEST
from q
Order by
id, interval

Using the MODEL clause
Behold for you are pushing your requirements beyond the limits of what is easy to express in "ordinary" SQL. But luckily, you're using Oracle, which features the MODEL clause, a device whose mystery is only exceeded by its power (excellent whitepaper here). You shall write:
SELECT
id, grp_set, sub_set, interval, drank
FROM (
SELECT id, grp_set, sub_set, interval, 1 drank
FROM q
)
MODEL PARTITION BY (id)
DIMENSION BY (row_number() OVER (ORDER BY interval, sub_set) rn)
MEASURES (grp_set, sub_set, interval, drank)
RULES (
drank[any] = NVL(drank[cv(rn) - 1] +
DECODE(grp_set[cv(rn) - 1], grp_set[cv(rn)], 0, 1), 1)
)
Proof on SQLFiddle
Explanation:
SELECT
id, grp_set, sub_set, interval, drank
FROM (
-- Here, we initialise your "dense rank" to 1
SELECT id, grp_set, sub_set, interval, 1 drank
FROM q
)
-- Then we partition the data set by ID (that's your requirement)
MODEL PARTITION BY (id)
-- We generate row numbers for all columns ordered by interval and sub_set,
-- such that we can then access row numbers in that particular order
DIMENSION BY (row_number() OVER (ORDER BY interval, sub_set) rn)
-- These are the columns that we want to generate from the MODEL clause
MEASURES (grp_set, sub_set, interval, drank)
-- And the rules are simple: Each "dense rank" value is equal to the
-- previous "dense rank" value + 1, if the grp_set value has changed
RULES (
drank[any] = NVL(drank[cv(rn) - 1] +
DECODE(grp_set[cv(rn) - 1], grp_set[cv(rn)], 0, 1), 1)
)
Of course, this only works if there are no interleaving events, i.e. there is no other grp_set than a between 123 and 124

This might work for you. The complicating factor is that you want the same "DENSE RANK" for intervals 123 and 124 and for intervals 234 and 235. So we'll truncate them to the nearest 10 for purposes of ordering the DENSE_RANK() function:
SELECT id, grp_set, sub_set, interval, drank
, DENSE_RANK() OVER ( PARTITION BY id ORDER BY TRUNC(interval, -1), grp_set ) AS drank_test
FROM q
Please see SQL Fiddle demo here.
If you want the intervals to be even closer together in order to be grouped together, then you can multiply the value before truncating. This would group them by 3s (but maybe you don't need them so granular):
SELECT id, grp_set, sub_set, interval, drank
, DENSE_RANK() OVER ( PARTITION BY id ORDER BY TRUNC(interval*10/3, -1), grp_set ) AS drank_test
FROM q

Related

How can I write BigQuery SQL to group data by start and end date of column changing?

ID|FLAG|TMST
1|1|2022-01-01
1|1|2022-01-02
...(all dates between 01-02 and 02-05 have 1, there are rows for all these dates)
1|1|2022-02-15
1|0|2022-02-16
1|0|2022-02-17
...(all dates between 02-17 and 05-15 have 0, there are rows for all these dates)
1|0|2022-05-15
1|1|2022-05-16
->
ID|FLAG|STRT_MONTH|END_MONTH
1|1|202201|202202
1|0|202203|202204
1|1|202205|999912
I have the first dataset and I am trying to get the second dataset. How can I write bigquery SQL to group by the ID then get the start and end month of when a flag changes? If a specific month has both 0,1 flag like month 202202, I would like to consider that month to be a 1.
You might consider below gaps and islands approach.
WITH sample_table AS (
SELECT 1 id, 1 flag, DATE '2022-01-01' tmst UNION ALL
SELECT 1 id, 1 flag, '2022-01-02' tmst UNION ALL
-- (all dates between 01-02 and 02-05 have 1, there are rows for all these dates)
SELECT 1 id, 1 flag, '2022-02-15' tmst UNION ALL
SELECT 1 id, 0 flag, '2022-02-16' tmst UNION ALL
SELECT 1 id, 0 flag, '2022-02-17' tmst UNION ALL
SELECT 1 id, 0 flag, '2022-03-01' tmst UNION ALL
SELECT 1 id, 0 flag, '2022-04-01' tmst UNION ALL
-- (all dates between 02-17 and 05-15 have 0, there are rows for all these dates)
SELECT 1 id, 0 flag, '2022-05-15' tmst UNION ALL
SELECT 1 id, 1 flag, '2022-05-16' tmst
),
aggregation AS (
SELECT id, DATE_TRUNC(tmst, MONTH) month, IF(SUM(flag) > 0, 1, 0) flag
FROM sample_table
GROUP BY 1, 2
)
SELECT id, ANY_VALUE(flag) flag,
MIN(month) start_month,
IF(MAX(month) = ANY_VALUE(max_month), '9999-12-01', MAX(month)) end_month
FROM (
SELECT * EXCEPT(gap), COUNTIF(gap) OVER w1 AS part FROM (
SELECT *, flag <> LAG(flag) OVER w0 AS gap, MAX(month) OVER w0 AS max_month
FROM aggregation
WINDOW w0 AS (PARTITION BY id ORDER BY month)
) WINDOW w1 AS (PARTITION BY id ORDER BY month)
)
GROUP BY 1, part;
Query results

BigQuery: Possible two merge 2 arrays of ranges where you split the range if they overlap

I was wondering if the following case is possible within BigQuery.
There are 2 tables of intervals. The intervals in a single table do not overlap with other intervals in the same table. The intervals however can overlap with intervals in the other table.
I want to merge the intervals, but also divide the intervals into multiple intervals if they overlap. So if for example the interval is in table A from 5/8/2020 - 5/9/2020 and there is an interval in B 18/8/2020 - 1/9/2020, then I want to split the interval as 5/8/2020 - 18/8/2020 (in A), 18/8/2020 - 1/9/2020 (in A and B) and 1/9/2020 - 5/9/2020 (in A).
A more extensive example: We have a table with intervals where people eat Apples
ID
StartDate
EndDate
1
01/01/19
01/04/19
2
01/01/19
03/01/19
And a table with intervals where people eat Bananas
ID
StartDate
EndDate
1
15/12/18
12/01/19
1
01/02/19
17/02/19
1
15/03/19
15/04/19
2
01/06/19
01/07/19
And now we want to combine those intervals and classify the intervals as either, apple eaters, banana eaters, or apple and banana eaters.
ID
StartDate
EndDate
type
1
15/12/18
01/01/19
B
1
01/01/19
12/01/19
AB
1
12/01/19
01/02/19
A
1
01/02/19
17/02/19
AB
1
17/02/19
15/03/19
A
1
15/03/19
01/04/19
AB
1
01/04/19
15/04/19
B
2
01/01/19
03/01/19
A
2
01/06/19
01/07/19
B
Is it possible to solve this with bigQuery?
Consider below query :
WITH stacked AS (
SELECT ID, date, STRING_AGG(type, '' ORDER BY type) type FROM (
SELECT *, 'A' type FROM Apples
UNION ALL
SELECT *, 'B' type FROM Bananas
), UNNEST (GENERATE_DATE_ARRAY(PARSE_DATE('%d/%m/%y', StartDate), PARSE_DATE('%d/%m/%y', EndDate), INTERVAL 1 DAY)) date
GROUP BY 1, 2
),
partitioned AS (
SELECT ID, date, type,
COUNTIF(flag) OVER w AS div,
type = 'AB' AND LEAD(type) OVER w <>'AB' in_AB,
type <> 'AB' AND LAG(type) OVER w = 'AB' out_AB,
type <> 'AB' AND LEAD(type, 1, 'A') OVER w <> 'AB' bw_AB,
FROM (
SELECT ID, date, type, type <> LAG(type) OVER (PARTITION BY ID ORDER BY date) AS flag
FROM stacked
)
WINDOW w AS (PARTITION BY ID ORDER BY date)
)
SELECT ID,
MIN(IF(out_AB, date - 1, date)) StartDate,
MAX(IF(in_AB or bw_AB, date, date + 1)) EndDate,
ANY_VALUE(type) type
FROM partitioned
GROUP BY ID, div
ORDER BY 1, 2;
With sample tables:
CREATE TEMP TABLE Apples AS
select 1 ID, '01/01/19' StartDate, '01/04/19'EndDate union all
select 2, '01/01/19', '03/01/19';
CREATE TEMP TABLE Bananas AS
select 1 ID, '15/12/18' StartDate, '12/01/19' EndDate union all
select 1, '01/02/19', '17/02/19' union all
select 1, '15/03/19', '15/04/19' union all
select 2, '01/06/19', '01/07/19';

Oracle SQL recursive adding values

I have the following data in the table
Period Total_amount R_total
01/01/20 2 2
01/02/20 5 null
01/03/20 3 null
01/04/20 8 null
01/05/20 31 null
Based on the above data I would like to have the following situation.
Period Total_amount R_total
01/01/20 2 2
01/02/20 5 3
01/03/20 3 0
01/04/20 8 8
01/05/20 31 23
Additional data
01/06/20 21 0 (previously it would be -2)
01/07/20 25 25
01/08/20 29 4
Pattern to the additional data is:
if total_amount < previous(r_total) then 0
Based on the filled data, we can spot the pattern is:
R_total = total_amount - previous(R_total)
Could you please help me out with this issue?
As Gordon Linoff suspected, it is possible to solve this problem with analytic functions. The benefit is that the query will likely be much faster. The price to pay for that benefit is that you need to do a bit of math beforehand (before ever thinking about "programming" and "computers").
A bit of elementary arithmetic shows that R_TOTAL is an alternating sum of TOTAL_AMOUNT. This can be arranged easily by using ROW_NUMBER() (to get the signs) and then an analytic SUM(), as shown below.
Table setup:
create table sample_data (period, total_amount) as
select to_date('01/01/20', 'mm/dd/rr'), 2 from dual union all
select to_date('01/02/20', 'mm/dd/rr'), 5 from dual union all
select to_date('01/03/20', 'mm/dd/rr'), 3 from dual union all
select to_date('01/04/20', 'mm/dd/rr'), 8 from dual union all
select to_date('01/05/20', 'mm/dd/rr'), 31 from dual
;
Query and result:
with
prep (period, total_amount, sgn) as (
select period, total_amount,
case mod(row_number() over (order by period), 2) when 0 then 1 else -1 end
from sample_data
)
select period, total_amount,
sgn * sum(sgn * total_amount) over (order by period) as r_total
from prep
;
PERIOD TOTAL_AMOUNT R_TOTAL
-------- ------------ ----------
01/01/20 2 2
01/02/20 5 3
01/03/20 3 0
01/04/20 8 8
01/05/20 31 23
This may be possible with window functions, but the simplest method is probably a recursive CTE:
with t as (
select t.*, row_number() over (order by period) as seqnum
from yourtable t
),
cte(period, total_amount, r_amount, seqnum) as (
select period, total_amount, r_amount, seqnum
from t
where seqnum = 1
union all
select t.period, t.total_amount, t.total_amount - cte.r_amount, t.seqnum
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select *
from cte;
This question explicitly talks about "recursively" adding values. If you want to solve this using another mechanism, you might explain the logic in detail and ask if there is a non-recursive CTE solution.

Group by 1 minute interval for the chain of actions sql BigQuery

I need to group the data with 1 minute interval for the chain of actions. My data looks like this:
id MetroId Time ActionName refererurl
111 a 2020-09-01-09:19:00 First www.stackoverflow/a12345
111 b 2020-09-01-12:36:54 First www.stackoverflow/a12345
111 f 2020-09-01-12:36:56 First www.stackoverflow/xxxx
111 b 2020-09-01-12:36:58 Midpoint www.stackoverflow/a12345
111 f 2020-09-01-12:37:01 Midpoint www.stackoverflow/xxx
111 b 2020-09-01-12:37:03 Third www.stackoverflow/a12345
111 b 2020-09-01-12:37:09 Complete www.stackoverflow/a12345
222 d 2020-09-01-15:17:44 First www.stackoverflow/a2222
222 d 2020-09-01-15:17:48 Midpoint www.stackoverflow/a2222
222 d 2020-09-01-15:18:05 Third www.stackoverflow/a2222
I need to grab the data with the following condition: if x_id and x_url has Complete value for action_name column, grab that. If it doesn't have Complete then grab Third and so on.
ARRAY_AGG(current_query_result
ORDER BY CASE ActionName
WHEN 'Complete' THEN 1
WHEN 'Third' THEN 2
WHEN 'Midpoint' THEN 3
WHEN 'First' THEN 4
END
LIMIT 1
)[OFFSET(0)]
FROM
(
SELECT d.id, c.Time, c.ActionName, c.refererurl, c.MetroId
FROM
`bq_query_table_c` c
INNER JOIN `bq_table_d` d ON d.id = c.CreativeId
WHERE
c.refererurl LIKE "https://www.stackoverflow/%"
AND c.ActionName in ('First', 'Midpoint', 'Third', 'Complete')
) current_query_result
GROUP BY
id,
refererurl,
MetroId
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time),
INTERVAL MOD(UNIX_SECONDS(PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time)), 1 * 60)
SECOND
)
Desired output:
id MetroId Time ActionName refererurl
111 a 2020-09-01-09:19:00 First www.stackoverflow/a12345
111 f 2020-09-01-12:37:01 Midpoint www.stackoverflow/xxx
111 b 2020-09-01-12:37:09 Complete www.stackoverflow/a12345
222 c 2020-09-01-15:18:05 Third www.stackoverflow/a2222
This reads like a gaps-and-islands problem, where a gaps is greater than 1 minute, and islands represents the "chains of actions".
I would start by building groups that represent the islands: for this, you can use lag() to retrieve the previous action time, and a cumulative sum that increments for every gap of 1 minute or more between two consecutive actions:
select t.*,
sum(case when time > timestamp_add(lag_time, interval 1 minute) then 1 else 0 end)
over(partition by x_id, x_url order by time) grp
from (
select d.id, c.time, c.actionname, c.refererurl,
lag(time) over(partition by id, refererurl order by time) lag_time
from `bq_query_table_c` c
inner join `bq_table_d` d on d.id = c.creativeid
where c.refererurl like "https://www.stackoverflow/%"
and c.actionname in ('First', 'Midpoint', 'Third', 'Complete')
) t
grp is the island identifier.
From there on, we can use your original logic that filters the preferred action per group. We don't need to aggregate by 1 minute intervals - we can use grp instead:
select
array_agg(t) order by case actionname
when 'Complete' then 1
when 'Third' then 2
when 'midpoint' then 3
when 'first' then 4
end limit 1)[offset(0)]
from (
select t.*,
sum(case when time > timestamp_add(lag_time, interval 1 minute) then 1 else 0 end)
over(partition by x_id, x_url order by time) grp
from (
select d.id, c.time, c.actionname, c.refererurl,
lag(time) over(partition by id, refererurl order by time) lag_time
from `bq_query_table_c` c
inner join `bq_table_d` d on d.id = c.creativeid
where c.refererurl like "https://www.stackoverflow/%"
and c.actionname in ('First', 'Midpoint', 'Third', 'Complete')
) t
) t
group by id, refererurl, grp
Note that if there are, say, two "Complete" actions on a single island, it is undefined which one will be picked (you original query has pretty much the same flaw). To make the results deterministic, you want to add another sorting criteria to ARRAY_AGG(), like time for example:
array_agg(t) order by case actionname
when 'Complete' then 1
when 'Third' then 2
when 'midpoint' then 3
when 'first' then 4
end, time limit 1)[offset(0)]
Below is for BigQuery Standard SQL
#standardSQL
WITH temp AS (
SELECT *, PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time) ts
FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts, time_lag) FROM (
SELECT * ,
TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, SECOND) time_lag
FROM (
SELECT
AS VALUE ARRAY_AGG(t
ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC
LIMIT 1
)[OFFSET(0)]
FROM temp t
WHERE action_name IN ('First', 'Midpoint', 'Third', 'Complete')
GROUP BY id, url,
TIMESTAMP_SUB(ts, INTERVAL MOD(UNIX_SECONDS(ts), 60) SECOND
)
)
)
WHERE NOT IFNULL(time_lag, 777) < 60
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.bq_table` AS (
SELECT 111 id, '2020-09-01-09:19:00' time, 'First' action_name, 'www.stackoverflow/a12345' url UNION ALL
SELECT 111, '2020-09-01-12:36:54', 'First', 'www.stackoverflow/a12345' UNION ALL
SELECT 111, '2020-09-01-12:36:58', 'Midpoint', 'www.stackoverflow/a12345' UNION ALL
SELECT 111, '2020-09-01-12:37:03', 'Third', 'www.stackoverflow/a12345' UNION ALL
SELECT 111, '2020-09-01-12:37:09', 'Complete', 'www.stackoverflow/a12345' UNION ALL
SELECT 222, '2020-09-01-15:17:44', 'First', 'www.stackoverflow/a2222' UNION ALL
SELECT 222, '2020-09-01-15:17:48', 'Midpoint', 'www.stackoverflow/a2222' UNION ALL
SELECT 222, '2020-09-01-15:18:05', 'Third', 'www.stackoverflow/a2222'
), temp AS (
SELECT *, PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time) ts
FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts, time_lag) FROM (
SELECT * ,
TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, SECOND) time_lag
FROM (
SELECT
AS VALUE ARRAY_AGG(t
ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC
LIMIT 1
)[OFFSET(0)]
FROM temp t
WHERE action_name IN ('First', 'Midpoint', 'Third', 'Complete')
GROUP BY id, url,
TIMESTAMP_SUB(ts, INTERVAL MOD(UNIX_SECONDS(ts), 60) SECOND
)
)
)
WHERE NOT IFNULL(time_lag, 777) < 60
with result
Row id time action_name url
1 111 2020-09-01-09:19:00 First www.stackoverflow/a12345
2 111 2020-09-01-12:37:09 Complete www.stackoverflow/a12345
3 222 2020-09-01-15:18:05 Third www.stackoverflow/a2222
Note: I am still not 100% sure about your use case - but above is based on what discussed / commented so far

take sum of last 7 days from the observed date in BigQuery

I have a table on which I want to compute the sum of revenue on last 7 days from the observed day. Here is my table -
with temp as
(
select DATE('2019-06-29') as transaction_date, "x"as id, 0 as revenue
union all
select DATE('2019-06-30') as transaction_date, "x"as id, 80 as revenue
union all
select DATE('2019-07-04') as transaction_date, "x"as id, 64 as revenue
union all
select DATE('2019-07-06') as transaction_date, "x"as id, 64 as revenue
union all
select DATE('2019-07-11') as transaction_date, "x"as id, 75 as revenue
union all
select DATE('2019-07-12') as transaction_date, "x"as id, 0 as revenue
)
select * from temp
I want to take a sum of last 7 days for each transaction_date. For instance for the last record which has transaction_date = 2019-07-12, I would like to add another column which adds up revenue for last 7 days from 2019-07-12 (which is until 2019-07-05), hence the value of new rollup_revenue column would be 0 + 75 + 64 = 139. Likewise, I need to compute the rollup for all the dates for every ID.
Note - the ID may or may not appear daily.
I have tried self join but I am unable to figure it out.
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
SUM(revenue) OVER(
PARTITION BY id ORDER BY UNIX_DATE(transaction_date)
RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
) rollup_revenue
FROM `project.dataset.temp`
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.temp` AS (
SELECT DATE '2019-06-29' AS transaction_date, 'x' AS id, 0 AS revenue UNION ALL
SELECT '2019-06-30', 'x', 80 UNION ALL
SELECT '2019-07-04', 'x', 64 UNION ALL
SELECT '2019-07-06', 'x', 64 UNION ALL
SELECT '2019-07-11', 'x', 75 UNION ALL
SELECT '2019-07-12', 'x', 0
)
SELECT *,
SUM(revenue) OVER(
PARTITION BY id ORDER BY UNIX_DATE(transaction_date)
RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
) rollup_revenue
FROM `project.dataset.temp`
-- ORDER BY transaction_date
with result
Row transaction_date id revenue rollup_revenue
1 2019-06-29 x 0 0
2 2019-06-30 x 80 80
3 2019-07-04 x 64 144
4 2019-07-06 x 64 208
5 2019-07-11 x 75 139
6 2019-07-12 x 0 139
One option uses a correlated subquery to find the rolling sum:
SELECT
transaction_date,
revenue,
(SELECT SUM(t2.revenue) FROM temp t2 WHERE t2.transaction_date
BETWEEN DATE_SUB(t1.transaction_date, INTERVAL 7 DAY) AND
t1.transaction_date) AS rev_7_days
FROM temp t1
ORDER BY
transaction_date;