BigQuery SQL : Rolling count distinct bounded between two conditions - sql

I am trying to find the rolling countdistinct of ip_var bounded between two events (in two different columns in Bigquery SQL).
eg i have a table of the form :
id TIME_STAMP event_1 event_2 ip_var
A 1 0 0 1
A 2 1 0 1
A 2 0 0 2
A 3 0 0 2
A 4 0 0 3
A 5 0 1 4
A 6 0 0 1
A 7 0 0 1
B 1 0 0 2
B 2 0 0 2
B 2 1 0 3
B 3 0 0 3
B 4 0 0 3
B 4 0 1 4
B 6 0 0 5
B 7 0 0 6
For each id , i need the countdistinct of ip_var when the event_1 happens till event_2 happens , its always guaranteed that even2 happens after event_1.
I have tried using rolling count for the problem without much success.
Final output looks like
id bounded_count
A 2
B 1

Below is for BigQuery Standard SQL
#standardSQL
SELECT id, COUNT(DISTINCT ip_var) bounded_count
FROM (
SELECT *,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) grp,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) != COUNTIF(event_2 = 1) OVER(win) qualify
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY id ORDER BY time_stamp)
)
WHERE qualify
GROUP BY id, grp
if to apply to sample data from your question - result is
Row id bounded_count
1 A 2
2 B 1
Note: above solution also works in case if you have multiple qualified pairs, like in below example (same code, I just added more rows into sample data)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'A' id, 1 time_stamp, 0 event_1, 0 event_2, 1 ip_var UNION ALL
SELECT 'A', 2, 1, 0, 1 UNION ALL
SELECT 'A', 2, 0, 0, 2 UNION ALL
SELECT 'A', 3, 0, 0, 2 UNION ALL
SELECT 'A', 4, 0, 0, 3 UNION ALL
SELECT 'A', 5, 0, 1, 4 UNION ALL
SELECT 'A', 6, 0, 0, 1 UNION ALL
SELECT 'A', 7, 0, 0, 1 UNION ALL
SELECT 'A', 12, 1, 0, 1 UNION ALL
SELECT 'A', 13, 0, 0, 2 UNION ALL
SELECT 'A', 14, 0, 0, 3 UNION ALL
SELECT 'A', 15, 0, 0, 4 UNION ALL
SELECT 'A', 16, 0, 0, 5 UNION ALL
SELECT 'A', 17, 0, 1, 1 UNION ALL
SELECT 'A', 18, 0, 0, 1 UNION ALL
SELECT 'B', 1, 0, 0, 2 UNION ALL
SELECT 'B', 2, 0, 0, 2 UNION ALL
SELECT 'B', 2, 1, 0, 3 UNION ALL
SELECT 'B', 3, 0, 0, 3 UNION ALL
SELECT 'B', 4, 0, 0, 3 UNION ALL
SELECT 'B', 5, 0, 1, 4 UNION ALL
SELECT 'B', 6, 0, 0, 5 UNION ALL
SELECT 'B', 7, 0, 0, 6
)
SELECT id, COUNT(DISTINCT ip_var) bounded_count, grp
FROM (
SELECT *,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) grp,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) != COUNTIF(event_2 = 1) OVER(win) qualify
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY id ORDER BY time_stamp)
)
WHERE qualify
GROUP BY id, grp
with result
Row id bounded_count grp
1 A 2 1
2 A 4 2
3 B 1 1

Hmmm . . . You can use window functions to calculate the timestamps for each event. The rest is just filtering and aggregation:
WITH t as (
SELECT "A" as id, 1 as time_stamp, 0 as event_1, 0 as event_2, 1 as ip_var UNION ALL
SELECT "A", 2, 1, 0, 1 UNION ALL
SELECT "A", 2, 0, 0, 2 UNION ALL
SELECT "A", 3, 0, 0, 2 UNION ALL
SELECT "A", 4, 0, 0, 3 UNION ALL
SELECT "A", 5, 0, 1, 4 UNION ALL
SELECT "A", 6, 0, 0, 1 UNION ALL
SELECT "A", 7, 0, 0, 1 UNION ALL
SELECT "B", 1, 0, 0, 2 UNION ALL
SELECT "B", 2, 0, 0, 2 UNION ALL
SELECT "B", 2, 1, 0, 3 UNION ALL
SELECT "B", 3, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 1, 4 UNION ALL
SELECT "B", 6, 0, 0, 5 UNION ALL
SELECT "B", 7, 0, 0, 6
)
select id, count(distinct ip_var) as bounded_count
from (select t.*,
min(case when event_1 = 1 then time_stamp end) over (partition by id) as timestamp_1,
max(case when event_2 = 1 then time_stamp end) over (partition by id) as timestamp_2
from t
) t
where time_stamp > timestamp_1 and time_stamp < timestamp_2
group by id

One way to do it is:
Find out start_time and end_time for each ID
For each ID, filter out events that are not in counting window
Count distinct ip_var
In order to print out intermediate step, I used temp table to demonstrate the idea. You should make second temp table id_start_end a WITH clause to be more efficient.
CREATE TEMP TABLE t as
SELECT "A" id, 1 time_stamp, 0 event_1, 0 event_2, 1 ip_var UNION ALL
SELECT "A", 2, 1, 0, 1 UNION ALL
SELECT "A", 2, 0, 0, 2 UNION ALL
SELECT "A", 3, 0, 0, 2 UNION ALL
SELECT "A", 4, 0, 0, 3 UNION ALL
SELECT "A", 5, 0, 1, 4 UNION ALL
SELECT "A", 6, 0, 0, 1 UNION ALL
SELECT "A", 7, 0, 0, 1 UNION ALL
SELECT "B", 1, 0, 0, 2 UNION ALL
SELECT "B", 2, 0, 0, 2 UNION ALL
SELECT "B", 2, 1, 0, 3 UNION ALL
SELECT "B", 3, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 1, 4 UNION ALL
SELECT "B", 6, 0, 0, 5 UNION ALL
SELECT "B", 7, 0, 0, 6;
CREATE TEMP TABLE id_start_end AS
SELECT ids.id, t_start.time_stamp as start_time, t_end.time_stamp as end_time FROM
(SELECT DISTINCT id FROM t) ids
JOIN t AS t_start ON ids.id = t_start.id AND t_start.event_1 = 1
JOIN t AS t_end ON ids.id = t_end.id AND t_end.event_2 = 1;
SELECT * FROM id_start_end;
SELECT t.id, COUNT(DISTINCT ip_var)
FROM t JOIN id_start_end
ON t.id = id_start_end.id
AND t.time_stamp < id_start_end.end_time
AND t.time_stamp > id_start_end.start_time
GROUP BY t.id
Output table id_start_end:
+----+------------+----------+
| id | start_time | end_time |
+----+------------+----------+
| A | 2 | 5 |
| B | 2 | 4 |
+----+------------+----------+
Final output:
+----+-----+
| id | f0_ |
+----+-----+
| B | 1 |
| A | 2 |
+----+-----+

Related

Order by multiple columns in the SELECT query

How can i order the results in my select query to have them like this?
1, 1, 0
1, 2, 0
1, 3, 0
1, 1, 1
1, 2, 1
1, 3, 1
2, 1, 0
2, 2, 0
2, 1, 1
2, 2, 1
I tried this query but the result is not what I'm looking for:
select * from my_table order by col1, col2, col3
In which col1 represents the first number, col2 is the second one and col3 is the last number in the above example.
This query returns:
1, 1, 0
1, 1, 1
1, 2, 0
1, 2, 1
...
Thanks
Sort should be 1-3-2, I'd say. See line #15.
SQL> with test (c1, c2, c3) as
2 (select 2, 1, 0 from dual union all
3 select 1, 3, 1 from dual union all
4 select 1, 1, 1 from dual union all
5 select 1, 1, 0 from dual union all
6 select 1, 2, 0 from dual union all
7 select 2, 2, 0 from dual union all
8 select 2, 2, 1 from dual union all
9 select 2, 1, 1 from dual union all
10 select 1, 3, 0 from dual union all
11 select 1, 2, 1 from dual
12 )
13 select *
14 from test
15 order by c1, c3, c2;
C1 C2 C3
---------- ---------- ----------
1 1 0
1 2 0
1 3 0
1 1 1
1 2 1
1 3 1
2 1 0
2 2 0
2 1 1
2 2 1
10 rows selected.
SQL>

How to identify pattern in SQL

This is my table. It does consist of A,B and C columns. Only one column value will be true at one time.
My task is to identify pattern based on latest five rows.
For example
I need to search entire table to find whenever these five values were repeated.
If they were repeated, what was the next value avilable for these pattern and show how many times does A, B and C values were found after the pattern.
How this can be done in SQL? I am using oracle 11g. Thanks.
You can convert your a, b, c value to a trinary number and then calculate a value for that row and the previous 4 as if the trinary values for the rows comprised a 5-digit trinary number and then use analytic functions to find the next occurrence and to count the occurrences:
SELECT id,
a,
b,
c,
CASE
WHEN grp_value IS NULL
THEN NULL
ELSE MIN(id) OVER (
PARTITION BY grp_value
ORDER BY id
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
) + 1
END AS row_after_next_match,
CASE
WHEN grp_value IS NULL
THEN 0
ELSE COUNT(id) OVER ( PARTITION BY grp_value )
END AS num_matches
FROM (
SELECT id,
a,
b,
c,
value,
81 * LAG(value,4) OVER ( ORDER BY id ) +
27 * LAG(value,3) OVER ( ORDER BY id ) +
9 * LAG(value,2) OVER ( ORDER BY id ) +
3 * LAG(value,1) OVER ( ORDER BY id ) +
1 * value AS grp_value
FROM (
SELECT id,
a,
b,
c,
DECODE(1,a,0,b,1,c,2) AS value
FROM table_name
)
)
ORDER BY id
Which, for the sample data:
CREATE TABLE table_name (
id PRIMARY KEY,
a,
b,
c,
CHECK (a IN (0,1)),
CHECK (b IN (0,1)),
CHECK (c IN (0,1)),
CHECK (a+b+c = 1)
) AS
SELECT 1, 1, 0, 0 FROM DUAL UNION ALL
SELECT 2, 1, 0, 0 FROM DUAL UNION ALL
SELECT 3, 0, 1, 0 FROM DUAL UNION ALL
SELECT 4, 1, 0, 0 FROM DUAL UNION ALL
SELECT 5, 0, 1, 0 FROM DUAL UNION ALL
SELECT 6, 0, 0, 1 FROM DUAL UNION ALL
SELECT 7, 1, 0, 0 FROM DUAL UNION ALL
SELECT 8, 0, 1, 0 FROM DUAL UNION ALL
SELECT 9, 1, 0, 0 FROM DUAL UNION ALL
SELECT 10, 0, 1, 0 FROM DUAL UNION ALL
SELECT 11, 0, 0, 1 FROM DUAL UNION ALL
SELECT 12, 1, 0, 0 FROM DUAL UNION ALL
SELECT 13, 1, 0, 0 FROM DUAL UNION ALL
SELECT 14, 1, 0, 0 FROM DUAL UNION ALL
SELECT 15, 1, 0, 0 FROM DUAL UNION ALL
SELECT 16, 1, 0, 0 FROM DUAL UNION ALL
SELECT 17, 1, 0, 0 FROM DUAL UNION ALL
SELECT 18, 1, 0, 0 FROM DUAL UNION ALL
SELECT 19, 1, 0, 0 FROM DUAL UNION ALL
SELECT 20, 1, 0, 0 FROM DUAL
Outputs:
ID
A
B
C
ROW_AFTER_NEXT_MATCH
NUM_MATCHES
1
1
0
0
0
2
1
0
0
0
3
0
1
0
0
4
1
0
0
0
5
0
1
0
1
6
0
0
1
12
2
7
1
0
0
13
2
8
0
1
0
1
9
1
0
0
1
10
0
1
0
1
11
0
0
1
2
12
1
0
0
2
13
1
0
0
1
14
1
0
0
1
15
1
0
0
1
16
1
0
0
18
5
17
1
0
0
19
5
18
1
0
0
20
5
19
1
0
0
21
5
20
1
0
0
5
db<>fiddle here

Making pairs of rows without repeating values with given allocation window using SQL

I have two different data sources with events (say, backend and frontend events). An event can be reported by only the first, only the second or by both sources. I'm trying to find an approach to combine these two sources into one, where all events will be reported once. Also, I don't want to lose events.
I have no identifiers, which I could use to join these sources. Instead, I have only event type, event datetime and a time window which I could use to join these events. The hard part begins when I have several events from both sides, catched by the same window - say, three events from source 'A' and two events from source 'B'. I don't know, how these events should be combined with each other, but it's not the issue - I want them to be combined pairwise without repetitions, when one event from source 'A' matches only one event from source 'B', and vice versa. And it's desirable (but not obligatory) to combine closest events first.
And I do this in BigQuery, so I can't use recursive queries.
Here is an example (note that I don't know actual true_parent value):
with raw_data as (
SELECT 1 source_number, 1 dt, '1.1' name, 1 event_type, null true_parent
union all select 1, 60, '1.2', 1, null
union all select 1, 69, '1.3', 1, null
union all select 2, 0, '2.1', 1, '1.1'
union all select 2, 0, '2.2', 1, null
union all select 2, 2, '2.3', 1, '1.2'
union all select 2, 2, '2.4', 1, null
union all select 2, 69, '2.5', 1, '1.3'
union all select 1, 60, '1.1', 2, null
union all select 1, 60, '1.2', 2, null
union all select 1, 69, '1.3', 2, null
union all select 2, 0, '2.1', 2, '1.1'
union all select 2, 0, '2.2', 2, '1.2'
union all select 1, 0, '1.1', 3, null
union all select 1, 1, '1.2', 3, null
union all select 1, 2, '1.3', 3, null
union all select 2, 101, '2.1', 3, '1.3'
union all select 2, 0, '2.2', 3, '1.1'
union all select 2, 3, '2.3', 3, '1.2'
union all select 1, 1, '1.1', 4, null
union all select 1, 100, '1.2', 4, null
union all select 1, 200, '1.3', 4, null
union all select 2, 5, '2.1', 4, '1.1'
union all select 2, 15, '2.2', 4, '1.2'
union all select 2, 102, '2.3', 4, '1.3'
)
, windows as (
select 1 source_number, 20 time_window
union all select 2, 80
)
, dat as (
select
*
from raw_data
left join windows using(source_number)
)
, parent_selection as (
select
c.event_type,
c.name,
c.source_number,
c.dt,
p.name parent,
c.true_parent
from dat c
left join dat p
on c.event_type = p.event_type
and c.source_number > p.source_number
and ABS(c.dt - p.dt) <= c.time_window + p.time_window
)
select distinct
*
except (true_parent)
replace(case when true_parent is null then name else parent end as parent)
from parent_selection
where true_parent = parent or true_parent is null
order by event_type, parent, name
I used this child-parent abstraction, because it's handy to group by parent in next steps, but I also will appreciate any other abstraction, which could be used to make this pairwise connections.
I just want an algorithm to replace the last part of the query, as I don't know actual true_parent value.
Output:
event_type name source_number dt parent
1 1.1 1 1 1.1
1 2.1 2 0 1.1
1 1.2 1 60 1.2
1 2.3 2 2 1.2
1 1.3 1 69 1.3
1 2.5 2 69 1.3
1 2.2 2 0 2.2
1 2.4 2 2 2.4
2 1.1 1 60 1.1
2 2.1 2 0 1.1
2 1.2 1 60 1.2
2 2.2 2 0 1.2
2 1.3 1 69 1.3
3 1.1 1 0 1.1
3 2.2 2 0 1.1
3 1.2 1 1 1.2
3 2.3 2 3 1.2
3 1.3 1 2 1.3
3 2.1 2 101 1.3
4 1.1 1 1 1.1
4 2.1 2 5 1.1
4 1.2 1 100 1.2
4 2.2 2 15 1.2
4 1.3 1 200 1.3
4 2.3 2 102 1.3
Explanation:
In event_type 1 1.1 should be combined with 2.1 or 2.2, and 1.2 - with 2.3 or 2.4, and 2.5 - with 2.3 according to closest dt value. I don't care if 1.1 will be combined with 2.1 or 2.2, but if one of them was added to the pair, second one shouldn't.
In event_type 2 1.1 and 1.2 should be combined with 2.1 or 2.2, order doesn't matter. 2.3 wouldn't be combined with any other event.
In event_type 3 2.1 can be combined only with 1.2 or 1.3, but not with 1.1, because 1.1 doesn't fit its time window. So, 2.1 is combined with 1.3 as it's closer then 1.2.
Remaining 2.2 and 2.3 can be combined with 1.1 and 1.2, but not with 1.3, because it was already occupied by 2.1.
I've finally found a satisfying solution.
The idea is to zip join two sources, ordered by datetime - but it still should meet the limitation of time window. So I made this steps to achieve all the goals:
Calculate session numbers for each source
Shift every parent's session if no parents found
Shift all downstream sessions accordingly
Adjust children's sessions to parent's
Group each session on parent
It still fails on event_type = 6, but this can be fixed by time_window tweaking.
with raw_data as (
SELECT 1 source_number, 1 dt, '1.1' name, 1 event_type
union all select 2, 0, '2.1', 1
union all select 2, 1, '2.2', 1
union all select 2, 69, '2.5', 1
union all select 3, 60, '3.1', 1
union all select 3, 60, '3.2', 1
union all select 3, 69, '3.3', 1
union all select 1, 1, '1.1', 2
union all select 1, 100, '1.2', 2
union all select 2, 100, '2.1', 2
union all select 2, 200, '2.2', 2
union all select 2, 202, '2.3', 2
union all select 3, 1, '3.1', 2
union all select 3, 10, '3.2', 2
union all select 3, 100, '3.3', 2
union all select 4, 5, '4.1', 2
union all select 4, 15, '4.2', 2
union all select 4, 200, '4.3', 2
union all select 5, 1, '5.1', 2
union all select 5, 5, '5.2', 2
union all select 5, 15, '5.3', 2
union all select 5, 99, '5.4', 2
union all select 5, 100, '5.5', 2
union all select 5, 101, '5.6', 2
union all select 6, 50, '6.1', 2
union all select 6, 140, '6.2', 2
union all select 6, 200, '6.3', 2
union all select 6, 290, '6.4', 2
union all select 7, 50, '7.1', 2
union all select 7, 200, '7.2', 2
union all select 7, 210, '7.3', 2
union all select 7, 1000, '7.4', 2
union all select 1, 1, '1.1', 6
union all select 2, 55, '2.1', 6
union all select 3, 85, '3.1', 6
union all select 4, 255, '4.1', 6
union all select 1, 1, '1.1', 7
union all select 1, 1000, '1.2', 7
union all select 2, 1001, '2.1', 7
union all select 3, 1020, '3.1', 7
union all select 4, 1030, '4.1', 7
)
, windows as (
select 1 source_number, 0 time_window
union all select 2, 60
union all select 3, 60
union all select 4, 120
union all select 5, 120
union all select 6, 150
union all select 7, 500
)
, dat as (
select
*
from raw_data
left join windows using(source_number)
)
, sessions as (
select
*,
row_number() over(partition by event_type, source_number order by dt) session
from dat
)
, calc_parent_shift as (
select
a.*,
case
when count(b.session) = 0
then greatest(countif(count(b.session) = 0) over (w_session) - 1, 0)
else 0
end as parent_shift
from sessions a
left join sessions b
on a.event_type = b.event_type
and a.source_number > b.source_number
and a.session <= b.session
and ABS(a.dt - b.dt) <= a.time_window + b.time_window
group by 1, 2, 3, 4, 5, 6
window w_session as (
partition by a.event_type, a.session order by a.dt
)
)
, shift_parent_session as (
select
* except(session),
session + max(parent_shift) over (w_shift) as session,
session as old_session
from calc_parent_shift
window w_shift as (
partition by event_type, source_number
order by session
)
)
, shift_child_session as (
select
a.* except(session),
ifnull(array_agg(b.session order by b.source_number, b.session)[offset(0)] - a.old_session, 0) as child_shift,
a.session + greatest (
max(ifnull(array_agg(b.session order by b.source_number, b.session)[offset(0)] - a.old_session, 0)) over (w)
- max(a.parent_shift) over (w)
, 0
) as session
from shift_parent_session a
left join shift_parent_session b
on a.event_type = b.event_type
and a.source_number > b.source_number
and a.session <= b.session
and ABS(a.dt - b.dt) <= a.time_window + b.time_window
group by 1, 2, 3, 4, 5, 6, 7, a.session
window w as (
partition by a.event_type, a.source_number
order by a.session
)
)
, session_groups as (
select
a.* except (parent_shift, child_shift, old_session),
min(b.source_number) parent_source_number
from shift_child_session a
left join shift_child_session b
on a.event_type = b.event_type
and a.session = b.session
and ABS(a.dt - b.dt) <= a.time_window + b.time_window
and a.source_number >= b.source_number
group by 1, 2, 3, 4, 5, 6
)
, result as (
select
event_type,
session,
parent_source_number,
array_agg(source_number) source_number,
array_agg(dt) dt,
array_agg(name) name
from session_groups
group by 1, 2, 3
order by 1, 2, 3
)
select * from result
# where event_type = 6
order by event_type, session

Bigquery aggregating into array based on id and id_type

I have a table that looks similar to this:
WITH
table AS (
SELECT 1 object_id, 234 type_id, 2 type_level UNION ALL
SELECT 1, 23, 1 UNION ALL
SELECT 1, 24, 1 UNION ALL
SELECT 1, 2, 0 UNION ALL
SELECT 1, 2, 0 UNION ALL
SELECT 2, 34, 1 UNION ALL
SELECT 2, 46, 1 UNION ALL
SELECT 2, 465, 2 UNION ALL
SELECT 2, 349, 2 UNION ALL
SELECT 2, 4, 0 UNION ALL
SELECT 2, 3, 0 )
SELECT
object_id,
type_id,
type_level
FROM
table
Now I am trying to create three new columns type_level_0_array,type_level_1_array,type_level_2_array for each object and aggregate the type_id of corresponding level of types into those array (I am not looking for string separated by commas).
So my resultant table should look like the following:
+----+--------------------+--------------------+--------------------+
| id | type_level_0_array | type_level_1_array | type_level_2_array |
+----+--------------------+--------------------+--------------------+
| 1 | 2 | 24,23 | 234 |
+----+--------------------+--------------------+--------------------+
| 2 | 3,4 | 34,46 | 465,349 |
+----+--------------------+--------------------+--------------------+
Is there any way to accomplish that?
Update:
Although it seems that my type_id has certain pattern e.g. level 0 types are of 1 length, level 1 types are of 2 length and so on, in my real dataset there is no such pattern. The identification of level is solely possible by looking at type_level of any row.
Below is for BigQuery Standard SQL
#standardSQL
SELECT object_id,
ARRAY_AGG(DISTINCT IF(type_level = 0, type_id, NULL) IGNORE NULLS) AS type_level_0_array,
ARRAY_AGG(DISTINCT IF(type_level = 1, type_id, NULL) IGNORE NULLS) AS type_level_1_array,
ARRAY_AGG(DISTINCT IF(type_level = 2, type_id, NULL) IGNORE NULLS) AS type_level_2_array
FROM `project.dataset.table`
GROUP BY object_id
You can test, play with above using sample data from your question as in below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 object_id, 234 type_id, 2 type_level UNION ALL
SELECT 1, 23, 1 UNION ALL
SELECT 1, 24, 1 UNION ALL
SELECT 1, 2, 0 UNION ALL
SELECT 1, 2, 0 UNION ALL
SELECT 2, 34, 1 UNION ALL
SELECT 2, 46, 1 UNION ALL
SELECT 2, 465, 2 UNION ALL
SELECT 2, 349, 2 UNION ALL
SELECT 2, 4, 0 UNION ALL
SELECT 2, 3, 0 )
SELECT object_id,
ARRAY_AGG(DISTINCT IF(type_level = 0, type_id, NULL) IGNORE NULLS) AS type_level_0_array,
ARRAY_AGG(DISTINCT IF(type_level = 1, type_id, NULL) IGNORE NULLS) AS type_level_1_array,
ARRAY_AGG(DISTINCT IF(type_level = 2, type_id, NULL) IGNORE NULLS) AS type_level_2_array
FROM `project.dataset.table`
GROUP BY object_id
with result
Row object_id type_level_0_array type_level_1_array type_level_2_array
1 1 2 24 234
23
2 2 4 34 349
3 46 465
Try this. Works for me.
Bigquery won't let you create an array with Nulls in them, which is why the IGNORE NULLS is required.
EDIT: I've updated the code to be based off the type_level column
WITH table
AS (
SELECT 1 object_id, 234 type_id, 2 type_level UNION ALL
SELECT 1, 23, 1 UNION ALL
SELECT 1, 24, 1 UNION ALL
SELECT 1, 2, 0 UNION ALL
SELECT 1, 2, 0 UNION ALL
SELECT 2, 34, 1 UNION ALL
SELECT 2, 46, 1 UNION ALL
SELECT 2, 465, 2 UNION ALL
SELECT 2, 349, 2 UNION ALL
SELECT 2, 4, 0 UNION ALL
SELECT 2, 3, 0 )
SELECT
ARRAY_AGG(CASE WHEN type_level = 0 THEN type_id ELSE NULL END IGNORE NULLS) AS type_level_0_array
, ARRAY_AGG(CASE WHEN type_level = 1 THEN type_id ELSE NULL END IGNORE NULLS) AS type_level_1_array
, ARRAY_AGG(CASE WHEN type_level = 2 THEN type_id ELSE NULL END IGNORE NULLS) AS type_level_2_array
FROM
table

Oracle PL/SQL: How to find duplicate sequences in large table?

I have a ~20000 row table like this (seq = sequence):
id seq_num seq_count seq_id a b c d
----------------------------------------------------
1 1 3 A400 1 0 0 0
2 2 3 A400 0 1 0 0
3 3 3 A400 0 0 1 0
4 1 2 V2303 1 1 1 1
5 2 2 V2303 1 1 1 1
6 1 3 G2 1 0 0 0
7 2 3 G2 0 1 0 0
8 3 3 G2 0 0 1 0
9 1 3 U900 1 0 0 0
10 2 3 U900 2 2 1 1
11 3 3 U900 5 3 8 5
I want to find the seq_id of a-b-c-d sequences that have duplicates in the table, could just be a dbms_ouput.put_line or anything. So as you can see, seq_id G2 is a duplicate of A400 because all of their rows match up, but U900 has no duplicates even though one row matches A400 and G2.
Is there a good way to check for duplicates like this on large sets of data? I cannot create new tables to temporarily hold data. So far I've been trying with cursors mostly but no luck.
Thank you, let me know if you need any more info about my problem.
Oracle Setup:
CREATE TABLE table_name ( id, seq_num, seq_count, seq_id, a, b, c, d ) AS
SELECT 1, 1, 3, 'A400', 1, 0, 0, 0 FROM DUAL UNION ALL
SELECT 2, 2, 3, 'A400', 0, 1, 0, 0 FROM DUAL UNION ALL
SELECT 3, 3, 3, 'A400', 0, 0, 1, 0 FROM DUAL UNION ALL
SELECT 4, 1, 2, 'V2303', 1, 1, 1, 1 FROM DUAL UNION ALL
SELECT 5, 2, 2, 'V2303', 1, 1, 1, 1 FROM DUAL UNION ALL
SELECT 6, 1, 3, 'G2', 1, 0, 0, 0 FROM DUAL UNION ALL
SELECT 7, 2, 3, 'G2', 0, 1, 0, 0 FROM DUAL UNION ALL
SELECT 8, 3, 3, 'G2', 0, 0, 1, 0 FROM DUAL UNION ALL
SELECT 9, 1, 3, 'U900', 1, 0, 0, 0 FROM DUAL UNION ALL
SELECT 10, 2, 3, 'U900', 2, 2, 1, 1 FROM DUAL UNION ALL
SELECT 11, 3, 3, 'U900', 5, 3, 8, 5 FROM DUAL;
Query:
SELECT s.seq_id,
t.seq_id AS matched_seq_id
FROM table_name s
INNER JOIN
table_name t
ON ( s.seq_num = t.seq_num
AND s.seq_count = t.seq_count
AND s.seq_id < t.seq_id
AND s.a = t.a
AND s.b = t.b
AND s.c = t.c
AND s.d = t.d )
GROUP BY
t.seq_id,
s.seq_id
HAVING COUNT( DISTINCT t.seq_num ) = MAX( t.seq_count );
Results:
SEQ_ID MATCHED_SEQ_ID
------ --------------
A400 G2
Assuming results fit in a string about 2000 characters long, the fastest way is probably to use listagg():
select abcds, listagg(seq_id, ',') within group (order by seq_id)
from (select seq_id, listagg(a||b||c||d, ',') within group (order by seq_num) as abcds
from table_name
group by seq_id
) t
group by abcds
having count(*) >= 2;
This returns the matches as a comma-delimited list.