In BigQuery, remove rows that are duplicated on every column besides one - google-bigquery

When removing duplicate rows in bigquery using multiple columns, a common solution is to use row_number() and partition by the multiple columns that are being removed. In our circumstance, we have a wide table (30 columns) and want to remove duplicates based on the uniqueness of 29 of these columns:
with
t1 as (
select 1 as a, 2 as b, 3 as c, 4 as d, 5 as e, 6 as f, 7 as g, 8 as h, 9 as i union all
select 2 as a, 3 as b, 3 as c, 4 as d, 5 as e, 6 as f, 7 as g, 8 as h, 9 as i union all
select 3 as a, 4 as b, 3 as c, 4 as d, 5 as e, 6 as f, 7 as g, 8 as h, 9 as i union all
select 4 as a, 5 as b, 3 as c, 4 as d, 5 as e, 6 as f, 7 as g, 8 as h, 9 as i union all
select 5 as a, 6 as b, 3 as c, 4 as d, 5 as e, 6 as f, 7 as g, 8 as h, 9 as i union all
select 6 as a, 2 as b, 3 as c, 4 as d, 5 as e, 6 as f, 7 as g, 8 as h, 9 as i
)
In the table above, we want to remove duplicates considering all columns except for column a. Therefore, rows 1 and 6 are duplicates and we want to remove either one, preferably removing the row with the higher value in column a, so row 6 in this example. Is this possible to do without using row_number() over (partition by b,c,d,e,f,g,h,i,...)

You may consider below query
SELECT *
FROM t1
QUALIFY ROW_NUMBER() OVER (
PARTITION BY TO_JSON_STRING((SELECT AS STRUCT t1.* EXCEPT(a)))
ORDER BY a ASC
) = 1;

Another option
select any_value(t).* replace(max(a) as a)
from your_table t
group by to_json_string((select as struct * except(a) from unnest([t])))
with output

Related

SQL - How to Label Number for Each ROW & BLOCK

I have a table column like this
A, A, B, C, A, A, B, D, E, E, E
I Would like to label number for each ROW & block like this
(A, 1), (A, 1), (B, 2), (C, 3), (A, 4), (A, 4), (B, 5), (D, 6), (E, 7), (E, 7), (E, 7)
How to do? Thank you.
Assuming you have a table like this:
SELECT * FROM t ORDER BY ord
let, ord
--------
A, 1
A, 2
B, 3
C, 4
A, 5
A, 6
B, 7
D, 8
E, 9
E, 10
E, 11
If you do this:
with cte as(
select let, ord, case when lag(let) over(order by ord) <> let then 1 else 0 end as letchanged
from yourtable
)
select let,
1 + sum(letchanged) over(order by ord rows unbounded preceding) as ctr
from cte
Then you will get:
let, ctr
--------
A, 1
A, 1
B, 2
C, 3
A, 4
A, 4
B, 5
D, 6
E, 7
E, 7
E, 7

Build query for nested structure in bigquery

I had the following code snippet
WITH sequences AS
(SELECT 1 AS id, [STRUCT(0 AS a, 1 AS b)] AS some_numbers
UNION ALL SELECT 2 AS id, [STRUCT(2 AS b, 4 AS a)] AS some_numbers
UNION ALL SELECT 3 AS id, [STRUCT(5 AS b, 3 AS a), STRUCT (7 AS b, 4 AS a)]
AS some_numbers)
SELECT id AS matching_rows
FROM sequences
WHERE EXISTS (SELECT 1
FROM UNNEST(some_numbers)
WHERE b > 3);
And I got the following output
Row matching_rows
1 2
2 3
As per the where condition the result must be 3rd row only. Why it shows 2nd row also..?
Struct fields are unioned by position, not name. So this:
WITH sequences AS
(SELECT 1 AS id, [STRUCT(0 AS a, 1 AS b)] AS some_numbers
UNION ALL SELECT 2 AS id, [STRUCT(2 AS b, 4 AS a)] AS some_numbers
UNION ALL SELECT 3 AS id, [STRUCT(5 AS b, 3 AS a), STRUCT (7 AS b, 4 AS a)]
AS some_numbers)
SELECT id AS matching_rows
FROM sequences
WHERE EXISTS (SELECT 1
FROM UNNEST(some_numbers)
WHERE b > 3);
is equivalent to this:
WITH sequences AS
(SELECT 1 AS id, [STRUCT(0 AS a, 1 AS b)] AS some_numbers
UNION ALL SELECT 2 AS id, [STRUCT(2 AS a, 4 AS b)] AS some_numbers
UNION ALL SELECT 3 AS id, [STRUCT(5 AS a, 3 AS b), STRUCT (7 AS a, 4 AS b)]
AS some_numbers)
SELECT id AS matching_rows
FROM sequences
WHERE EXISTS (SELECT 1
FROM UNNEST(some_numbers)
WHERE b > 3);
Aside from the first query in the union, too, you can remove the AS <name> aliases since they don't affect the result.

BigQuery function to collapse row data into JSON or structure

I'm looking for a way to group by a number of columns in BigQuery but keep more detail than otherwise possible of the rows being aggregated.
Data:
ID A B C D
2 1 2 3 4
2 2 3 4 5
1 1 2 1 3
How my query will look something like this:
SELECT id, TAKE_ANY(a), sum(b), count(d), max(d), MAGIC(a,b,c,d) FROM table GROUP BY 1
And the output I would like is something like:
1, 1, 2, 1, 3, [ (1,2,1,3)]
2, 2, 5, 2, 5, [ (1,2,3,4), (2,3,4,5) ]
Anything exist like the MAGIC function that will package the data into a structure of some sort?
Below option (for BigQuery Standard SQL) is for the case when by [ (1,2,3,4), (2,3,4,5) ] you actually mean STRING vs. ARRAY of STRUCTs (which is not very clear from question but i see possible)
#standardSQL
SELECT
id,
ANY_VALUE(a) any_a,
SUM(b) sum_b,
COUNT(d) count_d,
MAX(d) max_d,
FORMAT('[%s]', STRING_AGG(FORMAT('(%i,%i,%i,%i)', a, b, c, d), ',')) a_b_c_d
FROM `project.dataset.table`
GROUP BY id
If to apply to dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 id, 1 a, 2 b, 3 c, 4 d UNION ALL
SELECT 2, 2, 3, 4, 5 UNION ALL
SELECT 1, 1, 2, 1, 3
)
SELECT
id,
ANY_VALUE(a) any_a,
SUM(b) sum_b,
COUNT(d) count_d,
MAX(d) max_d,
FORMAT('[%s]', STRING_AGG(FORMAT('(%i,%i,%i,%i)', a, b, c, d), ',')) a_b_c_d
FROM `project.dataset.table`
GROUP BY id
ORDER BY id
result will be
Row id any_a sum_b count_d max_d a_b_c_d
1 1 1 2 1 3 [(1,2,1,3)]
2 2 1 5 2 5 [(1,2,3,4),(2,3,4,5)]
Inside of your select list, use ARRAY_AGG with the STRUCT function. For example,
SELECT id, ARRAY_AGG(STRUCT(a, b, c, d))
FROM table
GROUP BY id
This will return an array containing all the values of those columns for each group.

BigQuery: select the nth smallest value in window, ordered by another value

My table has two integer columns: a and b. For each row, I want to select the nth smallest value of b among the rows with smaller a values. Here's a sample input/output, with n=2.
Input:
a | b
-------
1 | 4
2 | 2
3 | 5
4 | 3
5 | 9
6 | 1
7 | 7
8 | 6
9 | 0
Output:
a | 2th min b
-------------
1 | null ← only 1 element in [4], no 2nd min
2 | 4 ← 2nd min between [4,2]
3 | 4 ← 2nd min between [4,2,5]
4 | 3 ← 2nd min between [4,2,5,3]
5 | 3 ← etc.
6 | 2
7 | 2
8 | 2
9 | 1
I used n=2 here to keep it simple, but in practice, I want the 2000th smallest value (or some other large-ish constant). The column a can be assumed to contain distinct integers (and even 1, 2, 3, … if that's easier).
The problem is that if I use ORDER BY b in my window clause and NTH_VALUE, it just computes the answer on the wrong set of values:
WITH data AS (
SELECT 1 AS a, 4 AS b
UNION ALL SELECT 2 AS a, 2 AS b
UNION ALL SELECT 3 AS a, 5 AS b
UNION ALL SELECT 4 AS a, 3 AS b
UNION ALL SELECT 5 AS a, 9 AS b
UNION ALL SELECT 6 AS a, 1 AS b
)
SELECT nth_value(b, 2) over (order by a)
from data
returns [null, 2, 2, 2, 2, 2]: the values are ordered by a (so in the same order than they appear), so the value b=2 is always the one in second place. I want to order by a and then take the nth smallest value of b. Any idea how to write this in BigQuery (preferably Standard SQL)?
Below is for BigQuery Standard SQL and produces correct result for given example.
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 a, 4 b UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5 UNION ALL
SELECT 4, 3 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 1 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 8, 6 UNION ALL
SELECT 9, 0
)
SELECT
a,
(SELECT b FROM
(SELECT b FROM UNNEST(c) b ORDER BY b LIMIT 2)
ORDER BY b DESC LIMIT 1
) b2
FROM (
SELECT a, IF(ARRAY_LENGTH(c) > 1, c, [NULL]) c
FROM (
SELECT a, ARRAY_AGG(b) OVER (ORDER BY a) c
FROM `project.dataset.table`
)
)
-- ORDER BY a
with expected result as below
Row a b2
1 1 null
2 2 4
3 3 4
4 4 3
5 5 3
6 6 2
7 7 2
8 8 2
9 9 1
Note: to make it work for 2000th element you might change 2 to 2000 in LIMIT 2
meantime, i can admit it looks a little ugly/messy to me and not sure about scalability but you can give it a shot
Quick Update
Below is a little less ugly looking version (same output of course)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 a, 4 b UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5 UNION ALL
SELECT 4, 3 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 1 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 8, 6 UNION ALL
SELECT 9, 0
)
SELECT a, c[SAFE_ORDINAL(2)] b2 FROM (
SELECT x.a, ARRAY_AGG(y.b ORDER BY y.b LIMIT 2) c
FROM `project.dataset.table` x
CROSS JOIN `project.dataset.table` y
WHERE y.a <= x.a
GROUP BY x.a
)
-- ORDER BY a
For 2000th element replace 2 to 2000 in LIMIT 2 and SAFE_ORDINAL(2)
Still potentially same issue with scalability because of (now) explicit CROSS JOIN

Use a calculated field in the where clause

Is there a way to use a calculated field in the where clause?
I want to do something like
SELECT a, b, a+b as TOTAL FROM (
select 7 as a, 8 as b FROM DUAL
UNION ALL
select 8 as a, 8 as b FROM DUAL
UNION ALL
select 0 as a, 0 as b FROM DUAL
)
WHERE TOTAL <> 0
;
but I get ORA-00904: "TOTAL": invalid identifier.
So I have to use
SELECT a, b, a+b as TOTAL FROM (
select 7 as a, 8 as b FROM DUAL
UNION ALL
select 8 as a, 8 as b FROM DUAL
UNION ALL
select 0 as a, 0 as b FROM DUAL
)
WHERE a+b <> 0
;
Logically, the select clause is one of the last parts of a query evaluated, so the aliases and derived columns are not available. (Except to order by, which logically happens last.)
Using a derived table is away around this:
select *
from (SELECT a, b, a+b as TOTAL FROM (
select 7 as a, 8 as b FROM DUAL
UNION ALL
select 8 as a, 8 as b FROM DUAL
UNION ALL
select 0 as a, 0 as b FROM DUAL)
)
WHERE TOTAL <> 0
;
This will work...
select *
from (SELECT a, b, a+b as TOTAL FROM (
select 7 as a, 8 as b FROM DUAL
UNION ALL
select 8 as a, 8 as b FROM DUAL
UNION ALL
select 0 as a, 0 as b FROM DUAL)
) as Temp
WHERE TOTAL <> 0;