How do you aggregate two columns into arrays in BigQuery? - sql

Say I have a table with 4 columns, a of type string, b of type integer, c of type integer, and d of type integer. How would I go ahead and use something like ARRAY_AGG on a STRUCT(b, c) and d separately (in other words, have two separate columns that would be arrays)?
The Query I have so far:
SELECT table1.a, table1.x, table2.y
FROM (
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x
FROM project.table
GROUP BY a
ORDER BY a
) AS table1
LEFT OUTER JOIN (
SELECT a, ARRAY_AGG(d) AS y
FROM project.table
GROUP BY a, d
ORDER BY a, d
) table2 ON table1.a = table2.a
GROUP BY table1.a
ORDER BY table1.a;
This gives me the error: SELECT list expression references table1.x which is neither grouped nor aggregated at [1:20]
But if I try to add table1.x to the GROUP BY clause at the end, I get a new error: Grouping by expressions of type ARRAY is not allowed at [14:22]

Below is for BigQuery Standard SQL
#standardSQL
SELECT a,
ARRAY(
SELECT AS STRUCT b, c FROM t.x GROUP BY b, c
) AS x, y
FROM (
SELECT a,
ARRAY_AGG(STRUCT(b, c)) AS x,
ARRAY_AGG(DISTINCT d) AS y
FROM `project.dataset.table`
GROUP BY a
) t

Why not just do this in one query?
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x, ARRAY_AGG(DISTINCT d) AS y
FROM project.table
GROUP BY a
ORDER BY a;
Your version will work without the outer GROUP BY. But it is needlessly compicated.

Related

How do I add two sum queries together?

I have the following two pseudo queries:
SELECT Sum(a)
FROM b
WHERE c
and
SELECT Sum(d)
FROM b
WHERE e
I want to sum these queries together to one value but I can't figure out the syntax. Note the FROM statement is the same ("b"). I've tried a UNION query but this gives me two values...
You can use iif() inside sum() where you apply the conditions:
select sum(iif(c, a, 0)) + sum(iif(e, d, 0))
from b
Since both queries will always return a single record, you could alternatively cross join the two subqueries and simply add the results, e.g.:
select r1 + r2 from
(select sum(a) as r1 from b where c) t1,
(select sum(d) as r2 from b where e) t2
Try
SELECT SUM(col1)
FROM
(
SELECT Sum(a) col1
FROM b
WHERE c
UNION
SELECT Sum(d) col1
FROM b
WHERE e) t
Please try the following
Select sum(sumVal)
FROM
(SELECT Sum(a) sumVal
FROM b
where c
UNION
SELECT Sum(d) sumVal
FROM e
where f
)
Try to use this :
;WITH
t1 as ( select sum(a) as a from b where c>20)
,
t2 as (select sum(d) as d from b where e is not null)
select t1.b1+t2.c2 as s from t1 inner join t2 on t1.b1 != t2.c2

Integrate two sql queries

I have these two queries:
SELECT DISTINCT a, b, c, d, FROM x WHERE b IN (1, 2)
SELECT DISTINCT c, d, FROM y
I would now like to merge these queries such that the statement initiated in the first query only includes rows where the c, d combination is in the output resulting from the second query. Any thoughts on how to do this? My table is large, so efficiency is important.
Use exists?
SELECT DISTINCT a, b, c, d
FROM x
WHERE b IN (1, 2) AND
EXISTS (SELECT 1 FROM y WHERE x.c = y.c and x.d = y.d);
When using exists, the select distinct is only necessary if x has duplicate values. Otherwise it is not necessary.
And, for performance, you want an index on y(c, d). Also, an index on x(b, a, c, d) would also be helpful in most databases.
Note: The distinct is not necessary in the subquery. In some databases, you can use in with composite values as well.
SELECT DISTINCT x.a,x.b,x.c,x.d
FROM x
INNER JOIN y ON x.c = y.c
AND x.d = y.d
WHERE b in (1,2)
Regarding efficiency, your indexing will determine how well that performs.

Redshift Join VS. Union with Group By

Let's say I would like to pull the fields dim,a,b,c,d from 2 tables which one contains a,b and the other contains c,d.
I'm wondering if there's a preferred way (between the following) to do it - Performance wise:
1:
select t1.dim,a,b,c,d
from
(select dim,sum(a) as a,sum(b)as b from t1 group by dim)t1
join
(select dim,sum(c) as c,sum(d) as d from t2 group by dim)t2
on t1.dim=t2.dim;
2:
select dim,sum(a) as a,sum(b) as b,sum(c) as c,sum(d) as d
from
(
select dim,a,b,null as c, null as d from t1
union
select dim,null as a, null as b, c, d from t2
)a
group by dim
Of course when handling a large amount of data (5-30M records at the final query).
Thanks!
The first method filters would any dim values that are not in both tables. union is inefficient. So, neither is appealing.
I would go for:
select dim, sum(a) as a, sum(b) as b, sum(c) as c, sum(d) as d
from (select dim, a, b, null as c, null as d from t1
union all
select dim, null as a, null as b, c, d from t2
) a
group by dim;
You could also pre-aggregate the values in each subquery. Or use full outer join for the first method.

2 equals select expression within select

I have to process my data by Levenshtein function.
In this case I'm using nested selection
SELECT levenshtein(a.param, b.param), *
FROM (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
) a,
FROM (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
) b,
is there a way to not duplicate inner SELECT ?
Solution is pretty simple, thanks to #Nikarus for suggestion about WITH expression:
WITH subtable AS (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
)
SELECT levenshtein(a.param, b.param), *
FROM subtable a, subtable b

one to one distinct restriction on selection

I encountered a problem like that. There are two tables (x value is ordered so that
in a incremental trend !)
Table A
id x
1 1
1 3
1 4
1 7
Table B
id x
1 2
1 5
I want to join these two tables:
1) on the condition of the equality of id and
2) each row of A should be matched only to one row of B, vice verse (one to one relationship) based on the absolute difference of x value (small difference row has
more priority to match).
Only based on the description above it is not a clear description because if two pairs of row which share a common row in one of the table have the same difference, there is no way to decide which one goes first. So define A as "Main" table, the row in table A with smaller line number always go first
Expected result of demo:
id A.x B.x abs_diff
1 1 2 1
1 4 5 1
End of table(two extra rows in A shouldn't be considered, because one to one rule)
I am using PostgreSQL so the thing I have tried is DISTINCT ON, but it can not solve.
select distinct on (A.x) id,A.x,B.x,abs_diff
from
(A join B
on A.id=B.id)
order by A.x,greatest(A.x,B.x)-least(A.x,B.x)
Do you have any ideas, it seems to be tricky in plain SQL.
Try:
select a.id, a.x as ax, b.x as bx, x.min_abs_diff
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff
fiddle: http://sqlfiddle.com/#!15/ab5ae/5/0
Although it doesn't match your expected output, I think the output is correct based on what you described, as you can see each pair has a difference with an absolute value of 1.
Edit - Try the following, based on order of a to b:
select *
from (select a.id,
a.x as ax,
b.x as bx,
x.min_abs_diff,
row_number() over(partition by a.id, b.x order by a.id, a.x) as rn
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff) x
where x.rn = 1
Fiddle: http://sqlfiddle.com/#!15/ab5ae/19/0
One possible solution for your currently ambiguous question:
SELECT *
FROM (
SELECT id, x AS a, lead(x) OVER (PARTITION BY grp ORDER BY x) AS b
FROM (
SELECT *, count(tbl) OVER (PARTITION BY id ORDER BY x) AS grp
FROM (
SELECT TRUE AS tbl, * FROM table_a
UNION ALL
SELECT NULL, * FROM table_b
) x
) y
) z
WHERE b IS NOT NULL
ORDER BY 1,2,3;
This way, every a.x is assigned the next bigger (or same) b.x, unless there is another a.x that is still smaller than the next b.x (or the same).
Produces the requested result for the demo case. Not sure about various ambiguous cases.
SQL Fiddle.