Redshift Join VS. Union with Group By - sql

Let's say I would like to pull the fields dim,a,b,c,d from 2 tables which one contains a,b and the other contains c,d.
I'm wondering if there's a preferred way (between the following) to do it - Performance wise:
1:
select t1.dim,a,b,c,d
from
(select dim,sum(a) as a,sum(b)as b from t1 group by dim)t1
join
(select dim,sum(c) as c,sum(d) as d from t2 group by dim)t2
on t1.dim=t2.dim;
2:
select dim,sum(a) as a,sum(b) as b,sum(c) as c,sum(d) as d
from
(
select dim,a,b,null as c, null as d from t1
union
select dim,null as a, null as b, c, d from t2
)a
group by dim
Of course when handling a large amount of data (5-30M records at the final query).
Thanks!

The first method filters would any dim values that are not in both tables. union is inefficient. So, neither is appealing.
I would go for:
select dim, sum(a) as a, sum(b) as b, sum(c) as c, sum(d) as d
from (select dim, a, b, null as c, null as d from t1
union all
select dim, null as a, null as b, c, d from t2
) a
group by dim;
You could also pre-aggregate the values in each subquery. Or use full outer join for the first method.

Related

union all two table instead of join

I have several table which I can not join them as it gets really complicated and bigquery is not able to process it. So I am trying to union all tables and then group by. I have an issue during this process. I have two tables called t1 and t2 with below headers, they don't have null values:
a. b. c. d. a. b. c. e.
so in order to union all and group them I have below code:
WITH
all_tables_unioned AS (
SELECT
*,
NULL e
FROM
`t1`
UNION ALL
SELECT
*,
NULL d
FROM
`t2` )
SELECT
a,
b,
c,
MAX(d) AS d,
MAX(e) AS e
FROM
all_tables_unioned
GROUP BY
a,
b,
c
unfortunately when I run this I get a table a,b,c,d,e which e column is all null!
I tried to run query for each table before union all to make sure they are not null. I do not really know what is wrong with my query.
union all does not go by column names. Just list all the columns explicitly:
WITH all_tables_unioned AS (
SELECT a, b, c, d, NULL as e
FROM `t1`
UNION ALL
SELECT a, b, c, NULL as d, e
FROM `t2`
)
Regardless of the names you assign, the union all uses positions for matching columns.

How do you aggregate two columns into arrays in BigQuery?

Say I have a table with 4 columns, a of type string, b of type integer, c of type integer, and d of type integer. How would I go ahead and use something like ARRAY_AGG on a STRUCT(b, c) and d separately (in other words, have two separate columns that would be arrays)?
The Query I have so far:
SELECT table1.a, table1.x, table2.y
FROM (
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x
FROM project.table
GROUP BY a
ORDER BY a
) AS table1
LEFT OUTER JOIN (
SELECT a, ARRAY_AGG(d) AS y
FROM project.table
GROUP BY a, d
ORDER BY a, d
) table2 ON table1.a = table2.a
GROUP BY table1.a
ORDER BY table1.a;
This gives me the error: SELECT list expression references table1.x which is neither grouped nor aggregated at [1:20]
But if I try to add table1.x to the GROUP BY clause at the end, I get a new error: Grouping by expressions of type ARRAY is not allowed at [14:22]
Below is for BigQuery Standard SQL
#standardSQL
SELECT a,
ARRAY(
SELECT AS STRUCT b, c FROM t.x GROUP BY b, c
) AS x, y
FROM (
SELECT a,
ARRAY_AGG(STRUCT(b, c)) AS x,
ARRAY_AGG(DISTINCT d) AS y
FROM `project.dataset.table`
GROUP BY a
) t
Why not just do this in one query?
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x, ARRAY_AGG(DISTINCT d) AS y
FROM project.table
GROUP BY a
ORDER BY a;
Your version will work without the outer GROUP BY. But it is needlessly compicated.

How do I add two sum queries together?

I have the following two pseudo queries:
SELECT Sum(a)
FROM b
WHERE c
and
SELECT Sum(d)
FROM b
WHERE e
I want to sum these queries together to one value but I can't figure out the syntax. Note the FROM statement is the same ("b"). I've tried a UNION query but this gives me two values...
You can use iif() inside sum() where you apply the conditions:
select sum(iif(c, a, 0)) + sum(iif(e, d, 0))
from b
Since both queries will always return a single record, you could alternatively cross join the two subqueries and simply add the results, e.g.:
select r1 + r2 from
(select sum(a) as r1 from b where c) t1,
(select sum(d) as r2 from b where e) t2
Try
SELECT SUM(col1)
FROM
(
SELECT Sum(a) col1
FROM b
WHERE c
UNION
SELECT Sum(d) col1
FROM b
WHERE e) t
Please try the following
Select sum(sumVal)
FROM
(SELECT Sum(a) sumVal
FROM b
where c
UNION
SELECT Sum(d) sumVal
FROM e
where f
)
Try to use this :
;WITH
t1 as ( select sum(a) as a from b where c>20)
,
t2 as (select sum(d) as d from b where e is not null)
select t1.b1+t2.c2 as s from t1 inner join t2 on t1.b1 != t2.c2

UNION without comparing one of the columns

I have two queries
select A, B, C, D from T1, T2
select A, B, C, D from T2, T3
I want to do a UNION of the two queries (no duplicates) but not comparing column D, that is if columns A B and C are the same then they are considered duplicates regardless of D. I do not want to select from joined tables T1, T2, and T3. Is this possible on a single statement?
(this is Oracle)
Use UNION and GROUP BY to do this, like following;)
select A, B, C
from(
select A, B, C, D from T1, T2
union
select A, B, C, D from T2, T3
)t
group by A, B, C
And you have to specify which D value do you want to get when A, B, C are the same, here I assume you get max(D), like this;
select A, B, C, max(D) as D
from(
select A, B, C, D from T1, T2
union
select A, B, C, D from T2, T3
)t
group by A, B, C
No matter which value you want to reserve, when you use group by in oracle, you only can select columns which appear in group by or some other columns with aggregation functions.

Where one or another column exists in a sub select

I'm looking to do something like this:
SELECT a, b, c, d FROM someTable WHERE
WHERE a in (SELECT testA FROM otherTable);
Only I want to be able to test if 2 columns exist in a sub select of 2 columns.
SELECT a, b, c, d FROM someTable WHERE
WHERE a OR b in (SELECT testA, testB FROM otherTable);
We are using MS SQL Server 2012
Try this
SELECT a, b, c, d
FROM someTable WHERE
WHERE a IN (SELECT testA FROM otherTable)
OR b IN (SELECT testB FROM otherTable)
or
SELECT a, b, c, d
FROM someTable WHERE
WHERE EXISTS
(SELECT NULL
FROM otherTable
WHERE testA = a OR testB = a
OR testA = b OR testB = b)
UPDATE:
Maybe you need to add index on testB column, if you have bad performance.
Also another option to use CROSS APPLY for MS SQL
SELECT a, b, c, d
FROM someTable ST
CROSS APPLY (
SELECT 1
FROM otherTable OT
WHERE OT.testA = ST.a OR OT.testB = ST.b
)
If any of this won't work, try using UNION. Mostly UNION gives better performance than OR
SELECT a, b, c, d
FROM someTable WHERE
WHERE a IN (SELECT testA FROM otherTable)
UNION
SELECT a, b, c, d
FROM someTable WHERE
WHERE b IN (SELECT testB FROM otherTable)
UPDATE 2:
For further reading on OR and UNION differences
Why is UNION faster than an OR statement
Try this..
SELECT a, b, c, d
FROM someTable
WHERE Exists
(
SELECT 1
FROM otherTable
Where a = testA OR b = testB
)
If I'm understanding your question correctly, LEFT JOIN is probably the way to go here:
SELECT a, b, c, d
FROM TableA ta
LEFT JOIN TableB tb
ON ta.a = tb.a
AND ta.b = tb.b
WHERE tb.a IS NOT NULL
AND tb.c IS NOT NULL
You could also use UNION and INNER JOIN:
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT on someTable.B = OT.testB
UNION
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT ON someTable.A= OT.testA
Note that the JOIN approach should be orders of magnitude faster if you have an index on the column
Joins seems to be one option, have you thought about using them with a Union?
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT on someTable.B = OT.testB
UNION
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT ON someTable.A= OT.testA