Integrate two sql queries - sql

I have these two queries:
SELECT DISTINCT a, b, c, d, FROM x WHERE b IN (1, 2)
SELECT DISTINCT c, d, FROM y
I would now like to merge these queries such that the statement initiated in the first query only includes rows where the c, d combination is in the output resulting from the second query. Any thoughts on how to do this? My table is large, so efficiency is important.

Use exists?
SELECT DISTINCT a, b, c, d
FROM x
WHERE b IN (1, 2) AND
EXISTS (SELECT 1 FROM y WHERE x.c = y.c and x.d = y.d);
When using exists, the select distinct is only necessary if x has duplicate values. Otherwise it is not necessary.
And, for performance, you want an index on y(c, d). Also, an index on x(b, a, c, d) would also be helpful in most databases.
Note: The distinct is not necessary in the subquery. In some databases, you can use in with composite values as well.

SELECT DISTINCT x.a,x.b,x.c,x.d
FROM x
INNER JOIN y ON x.c = y.c
AND x.d = y.d
WHERE b in (1,2)
Regarding efficiency, your indexing will determine how well that performs.

Related

How do you aggregate two columns into arrays in BigQuery?

Say I have a table with 4 columns, a of type string, b of type integer, c of type integer, and d of type integer. How would I go ahead and use something like ARRAY_AGG on a STRUCT(b, c) and d separately (in other words, have two separate columns that would be arrays)?
The Query I have so far:
SELECT table1.a, table1.x, table2.y
FROM (
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x
FROM project.table
GROUP BY a
ORDER BY a
) AS table1
LEFT OUTER JOIN (
SELECT a, ARRAY_AGG(d) AS y
FROM project.table
GROUP BY a, d
ORDER BY a, d
) table2 ON table1.a = table2.a
GROUP BY table1.a
ORDER BY table1.a;
This gives me the error: SELECT list expression references table1.x which is neither grouped nor aggregated at [1:20]
But if I try to add table1.x to the GROUP BY clause at the end, I get a new error: Grouping by expressions of type ARRAY is not allowed at [14:22]
Below is for BigQuery Standard SQL
#standardSQL
SELECT a,
ARRAY(
SELECT AS STRUCT b, c FROM t.x GROUP BY b, c
) AS x, y
FROM (
SELECT a,
ARRAY_AGG(STRUCT(b, c)) AS x,
ARRAY_AGG(DISTINCT d) AS y
FROM `project.dataset.table`
GROUP BY a
) t
Why not just do this in one query?
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x, ARRAY_AGG(DISTINCT d) AS y
FROM project.table
GROUP BY a
ORDER BY a;
Your version will work without the outer GROUP BY. But it is needlessly compicated.

Redundant use of distinct in group by?

I'm reviewing some SQL queries in SAS and I encountered the following query structure:
SELECT distinct A, B, Sum(C) FROM Table1 GROUP BY A, B;
I would like to know if it's strictly equivalent to:
SELECT A, B, Sum(C) FROM Table1 GROUP BY A, B;
Or if I'm missing a nuance, in the output or the way the computation is handled
The two queries are equivalent.
Generally,
SELECT DISTINCT a, b, c
FROM <something>
is equivalent to
SELECT a, b, c
FROM <something>
GROUP BY a, b, c
In your case, <something> happens to be a result of GROUP BY query, which has distinct columns A and B. This is enough to ensure that triples A, B, SUM(C) are going to be unique as well.

Using having count() in exists clause

I am trying to make a SQL query where the subquery in an 'exists' clause has a 'having' clause. The strange thing is that. There is no error and the subquery works as a stand-alone query. However, the whole query gives exactly the same results with the 'having' clause as without.
This is kind of what my query looks like:
SELECT X
FROM A
WHERE exists (
SELECT X, count(distinct Y)
FROM B
GROUP BY X
HAVING count(distinct Y) > 2)
So I'm trying to select the rows from A where X has more then two occurances of Y in B.
However, the results also include records that do not exist in the subquery. What am I doing wrong here?
You don't correlate the two queries:
SELECT X
FROM A
WHERE (
SELECT COUNT(DISTINCT y)
FROM b
WHERE b.x = a.x
) > 2
Your query says something like this:
select X from A IF THERE ARE records having more than one occurence if grouped by Y in B.
If your 'exists subquery' returns even one record from table B the condition is true and you will get all the rows from A.
Try:
select X
from A
where exists (select 1
from B
where B.x = A.x
group by b.x
having count(distinct b.y) > 2
)
I had a similar situation and solved by a JOIN since the other answers didn't work for me. I tried to correlate to your generic example. Hope it is helpful to someone else!
SELECT X
FROM A
JOIN (SELECT X, COUNT(DISTINCT y)
FROM B
GROUP BY X
HAVING count(distinct Y) > 2) C
ON A.X = C.X

Does the optimizer filter subqueries with outer where clauses

Take the following query:
select * from
(
select a, b
from c
UNION
select a, b
from d
)
where a = 'mung'
order by b
Will the optimizer generally work out that I am filtering a on the value 'mung' and consequently filter mung on each of the queries in the subquery.
OR
will it run each query within the subquery union and return the results to the outer query for filtering (as the query would perhaps suggest)
In which case the following query would perform better :
select * from
(
select a, b
from c
where a = 'mung'
UNION
select a, b
from d
where a = 'mung'
)
order by b
Obviously query 1 is best for maintenance, but is it sacrificing much performace for this?
Which is best?
Edit
Sorry I neglected to add the order by clause in the queries to indicate why it all needed to be a subquery in the first place!
Edit
Ok, I thought it was going to be a simple answer, but I forgot I was talking about databases here! Running it through the analyser indicates that there is no optimization going on, but this could of course be because I only have 4 rows in my tables. I was hoping for a simpler yes or no.. Will have to set up some more complex tests.
I am on mySQL at the moment, but I am after a general answer here...
Use:
SELECT x.a,
x.b
FROM TABLE_X x
WHERE x.a = 'mung'
UNION
SELECT y.a,
y.b
FROM TABLE_Y y
WHERE y.a = 'mung'
The subquery is unnecessary in the given context
The query above will use indexes if available on column "a"; Using the subquery method, there's no index to utilize
Use UNION ALL if you don't have to be concerned with duplicates - it will be faster than UNION

can this be written with an outer join

The requirement is to copy rows from Table B into Table A. Only rows with an id that doesn't already exist, need to be copied over:
INSERT INTO A(id, x, y)
SELECT id, x, y
FROM B b
WHERE b.id IS NOT IN (SELECT id FROM A WHERE x='t');
^^^^^^^^^^^
Now, I was trying to write this with an outer join to compare the explain paths, but I can't write this (efficiently at least).
Note that the sql highlighted with ^'s make this tricky.
try
INSERT INTO A(id, x, y)
SELECT id, x, y
FROM TableB b
Left Join TableA a
On a.Id = b.Id
And a.x = 't'
Where a.Id Is Null
But I prefer the subquery representation as I think it more clearly expresses what you are doing.
Why are you not happy with what you have? If you check your explain plan, I promise you it says that an anti-join is performed, if the optimizer thinks that is the most efficient way (which it most likely will).
For everyone who reads this: SQL is not what actually is executed. SQL is a way of telling the database what you want, not what to do. All decent databases will be able to treat NOT EXISTS and NOT IN as equal (when they are, ie. there are no null values) and perform an anti-join. The trick with an outer join and an IS NULL condition doesn't work on SQL Server, though (SQL Server is not clever enough to transform it to an antijoin).
Your query will perform better than the query with outer join.
I guess the following query will do the job:
INSERT INTO A(id, x, y)
SELECT id, x, y
FROM B b
LEFT JOIN A a
ON b.id = a.id AND NOT a.x='t'
INSERT INTO A (id, x, y)
SELECT
B.id, B.x, B.y
FROM
B
WHERE
NOT EXISTS (SELECT * FROM A WHERE B.id = A.id AND A.x = 't')