Error in using EXCEPT with INTERSECT in SQL - sql

Suppose I have three tables Table A,Table B and Table C.
Table A contains the col t1 with entries 1,2,2,3,4,4.
Table B has col t2 with entries 1,3,4,4.
Table C has col t3 with entries 1,2,4,4.
The query given was
SELECT * FROM A EXCEPT (SELECT * FROM B INTERSECT SELECT * FROM C ).
I saw this question in a test paper. It was mentioned that the expected answer was 2 but the answer obtained from this query was 1,2,4. I am not able to understand the principle behind this.

Well, as I see it, both the expected answer and the answer you obtained are wrong. It may be the RDBMS that you are using, but analyzing your query the results should be 2,3. First you should do the INTERSECT between tables B and C, the values that intersect are 1 and 4. Taking that result, you should take all the values from table A except 1 and 4, that leaves us with 2 and 3 (since EXCEPT and INTERSECT return only distinct values). Here is a sqlfiddle with this for you to try.

Because of the bracket, the INTERSECT between B and C is done first, resulting in (1,4). You can even verify this just be taking the latter part and running in isolation:
SELECT * FROM B INTERSECT SELECT * FROM C
The next step is to select everything in A EXCEPT those that exist in the previous result of (1,4), which leaves (2,3).
The answer should be 2 and 3, not 1,2 and 4.
BTW, it should be mentioned that even if you had no parenthesis in the query at all, the result should still be the same because the INTERSECT operator has a higher precedence than the EXCEPT/UNION operators. This is the SQL Server documentation but it's consistent with the standard that applies to any DBMS that implements these operators.

Related

Compare columns in SQL Select without additional select statement

I'm currently writing a relatively complex SQL statement which selects data from multiple tables and has quite a few sub-statements and joins.
In my "final" data set, I want to return raw data as well as comparisons between the raw data. While I can do this when the raw data is found using a Join, is it possible to do this while the raw data is found in a sub-query?
For example:
If I have a query which is
SELECT
A
,(SELECT B FROM BETA WHERE Row = ALPHA.Betalink) B
FROM APLHA
WHERE A > 1
Can I add a column which compares A and B without adding another Select?
The only way I know to solve this would be to do the above select, then select on that:
SELECT
A
,B
,greater(A,B)
FROM
(SELECT A
,(SELECT B FROM BETA WHERE Row = ALPHA.Betalink) B
FROM APLHA
WHERE A > 1
)
TIA
I think you are looking for a with clause.
What is the most important- the query is ran once, while subquery is runned for every returned row.
You can read about it here:
subquery_factoring_clause
In your example it could look like this:
WITH SUBQ_DATA as (SELECT B,Row FROM BETA)
SELECT alpha.A
,sub.B
,greater(alpha.A,sub.B)
FROM ALPHA alpha
JOIN SUBQ_DATA sub on sub.Row = alpha.Betalink
WHERE A > 1

BigQuery: Union / Join two tables with many columns, some overlapping

I have two large tables with some overlapping columns, some of which contain the same values in the overlapping columns. Here's a toy example (in the actual example, there are dozens of columns, both those that overlap and those that don't):
Table 1: a, b, c
Table 2: a, d, e
Some values of a are in only one table, some are in both.
Is there a query that will let me generate a table with all values where available:
Table 3: a, b, c, d, e
My current query requires listing every column, which is very verbose with dozens of columns, and inflexible when the schema changes:
SELECT
coalesce(t1.a,
t2.a) AS a,
t1.b,
t1.c,
t2.d,
t2.e
FROM
t1
FULL JOIN
t2
USING
(a)
Things I've tried: UNION seems to require the same schema, SELECT t1.*, t2.* raises an error on overlapping columns, SELECT t1.* ... USING (a) will give nulls for values in a where there are values only in t1.a.
Before BigQuery Standard SQL got available to all of us in June 2, 2016 - I was extremely happy with what now called BigQuery Legacy SQL. I still enjoy it time by time for some specific use cases
I think the case you described in your question is exactly one where you can leverage feature of Legacy SQL to resolve your issue
So, below is for BigQuery Legacy SQL
#legacySQL
SELECT *
FROM [project:dataset.table1],
[project:dataset.table2]
Note: in BigQuery Legacy SQL comma - , - means UNION ALL
Super-simplified example of above is
#legacySQL
SELECT *
FROM (SELECT1 a, 2 b, 3 c, 11 x),
(SELECT 1 a, 4 d, 5 e, 12 x)
with result
Row a b c x d e
1 1 2 3 11 null null
2 1 null null 12 4 5
Note: you cannot mix Legacy and Standard SQL in same query, so if you need use Standard SQL against resulted UNION - you will need first to materialize(save) result as a table and then query that table using Standard SQL
Is there any way with Standard SQL
You can use INFORMATION_SCHEMA to script out columns from both tables and built list of all involved columns - but you sutill will need then to copy-paste result into final query to run it

INTERSECT and UNION giving different counts of duplicate rows

I have two tables A and B with same column names. I have to combine them into table C
when I am running following query, the count is not matching -
select * into C
from
(
select * from A
union
select * from B
)X
The record count of C is not matching with A and B. There is difference of 89 rows. So I figured out that there are duplicates.
I used following query to find duplicates -
select * from A
INTERSECT
select * from B
-- 80 rows returned
Can anybody tell me why intersect returns 80 dups whereas count difference on using union is 89 ?
There are probably duplicates inside of A and/or B as well. All set operators perform an implicit DISTINCT on the result (logically, not necessarily physically).
Duplicate rows are usually a data-quality issue or an outright bug. I usually mitigate this risk by adding unique indexes on all columns and column sets that are supposed to be unique. I especially make sure that every table has a primary key if that is at all possible.

Two ways to use Count, are they equivalent?

Is
SELECT COUNT(a.attr)
FROM TABLE a
equivalent to
SELECT B
FROM (SELECT COUNT(a.attr) as B
FROM TABLE a)
I would guess no, but I'm not sure.
I'm also assuming the answer would be the same for functions like min, max, avg, correct?
EDIT:
This is all out of curiosity, I'm still new at this. Is there a difference between the value returned for the count of the following and the above?
SELECT B, C
FROM (SELECT COUNT(a.attr) as B, a.C
FROM TABLE a
GROUP BY c)
EDIT AGAIN: I looked into it, lesson learned: I should be awake when I try to learn about these things.
Technically, they are not the same, the first one is a simple select, the second one is a select with a sub select.
But every sane optimizer will generate the same execution plan for both of them.
The results are the same, and would be the same as:
SELECT E
FROM
(SELECT D as E
FROM
(SELECT C as D
FROM
(SELECT B as C
FROM
(SELECT COUNT(a.attr) as B
FROM TABLE a))))
And equally as pointless.
The second query is essentially obfuscating a COUNT and should be avoided.
EDIT:
Yes, your edited query that was added to the OP is the same thing. It's just adding a subquery for no reason.
Am posting this answer to supplement what has already been said in the other answers, and because you cannot format comments :)
You can always check the execution plan to see if queries are equivalent; this is what SQL Server makes of it:
DECLARE #A TABLE
(
attr int,
c int
)
INSERT #A(attr,c) VALUES(1,1)
INSERT #A(attr,c) VALUES(2,1)
INSERT #A(attr,c) VALUES(3,1)
INSERT #A(attr,c) VALUES(4,2)
INSERT #A(attr,c) VALUES(5,2)
SELECT count(attr) FROM #A
SELECT B
FROM (SELECT COUNT(attr) as B
FROM #A) AS T
SELECT B, C
FROM (SELECT COUNT(attr) as B, c AS C
FROM #A
GROUP BY c) AS T
Here's the execution plan of the SELECT statments, as you can see there is no difference in the first two:
Yes there are. All your doing in the second one is naming the returned count B. They will return the same results.
http://www.roseindia.net/sql/sql-as-keyword.shtml
EDIT:
Better example:
http://www.w3schools.com/sql/sql_alias.asp
The third example will be different because it contains a group by. It will return the count for every distinct a.C entry. Example
B C
w/e a
w/e a
w/e b
w/e a
w/e c
Would return
3 a
1 b
1 c
Not necessarily in that order
Easiest way to check all of this is to try it for yourself and see what it returns.
Your first code sample is correct, but second does not have any sense.
You just select all data twice without any operations.
So, output for first and second samples will be equal.

When to use EXCEPT as opposed to NOT EXISTS in Transact SQL?

I just recently learned of the existence of the new "EXCEPT" clause in SQL Server (a bit late, I know...) through reading code written by a co-worker. It truly amazed me!
But then I have some questions regarding its usage: when is it recommended to be employed? Is there a difference, performance-wise, between using it versus a correlated query employing "AND NOT EXISTS..."?
After reading EXCEPT's article in the BOL I thought it was just a shorthand for the second option, but was surprised when I rewrote a couple queries using it (so they had the "AND NOT EXISTS" syntax much more familiar to me) and then checked the execution plans - surprise! The EXCEPT version had a shorter execution plan, and executed faster, also. Is this always so?
So I'd like to know: what are the guidelines for using this powerful tool?
EXCEPT treats NULL values as matching.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE value NOT IN
(
SELECT value
FROM p
)
will return an empty rowset.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE NOT EXISTS
(
SELECT NULL
FROM p
WHERE p.value = q.value
)
will return
NULL
1
, and this one:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
EXCEPT
SELECT *
FROM p
will return:
1
Recursive reference is also allowed in EXCEPT clause in a recursive CTE, though it behaves in a strange way: it returns everything except the last row of a previous set, not everything except the whole previous set:
WITH q (value) AS
(
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
),
rec (value) AS
(
SELECT value
FROM q
UNION ALL
SELECT *
FROM (
SELECT value
FROM q
EXCEPT
SELECT value
FROM rec
) q2
)
SELECT TOP 10 *
FROM rec
---
1
2
3
-- original set
1
2
-- everything except the last row of the previous set, that is 3
1
3
-- everything except the last row of the previous set, that is 2
1
2
-- everything except the last row of the previous set, that is 3, etc.
1
SQL Server developers must just have forgotten to forbid it.
I have done a lot of analysis of except, not exists, not in and left outer join. Generally the left outer join is the fastest for finding missing rows, especially joining on a primary key. Not In can be very fast if you know it will be a small list returned in the select.
I use EXCEPT a lot to compare what is being returned when rewriting code. Run the old code saving results. Run new code saving results and then use except to capture all differences. It is a very quick and easy way to find differences, especially when needing to get all differences including null. Very good for on the fly easy coding.
But, every situation is different. I say to every developer I have ever mentored. Try it. Do timings all different ways. Try it, time it, do it.
EXCEPT compares all (paired)columns of two full-selects.
NOT EXISTS compares two or more tables accoding to the conditions specified in WHERE clause in the sub-query following NOT EXISTS keyword.
EXCEPT can be rewritten by using NOT EXISTS.
(EXCEPT ALL can be rewritten by using ROW_NUMBER and NOT EXISTS.)
Got this from here
There is no accounting for SQL server's execution plans. I have always found when having performance issues that it was utterly arbitrary (from a user's perspective, I'm sure the algorithm writers would understand why) when one syntax made a better execution plan rather than another.
In this case, something about the query parameter comparison allows SQL to figure out a shortcut that it couldn't from a straight select statement. I'm sure that is a deficiency in the algorithm. In other words, you could logically interpolate the same thing, but the algorithm doesn't make that translation on an exists query. Sometimes that is because an algorithm that could reliably figure it out would take longer to execute than the query itself, or at least the algorithm designer thought so.
If your query is fine tuned then there is no performance difference b/w using of EXCEPT clause and NOT EXIST/NOT IN.. first time when I ran EXCEPT after changing my correlated query into it.. I was surprised because it returned with the result just in 7 secs while correlated query was returning in 22 secs.. then I used distinct clause in my correlated query and reran.. it also returned in 7 secs.. so EXCEPT is good when you don't know or don't have time to fine tuned your query otherwise both are same performance wise..