A – ( B ∩ A )
I was wondering what this set of mathematics can translate to when looking at,comparing it with SQL (operators).
If A and B are tables of the same "type" (same number of columns and compatible datatypes on corresponding columns), this can be translated to SQL like this:
A EXCEPT (A INTERSECT B)
which is of course equivalent (both in set and relational algebra) to:
A EXCEPT B
If A and B are tables of different "type", then set operations do not make sense between them. (Joins are something different, they should not be confused with Unions, Differences or Intersections, no matter how popular a link that "explains" them is.)
The fact that Unions, Differences and Intersections can also be expressed in several ways (using (LEFT) JOIN, (NOT) IN, (NOT) EXISTS combinations, besides the explicit UNION, EXCEPT and INTERSECT operators) does not change that.
The syntax is not exactly as above. One can use either (works in Postgres and SQL-Server. It also works in Oracle if one replaces EXCEPT with MINUS):
SELECT *
FROM a
EXCEPT
( SELECT *
FROM a
INTERSECT
SELECT *
FROM b
) ;
or this (works in Postgres 8.4 and above: SQL-Fiddle test)
SELECT *
FROM
( TABLE a
EXCEPT
( TABLE a INTERSECT TABLE b )
) t ;
and even this (Look ma, no SELECT!):
TABLE a
EXCEPT
( TABLE a INTERSECT TABLE b ) ;
Just to give another option in terms of SQL:
SELECT id FROM A
MINUS
SELECT id FROM B
UPD:
Beware that MINUS removes duplicates from the final result set and only exists in Oracle.
SQL standard uses EXCEPT which is supported by other vendors:
SELECT id FROM A
EXCEPT
SELECT id FROM B
In standard it has DISTINCT option which should remove dupes. I guess you will have to check docs of s specific vendor to see if duplicates will be removed. For example SQL Server's implementation of EXCEPT is the same as MINUS in Oracle.
This computes the set difference of A and B and in mats is also equivelent to A - B. So what you are interested in here are the elements in A that are not in B.
You can have a look at this blog post to see how set difference can be done in mysql.
That would be SET A MINUS (SET A INNER JOIN SET B). You are taking the records that occur in both A and B (INNER JOIN), then removing those from SET A (MINUS)
The intersecting in set theory terms is equivilant to INNER JOIN in SQL terms.
Edit: As for your question about being the same as NOT IN, not really. NOT IN is an operator used on column values ('selection'). The MINUS, JOIN, etc are set operators to perform operations between sets of rows.
A – ( B ∩ A ) equivalent is A LEFT OUTER JOIN B WHERE TableB.id IS null
REFERENCE link HERE
Another option will be to use
SELECT * FROM A
WHERE
NOT EXISTS (SELECT * FROM B WHERE A.Id = B.id)
Related
What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.
Can I use WHERE after
JOIN USING?
In my case if I run on snowflake multiple times the same code:
with CTE1 as
(
select *
from A
left join B
on A.date_a = B.date_b
)
select *
from CTE1
inner join C
using(var1_int)
where CTE1.date_a >= date('2020-10-01')
limit 1000;
sometimes I get a result and sometimes i get the error:
SQL compilation error: Can not convert parameter 'DATE('2020-10-01')' of type [DATE] into expected type [NUMBER(38,0)]
where NUMBER(38,0) is the type of var1_int column
Your problem has nothing to do with the existence of a where clause. Of course you can use a where clause after joins. That is how SQL queries are constructed.
According to the error message, CTE1.date_a is a number. Comparing it to a date results in a type-conversion error. If you provided sample data and desired results, then it might be possible to suggest a way to fix the problem.
tl;dr: Instead of JOIN .. USING() always prefer JOIN .. ON.
You are right to be suspicious of the results. Given your staging, only one of these queries returns without errors:
select a.date_1, id_1
from AE_USING_TST_A a
left join AE_USING_TST_B b
on a.date_1 = b.date_2
join AE_USING_TST_C v
using(id_1)
where A.date_1 >= date('2020-10-01')
-- Can not convert parameter 'DATE('2020-10-01')' of type
-- [DATE] into expected type [NUMBER(38,0)]
;
select a.date_1, a.id_1
from AE_USING_TST_A a
left join AE_USING_TST_B b
on a.date_1 = b.date_2
join AE_USING_TST_C v
on a.id_1=v.id_1
where A.date_1 >= date('2020-10-01')
-- 2020-10-11 2
;
I would call this a bug, except that the documentation is clear about not doing this kind of queries with JOIN .. USING:
To use the USING clause properly, the projection list (the list of columns and other expressions after the SELECT keyword) should be “*”. This allows the server to return the key_column exactly once, which is the standard way to use the USING clause. For examples of standard and non-standard usage, see the examples below.
https://docs.snowflake.com/en/sql-reference/constructs/join.html
The documentation doubles down on the problems of using USING() on non-standard situations, with a different query acting "wrong":
The following example shows non-standard usage; the projection list contains something other than “*”. Because the usage is non-standard, the output contains two columns named “userid”, and the second occurrence (which you might expect to contain a value from table ‘r’) contains a value that is not in the table (the value ‘a’ is not in the table ‘r’).
So just prefer JOIN .. ON. For extra discussion on the SQL ANSI standard not defining behavior for some cases of USING() check:
https://community.snowflake.com/s/question/0D50Z00008WRZBBSA5/bug-with-join-using-
I have a query that requires me to join/refers to the same table, however, I am unable to get a result using the query.
Below is a sample of my query
SELECT a."column1", b."column1" as anotherColumn
FROM table1 AS a, table2 AS b
where a.'x' = b.'x'
AND NOT a.'y' = b.'y'
This query take forever to load. However, if I just run:
SELECT a."column1"
FROM table1 AS A
it only takes 14sec.
I'm currently using PostgreSQL with Pgadmin. table1 has 1.4million table currently.
Is it because there is a lock on the table 1 when it was first referred to as a?
EDIT : Each row contains the record of "author","book published" and in this case, there might be many authors for a book hence being collaborators. What I am trying to achieve is to find out the number of collaborators for each author
What I am trying to achieve is to find out the number of collaborators for each author
Something like this would count the number of authors, and I guess where that number is greater than 1, the number of collaborators is that number - 1
select b.name, count(a.*)-1 as num_collaborators
from books b
inner join authors a on b.id = a.book_id
group by b.name
having count(a.*) > 1
--original
SELECT a."column1", b."column1" as anotherColumn
FROM table1 AS a, table2 AS b
;
--amended
SELECT a."column1", b."column1" as anotherColumn
FROM table1 AS a, table2 AS b
where a.'x' = b.'x'
AND NOT a.'y' = b.'y'
Over 25 years ago ANSI standards for SQL introduced a more "explicit" syntax for joins and using this is well established as "best practice" now.
One of the greatest benefits of this "explicit join syntax" is that accidentally forgetting to join properly becomes impossible, unlike the original query which did forget the joining predicate. (& When that happens an unexpected Cartesian product is produced.)
So, I encourage you to stop using commas between table names. Taking that simple step will help you use better join syntax.
Can anyone give me a good example of a subquery using TSQL 2008?
Maximilian Mayer believes that, due to referencing MS documentation, my assertion that there is a difference between a subquery and a subSelect is incorrect. Frankly, I'd consider MSDN's "Subquery Fundamentals" a better choice. Quote:
You are making distinctions between terms that actually mean the same.
O RLY?
A subQUERY...
IE:
WHERE id IN (SELECT n.id FROM TABLE n)
OR id = (SELECT MAX(m.id) FROM TABLE m)
OR EXISTS(SELECT 1/0 FROM TABLE) --won't return a math error for division by zero
...affects the WHERE or HAVING clauses -- the filteration of data -- for a SELECT, INSERT, UPDATE or DELETE statement. The value from a subquery is never directly visible in the SELECT clause.
A subSELECT...
IE:
SELECT t.column,
(SELECT x.col FROM TABLE x) AS col2
FROM TABLE t
...does not affect the filteration of data in the main query, and the value is exposed directly in the SELECT clause. But it's only one value - you can't return two or more columns into a single column in the outer query.
A subselect is a consistent means of performing a LEFT JOIN in ANSI-89 join syntax - if there is no supporting row, the column will be null. Additionally, a non-correlated subselect will return the same value for every row of the main query.
Correlation
If a subquery or subselect is correlated, that query runs once for every record of the main query returned -- which doesn't scale well as the number of rows in the result set increases.
Derived Table/Inline View
IE:
SELECT x.*,
y.max_date,
y.num
FROM TABLE x
JOIN (SELECT t.id,
t.num,
MAX(t.date) AS max_date
FROM TABLE t
GROUP BY t.id, t.num) y ON y.id = x.id
...is a JOIN to a derived table (AKA inline view).
"Inline view" is a better term, because that is all that happens when you reference a non-materialized view -- a view is just a prepared SQL statement. There's no performance or efficiency difference if you create a view with a query like the one in the example, and reference the view name in place of the SELECT statement within the brackets of the JOIN. The example has the same information as a correlated subquery, but the performance benefit of using a join and none of the subquery detriments. And you can return more than one column, because it is a view/derived table.
Conclusion
It should be obvious why I and others make distinctions. The concept of relying on the word "subquery" to categorize any SELECT statement that isn't the main clause is fatality flawed, because it's also a specific case under a categorization of the same word (IE: subquery-subselect, subquery-subquery, subquery-join...). Now think of helping someone who says "I've got a problem with a subquery..."
Maximilian Mayer's idea of "official" documentation was written by technical writers, who often have no experience in the subject and are only summarizing what they've been told to from knowledgeable people who have simplified things. Ultimately, it's just text on a page or screen -- like what you're reading now -- and the decision is up to you if the details I've laid out make sense to you.
For variety's sake, here's one in the where clause:
select
a.firstname,
a.lastname
from
employee a
where
a.companyid in (
select top 10
c.companyid
from
company c
where
c.num_employees > 1000
)
...returns all employees in the top ten companies with over 1000 employees.
SELECT
*,
(SELECT TOP 1 SomeColumn FROM dbo.SomeOtherTable)
FROM
dbo.MyTable
SELECT a.*, b.*
FROM TableA AS a
INNER JOIN
(
SELECT *
FROM TableB
) as b
ON a.id = b.id
Thats a normal subquery, running once for the whole result set.
On the other hand
SELECT a.*, (SELECT b.somecolumn FROM TableB AS b WHERE b.id = a.id)
FROM TableA AS a
is a correlated subquery, running once for every row in the result set.
I've just learned ( yesterday ) to use "exists" instead of "in".
BAD
select * from table where nameid in (
select nameid from othertable where otherdesc = 'SomeDesc' )
GOOD
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
And I have some questions about this:
1) The explanation as I understood was: "The reason why this is better is because only the matching values will be returned instead of building a massive list of possible results". Does that mean that while the first subquery might return 900 results the second will return only 1 ( yes or no )?
2) In the past I have had the RDBMS complainin: "only the first 1000 rows might be retrieved", this second approach would solve that problem?
3) What is the scope of the alias in the second subquery?... does the alias only lives in the parenthesis?
for example
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
AND
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeOtherDesc' )
That is, if I use the same alias ( o for table othertable ) In the second "exist" will it present any problem with the first exists? or are they totally independent?
Is this something Oracle only related or it is valid for most RDBMS?
Thanks a lot
It's specific to each DBMS and depends on the query optimizer. Some optimizers detect IN clause and translate it.
In all DBMSes I tested, alias is only valid inside the ( )
BTW, you can rewrite the query as:
select t.*
from table t
join othertable o on t.nameid = o.nameid
and o.otherdesc in ('SomeDesc','SomeOtherDesc');
And, to answer your questions:
Yes
Yes
Yes
You are treading into complicated territory, known as 'correlated sub-queries'. Since we don't have detailed information about your tables and the key structures, some of the answers can only be 'maybe'.
In your initial IN query, the notation would be valid whether or not OtherTable contains a column NameID (and, indeed, whether OtherDesc exists as a column in Table or OtherTable - which is not clear in any of your examples, but presumably is a column of OtherTable). This behaviour is what makes a correlated sub-query into a correlated sub-query. It is also a routine source of angst for people when they first run into it - invariably by accident. Since the SQL standard mandates the behaviour of interpreting a name in the sub-query as referring to a column in the outer query if there is no column with the relevant name in the tables mentioned in the sub-query but there is a column with the relevant name in the tables mentioned in the outer (main) query, no product that wants to claim conformance to (this bit of) the SQL standard will do anything different.
The answer to your Q1 is "it depends", but given plausible assumptions (NameID exists as a column in both tables; OtherDesc only exists in OtherTable), the results should be the same in terms of the data set returned, but may not be equivalent in terms of performance.
The answer to your Q2 is that in the past, you were using an inferior if not defective DBMS. If it supported EXISTS, then the DBMS might still complain about the cardinality of the result.
The answer to your Q3 as applied to the first EXISTS query is "t is available as an alias throughout the statement, but o is only available as an alias inside the parentheses". As applied to your second example box - with AND connecting two sub-selects (the second of which is missing the open parenthesis when I'm looking at it), then "t is available as an alias throughout the statement and refers to the same table, but there are two different aliases both labelled 'o', one for each sub-query". Note that the query might return no data if OtherDesc is unique for a given NameID value in OtherTable; otherwise, it requires two rows in OtherTable with the same NameID and the two OtherDesc values for each row in Table with that NameID value.
Oracle-specific: When you write a query using the IN clause, you're telling the rule-based optimizer that you want the inner query to drive the outer query. When you write EXISTS in a where clause, you're telling the optimizer that you want the outer query to be run first, using each value to fetch a value from the inner query. See "Difference between IN and EXISTS in subqueries".
Probably.
Alias declared inside subquery lives inside subquery. By the way, I don't think your example with 2 ANDed subqueries is valid SQL. Did you mean UNION instead of AND?
Personally I would use a join, rather than a subquery for this.
SELECT t.*
FROM yourTable t
INNER JOIN otherTable ot
ON (t.nameid = ot.nameid AND ot.otherdesc = 'SomeDesc')
It is difficult to generalize that EXISTS is always better than IN. Logically if that is the case, then SQL community would have replaced IN with EXISTS...
Also, please note that IN and EXISTS are not same, the results may be different when you use the two...
With IN, usually its a Full Table Scan of the inner table once without removing NULLs (so if you have NULLs in your inner table, IN will not remove NULLS by default)... While EXISTS removes NULL and in case of correlated subquery, it runs inner query for every row from outer query.
Assuming there are no NULLS and its a simple query (with no correlation), EXIST might perform better if the row you are finding is not the last row. If it happens to be the last row, EXISTS may need to scan till the end like IN.. so similar performance...
But IN and EXISTS are not interchangeable...