The overwhelming majority of people support my own view that there is no difference between the following statements:
SELECT * FROM tableA WHERE EXISTS (SELECT * FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT y FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT 1 FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT NULL FROM tableB WHERE tableA.x = tableB.y)
Yet today I came face-to-face with the opposite claim when in our internal developer meeting it was advocated that select 1 is the way to go and select * selects all the (unnecessary) data, hence hurting performance.
I seem to remember that there was some old version of Oracle or something where this was true, but I cannot find references to that. So, I'm curious - how was this practice born? Where did this myth originate from?
Added: Since some people insist on having evidence that this is indeed a false belief, here - a google query which shows plenty of people saying it so. If you're too lazy, check this direct link where one guy even compares execution plans to find that they are equivalent.
The main part of your question is - "where did this myth come from?"
So to answer that, I guess one of the first performance hints people learn with sql is that select * is inefficient in most situations. The fact that it isn't inefficient in this specific situation is hence somewhat counter intuitive. So its not surprising that people are skeptical about it. But some simple research or experiments should be enough to banish most myths. Although human history kinda shows that myths are quite hard to banish.
As a demo, try these
SELECT * FROM tableA WHERE EXISTS (SELECT 1/0 FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT CAST('bollocks' as int) FROM tableB WHERE tableA.x = tableB.y)
Now read the ANSI standard. ANSI-92, page 191, case 3a
If the <select list> "*" is simply contained in a <subquery>
that is immediately contained in an <exists predicate>, then
the <select list> is equivalent to a <value expression> that
is an arbitrary <literal>.
Finally, the behaviour on most RDBMS should ignore THE * in the EXISTS clause. As per this question yesterday ( Sql Server 2005 - Insert if not exists ) this doesn't work on SQL Server 2000 but I know it does on SQL Server 2005+
For SQL Server Conor Cunningham from the Query Optimiser team explains why he typically uses SELECT 1
The QP will take and expand all *'s
early in the pipeline and bind them to
objects (in this case, the list of
columns). It will then remove
unneeded columns due to the nature of
the query.
So for a simple EXISTS subquery like
this:
SELECT col1 FROM MyTable WHERE EXISTS
(SELECT * FROM Table2 WHERE
MyTable.col1=Table2.col2)The * will be
expanded to some potentially big
column list and then it will be
determined that the semantics of the
EXISTS does not require any of those
columns, so basically all of them can
be removed.
"SELECT 1" will avoid having to
examine any unneeded metadata for that
table during query compilation.
However, at runtime the two forms of
the query will be identical and will
have identical runtimes.
Edit: However I have looked at this in some detail since posting this answer and come to the conclusion that SELECT 1 does not avoid this column expansion. Full details here.
This question has an answer that says it was some version of MS Access that actually did not ignore the field of the SELECT clause. I have done some Access development, and I have heard that SELECT 1 is best practice, so this seems very likely to me to be the source of the "myth."
Performance of SQL EXISTS usage variants
Related
We have a database that holds data for numerous customers. We want to give customers access to the database, but only to the data that belongs to them. Parsing the select to then insert in the where clause "and Company.Name = 'Acme'" strikes me as weak because SQL selects can be very complex and handling 100% of all cases may be difficult.
Is there some way to do the equivalent of (I know this is not valid SQL):
select * from * where Company.Name = 'Acme' and (passed_in_select)
You can nest a full select in as an inner part of a large select. Is there some way to do the above? This way it's a very simple restriction on the select and that is likely to work 100% of the time.
Here is a system solution called "virtual private database" for Oracle database:
https://docs.oracle.com/cd/B28359_01/network.111/b28531/vpd.htm
For other databases look whether there is similar built-in solution.
But there is very simple solution using the WITH clause:
WITH
tab_a__ AS (SELECT * FROM tab_a WHERE comp="xy"),
tab_b__ AS (SELECT * FROM tab_b WHERE comp="xy")
SELECT ... //original select
You just have to find all used tables in the select, add __ behind and add the CTEs to the WITH clause.
Notes: Some databases do not support WITH clause though it is an SQL standard. Some databases can have alias length limitation you could exceed by adding the suffix.
select * from
(
select * from table_a
) outer_table_a
where outer_table_a.col_a = 'test'
I do this sort of thing often especially when I want to perform some aggregation on the data in the inner query (sum, max, etc.) I do this with SQL Server, I do not know if it is valid with other DBMS but I would be surprised if it were not.
I don't know if I would rely on this approach to effectively grant permissions. Perhaps views would allow you lock things down a bit tighter. It sounds like you're planning to tack something on dynamically to a query that you may not have written? In that case whomever writes that query could transform your column of interest which would result in visibility over things you didn't intend, like:
select * from
(
select 'test' as col_a, launch_codes from table_a
) outer_table_a
where outer_table_a.col_a = 'test'
Is there a good or standard SQL method of asserting that a join does not duplicate any rows (produces 0 or 1 copies of the source table row)? Assert as in causes the query to fail or otherwise indicate that there are duplicate rows.
A common problem in a lot of queries is when a table is expected to be 1:1 with another table, but there might exist 2 rows that match the join criteria. This can cause errors that are hard to track down, especially for people not necessarily entirely familiar with the tables.
It seems like there should be something simple and elegant - this would be very easy for the SQL engine to detect (have I already joined this source row to a row in the other table? ok, error out) but I can't seem to find anything on this. I'm aware that there are long / intrusive solutions to this problem, but for many ad hoc queries those just aren't very fun to work out.
EDIT / CLARIFICATION: I'm looking for a one-step query-level fix. Not a verification step on the results of that query.
If you are only testing for linked rows rather than requiring output, then you'd use EXISTS.
More correctly, you need a "semi-join" but this isn't supported by most RDBMS unless as EXISTS
SELECT a.*
FROM TableA a
WHERE EXISTS (SELECT * FROM TableB b WHERE a.id = b.id)
Also see:
Using 'IN' with a sub-query in SQL Statements
EXISTS vs JOIN and use of EXISTS clause
SELECT JoinField
FROM MyJoinTable
GROUP BY JoinField
HAVING COUNT(*) > 1
LIMIT 1
Is that simple enough? Don't have Postgres but I think it's valid syntax.
Something along the lines of
SELECT a.id, COUNT(b.id)
FROM TableA a
JOIN TableB b ON a.id = b.id
GROUP BY a.id
HAVING COUNT(b.id) > 1
Should return rows in TableA that have more than one associated row in TableB.
I was asked this question during one of my interviews.
Can you do JOIN using UNION keyword?
Can you do UNION using JOIN keyword?
That is -
1. I should get same output as JOIN without using JOIN keyword, but using UNION Keyword?
2. I should get same output as UNION without using UNION keyword, but using JOIN Keyword?
Can you give me an example of how to do this if possible?
An interview is the framework on which you set out your wares. Remember: don't answer questions ;)
Think of a press conference: the spokesperson is not looking to answer difficult questions from journos to catch themselves out. Rather, they are looking for questions to which they already have answers, being the information they want to release (and no more!)
If I faced this question in an interview, I would use it to demonstrate my knowledge of relational algebra because that's what I'd have gone into the interview with the intention of doing; I be alert for the "Talk about relational algebra here" question and this would be it.
Loosely speaking, JOIN is the counterpart of logical AND, whereas UNION is the counterpart of logical OR. Therefore, similar questions using convention logic could be, "Can you do AND using OR?" and "Can you do OR using AND?" The answer would depend on what else you could use e.g. NOT might come in handy ;)
I'd also be tempted to discuss the differences between the set of primitive operators, the set of operators necessary for computational completeness and the set of operators and shorthands required for practical purposes.
Trying to answer the question directly raises further questions. JOIN implies 'natural join' in relational algebra whereas in SQL it implies INNER JOIN. If the question specifically relates to SQL, do you have to answer for all the JOIN types? What about UNION JOIN?
To employ one example, SQL's outer join is famously a UNION. Chris Date expresses it better than I could ever hope to:
Outer join is expressly designed to
produce nulls in its result and should
therefore be avoided, in general.
Relationally speaking, it's a kind of
shotgun marriage: It forces tables
into a kind of union—yes, I do mean
union, not join—even when the tables
in question fail to conform to the
usual requirements for union (see
Chapter 6). It does this, in effect,
by padding one or both of the tables
with nulls before doing the union,
thereby making them conform to those
usual requirements after all. But
there's no reason why that padding
shouldn't be done with proper values
instead of nulls
SQL and Relational Theory, 1st Edition by C.J. Date
This would be a good discussion point if, "I hate nulls" is something you wanted to get across in the interview!
These are just a few thoughts that spring to mind. The crucial point is, by asking these questions the interviewer is offering you a branch. What will YOU hang on it? ;)
As this is an interview question, they are testing your understanding of both these functions.
The likely answer they are expecting is "generally no you cannot do this as they perform different actions", and you would explain this in more detail by stating that a union appends rows to the end of the result set where as a join adds further columns.
The only way you could have a Join and a Union work is where rows contain data from only one of the two sources:
SELECT A.AA, '' AS BB FROM A
UNION ALL
SELECT '' AS AA, B.BB FROM B
Is the same as:
SELECT ISNULL(A.AA, '') AS AA, ISNULL(B.BB, '') AS BB FROM A
FULL OUTER JOIN B ON 1=0
Or to do this with only one column where the types match:
SELECT A.AA AS TT FROM A
UNION ALL
SELECT B.BB AS TT FROM B
Is the same as:
SELECT ISNULL(A.AA, B.AA) AS TT
FROM A
FULL OUTER JOIN B ON 1=0
One case where you would do this is if you have data spawned over multiple tables but you want to see ti all together, however I would advise to use a UNION in this case rather than a FULL OUTER JOIN because of the query is doing what you would otherwise expect.
Do you mean something like this?
create table Test1 (TextField nvarchar(50), NumField int)
create table Test2 (NumField int)
create table Test3 (TextField nvarchar(50), NumField int)
insert into Test1 values ('test1a', 1)
insert into Test1 values ('test1b', 2)
insert into Test2 values (1)
insert into Test3 values ('test3a', 4)
insert into Test3 values ('test3b', 5)
select Test1.*
from Test1 inner join Test2 on Test1.NumField = Test2.NumField
union
select * from Test3
(written on SQL Server 2008)
UNION works when both SELECT statements have the same number of columns, AND the columns have the same (or at least similar) data types.
UNION doesn't care if both SELECT statements select data only from a single table, or if one or both of them are already JOINs on more than one table.
I think it also depends on other operations available.
If I remember well, UNION can be done using a FULL OUTER join:
Table a (x, y)
Table b (x, y)
CREATE VIEW one
AS
SELECT a.x AS Lx
, b.x AS Rx
, a.y AS Ly
, b.y AS Ry
FROM a FULL OUTER JOIN b
ON a.x = b.x
AND a.y = b.y
CREATE VIEW unionTheHardWay
AS
SELECT COALESCE(Lx, Rx) AS x
, COALESCE(Ly, Ry) AS y
FROM one
Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
I see them as being misleading (they're arguably harder to accurately visualize compared to conventional joins and subqueries), often misunderstood (e.g. using SELECT * will behave the same as SELECT 1 in the EXISTS/NOT EXISTS subquery), and from my limited experience, slower to execute.
Can someone describe and/or provide me an example where they are best suited or where there is no option other than to use them? Note that since their execution and performance are likely platform dependent, I'm particularly interested in their use in MySQL.
Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
This article (though SQL Server related):
IN vs. JOIN vs. EXISTS
may be of interest to you.
In a nutshell, JOIN is a set operation, while EXISTS is a predicate.
In other words, these queries:
SELECT *
FROM a
JOIN b
ON some_condition(a, b)
vs.
SELECT *
FROM a
WHERE EXISTS
(
SELECT NULL
FROM b
WHERE some_condition(a, b)
)
are not the same: the former can return more than one record from a, while the latter cannot.
Their counterparts, NOT EXISTS vs. LEFT JOIN / IS NULL, are the same logically but not performance-wise.
In fact, the former may be more efficient in SQL Server:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
if the main query returned much less rows then the table where you want to find them. example:
SELECT st.State
FROM states st
WHERE st.State LIKE 'N%' AND EXISTS(SELECT 1 FROM addresses a WHERE a.State = st.State)
doing this with a join will be much slower. or a better example, if you want to search if a item exists in 1 of multiple tables.
You can't [easily] use a join in an UPDATE statement, so WHERE EXISTS works excellently there:
UPDATE mytable t
SET columnX = 'SomeValue'
WHERE EXISTS
(SELECT 1
FROM myothertable ot
WHERE ot.columnA = t.columnY
AND ot.columnB = 'XYX'
);
Edit: Basing this on Oracle more than MySQL, and yes there are ways to do it with an inline view, but IMHO this is cleaner.
I need to perform a query like this:
SELECT *,
(SELECT Table1.Column
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 WHERE tmp = 1
I know I can take a workaround but I would like to know if this syntax is possible as it is (I think) in Mysql.
The query you posted won't work on sql server, because the sub query in your select clause could possibly return more than one row. I don't know how MySQL will treat it, but from what I'm reading MySQL will also yield an error if the sub query returns any duplicates. I do know that SQL Server won't even compile it.
The difference is that MySQL will at least attempt to run the query and if you're very lucky (Table2Id is unique in Table1) it will succeed. More probably is will return an error. SQL Server won't try to run it at all.
Here is a query that should run on either system, and won't cause an error if Table2Id is not unique in Table1. It will return "duplicate" rows in that case, where the only difference is the source of the Table1.Column value:
SELECT Table2.*, Table1.Column AS tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
Perhaps if you shared what you were trying to accomplish we could help you write a query that does it.
SELECT *
FROM (
SELECT t.*,
(
SELECT Table1.Column
FROM Table1
INNER JOIN
Table2
ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 t
) q
WHERE tmp = 1
This is valid syntax, but it will fail (both in MySQL and in SQL Server) if the subquery returns more than 1 row
What exactly are you trying to do?
Please provide some sample data and desired resultset.
I agree with Joel's solution but I want to discuss why your query would be a bad idea to use (even though the syntax is essentially valid). This is a correlated subquery. The first issue with these is that they don't work if the subquery could possibly return more than one value for a record. The second and more critical problem (in my mind) is that they must work row by row rather than on the set of data. This means they will virtually always affect performance. So correlated subqueries should almost never be used in a production system. In this simple case, the join Joel showed is the correct solution.
If the subquery is more complicated, you may want to turn it into a derived table instead (this also fixes the more than one value associated to a record problem). While a derived table looks a lot like a correlated subquery to the uninitated, it does not perform the same way because it acts on the set of data rather than row-by row and thus will often be significantly faster. You are essentially making the query a table in the join.
Below is an example of your query re-written as a derived table. (Of course in production code you would not use select * either especially in a join, spell out the fields you need)
SELECT *
FROM Table2 t2
JOIN
(SELECT Table1.[Column], Table1.Table2Id as tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id ) as t
ON t.Table2Id = Table2.Id
WHERE tmp = 1
You've already got a variety of answers, some of them more useful than others. But to answer your question directly:
No, SQL Server will not allow you to reference the column alias (defined in the select list) in the predicate (the WHERE clause). I think that is sufficient to answer the question you asked.
Additional details:
(this discussion goes beyond the original question you asked.)
As you noted, there are several workarounds available.
Most problematic with the query you posted (as others have already pointed out) is that we aren't guaranteed that the subquery in the SELECT list returns only one row. If it does return more than one row, SQL Server will throw a "too many rows" exception:
Subquery returned more than 1 value.
This is not permitted when the subquery
follows =, !=, , >= or when the
subquery is used as an expression.
For the following discussion, I'm going to assume that issue is already sufficiently addressed.
Sometimes, the easiest way to make the alias available in the predicate is to use an inline view.
SELECT v.*
FROM ( SELECT *
, (SELECT Table1.Column
FROM Table1
JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
) as tmp
FROM Table2
) v
WHERE v.tmp = 1
Note that SQL Server won't push the predicate for the outer query (WHERE v.tmp = 1) into the subquery in the inline view. So you need to push that in yourself, by including the WHERE Table1.Column = 1 predicate in the subquery, particularly if you're depending on that to make the subquery return only one value.
That's just one approach to working around the problem, there are others. I suspect that query plan for this SQL Server query is not going to be optimal, for performance, you probably want to go with a JOIN or an EXISTS predicate.
NOTE: I'm not an expert on using MySQL. I'm not all that familiar with MySQL support for subqueries. I do know (from painful experience) that subqueries weren't supported in MySQL 3.23, which made migrating an application from Oracle 8 to MySQL 3.23 particularly painful.
Oh and btw... of no interest to anyone in particular, the Teradata DBMS engine DOES have an extension that allows for the NAMED keyword in place of the AS keyword, and a NAMED expression CAN be referenced elsewhere in the QUERY, including the WHERE clause, the GROUP BY clause and the ORDER BY clause. Shuh-weeeet
That kind of syntax is basically valid (you need to move the where tmp=... to on outer "select * from (....)", though), although it's ambiguous since you have two sets named "Table2"- you should probably define aliases on at least one of your usages of that table to clear up the ambiguity.
Unless you intended that to return a column from table1 corresponding to columns in table2 ... in which case you might have wanted to simply join the tables?