I was asked this question during one of my interviews.
Can you do JOIN using UNION keyword?
Can you do UNION using JOIN keyword?
That is -
1. I should get same output as JOIN without using JOIN keyword, but using UNION Keyword?
2. I should get same output as UNION without using UNION keyword, but using JOIN Keyword?
Can you give me an example of how to do this if possible?
An interview is the framework on which you set out your wares. Remember: don't answer questions ;)
Think of a press conference: the spokesperson is not looking to answer difficult questions from journos to catch themselves out. Rather, they are looking for questions to which they already have answers, being the information they want to release (and no more!)
If I faced this question in an interview, I would use it to demonstrate my knowledge of relational algebra because that's what I'd have gone into the interview with the intention of doing; I be alert for the "Talk about relational algebra here" question and this would be it.
Loosely speaking, JOIN is the counterpart of logical AND, whereas UNION is the counterpart of logical OR. Therefore, similar questions using convention logic could be, "Can you do AND using OR?" and "Can you do OR using AND?" The answer would depend on what else you could use e.g. NOT might come in handy ;)
I'd also be tempted to discuss the differences between the set of primitive operators, the set of operators necessary for computational completeness and the set of operators and shorthands required for practical purposes.
Trying to answer the question directly raises further questions. JOIN implies 'natural join' in relational algebra whereas in SQL it implies INNER JOIN. If the question specifically relates to SQL, do you have to answer for all the JOIN types? What about UNION JOIN?
To employ one example, SQL's outer join is famously a UNION. Chris Date expresses it better than I could ever hope to:
Outer join is expressly designed to
produce nulls in its result and should
therefore be avoided, in general.
Relationally speaking, it's a kind of
shotgun marriage: It forces tables
into a kind of union—yes, I do mean
union, not join—even when the tables
in question fail to conform to the
usual requirements for union (see
Chapter 6). It does this, in effect,
by padding one or both of the tables
with nulls before doing the union,
thereby making them conform to those
usual requirements after all. But
there's no reason why that padding
shouldn't be done with proper values
instead of nulls
SQL and Relational Theory, 1st Edition by C.J. Date
This would be a good discussion point if, "I hate nulls" is something you wanted to get across in the interview!
These are just a few thoughts that spring to mind. The crucial point is, by asking these questions the interviewer is offering you a branch. What will YOU hang on it? ;)
As this is an interview question, they are testing your understanding of both these functions.
The likely answer they are expecting is "generally no you cannot do this as they perform different actions", and you would explain this in more detail by stating that a union appends rows to the end of the result set where as a join adds further columns.
The only way you could have a Join and a Union work is where rows contain data from only one of the two sources:
SELECT A.AA, '' AS BB FROM A
UNION ALL
SELECT '' AS AA, B.BB FROM B
Is the same as:
SELECT ISNULL(A.AA, '') AS AA, ISNULL(B.BB, '') AS BB FROM A
FULL OUTER JOIN B ON 1=0
Or to do this with only one column where the types match:
SELECT A.AA AS TT FROM A
UNION ALL
SELECT B.BB AS TT FROM B
Is the same as:
SELECT ISNULL(A.AA, B.AA) AS TT
FROM A
FULL OUTER JOIN B ON 1=0
One case where you would do this is if you have data spawned over multiple tables but you want to see ti all together, however I would advise to use a UNION in this case rather than a FULL OUTER JOIN because of the query is doing what you would otherwise expect.
Do you mean something like this?
create table Test1 (TextField nvarchar(50), NumField int)
create table Test2 (NumField int)
create table Test3 (TextField nvarchar(50), NumField int)
insert into Test1 values ('test1a', 1)
insert into Test1 values ('test1b', 2)
insert into Test2 values (1)
insert into Test3 values ('test3a', 4)
insert into Test3 values ('test3b', 5)
select Test1.*
from Test1 inner join Test2 on Test1.NumField = Test2.NumField
union
select * from Test3
(written on SQL Server 2008)
UNION works when both SELECT statements have the same number of columns, AND the columns have the same (or at least similar) data types.
UNION doesn't care if both SELECT statements select data only from a single table, or if one or both of them are already JOINs on more than one table.
I think it also depends on other operations available.
If I remember well, UNION can be done using a FULL OUTER join:
Table a (x, y)
Table b (x, y)
CREATE VIEW one
AS
SELECT a.x AS Lx
, b.x AS Rx
, a.y AS Ly
, b.y AS Ry
FROM a FULL OUTER JOIN b
ON a.x = b.x
AND a.y = b.y
CREATE VIEW unionTheHardWay
AS
SELECT COALESCE(Lx, Rx) AS x
, COALESCE(Ly, Ry) AS y
FROM one
Related
By experimentation and surprisingly, I have found out that LEFT JOINING a point-table is much faster on large tables then a simple assigning of a single value to a column. By a point-table I mean a table 1x1 (1 row and 1 column).
Approach 1. By a simple assigning value, I mean this (slower):
SELECT A.*, 'Value' as NewColumn,
FROM Table1 A
Approach 2. By left-joining a point-table, I mean this (faster):
WITH B AS (SELECT 'Value' as 'NewColumn')
SELECT * Table1 A
LEFT JOIN B
ON A.ID <> B.NewColumn
Now the core of my question. Can someone advise me how to get rid of the whole ON clause:
ON A.ID <> B.NewColumn?
Checking the joining condition seems unnecessary waste of time because the key of table A must not equal the key of table B. It would throw out the rows from results if t1.ID had the same value as 'Value'. Removing that condition or maybe changing <> to = sign, seems further space to facilitate the join's performance.
Update February 23, 2015
Bounty question addressed to performance experts. Which of the approaches mentioned in my question and answers is the fastest.
Approach 1 Simple assigning value,
Approach 2 Left joining a point-table,
Approach 3 Cross joining a point-table (thanks to answer of Gordon Linoff)
Approach 4 Any other approach which may be suggested during the bounty period.
As I have measured empirically time of query execution in seconds of 3 approaches - the second approach with LEFT JOIN is the fastest. Then CROSS JOIN method, and then at last simple assigning value. Surprising as it is. Performance expert with a Solomon's sword is needed to confirm it or deny it.
I'm surprised this is faster for a simple expression, but you seem to want a cross join:
WITH B AS (SELECT 'Value' as NewColumn)
SELECT *
FROM Table1 A CROSS JOIN
B;
I use this construct to put "parameters" in queries (values that can easily be changed). However, I don't see why it would be faster. If the expression is more complicated (such as a subquery or very complicated calculation), then this method only evaluates it once. In the original query, it would normally be evaluated only once, but there might be cases where it is evaluated for each row.
You can also try with CROSS APPLY:
SELECT A.*, B.*,
FROM Table1 A
CROSS APPLY(SELECT 'Value' as 'NewColumn') B
Can you try to insert into a temp table instead of outputting to screen:
SELECT A.*, 'Value' as NewColumn
INTO #Table1Assign
FROM Table1 A
and
WITH B AS (SELECT 'Value' as 'NewColumn')
SELECT * Table1 A
INTO #Table1Join
LEFT JOIN B
ON A.ID <> B.NewColumn
That takes the actual transmission and rendering of the data to SSMS out of the equation, which could be caused by network slowdown or processing on the client.
When I run this with a 1M row table, I consistently get better performance with the simple assigning method, even if I switch to CROSS JOIN for the join method.
I doubts that second approach will be faster,with three select and left join.
First of all you should test same query with various sample data repeatedly.
What is the real scenario like ?
Inner join will be definitely faster than left join .
How about this ?
Declare #t table(id int,c2 varchar(10))
INSERT INTO #T
select 1,'A' union all
select 2,'A' union all
select 3,'B' union all
select 4,'B'
Declare #t1 table(nEWcOL varchar(10))
INSERT INTO #T1 Values('Value')
-- #Approach1
--SELECT * FROM #T outer apply
--#t1
--Create index on both join column
--#Approach2
SELECT * FROM #T A inner join
#t1 b on a.c2<>b.nEWcOL
--#Approach3
Declare #value varchar(20)
Select #value= nEWcOL from #t1
select *,#value value from #t
Too much text for a comment, so added this as an answer although I'm actually more adding to the question (**)
Somehow I think this is going to be one of those 'it depends' situations. I think it depends a lot on the amount of rows involved and even more on what happens afterwards with the data. Is it simply returned, is it used in a GROUP BY or DISTINCT later on, do we further JOIN or calculate with it etc..
Anyway, I think this IS an interesting question in that I've had to find out the hard way that having a dozen of 'parameters' in a single-row temp-table was faster than having them assigned upfront to 12 variables. Many, many moons ago the code I was given looked like an absurd construction to me so I rewrote it to use #variables instead. This was in a +1000-lines stored procedure which needed some extra performance squeezed out of it. After quite a bit of refactoring it turned out to run remarkably slower than before the change?!?!!
I've never really understood why and at the time simply reverted to the old version again. My best guess is some weird kind of combination of parameter-sniffing vs (auto-created?) statistics on the temp-table in question; if anyone could bring some light to your question it probably will lead to an answer of mine too =)
(**: I realize SO is not a forum so I apologise upfront, simply wanted to chime in that the observed behaviour of the OP isn't entirely anecdotal)
Select * doesn't use indexes properly on SQL, you should always specify your columns.
Other than that I would use
DECLARE #Value VARCHAR(30) = 'Value'
SELECT t.Id, t.C2, #Value NewColumn
FROM Table1 t
The overwhelming majority of people support my own view that there is no difference between the following statements:
SELECT * FROM tableA WHERE EXISTS (SELECT * FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT y FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT 1 FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT NULL FROM tableB WHERE tableA.x = tableB.y)
Yet today I came face-to-face with the opposite claim when in our internal developer meeting it was advocated that select 1 is the way to go and select * selects all the (unnecessary) data, hence hurting performance.
I seem to remember that there was some old version of Oracle or something where this was true, but I cannot find references to that. So, I'm curious - how was this practice born? Where did this myth originate from?
Added: Since some people insist on having evidence that this is indeed a false belief, here - a google query which shows plenty of people saying it so. If you're too lazy, check this direct link where one guy even compares execution plans to find that they are equivalent.
The main part of your question is - "where did this myth come from?"
So to answer that, I guess one of the first performance hints people learn with sql is that select * is inefficient in most situations. The fact that it isn't inefficient in this specific situation is hence somewhat counter intuitive. So its not surprising that people are skeptical about it. But some simple research or experiments should be enough to banish most myths. Although human history kinda shows that myths are quite hard to banish.
As a demo, try these
SELECT * FROM tableA WHERE EXISTS (SELECT 1/0 FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT CAST('bollocks' as int) FROM tableB WHERE tableA.x = tableB.y)
Now read the ANSI standard. ANSI-92, page 191, case 3a
If the <select list> "*" is simply contained in a <subquery>
that is immediately contained in an <exists predicate>, then
the <select list> is equivalent to a <value expression> that
is an arbitrary <literal>.
Finally, the behaviour on most RDBMS should ignore THE * in the EXISTS clause. As per this question yesterday ( Sql Server 2005 - Insert if not exists ) this doesn't work on SQL Server 2000 but I know it does on SQL Server 2005+
For SQL Server Conor Cunningham from the Query Optimiser team explains why he typically uses SELECT 1
The QP will take and expand all *'s
early in the pipeline and bind them to
objects (in this case, the list of
columns). It will then remove
unneeded columns due to the nature of
the query.
So for a simple EXISTS subquery like
this:
SELECT col1 FROM MyTable WHERE EXISTS
(SELECT * FROM Table2 WHERE
MyTable.col1=Table2.col2)The * will be
expanded to some potentially big
column list and then it will be
determined that the semantics of the
EXISTS does not require any of those
columns, so basically all of them can
be removed.
"SELECT 1" will avoid having to
examine any unneeded metadata for that
table during query compilation.
However, at runtime the two forms of
the query will be identical and will
have identical runtimes.
Edit: However I have looked at this in some detail since posting this answer and come to the conclusion that SELECT 1 does not avoid this column expansion. Full details here.
This question has an answer that says it was some version of MS Access that actually did not ignore the field of the SELECT clause. I have done some Access development, and I have heard that SELECT 1 is best practice, so this seems very likely to me to be the source of the "myth."
Performance of SQL EXISTS usage variants
Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
I see them as being misleading (they're arguably harder to accurately visualize compared to conventional joins and subqueries), often misunderstood (e.g. using SELECT * will behave the same as SELECT 1 in the EXISTS/NOT EXISTS subquery), and from my limited experience, slower to execute.
Can someone describe and/or provide me an example where they are best suited or where there is no option other than to use them? Note that since their execution and performance are likely platform dependent, I'm particularly interested in their use in MySQL.
Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
This article (though SQL Server related):
IN vs. JOIN vs. EXISTS
may be of interest to you.
In a nutshell, JOIN is a set operation, while EXISTS is a predicate.
In other words, these queries:
SELECT *
FROM a
JOIN b
ON some_condition(a, b)
vs.
SELECT *
FROM a
WHERE EXISTS
(
SELECT NULL
FROM b
WHERE some_condition(a, b)
)
are not the same: the former can return more than one record from a, while the latter cannot.
Their counterparts, NOT EXISTS vs. LEFT JOIN / IS NULL, are the same logically but not performance-wise.
In fact, the former may be more efficient in SQL Server:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
if the main query returned much less rows then the table where you want to find them. example:
SELECT st.State
FROM states st
WHERE st.State LIKE 'N%' AND EXISTS(SELECT 1 FROM addresses a WHERE a.State = st.State)
doing this with a join will be much slower. or a better example, if you want to search if a item exists in 1 of multiple tables.
You can't [easily] use a join in an UPDATE statement, so WHERE EXISTS works excellently there:
UPDATE mytable t
SET columnX = 'SomeValue'
WHERE EXISTS
(SELECT 1
FROM myothertable ot
WHERE ot.columnA = t.columnY
AND ot.columnB = 'XYX'
);
Edit: Basing this on Oracle more than MySQL, and yes there are ways to do it with an inline view, but IMHO this is cleaner.
I have a result set A which is 10 rows 1-10 {1,2,3,4,5,6,7,8,9,10}, and B which is 10 rows consisting of evens 1-20 {2,4,6,8,10,12,14,16,18,20}. I want to find of the elements that are in one set but not both. There are no other columns in the rows.
I know a UNION will be A + B. I can find the ones in both A and B with A INTERSECT B. I can find all of the rows in A that are not in B with A EXCEPT B.
This brings me to the question how do I find all rows that are in A or B, but not both, is there a transitive equiv of ( A EXCEPT B ) UNION ( B EXCEPT A) in the sql spec? I'm wanting a set of {1,3,5,7,9,12,14,16,18,20}. I believe this can also be written A UNION B EXCEPT ( A INTERSECT B )
Is there a mathy reason in set theory why this can't be done in one operation (that can be explained to someone who doesn't understand set theory)? Or, is it just not implemented because it is so simple to build yourself? Or, do I just not know of a better way to do it?
I'm thinking this must be in the SQL spec somewhere: I know the thing is humongous.
There's another way to do what you want, using a FULL OUTER JOIN with a WHERE clause to remove the rows that appear in both tables. This is probably more efficient than the constructs you suggested, but you should of course measure the performance of both to be sure. Here's a query you might be able to use:
SELECT COALESCE(A.id, B.id) AS id
FROM A
FULL OUTER JOIN B
ON A.id = B.id
WHERE A.id IS NULL OR B.id IS NULL
The "exclusive-or" type operation is also called symmetric set difference in set theory. Using this phrase in search, I found a page describing a number of techniques to implement the Symmetric Difference in SQL. It describes a couple of queries and how to optimise them. Although the details appear to be specific to Oracle, the general techniques are probably applicable to any DBMS.
I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php