can this be written with an outer join - sql

The requirement is to copy rows from Table B into Table A. Only rows with an id that doesn't already exist, need to be copied over:
INSERT INTO A(id, x, y)
SELECT id, x, y
FROM B b
WHERE b.id IS NOT IN (SELECT id FROM A WHERE x='t');
^^^^^^^^^^^
Now, I was trying to write this with an outer join to compare the explain paths, but I can't write this (efficiently at least).
Note that the sql highlighted with ^'s make this tricky.

try
INSERT INTO A(id, x, y)
SELECT id, x, y
FROM TableB b
Left Join TableA a
On a.Id = b.Id
And a.x = 't'
Where a.Id Is Null
But I prefer the subquery representation as I think it more clearly expresses what you are doing.

Why are you not happy with what you have? If you check your explain plan, I promise you it says that an anti-join is performed, if the optimizer thinks that is the most efficient way (which it most likely will).
For everyone who reads this: SQL is not what actually is executed. SQL is a way of telling the database what you want, not what to do. All decent databases will be able to treat NOT EXISTS and NOT IN as equal (when they are, ie. there are no null values) and perform an anti-join. The trick with an outer join and an IS NULL condition doesn't work on SQL Server, though (SQL Server is not clever enough to transform it to an antijoin).

Your query will perform better than the query with outer join.
I guess the following query will do the job:
INSERT INTO A(id, x, y)
SELECT id, x, y
FROM B b
LEFT JOIN A a
ON b.id = a.id AND NOT a.x='t'

INSERT INTO A (id, x, y)
SELECT
B.id, B.x, B.y
FROM
B
WHERE
NOT EXISTS (SELECT * FROM A WHERE B.id = A.id AND A.x = 't')

Related

In SQL is there a way to use select * on a join?

Using Snowflake,have 2 tables, one with many columns and the other with a few, trying to select * on their join, get the following error:
SQL compilation error:duplicate column name
which makes sense because my joining columns are in both tables, could probably use select with columns names instead of *, but is there a way I could avoid that? or at least have the query infer the columns names dynamically from any table it gets?
I am quite sure snowflake will let you choose all from both halves of two+ tables via
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
what you will not be able to do is refer to the named of the columns in GROUP BY indirectly, thus this will not work
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY x
even though some databases know because you have JOIN ON a.x = b.x there is only one x, snowflake will not allow it (well it didn't last time I tried this)
but you can with the above use the alias name or the output column position thus both the following will work.
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY a.x
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY 1 -- assuming x is the first column
in general the * and a.* forms are super convenient, but are actually bad for performance.
when selecting you are now are risk of getting the columns back in a different order if the table has been recreated, thus making reading code unstable. Which also impacts VIEWs.
It also means all meta data for the table need to be loaded to know what the complete form of the data will be in. Where if you want x,y,z only and later a w was added to the table, the whole query plan can be compiled faster.
Lastly if you are selecting SELECT * FROM table in a sub-select and only a sub-set of those columns are needed the execution compiler doesn't need to prune these. And if all variables are attached to a correctly aliased table, if later a second table adds the same named column, naked columns are not later ambiguous. Which will only occur when that SQL is run, which might be an "annual report" which doesn't happen that often. wow, what a long use alias rant.
You can prefix the name of the column with the name of the table:
select table_a.id, table_b.name from table_a join table_b using (id)
The same works in combination with *:
select table_a.id, table_b.* from table_a join table_b using (id)
It works in "join" and "where" parts of the statement as well
select table_a.id, table_b.* from table_a join table_b
on table_a.id = table_b.id where table_b.name LIKE 'b%'
You can use table aliases to make the statement sorter:
select a.id, b.* from table_a a join table_b b
on a.id = b.id
Aliases could be applies on fields to use in subqueries, client software and (depending on the SQL server) in the other parts of the statements, for example 'order by':
select a.id as a_id, b.* from table_a a join table_b b
on a.id = b.id order by a_id
If you're after a result that includes all the distinct non-join columns from each table in the join with the join columns included in the output only once (given they will be identical for an inner-join) you can use NATURAL JOIN.
e.g.
select * from d1 natural inner join d2 order by id;
See examples: https://docs.snowflake.com/en/sql-reference/constructs/join.html#examples

How good is it to write a query like this?

select
a,
b,
(select x from table3 where id = z.id) as c,
d
from
table1 z, table2 zz
where z.id = zz.id;
I know that the query can be simplified easily like below:
select a,
b,
c.x,
d
from
table1 z,table2 zz, table3 c,
where z.id = zz.id and z.id = c.id;
but i want to know what is the performance impact or extra execution happens in case1 or they both have same performance? Asking just for knowledge.
If you want to use a correlated subquery (which is fine), then you should do:
select a, b,
(select t3.x from table3 t3 where t3.id = z.id) as c,
d
from table1 z join
table2 zz
on z.id = zz.id;
Important changes:
Qualify all column names (I don't know where a, b and d come from).
Use explicit join.
You can also write this query as:
select a, b, t3.x, d
from table1 z join
table2 zz
on z.id = zz.id left join
table3 t3
on t3.id = z.id;
This query is subtly different from the previous one. The previous one will return an error if the subquery returns more than one row. This one will put each such value in a different column.
That said, the Oracle optimizer is quite good. I would be surprised if there were any noticeable performance difference.
The first query, with a correlated sub-query, will always return data even if table3 is empty. You need an outer join to get the same result:
select a,
b,
c.x,
d
from table1 z
join table2 zz on z.id = zz.id
left join table3 c on z.id = c.id
Using join the query has been more readable
But performance is that same
select a,
b,
c.x,
d
from table1 z
join table2 zz on z.id = zz.id
join table3 c on z.id = c.id;
If your subquery is returning a single value based on a single input, it is a scalar subquery. A scalar subquery MIGHT improve performance for your query. It will do so under a couple of basic conditions. First, if z.id has a relatively low number of possible values. Scalar subquery processing will cache up to 254 values, if I recall. Second, if the rest of the query is returning a relatively high number of rows. In this case, if you only return a few rows, then the caching will not have an opportunity to help. But if you are returning a lot of rows, the caching benefits will build up.
Others have already highlighted how your original queries are not quite equivalent.
See more on scalar subqueries here -> Scalar Subqueries

How do I select a row from one table where the value row does not exist in another table?

Let's say I have two identical tables, A and B, with the row "x".
I want to select all elements in A, where the value of x in A is not in any value of x of B.
How do I do that?
You could also do something like this:
SELECT * FROM TableA
LEFT JOIN TableB on TableA.X = TableB.X
WHERE TableB.X IS NULL
(For the very straightforward example in your question, a NOT EXISTS / NOT IN approach is probably preferable, but is your real query is more complex, this is an option you might want to consider; if, for instace, you want som information from TableB where there is a match, but also want to know where there isn't one)
I'm having some trouble to understand what you need.
Anyway try this:
SELECT * FROM tableA
WHERE x not IN (SELECT x FROM tableB)
select *
from TableA
except
select *
from TableB
The fastest is the Left Join
SELECT * FROM A LEFT JOIN B ON A.X = B.X WHERE B.X IS NULL
use it :
select * from a where x not in (select x from b)

Join a table only if result set > 0

I have a table A joined with a table B which give me a result set.
I want to join a table C to the previous ones in order to restrict the result set. But in case there is no result with this join, I would like to have the same result set than before (without taking care of C).
Can you think of way to do that in SQL ?
SELECT *
FROM TableA
INNER JOIN TableB
ON TableA.ID = TableB.TableAID
LEFT JOIN TableC
ON TableC.ID = TableB.TableCID
This will return all rows from Tables A & B but only the rows from TableC where the ON criteria match.
Otherwise conditional joins don't really apply in standard SQL. If you are using SQL Server you can perform some stored procedure logic to check the results from TableC and if there are none then only get data from Table A & B. But this approach with be provider specific
Not possible with regular SQL since it involves logic.
Your best bet is to make a small script, e.g. (in pseudo code)
select * into #tmp from x inner join y inner join z where blabla;
if (exists (select * from #tmp))
BEGIN
select * from #tmp
END
else
BEGIN
select * from x inner join y where blabla;
END
Edit:
But if I were you, I would just always join with C using a LEFT JOIN, so you can see if the result was in one or the other result set...
e.g.
select x.*, y.*, case when z.id is null then 0 else 1 end from x inner join y left join z on blabla where blabla;
But that of course assumes you are able to alter the code path that reads the result.
I see a problem in the LEFT/OUTER JOIN methods. If you do it you could get some results that are in A and B but not in C. If I understand well the porpouse is join AB with C, I mean the result when crossing with C must include the three restrictions. So the #Cine solution is the apropiate to this case.

SQL Select queries

Which is better and what is the difference?
SELECT * FROM TABLE_A A WHERE A.ID IN (SELECT B.ID FROM TABLE_B B)
or
SELECT * FROM TABLE_A A, TABLE_B B WHERE A.ID = B.ID
The "best" way is to use the standard ANSI JOIN syntax:
SELECT (columns)
FROM TABLE_A a
INNER JOIN TABLE_B b
ON b.ID = a.ID
The first WHERE IN version will often result in the same execution plan, but on certain platforms it can be slower - it's not always consistent. The IN query (which is equivalent to EXISTS) is also going to become progressively more cumbersome to write and maintain as you start to add more tables or create more complex join conditions - it's not as flexible as an actual JOIN.
The second, comma-separated syntax is not as consistently supported as JOIN. It does work on most SQL DBMSes, but it's not the "preferred" version because if you leave out the WHERE clause then you end up with a cross-product. Whereas if you forget to write in the JOIN condition, you'll just end up with a syntax error. JOIN tends to be preferred because of this safety net.
I upvoted #Aaronaught's answer, but I have some comments:
Both the comma-style join syntax and the JOIN syntax are ANSI. The first is SQL-89, and the second is SQL-92. The SQL-89 syntax is still part of the standard, to support backward compatibility.
Can you give an example of an RDBMS that supports the SQL-92 syntax but not the SQL-89? I don't think there are any, so "not as consistently supported" may not be accurate.
You can also omit the join condition using JOIN syntax, and create a Cartesian product. Example: SELECT ... FROM A JOIN B is valid (correction: this is true only in some brands that implement the standard syntax loosely, such as MySQL).
But in any case I agree this is easier to spot when you use SQL-92 syntax. If you use SQL-89 syntax you may end up with a long WHERE clause and it's too easy to miss one of your join conditions.
The difference is that the first does a subquery which can be slower in some databases. And the second does a join, combining both tables in the same query.
Generally, the second would be faster if the database won't optimize it since with a subquery the database would have to keep the results of the subquery in memory.
These two queries return different results. You select only columns from TABLE_A in the first.
There are at least three differences between query X:
SELECT * FROM TABLE_A A WHERE A.ID IN (SELECT B.ID FROM TABLE_B B)
and Y:
SELECT * FROM TABLE_A A, TABLE_B B WHERE A.ID = B.ID
1) As Michas said, the set of columns will be different, where query Y will return the columns from tables A & B, but query X only returns the columns from table A. If you explicitly name which columns you want back, query X can only include columns from table A, but query Y would include columns from table B.
2) The number of rows may be different. If table B has more than on ID matching an ID from table A, then more rows will be returned with Query Y than X.
create table TABLE_A (ID int, st VARCHAR(10))
create table TABLE_B (ID int, st VARCHAR(10))
insert into TABLE_A values (1, 'A-a')
insert into TABLE_B values (1, 'B-a')
insert into TABLE_B values (1, 'B-b')
SELECT * FROM TABLE_A A WHERE A.ID IN (SELECT B.ID FROM TABLE_B B)
ID st
----------- ----------
1 A-a
(1 row(s) affected)
SELECT * FROM TABLE_A A, TABLE_B B WHERE A.ID = B.ID
ID st ID st
----------- ---------- ----------- ----------
1 A-a 1 B-a
1 A-a 1 B-b
(2 row(s) affected)
3) The execution plans will probably be different, since the queries are asking the database for different results. Inner joins used to run faster than in or exists and may still run faster in some cases. But since the results can be different you need to make sure that the data supports the transformation from a in or exists to a join.