What is functional difference between UNION and OR? - sql

I wonder if UNION and OR in WHERE statement have any difference?
Using UNION statement with two different tables:
SELECT A.col
FROM A, B1
WHERE A.col = B1.col
UNION
SELECT A.col
FROM A, B2
WHERE A.col = B2.col
Using or in WHERE statement:
SELECT A.col
FROM A, B1, B2
WHERE A.col=B1.col or A.col=B2.col
Except the performance difference, is there any difference on their meanings?

These queries are nothing alike. First, you should learn to use proper explicit JOIN syntax.
If you ran them -- even on test tables -- you would quickly find the differences.
For instance, the union query removes duplicates. The or query does not.
The or query does a Cartesian product of the tables. So, if any of the tables is empty (such as B1 or B2) then no rows are returned. The union will return values from the other two tables.

Related

Can I select several tables in the same WITH query?

I have a long query with a with structure. At the end of it, I'd like to output two tables. Is this possible?
(The tables and queries are in snowflake SQL by the way.)
The code looks like this:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
..... many more alias tables and subqueries here .....
)
select * from table_g where z = 3 ;
But for the very last row, I'd like to query table_g twice, once with z = 3 and once with another condition, so I get two tables as the result. Is there a way of doing that (ending with two queries rather than just one) or do I have to re-run the whole code for each table I want as output?
One query = One result set. That's just the way that RDBMS's work.
A CTE (WITH statement) is just syntactic sugar for a subquery.
For instance, a query similar to yours:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
select id,
product_c
from x.z ),
select *
from table_a
inner join table_b on table_a.id = table_b.id
inner join table_c on table_b.id = table_c.id;
Is 100% identical to:
select *
from
(select id, product_a from x.x) table_a
inner join (select id, product_b from x.y) table_b
on table_a.id = table_b.id
inner join (select id, product_c from x.z) table_c
on table_b.id = table_c.id
The CTE version doesn't give you any extra features that aren't available in the non-cte version (with the exception of a recursive cte) and the execution path will be 100% the same (EDIT: Please see Simon's answer and comment below where he notes that Snowflake may materialize the derived table defined by the CTE so that it only has to perform that step once should the CTE be referenced multiple times in the main query). As such there is still no way to get a second result set from the single query.
While they are the same syntactically, they don't have the same performance plan.
The first case can be when one of the stages in the CTE is expensive, and is reused via other CTE's or join to many times, under Snowflake, use them as a CTE I have witness it running the "expensive" part only a single time, which can be good so for example like this.
WITH expensive_select AS (
SELECT a.a, b.b, c.c
FROM table_a AS a
JOIN table_b AS b
JOIN table_c AS c
WHERE complex_filters
), do_some_thing_with_results AS (
SELECT stuff
FROM expensive_select
WHERE filters_1
), do_some_agregation AS (
SELECT a, SUM(b) as sum_b
FROM expensive_select
WHERE filters_2
)
SELECT a.a
,a.b
,b.stuff
,c.sum_b
FROM expensive_select AS a
LEFT JOIN do_some_thing_with_results AS b ON a.a = b.a
LEFT JOIN do_some_agregation AS c ON a.a = b.a;
This was originally unrolled, and the expensive part was some VIEWS that the date range filter that was applied at the top level were not getting pushed down (due to window functions) so resulted in full table scans, multiple times. Where pushing them into the CTE the cost was paid once. (In our case putting date range filters in the CTE made Snowflake notice the filters and push them down into the view, and things can change, a few weeks later the original code ran as good as the modified, so they "fixed" something)
In other cases, like this the different paths that used the CTE use smaller sub-sets of the results, so using the CTE reduced the remote IO so improved performance, there then was more stalls in the execution plan.
I also use CTEs like this to make the code easier to read, but giving the CTE a meaningful name, but the aliasing it to something short, for use. Really love that.

Which way for combining of conditions is preferable for unions?

There are some tables with some fields (on PostgreSQL):
Table1 {a, b, c}
Table2 {c, d, a}
Table3 {f, g, a}
I need to union a several queries and get a field. All of these queries have the same condition CONDITION_A and others different conditions: CONDITION_1, CONDITION_2, CONDITION_3
Which way for combining of conditions is preferable for unions?
1)CONDITION_A is embedded into each query:
SELECT a FROM Table1 WHERE <CONDITIONS_1> AND a.someParam=<CONDITION_A>
UNION
SELECT a FROM Table2 WHERE <CONDITIONS_2> AND a.someParam=<CONDITION_A>
UNION
SELECT a FROM Table3 WHERE <CONDITIONS_3> AND a.someParam=<CONDITION_A>
2) CONDITION_A is embedded after unions.
(SELECT a FROM Table1 WHERE <CONDITIONS_1>
UNION
SELECT a FROM Table2 WHERE <CONDITIONS_2>
UNION
SELECT a FROM Table3 WHERE <CONDITIONS_2> ) WHERE a.someParam=<CONDITION_A>
First, do you really need union? If you can use union all the query will run better because union incurs the overhead of removing duplicates.
Putting the conditions as close to the from is probably a good idea. This gives the optimizer more opportunities to optimize the query. For instance, each query might have indexes that include someParam, which can be used in the subqueries.

Different Values In Two SQL Tables

I'm trying to select different values from two tables in SQL but my code isn't working. The first part of it works:
SELECT distinct a.c1, b."Commodity.Code"::numeric FROM coletados a, commod b
WHERE a.c1 = b."Commodity.Code"::numeric
But when I try to select different values, it doesn't work. My entire SQL statement is:
SELECT * FROM commod b
WHERE b."Commodity.Code"::numeric =!
(SELECT DISTINCT a.c1, b."Commodity.Code"::numeric
FROM coletados a, commod b
WHERE a.c1 = b."Commodity.Code"::numeric)
In reality, I just want the column of numbers that are different in the two tables, so I don't need the '*', but I don't know if I can select the same variable (a.c1 or b."Commodity.Code") twice. Thanks for all the help.
You are comparing one value to two values. In Postgres, one method is:
select *
from commod b
where (b.c1, "Commodity.Code"::numeric) not in (select a.c1, a."Commodity.Code"::numeric
from coletados a
);
Or, using your approach:
select *
from commod b
where "Commodity.Code"::numeric not in (select a."Commodity.Code"::numeric
from coletadoa a
where a.c1 = b.c1
);
That is, the subquery does not need a join, just a correlation clause.

PostgreSQL union two tables and join with a third table

I want to union to tables and join them with a third metadata table and I would like to know which approach is the best/fastest?
The database is a PostgreSQL.
Below is my two suggestions, but other approaches are welcome.
To do the join before the union on both tables:
SELECT a.id, a.feature_type, b.datetime, b.file_path
FROM table1 a, metadata b WHERE a.metadata_id = b.id
UNION ALL
SELECT a.id, a.feature_type, b.datetime, b.file_path
FROM table2 a, metadata b WHERE a.metadata_id = b.id
Or to do the union first and then do the join:
SELECT a.id, a.feature_type, b.datetime, b.file_path
FROM
(
SELECT id, feature_type, metadata_id FROM table1
UNION ALL
SELECT id, feature_type, metadata_id FROM table2
)a, metadata b
WHERE a.metadata_id = b.id
Run an EXPLAIN ANALYZE on both statements then you will see which one is more efficient.
it can be unpredictable due to sql-engine optimizator. it's better to look at the execution plan. finally both approaches can be represented in the same way
In so far as I can remember, running Explain will reveal that PostgreSQL interprets the second as the first provided that there is no group by clause (explicit, or implicit due to union instead of union all) in any of the subqueries.

Does the optimizer filter subqueries with outer where clauses

Take the following query:
select * from
(
select a, b
from c
UNION
select a, b
from d
)
where a = 'mung'
order by b
Will the optimizer generally work out that I am filtering a on the value 'mung' and consequently filter mung on each of the queries in the subquery.
OR
will it run each query within the subquery union and return the results to the outer query for filtering (as the query would perhaps suggest)
In which case the following query would perform better :
select * from
(
select a, b
from c
where a = 'mung'
UNION
select a, b
from d
where a = 'mung'
)
order by b
Obviously query 1 is best for maintenance, but is it sacrificing much performace for this?
Which is best?
Edit
Sorry I neglected to add the order by clause in the queries to indicate why it all needed to be a subquery in the first place!
Edit
Ok, I thought it was going to be a simple answer, but I forgot I was talking about databases here! Running it through the analyser indicates that there is no optimization going on, but this could of course be because I only have 4 rows in my tables. I was hoping for a simpler yes or no.. Will have to set up some more complex tests.
I am on mySQL at the moment, but I am after a general answer here...
Use:
SELECT x.a,
x.b
FROM TABLE_X x
WHERE x.a = 'mung'
UNION
SELECT y.a,
y.b
FROM TABLE_Y y
WHERE y.a = 'mung'
The subquery is unnecessary in the given context
The query above will use indexes if available on column "a"; Using the subquery method, there's no index to utilize
Use UNION ALL if you don't have to be concerned with duplicates - it will be faster than UNION