Which way for combining of conditions is preferable for unions? - sql

There are some tables with some fields (on PostgreSQL):
Table1 {a, b, c}
Table2 {c, d, a}
Table3 {f, g, a}
I need to union a several queries and get a field. All of these queries have the same condition CONDITION_A and others different conditions: CONDITION_1, CONDITION_2, CONDITION_3
Which way for combining of conditions is preferable for unions?
1)CONDITION_A is embedded into each query:
SELECT a FROM Table1 WHERE <CONDITIONS_1> AND a.someParam=<CONDITION_A>
UNION
SELECT a FROM Table2 WHERE <CONDITIONS_2> AND a.someParam=<CONDITION_A>
UNION
SELECT a FROM Table3 WHERE <CONDITIONS_3> AND a.someParam=<CONDITION_A>
2) CONDITION_A is embedded after unions.
(SELECT a FROM Table1 WHERE <CONDITIONS_1>
UNION
SELECT a FROM Table2 WHERE <CONDITIONS_2>
UNION
SELECT a FROM Table3 WHERE <CONDITIONS_2> ) WHERE a.someParam=<CONDITION_A>

First, do you really need union? If you can use union all the query will run better because union incurs the overhead of removing duplicates.
Putting the conditions as close to the from is probably a good idea. This gives the optimizer more opportunities to optimize the query. For instance, each query might have indexes that include someParam, which can be used in the subqueries.

Related

Performance of JOIN then UNION vs. UNION then JOIN

I have a SQL query along the following lines:
WITH a AS (
SELECT *
FROM table1
INNER JOIN table3 ON table1.id = table3.id
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
INNER JOIN table3 ON table2.id = table3.id
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM a
UNION
SELECT *
FROM b
)
SELECT *
FROM combined
I rewrote this as:
WITH a AS (
SELECT *
FROM table1
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM (
SELECT *
FROM a
UNION
SELECT *
FROM b
) union
INNER JOIN table3 ON union.id = table3.id
)
SELECT *
FROM combined
I expected that this might be more performant, since it's only doing the JOIN once, or at the very least that it would have no effect on execution time. I was surprised to find that the query now takes almost twice as long to run.
This is no problem since it worked perfectly well before, I only really rewrote it out of my own personal style preference anyway so I'll stick with the original. But I'm no expert when it comes to databases/SQL, so I was interested to know if anyone can share any insights as to why this second approach is so much less performant?
If it makes a difference, it's a Redshift database, table1 and table2 are both around ~250 million rows, table3 is ~1 million rows, and combined has less than 1000 rows.
The SQL optimizer has more information on "bare" tables than on "computed" tables. So, it is easier to optimize the two CTEs.
In a database that uses indexes, this might affect index usage. In Redshift, this might incur additional data movement.
In this particular case, though, I suspect the issue might have to do with filtering via the JOIN operation. The UNION is incurring overhead to remove duplicates. By filtering before the UNION, duplicate removal is faster than filtering afterwards.
In addition, the UNION may affect where the data is located, so the second version might require additional data movement.

What is functional difference between UNION and OR?

I wonder if UNION and OR in WHERE statement have any difference?
Using UNION statement with two different tables:
SELECT A.col
FROM A, B1
WHERE A.col = B1.col
UNION
SELECT A.col
FROM A, B2
WHERE A.col = B2.col
Using or in WHERE statement:
SELECT A.col
FROM A, B1, B2
WHERE A.col=B1.col or A.col=B2.col
Except the performance difference, is there any difference on their meanings?
These queries are nothing alike. First, you should learn to use proper explicit JOIN syntax.
If you ran them -- even on test tables -- you would quickly find the differences.
For instance, the union query removes duplicates. The or query does not.
The or query does a Cartesian product of the tables. So, if any of the tables is empty (such as B1 or B2) then no rows are returned. The union will return values from the other two tables.

Multiple reusable SQL queries

(Note I am getting an error submitting to stackoverflow if i use "select", so have misspelled my queries. [Now Fixed])
Sorry this is a newbie question. I have one very long SQL query that is getting harder to manage. In fact there are some sub-queries that are being used multiple times. What is the best way to break up the query? I would prefer to keep it in the database, rather than take it out into the calling program. It goes something like this.
Select A, B, C
from (select D from Table_1 where ...)
Union Select E, F
from Table_2
Inner Join (Select D, E, from Table_1 where...)..
So what I would like to do is
Result1 = select D,E from Table_1 where....
Result2 = Select A,B,C from Result_1 Union Select E,F from Table_2 Inner Join Result_1 ...
What is the best way to do this? I can't use Views because I don't have privileges. How can I use the results from the first query in the second query? Can cursors be used in this case?
Using a CTE you can access the same subquery multiple times (this is the main difference to Derived Tables):
with CTE as
(Select D, E, from Table_1 where...)
Select A, B, C
from CTE
Union
Select E, F
from Table_2
Inner Join CTE ..

PostgreSQL union two tables and join with a third table

I want to union to tables and join them with a third metadata table and I would like to know which approach is the best/fastest?
The database is a PostgreSQL.
Below is my two suggestions, but other approaches are welcome.
To do the join before the union on both tables:
SELECT a.id, a.feature_type, b.datetime, b.file_path
FROM table1 a, metadata b WHERE a.metadata_id = b.id
UNION ALL
SELECT a.id, a.feature_type, b.datetime, b.file_path
FROM table2 a, metadata b WHERE a.metadata_id = b.id
Or to do the union first and then do the join:
SELECT a.id, a.feature_type, b.datetime, b.file_path
FROM
(
SELECT id, feature_type, metadata_id FROM table1
UNION ALL
SELECT id, feature_type, metadata_id FROM table2
)a, metadata b
WHERE a.metadata_id = b.id
Run an EXPLAIN ANALYZE on both statements then you will see which one is more efficient.
it can be unpredictable due to sql-engine optimizator. it's better to look at the execution plan. finally both approaches can be represented in the same way
In so far as I can remember, running Explain will reveal that PostgreSQL interprets the second as the first provided that there is no group by clause (explicit, or implicit due to union instead of union all) in any of the subqueries.

Does the optimizer filter subqueries with outer where clauses

Take the following query:
select * from
(
select a, b
from c
UNION
select a, b
from d
)
where a = 'mung'
order by b
Will the optimizer generally work out that I am filtering a on the value 'mung' and consequently filter mung on each of the queries in the subquery.
OR
will it run each query within the subquery union and return the results to the outer query for filtering (as the query would perhaps suggest)
In which case the following query would perform better :
select * from
(
select a, b
from c
where a = 'mung'
UNION
select a, b
from d
where a = 'mung'
)
order by b
Obviously query 1 is best for maintenance, but is it sacrificing much performace for this?
Which is best?
Edit
Sorry I neglected to add the order by clause in the queries to indicate why it all needed to be a subquery in the first place!
Edit
Ok, I thought it was going to be a simple answer, but I forgot I was talking about databases here! Running it through the analyser indicates that there is no optimization going on, but this could of course be because I only have 4 rows in my tables. I was hoping for a simpler yes or no.. Will have to set up some more complex tests.
I am on mySQL at the moment, but I am after a general answer here...
Use:
SELECT x.a,
x.b
FROM TABLE_X x
WHERE x.a = 'mung'
UNION
SELECT y.a,
y.b
FROM TABLE_Y y
WHERE y.a = 'mung'
The subquery is unnecessary in the given context
The query above will use indexes if available on column "a"; Using the subquery method, there's no index to utilize
Use UNION ALL if you don't have to be concerned with duplicates - it will be faster than UNION