Why is UNION faster than an OR statement [duplicate]

Why is UNION faster than an OR statement [duplicate] - sql

This question already has answers here:
UNION ALL vs OR condition in sql server query
(3 answers)
Closed 9 years ago.
I have a problem where I need to find records that either have a measurement that matches a value, or do not have that measurement at all. I solved that problem with three or four different approaches, using JOINs, using NOT IN and using NOT EXISTS. However, the query ended up being extremely slow every time. I then tried splitting the query in two, and they both run very fast (three seconds). But combining the queries using OR takes more than five minutes.
Reading on SO I tried UNION, which is very fast, but very inconvenient for the script I am using.
So two questions:
Why is UNION so much faster? (Or why is OR so slow)?
Is there any way I can force MSSQL to use a different approach for the OR statement that is fast?

The reason is that using OR in a query will often cause the Query Optimizer to abandon use of index seeks and revert to scans. If you look at the execution plans for your two queries, you'll most likely see scans where you are using the OR and seeks where you are using the UNION. Without seeing your query it's not really possible to give you any ideas on how you might be able to restructure the OR condition. But you may find that inserting the rows into a temporary table and joining on to it may yield a positive result.
Also, it is generally best to use UNION ALL rather than UNION if you want all results, as you remove the cost of row-matching.

There is currently no way in SQL Server to force a UNION execution plan if no UNION statement was used. If the only difference between the two parts is the WHERE clause, create a view with the complex query. The UNION query then becomes very simple:
SELECT * FROM dbo.MyView WHERE <cond1>
UNION ALL
SELECT * FROM dbo.MyView WHERE <cond2>
It is important to use UNION ALL in this context when ever possible. If you just use UNION SQL Server has to filter out duplicate rows, which requires an expensive sort operation in most cases.

Related

What are the pros/cons of using SQL variables versus subqueries?

I'm wondering there is a difference between SQL variables and subqueries. Whether one uses more processing power, or one is quicker, or even if one merely is more readable.
For (a very basic) example, I like to use variables to hold polygon and transformations in PostGIS:
WITH region_polygon AS (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
), raster_pixels AS (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
)
SELECT x, y
FROM raster_pixels a, region_polygon b
WHERE ST_Within(a.geom, b.geom)
But would it be better in any way to use subqueries?
SELECT x, y
FROM (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
) a, (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
) b
WHERE ST_Within(a.geom, b.geom)
Note that I'm using PostgreSQL.

There's an important syntactic advantage of common table expressions over derived tables when it comes to reuse. Consider the following, equivalent examples using self-joins:
Using common table expressions
WITH a(v) AS (SELECT 1 UNION SELECT 2)
SELECT *
FROM a AS x, a AS y
Using derived tables
SELECT *
FROM (SELECT 1 UNION SELECT 2) x(v),
(SELECT 1 UNION SELECT 2) y(v)
As you can see, using common table expressions, the view (SELECT 1 UNION SELECT 2) can be reused multiple times in your query. With derived tables, you will have to repeat your view declaration. In my example, this is still OK. In your own example, this starts getting a bit more hairy.
It's all about scope
Views in SQL are all about scoping. There are essentially four levels of declaring views:
As derived tables. They can be consumed exactly once.
As common table expressions. They can be consumed several times, but only in one query.
As views. They can be consumed several times in several queries.
As materialized views. Same as views, but the data is pre-calculated.
Some databases (in particular PostgreSQL) also know table-valued functions. From a mere syntax perspective, they're just like views - parameterised views.
Performance
Note that these thoughts only focus on syntax, not query planning. The different approaches may have very different performance implications, depending on the database vendor.

Those aren't variables, they're common table expressions (cte). In your query above, the execution plans are likely identical, because the optimizer should recognize they are equivalent queries. I prefer to use cte's because I think they're easier to read than subqueries, but that's it.
Edit: Upon further reading it looks like PostgreSQL does treat common table expressions differently than other databases, you can't update a cte in PostgreSQL, for instance. I'll leave my answer here because I believe for your query there won't be a difference, but I'm not terribly familiar with PostgreSQL.

As pointed out this construct is called Common Table Expression, not a variable.
I prefer to use CTE, rather than subquery, because it is way easier to read and write for me, especially when you have several nested CTEs.
You can write CTE once and refer to it several times in the rest of the query. With subquery you'll have to repeat the code several times.
Important difference of PostgreSQL from other databases (at least from MS SQL Server) is that PostgreSQL evaluates each CTE only once.
A useful property of WITH queries is that they are evaluated only once
per execution of the parent query, even if they are referred to more
than once by the parent query or sibling WITH queries. Thus, expensive
calculations that are needed in multiple places can be placed within a
WITH query to avoid redundant work. Another possible application is to
prevent unwanted multiple evaluations of functions with side-effects.
However, the other side of this coin is that the optimizer is less
able to push restrictions from the parent query down into a WITH query
than an ordinary sub-query. The WITH query will generally be evaluated
as written, without suppression of rows that the parent query might
discard afterwards. (But, as mentioned above, evaluation might stop
early if the reference(s) to the query demand only a limited number of
rows.)
MS SQL Server would inline each reference of CTE into the main query and optimize the whole result, but PostgreSQL doesn't. In some sense PostgreSQL is more flexible here. If you want the subquery to be evaluated only once, put it in CTE. If you don't want, put it in subquery and repeat the code. In SQL Server you'd have to use temporary table explicitly.
Your example in the question is too simple and most likely both variants are equivalent - check the execution plan.
Official docs mention it, as I quoted above, but Nick Barnes gave a link to a good article explaining it in more details and I thought it is worth putting it in an answer, rather that comment.
When optimising queries in PostgreSQL (true at least in 9.4 and
older), it’s worth keeping in mind that – unlike newer versions of
various other databases – PostgreSQL will always materialise a CTE
term in a query.
This can have quite surprising effects for those used to working with
DBs like MS SQL:
A query that should touch a small amount of data instead reads a whole
table and possibly spills it to a tempfile;
and You cannot UPDATE or
DELETE FROM a CTE term, because it’s more like a read-only temp table
rather than a dynamic view.
So, there is no definite answer whether CTE is better than subquery in PostgreSQL. In some cases it can be faster, in some cases it can be slower. But, IMHO, in most cases CTE is easier to write, read and maintain.
And, obviously, there is a case when you have no other option, but to use so-called recursive CTE (recursive queries are typically used to deal with hierarchical or tree-structured data).

Is selecting all columns from a SQL table expensive? [duplicate]

This question already has answers here:
What is the reason not to use select *?
(20 answers)
select * vs select column
(12 answers)
Closed 9 years ago.
Is it expensive to select all columns from a SQL table, compared to specifying which columns to retrieve?
SELECT * FROM table
vs
SELECT col1, col2, col3 FROM table
Might be useful to know that some of the tables I'm querying have over 100 columns.

It may be. It depends on what indexes are defined on the table.
If there's a non-clustered index on col1,col2,col3 then that index may be used to satisfy the query (since it's narrower than the table itself, be it a heap or a clustered index), which should result in lower I/O costs.
It's also, generally, preferred so that you can determine which queries are using particular columns in a table.
If the table has no indexes, or only a single clustered index, or there are no indexes that cover your particular query, then every page of the heap/clustered index is going to have to be accessed anyway. Even then, if you have any off-row data (e.g. a largish varchar(max)) then, if you don't include that column in the SELECT then you can avoid that I/O cost.

performance in this case, depends upon the proper use of indexes in your query.
If your DB is normalized properly and you have made use of indexes in where clause then for sure performance is going to be better.
Eg.
select * from tableName where id=232
Here index is used.
You can refer following link:
Performance issue in using SELECT *?
What is the reason not to use select *?

Lets take it apart into the main issues:
The actual database/application: How you type your query MIGHT change how the SQL application actually optimizes your query, where it gets the data from, etc. Then again, it might not. Its hard to generalize here and depends on the database application and setup.
Programmer resources: Using * instead of typing things out is easier and quicker for you. Yay! And if the "implication" behind the command is literally "get everything", maybe its a nice bit of programmer communication to use * instead of listing all the columns out by hand. Being hit in the face with a list of hundreds of column names as a programmer reading code afterwards is an unpleasant experience. On the other hand, listing things by hand can act as a bit of a signal that there's some reason you're asking for those columns specifically. Its not a strong signal, but its still a signal.
Other resources/IO/memory, etc: Now, if you don't actually need all 100 columns and you're querying them because you're lazy, then we get into further grey area. What's the database being loaded from? Where are the query results going? How fast are the read/write speeds on those things? Do you really want to do that with all the columns? How much memory or resources are going to be used in actioning the query? Will it be using an index? Is it indexed? Do you even need to care about optimization at this stage?
So the long and short of it is, its a grey area...

Generally you should only select the columns that you need for the query.
Sometimes selecting all the columns for a query which is used later in a stored procedure won't make any difference due to how the execution plan optimises the whole stored procedure. Also indexes on columns will have an effect.

Is there a speed difference between "select *" and a select with all columns? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select * vs select column
Suppose there is a table with 20 columns. Is there any speed difference between a select * and a select that includes all 20 columns explicitly?
If there is no difference, what would you advise? Should I use the lazy select * or should I create a query string with each column?
(If it makes any difference: I use SQL Server.)

Not so much about speed, but maintainability...
If your application requests columns specifically, and the table structure changes (column removed or renamed) then the statement will break when it runs, indicating exactly where the issue lies.
select * will still work after a structure change, but may cause a more subtle issue later on in the application that will be more difficult to trace.
It is additional work up front, but better for maintainability to explicitly list columns.
Assuming the where clause is paramaterised the SQL is only parsed once, and should be in the statement cache after that, execution speed won't be much different.

As a good practice its better to include column names in the select statement itself. Because the structure of the table can change in the future, and you would be pulling unwanted data..
Performance wise
When you use select * , sql server compiler has to replace the * with column names, which is an additional task, but I think that is negligible

UNION vs DISTINCT in performance

In SQL 2008, I have a query like so:
QUERY A
UNION
QUERY B
UNION
QUERY C
Will it be slower/faster than putting the result of all 3 queries in say, a temporary table and then SELECTing them with DISTINCT?

It depends on the query -- without knowing the complexity of queries A, B or C it's not one that can be answered, so your best bet is to profile and then judge based on that.
However...
I'd probably go with a union regardless: a temporary table can be quite expensive, especially as it gets big. Remember with a temporary table, you're explicitly creating extra operations and thus more i/o to stress the disk sub-system out. If you can do a select without resorting to a temporary table, that's always (probably) going to be faster.
There's bound to be an exception (or seven) to this rule, hence you're better off profiling against a realistically large dataset to make sure you get some solid figures to make a suitable decision on.

DISTINCT and UNION stand for totally different tasks. The first one eliminates, while the second joins result sets. I don't know what you want to do, but it seems you want distinct rows from 3 different queries with joined results. In that case:
query A UNION query B......
that would be the fastest, depending of course on what you want to do.

To union or union all, that is the question

I have two queries that I'm UNIONing together such that I already know there will be no duplicate elements between the two queries. Therefore, UNION and UNION ALL will produce the same results.
Which one should I use?

You should use the one that matches the intent of what you are looking for. If you want to ensure that there are no duplicates use UNION, otherwise use UNION ALL. Just because your data will produce the same results right now doesn't mean that it always will.
That said, UNION ALL will be faster on any sane database implementation, see the articles below for examples. But typically, they are the same except that UNION performs an extra step to remove identical rows (as one might expect), and it may tend to dominate execution time.
SQL Server article
Oracle article
MySQL article
DB2 documentation

I see that you've tagged this question PERFORMANCE, so I assume that's your primary consideration.
UNION ALL will absolutely outperform UNION since SQL doesn't have to check the two sets for dups.
Unless you need SQL to perform the duplicate checking for you, always use UNION ALL.

I would use UNION ALL anyway. Even though you know that there are not going to be duplicates, depending on your database server engine, it might not know that.
So, just to provide extra information to DB server, in order for its query planner a better choice (probably), use UNION ALL.
Having said that, if your DB server's query planner is smart enough to infer that information from the UNION clause and table indexes, then results (performance and semantic wise) should be the same.
Either case, it strongly depends on the DB server you are using.

According to http://blog.sqlauthority.com/2007/03/10/sql-server-union-vs-union-all-which-is-better-for-performance/ at least for performance it is better to use UNION ALL, since it does not actively distinct duplicates and as such is faster

Since there will be no duplicates from the two use UNION ALL. You don't need to check for duplicates and UNION ALL will preform the task more efficiently.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas