To union or union all, that is the question - sql

I have two queries that I'm UNIONing together such that I already know there will be no duplicate elements between the two queries. Therefore, UNION and UNION ALL will produce the same results.
Which one should I use?

You should use the one that matches the intent of what you are looking for. If you want to ensure that there are no duplicates use UNION, otherwise use UNION ALL. Just because your data will produce the same results right now doesn't mean that it always will.
That said, UNION ALL will be faster on any sane database implementation, see the articles below for examples. But typically, they are the same except that UNION performs an extra step to remove identical rows (as one might expect), and it may tend to dominate execution time.
SQL Server article
Oracle article
MySQL article
DB2 documentation

I see that you've tagged this question PERFORMANCE, so I assume that's your primary consideration.
UNION ALL will absolutely outperform UNION since SQL doesn't have to check the two sets for dups.
Unless you need SQL to perform the duplicate checking for you, always use UNION ALL.

I would use UNION ALL anyway. Even though you know that there are not going to be duplicates, depending on your database server engine, it might not know that.
So, just to provide extra information to DB server, in order for its query planner a better choice (probably), use UNION ALL.
Having said that, if your DB server's query planner is smart enough to infer that information from the UNION clause and table indexes, then results (performance and semantic wise) should be the same.
Either case, it strongly depends on the DB server you are using.

According to http://blog.sqlauthority.com/2007/03/10/sql-server-union-vs-union-all-which-is-better-for-performance/ at least for performance it is better to use UNION ALL, since it does not actively distinct duplicates and as such is faster

Since there will be no duplicates from the two use UNION ALL. You don't need to check for duplicates and UNION ALL will preform the task more efficiently.

Related

Is it safe to run a lot of UNION ALL in one query?

I received query from analytic to automate and it contains about 10-15 UNION and UNION ALL commands. Each subquery contains several joins of big tables.
I think it will be more safe and optimal way to save results of subqueries into temp tables and union them.
So how UNION(ALL) works on deep level? Is my way better for optimisation?
I don't know what you mean by "safe", but a priori there is no reason to split the query into separate tables -- unless you want to query those tables independently.
Splitting the results into separate tables has some big downsides -- namely, managing the extra tables you make. I have memories of nightmares tracking down errors because some intermediate table was not updated correctly.
In a UNION ALL, the subqueries are really isolated from each other. There is no reason to split the queries into tables to make the overall query more performant.
Introducing intermediate tables introduces complexity, so you need a good reason for that. There are good reasons. For instance, if the view is used many times and the subqueries are expensive, then materializing them makes the query more efficient. In fact, you might find that indexed views (SQL Server's approach to materialized views) are a handy way of doing this. Materialized views (whatever they are called) get around the issue of out-of-date intermediate tables.
Unfortunately, saving each subquery into a temp table won't really help/change much from a memory consumption point of view. If you run the full query with so many unions, SQL Server will consume server memory to hold the temporary results, before delivering the final result set. Temp tables also get stored in server memory, which means the memory footprint would be about the same.
For a quick comment on UNION versus UNION ALL, if you can logically accept the latter instead of the former, then it might help performance. This is because UNION by itself means to aggregate results, and also remove duplicates. UNION ALL has no such duplicate removal step, which means it generally will outperform plain UNION.

Why is UNION faster than an OR statement [duplicate]

This question already has answers here:
UNION ALL vs OR condition in sql server query
(3 answers)
Closed 9 years ago.
I have a problem where I need to find records that either have a measurement that matches a value, or do not have that measurement at all. I solved that problem with three or four different approaches, using JOINs, using NOT IN and using NOT EXISTS. However, the query ended up being extremely slow every time. I then tried splitting the query in two, and they both run very fast (three seconds). But combining the queries using OR takes more than five minutes.
Reading on SO I tried UNION, which is very fast, but very inconvenient for the script I am using.
So two questions:
Why is UNION so much faster? (Or why is OR so slow)?
Is there any way I can force MSSQL to use a different approach for the OR statement that is fast?
The reason is that using OR in a query will often cause the Query Optimizer to abandon use of index seeks and revert to scans. If you look at the execution plans for your two queries, you'll most likely see scans where you are using the OR and seeks where you are using the UNION. Without seeing your query it's not really possible to give you any ideas on how you might be able to restructure the OR condition. But you may find that inserting the rows into a temporary table and joining on to it may yield a positive result.
Also, it is generally best to use UNION ALL rather than UNION if you want all results, as you remove the cost of row-matching.
There is currently no way in SQL Server to force a UNION execution plan if no UNION statement was used. If the only difference between the two parts is the WHERE clause, create a view with the complex query. The UNION query then becomes very simple:
SELECT * FROM dbo.MyView WHERE <cond1>
UNION ALL
SELECT * FROM dbo.MyView WHERE <cond2>
It is important to use UNION ALL in this context when ever possible. If you just use UNION SQL Server has to filter out duplicate rows, which requires an expensive sort operation in most cases.

UNION vs DISTINCT in performance

In SQL 2008, I have a query like so:
QUERY A
UNION
QUERY B
UNION
QUERY C
Will it be slower/faster than putting the result of all 3 queries in say, a temporary table and then SELECTing them with DISTINCT?
It depends on the query -- without knowing the complexity of queries A, B or C it's not one that can be answered, so your best bet is to profile and then judge based on that.
However...
I'd probably go with a union regardless: a temporary table can be quite expensive, especially as it gets big. Remember with a temporary table, you're explicitly creating extra operations and thus more i/o to stress the disk sub-system out. If you can do a select without resorting to a temporary table, that's always (probably) going to be faster.
There's bound to be an exception (or seven) to this rule, hence you're better off profiling against a realistically large dataset to make sure you get some solid figures to make a suitable decision on.
DISTINCT and UNION stand for totally different tasks. The first one eliminates, while the second joins result sets. I don't know what you want to do, but it seems you want distinct rows from 3 different queries with joined results. In that case:
query A UNION query B......
that would be the fastest, depending of course on what you want to do.

Does the way you write sql queries affect performance?

say i have a table
Id int
Region int
Name nvarchar
select * from table1 where region = 1 and name = 'test'
select * from table1 where name = 'test' and region = 1
will there be a difference in performance?
assume no indexes
is it the same with LINQ?
Because your qualifiers are, in essence, actually the same (it doesn't matter what order the where clauses are put in), then no, there's no difference between those.
As for LINQ, you will need to know what query LINQ to SQL actually emits (you can use a SQL Profiler to find out). Sometimes the query will be the simplest query you can think of, sometimes it will be a convoluted variety of such without you realizing it, because of things like dependencies on FKs or other such constraints. LINQ also wouldn't use an * for select.
The only real way to know is to find out the SQL Server Query Execution plan of both queries. To read more on the topic, go here:
SQL Server Query Execution Plan Analysis
Should it? No. SQL is a relational algebra and the DBMS should optimize irrespective of order within the statement.
Does it? Possibly. Some DBMS' may store data in a certain order (e.g., maintain a key of some sort) despite what they've been told. But, and here's the crux: you cannot rely on it.
You may need to switch DBMS' at some point in the future. Even a later version of the same DBMS may change its behavior. The only thing you should be relying on is what's in the SQL standard.
Regarding the query given: with no indexes or primary key on the two fields in question, you should assume that you'll need a full table scan for both cases. Hence they should run at the same speed.
I don't recommend the *, because the engine should look for the table scheme before executing the query. Instead use the table fields you want to avoid unnecessary overhead.
And yes, the engine optimizes your queries, but help him :) with that.
Best Regards!
For simple queries, likely there is little or no difference, but yes indeed the way you write a query can have a huge impact on performance.
In SQL Server (performance issues are very database specific), a correlated subquery will usually have poor performance compared to doing the same thing in a join to a derived table.
Other things in a query that can affect performance include using SARGable1 where clauses instead of non-SARGable ones, selecting only the fields you need and never using select * (especially not when doing a join as at least one field is repeated), using a set-bases query instead of a cursor, avoiding using a wildcard as the first character in a a like clause and on and on. There are very large books that devote chapters to more efficient ways to write queries.
1 "SARGable", for those that don't know, are stage 1 predicates in DB2 parlance (and possibly other DBMS'). Stage 1 predicates are more efficient since they're parts of indexes and DB2 uses those first.

What's optimal? UNION vs WHERE IN (str1, str2, str3)

I'm writing a program that sends an email out at a client's specific local time. I have a .NET method that takes a timezone & time and destination timezone and returns the time in that timezone. So my method is to select every distinct timezone in the database, check if it is the correct time using the method, then select every client out of the database with that timezone(s).
The query will look like one of these. Keep in mind the order of the result set does not matter, so a union would be fine. Which runs faster, or do they really do the same thing?
SELECT email FROM tClient WHERE timezoneID in (1, 4, 9)
or
SELECT email FROM tClient WHERE timezoneID = 1
UNION ALL SELECT email FROM tClient WHERE timezoneID = 4
UNION ALL SELECT email FROM tCLIENT WHERE timezoneID = 9
Edit: timezoneID is a foreign key to tTimezone, a table with primary key timezoneID and varchar(20) field timezoneName.
Also, I went with WHERE IN since I didn't feel like opening up the analyzer.
Edit 2: Query processes 200k rows in under 100 ms, so at this point I'm done.
Hey! These queries are not equivalent.
Results will be same only if assuming that one email belongs only to the one time zone. Of course it does however SQL engine doesn't know that and tries to remove duplicities. So the first query should be faster.
Always use UNION ALL, unless you know why you want to use UNION.
If you are not sure what is difference see this SO question.
Note: that yell belongs to previous version of question.
For most database related performance questions, the real answer is to run it and analyze what the DB does for your dataset. Run an explain plan or trace to see if your query is hitting the proper indexes or create indexes if necessary.
I would likely go with the first using the IN clause since that carries the most semantics of what you want. The timezoneID seems like a primary key on some timezone table, so it should be a foreign key on email and indexed. Depending on the DB optimizer, I would think it should do an index scan on the foreign key index.
In the book "SQL Performance Tuning", the authors found that the UNION queries were slower in all 7 DBMS' that they tested (SQL Server 2000, Sybase ASE 12.5, Oracle 9i, DB2, etc.): http://books.google.com/books?id=3H9CC54qYeEC&pg=PA32&vq=UNION&dq=sql+performance+tuning&source=gbs_search_s&sig=ACfU3U18uYZWYVHxr2I3uUj8kmPz9RpmiA#PPA33,M1
The later DBMS' may have optimized that difference away, but it's doubtful. Also, the UNION method is much longer and more difficult to maintain (what if you want a third?) vs. the IN.
Unless you have good reason to use UNION, stick with the OR/IN method.
My first guess would be that SELECT email FROM tClient WHERE timezoneID in (1, 4, 9) will be faster as it requires only single scan of the table to find the results, but I suggest checking the execution plan for both queries.
I do not have MS SQL Query Analyzer at hand to actually check my hypothesis, but think that WHERE IN variant would be faster because with UNION server will have to do 3 table scans whereas with WHERE IN will need only one. If you have Query Analyzer check execution plans for both queries.
On the Internet you may often encounter suggestions to avoid using WHERE IN, but that refers to cases where subqueries a used. So this case is out of scope of this recommendation and additionally is easier for reading and understanding.
I think that there are several very important information missing in the question. First of all, it is of great importance weather timezoneID is indexed or not, is it part of the primary key etc. I would advice everyone to have a look at the analyzer, but in my experience the WHERE clause should be faster, especially with an index. The logic is something like, there is an additional overhead in the union query, checking types, column numbers in each etc.
Some DBMS's Query Optimizers modify your query to make it more efficient, so depending on the DBMS your using, you probably shouldn't care.