UNION vs DISTINCT in performance

UNION vs DISTINCT in performance - sql

In SQL 2008, I have a query like so:
QUERY A
UNION
QUERY B
UNION
QUERY C
Will it be slower/faster than putting the result of all 3 queries in say, a temporary table and then SELECTing them with DISTINCT?

It depends on the query -- without knowing the complexity of queries A, B or C it's not one that can be answered, so your best bet is to profile and then judge based on that.
However...
I'd probably go with a union regardless: a temporary table can be quite expensive, especially as it gets big. Remember with a temporary table, you're explicitly creating extra operations and thus more i/o to stress the disk sub-system out. If you can do a select without resorting to a temporary table, that's always (probably) going to be faster.
There's bound to be an exception (or seven) to this rule, hence you're better off profiling against a realistically large dataset to make sure you get some solid figures to make a suitable decision on.

DISTINCT and UNION stand for totally different tasks. The first one eliminates, while the second joins result sets. I don't know what you want to do, but it seems you want distinct rows from 3 different queries with joined results. In that case:
query A UNION query B......
that would be the fastest, depending of course on what you want to do.

Related

Is it safe to run a lot of UNION ALL in one query?

I received query from analytic to automate and it contains about 10-15 UNION and UNION ALL commands. Each subquery contains several joins of big tables.
I think it will be more safe and optimal way to save results of subqueries into temp tables and union them.
So how UNION(ALL) works on deep level? Is my way better for optimisation?

I don't know what you mean by "safe", but a priori there is no reason to split the query into separate tables -- unless you want to query those tables independently.
Splitting the results into separate tables has some big downsides -- namely, managing the extra tables you make. I have memories of nightmares tracking down errors because some intermediate table was not updated correctly.
In a UNION ALL, the subqueries are really isolated from each other. There is no reason to split the queries into tables to make the overall query more performant.
Introducing intermediate tables introduces complexity, so you need a good reason for that. There are good reasons. For instance, if the view is used many times and the subqueries are expensive, then materializing them makes the query more efficient. In fact, you might find that indexed views (SQL Server's approach to materialized views) are a handy way of doing this. Materialized views (whatever they are called) get around the issue of out-of-date intermediate tables.

Unfortunately, saving each subquery into a temp table won't really help/change much from a memory consumption point of view. If you run the full query with so many unions, SQL Server will consume server memory to hold the temporary results, before delivering the final result set. Temp tables also get stored in server memory, which means the memory footprint would be about the same.
For a quick comment on UNION versus UNION ALL, if you can logically accept the latter instead of the former, then it might help performance. This is because UNION by itself means to aggregate results, and also remove duplicates. UNION ALL has no such duplicate removal step, which means it generally will outperform plain UNION.

Why is there a HUGE performance difference between temp table and subselect

This is a question about SQL Server 2008 R2
I'm not a DBA, by far. I'm a java developer, who has to write SQL from time to time. (mostly embedded in code). I want to know if I did something wrong here, and if so, what I can do to avoid it to happen again.
Q1:
SELECT something FROM (SELECT * FROM T1 WHERE condition1) JOIN ...
Q1 features 14 joins
Q2 is the same as Q1, with one exception. (SELECT * FROM T1 WHERE condition1) is executed before, and stored in a temp table.
This is not a correlated sub-query.
Q2:
SELECT * INTO #tempTable FROM T1 WHERE condition1
SELECT something FROM #tempTable JOIN ...
again, 14 joins.
The thing that puzzles me now is that Q1 took > 2min, (tried it a few times, to avoid caching to play a role) while Q2 (both queries combined) took 2sec!!! What gives?

Why it's not recommended to use subqueries?
Database Optimizer (regardless of what database you are using) can not always properly optimize such query (with subqueries). In this case, the problem to the optimizer is to choose the right way to join result sets. There are several algorithms for joining two result sets. The choice of the algorithm depends on the number of records which are contained in one and in the other result set. In case if you join two physical tables (subquery is not a physical table), the database can easily determine the amount of data in two result sets by the available statistics. If one of result sets is a subquery then to understand how many records it returns is very difficult. In this case the database can choose wrong query plan of join, so that will lead to a dramatic reduction in the performance of the query.
Rewriting the query with using temporary tables is intended to simplify the database optimizer. In the rewritten query all result sets participating in joins will be physical tables and the database will easily determine the length of each result set. This will allow the database to choose the guaranteed fastest of all possible query plans. Moreover, the database will make the right choice no matter what are the conditions. The rewritten query with temporary tables would work well on any database, this is especially important in the development of portable solutions. In addition, the rewritten query is easier to read, easier to understand and to debug.
It is understood that rewriting the query with temporary tables can lead to some slowdown due to additional expenses: creation of temporary tables. If the database will not be mistaken with the choice of the query plan, it will perform the old query faster than a new one. However, this slowdown will always be negligible. Typically the creation of a temporary table takes a few milliseconds. That is, the delay can not have a significant impact on system performance, and usually can be ignored.
Important! Do not forget to create indexes for temporary tables. The index fields should include all fields that are used in join conditions.

There are lot of things to tackle here, indexes, execution plans, etc. Testing and comparing results is the way to go.
You could take a look to the usual suspects, indexes. Take a look into the execution plan and compare them. Make sure the WHERE clause is using the correct ones. Ensure you are using the indexes on your JOINs.
These answers sure will help you a lot.
Performance: Subquery or Joining
Is there a speed difference between CTE , SubQuery and Temp tables?

UNION in a subquery throwing the numbers

I'm working on a project for a landing page. Basically, there are multiple criteria that the user can select that will run a query on a DB2 database and return the results. The queries are broken down into various pieces that are assembled depending on user criteria and parameters inserted. While I'm having some difficulty with some that are return giant datasets pulled from even larger tables and joins, there's one that stands out as an oddball when I run some performance numbers on the database.
One thing that all of these fully-assembled queries have in common is that they are filtered on a list of use ids. There are half a dozen or so of these queries that return datasets of varying sizes. Most of them are pretty straightforward, ie:
TABLE.COLUMN IN (subquery with a few joins that returns a column of user ids)
These subqueries take diddly for time to run by themselves. However, one of these requires a union. Essentially, one table contains a key that has to be used to gather user ids from two different tables, so two sets of user ids must be unioned to get a single list for the subquery, ie:
TABLE.COLUMN IN (subquery UNION subquery)
It's my guess that the DB2 optimizer runs into a lot more limitations when going over a subquery with a union than one with a simple series of joins and can't handle it as well. This particular subquery is middle-of-the-road when it comes to the amount of data it collects, so it's not an issue with a giant dataset.
I'm wondering what alternatives I might have to a union that would at least bring this subquery in line with the others. It's a bit maddening that making changes may help this particular case, but show a detriment to the others, or vice versa. I've tinkered with a few things, but with no luck. The explain plan shows that the proper indexes are being utilized, at least. I know that I don't have much in the way of examples, but these queries are pretty massive overall and it would be difficult to post the necessary data concisely, but let me know if it's necessary and I'll try to knock something together. Thanks.

You try these two alternatives to a union:
WHERE TABLE.COLUMN IN (subquery1)
OR TABLE.COLUMN IN (subquery2)
Or using filtering joins:
SELECT *
FROM TABLE T
LEFT JOIN
(
subquery1
) f1
ON f1.COLUMN = T.COLUMN
LEFT JOIN
(
subquery2
) f1
ON f2.COLUMN = T.COLUMN
WHERE f1.COLUMN IS NOT NULL
OR f2.COLUMN IS NOT NULL

Why is UNION faster than an OR statement [duplicate]

This question already has answers here:
UNION ALL vs OR condition in sql server query
(3 answers)
Closed 9 years ago.
I have a problem where I need to find records that either have a measurement that matches a value, or do not have that measurement at all. I solved that problem with three or four different approaches, using JOINs, using NOT IN and using NOT EXISTS. However, the query ended up being extremely slow every time. I then tried splitting the query in two, and they both run very fast (three seconds). But combining the queries using OR takes more than five minutes.
Reading on SO I tried UNION, which is very fast, but very inconvenient for the script I am using.
So two questions:
Why is UNION so much faster? (Or why is OR so slow)?
Is there any way I can force MSSQL to use a different approach for the OR statement that is fast?

The reason is that using OR in a query will often cause the Query Optimizer to abandon use of index seeks and revert to scans. If you look at the execution plans for your two queries, you'll most likely see scans where you are using the OR and seeks where you are using the UNION. Without seeing your query it's not really possible to give you any ideas on how you might be able to restructure the OR condition. But you may find that inserting the rows into a temporary table and joining on to it may yield a positive result.
Also, it is generally best to use UNION ALL rather than UNION if you want all results, as you remove the cost of row-matching.

There is currently no way in SQL Server to force a UNION execution plan if no UNION statement was used. If the only difference between the two parts is the WHERE clause, create a view with the complex query. The UNION query then becomes very simple:
SELECT * FROM dbo.MyView WHERE <cond1>
UNION ALL
SELECT * FROM dbo.MyView WHERE <cond2>
It is important to use UNION ALL in this context when ever possible. If you just use UNION SQL Server has to filter out duplicate rows, which requires an expensive sort operation in most cases.

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.

If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.

So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas