I am trying to figure out the Spark-Sql query performance with OR vs IN vs UNION ALL.
Option-1:
select cust_id, prod_id, prod_typ
from cust_prod
where prod_typ = '0102' OR prod_typ = '0265';
Option-2:
select cust_id, prod_id, prod_typ
from cust_prod
where prod_typ IN ('0102, '0265');
Option-3:
select cust_id, prod_id, prod_type
from cust_prod
where prod_typ = '0102'
union all
select cust_id, prod_id, prod_type
from cust_prod
where prod_typ = '0265';
I have checked the query plans for all of the above options and have found that with OR, the source table is scanned only ONCE. However, with UNION ALL I found that the source table is being scanned twice. Hence, logically speaking, the query using OR and IN would be more efficient than the one with UNION ALL.
But I read somewhere that the UNION ALL is preferred (over OR) in such scenarios. Hence, I am bit confused as to which one to follow - OR vs IN vs UNION ALL.
Can anyone please help me understand which is the right approach. Any help is highly appreciated.
Note: We use SparkSQL version 2.4.0
Thanks
Union All can also be made to reuse exchanges in specific situations by excluding optimizer's rule PushDownPredicate in some instances.
Related
I have doc_no, I need to find out its type. So I need to select a query from the table.
SELECT o.type_id
FROM operation_out o
WHERE o.doc_no = 17025337;
But what to do, if I have 3 different tables
operation_out, operation_in, operation_reverse
and given doc_no can be in any of the given tables.
I have already tried union:
SELECT transfer_type_id
FROM (SELECT doc_no, mt_o.type_id
FROM mt_operation_out mt_o
UNION ALL
SELECT doc_no, mt_in.type_id
FROM mt_operation_in mt_in
UNION ALL
SELECT doc_no, mt_r.type_id
FROM mt_operation_reverse mt_r)
WHERE doc_no = 17025337;
Is there another alternative way of writing this? Is there any syntax in SQL for this kind of task?
Thank you all for your recommendations. In the end, I created a view based on the UNION ALL, and I'm going to use it wherever needed.
We have separate Databases in DB2 for each customer but with same table structure in each of them. For a .Net Application I need to scan all the databases and show result for the matching entries to the user. I was wondering would it be faster to do a UNION ALL for all the databases or run each query in parallel and then combine them from my .Net Application.
Select EmpName, EmpSal, EmpDate
from A.Emptable
where EmpDate > '2015-01-01'
UNION ALL
Select EmpName, EmpSal, EmpDate
from B.Emptable
where EmpDate > '2015-01-01'
UNION ALL
Select EmpName, EmpSal, EmpDate
from C.Emptable
where EmpDate > '2015-01-01'
VS.
Creating a .net Method GetEmpData to call each query and combine their results as:
var response = await Task.WhenAll(GetEmpData(A,'2015-01-01'),GetEmpData(B,'2015-01-01'),GetEmpData(C,'2015-01-01'));
var result = response[0].Concat(response[1]).Concat(response[2]).ToList();
Thanks.
If your federation is optimally configured for all aspects (especially pushdown and performance aspects), then I expect the UNION ALL to be more maintainable in the long term. As regards relative performance, you can measure, because so many factors influence that.
I would like to optimize performance while bringing together queries on many SAS data sets with the same metadata. At this point I have:
select * from
(select t1.column_a, t1.column_b
from table t1)
Union
(select t2.column_a, t2.column_b
from table t2)
and so on.
Each query brings up unique rows, do I save time wise if I use use Union All instead?
Yes. you are correct. Please refer this. What is the difference between UNION and UNION ALL?
If you are pretty sure that you don't have duplicates, then you can just use the UNION ALL instead of UNION. The later lacks in performance as it has to remove the duplicates
At some point during UNION there will be checks for duplicates. Even if these checks all turn up false, they are an extra step. UNION ALL will probably be more efficient but as dfundako pointed out, you'll have to test and see to be sure of a difference in speed.
This question already has answers here:
SQL Server UNION - What is the default ORDER BY Behaviour
(6 answers)
Closed 9 years ago.
Can I be sure that the result set of the following script will always be sorted like this O-R-D-E-R ?
SELECT 'O'
UNION ALL
SELECT 'R'
UNION ALL
SELECT 'D'
UNION ALL
SELECT 'E'
UNION ALL
SELECT 'R'
Can it be proved to sometimes be in a different order?
There is no inherent order, you have to use ORDER BY. For your example you can easily do this by adding a SortOrder to each SELECT. This will then keep the records in the order you want:
SELECT 'O', 1 SortOrder
UNION ALL
SELECT 'R', 2
UNION ALL
SELECT 'D', 3
UNION ALL
SELECT 'E', 4
UNION ALL
SELECT 'R', 5
ORDER BY SortOrder
You cannot guarantee the order unless you specifically provide an order by with the query.
No it does not. SQL tables are inherently unordered. You need to use order by to get things in a desired order.
The issue is not whether it works once when you try it out. The issue is whether you can trust this behavior. And you cannot. SQL Server does not even guarantee the ordering for this:
select *
from (select t.*
from t
order by col1
) t
It says here:
When ORDER BY is used in the definition of a view, inline function,
derived table, or subquery, the clause is used only to determine the
rows returned by the TOP clause. The ORDER BY clause does not
guarantee ordered results when these constructs are queried, unless
ORDER BY is also specified in the query itself.
A fundamental principle of the SQL language is that tables are not ordered. So, although your query might work in many databases, you should use the version suggested by BlueFeet to guarantee the ordering of results.
Try removing all of the ALLs, for example. Or even just one of them. Now consider that the type of optimization that has to happen there (and many other types) will also be possible when the SELECT queries are actual queries against tables, and are optimized separately. Without an ORDER BY, ordering within each query will be arbitrary, and you can't guarantee that the queries themselves will be processed in any order.
Saying UNION ALL with no ORDER BY is like saying "Just throw all the marbles on the floor." Maybe every time you throw all the marbles on the floor, they end up being organized by color. That doesn't mean the next time you throw them on the floor they'll behave the same way. The same is true for ordering in SQL Server - if you don't say ORDER BY then SQL Server assumes you don't care about order. You may see by coincidence a certain order being returned all the time, but many things can affect the arbitrary order that has been selected next time. Data changes, statistics changes, recompile, plan flush, upgrade, service pack, hotfix, trace flag... ad nauseum.
I will put this in large letters to make it clear:
You cannot guarantee an order without ORDER BY
Some further reading:
Bad habits to kick : relying on undocumented behavior
Also, please read this post by Conor Cunningham, a pretty smart guy on the SQL team.
No. You get the records in whatever way SQL Server fetches them for you. You can apply an order on a unioned result set by 1-based index thusly:
SELECT 1, 'O'
UNION ALL
SELECT 2, 'R'
UNION ALL
SELECT 3, 'D'
UNION ALL
SELECT 4, 'E'
UNION ALL
SELECT 5, 'R'
ORDER BY 1
Here's my query:
SELECT my_view.*
FROM my_view
WHERE my_view.trial in (select 2 as trial_id from dual union select 3 from dual union select 4 from dual)
and my_view.location like ('123-%')
When I execute this query it returns results which do not conform to the my_view.location like ('123-%') condition. It's as if that condition is being ignored completely. I can even change it to my_view.location IS NULL and it returns the same results, despite that field being not-nullable.
I know this query seems ridiculous with the selects from dual, but I've structured it this way to replicate a problem I have when I use a 'WITH' clause (the results of that query are where the selects from dual inline view are).
I can modify the query like so and it returns the expected results:
SELECT my_view.*
FROM my_view
WHERE my_view.trial in (2, 3, 4)
and my_view.location like ('123-%')
Unfortunately I do not know the trial values up front (they are queried for in a 'WITH' clause) so I cannot structure my query this way. What am I doing wrong?
I will say that the my_view view is composed of 3 other views whose results are UNION ALL and each of which retrieve some data over a DB Link. Not that I believe that should matter, but in case it does.
One thing you could try if you don't have luck with this route is to replace "IN" with an "EXISTS" or "NOT EXISTS" statement.
If you could accomplish what you want using joins, that would be the best option because of performance. If you have views pulling data from views, you can often make a single query to do what you want that gives you better performance using subqueries.
If you do
EXPLAIN PLAN FOR
SELECT my_view.*
FROM my_view
WHERE my_view.trial in (select 2 as trial_id from dual union select 3 from dual union select 4 from dual)
and my_view.location like ('123-%');
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
you should see where the location predicate is 9or is not) being applied. My bet is that it has something to do with the DB links and you won't be able to reproduce it if all the tables are local.
Optimizing a distributed query gets complicated.
Try changing the UNION query to use UNION ALL, as in:
SELECT my_view.*
FROM my_view
WHERE my_view.trial in (select 2 as trial_id from dual
UNION ALL
select 3 AS TRIAL_ID from dual
UNION ALL
select 4 AS TRIAL_ID from dual)
and my_view.location like ('123-%')
I also put in "AS TRIAL_ID" on the 3 and 4 cases. I agree that neither of these should matter, but I've run into cases occasionally where things that I thought shouldn't matter mattered.
Good luck.