In SQL, does the COUNT(*) happen after the JOIN? - sql

I have a query like follows :
SELECT LOCATION_CODE AS "Location",
COUNT(prha.authorization_status) AS "Reqn Lines Count Approved" ,
FROM tabl t1
JOIN table t1 ... etc
JOIN
My question is - suppose that I want to tally up both the counts of something, and then the "opposite" counts (i.e counting the nulls and zero's ) ; all within one query.
So I was wondering if this is possible? or does the COUNT(*) function only occur after we use the JOIN's ? thanks

I'm not sure that I completely understand what you're asking, but recent versions of Oracle do not technically have to perform joins at all if they would not affect the required result.
If you were counting records from a table, and joined to a table against which there was a foreign key constraint, then the optimiser can infer that the join is not required and can omit it.
Furthermore, I seem to recall that the optimiser can also perform aggregations prior to joins as well in some circumstances, if it would be more efficient to do so (for example, if joining between a DW fact table and dimension table, grouping at the atomic level of the dimension and selecting many dimension columns -- the aggregation can be performed on the fact table prior to the join to the dimension, in order to reduce the size of the sort needed on the aggregation).
So while under normal circumstances the join is going to be executed first, in some cases it will not.

Related

Performance over PostgreSQL conditional join - Query optimization

Let's assume I have three tables, subscriptions that has a field called type, which can only have 2 values;
FREE
PREMIUM.
The other two tables are called premium_users and free_users. I'd like to perfom a LEFT JOIN, starting from the subscriptions table but the thing is that depending on the value of the field type I will ONLY find the matching row in one or the other table, i.e. if type equals 'FREE', then the matching row will ONLY be in free_users table and vice versa.
I'm thinking of some ways to do this, such as LEFT JOINING both tables and then using a COALESCE function the get the non null value, or with a UNION, with two different queries using a INNER JOIN on both queries, but I'm not quite sure which would be the best way in terms of performance. Also, as you would guess, the free_users table is almost five times larger than the premium_users table. Another thing you should know, is that I'm joining by user_id field, which is PK in both free_users and premium_users
So, my question is: which would be the most performant way to do a JOIN that depending on the value of type column will match to one table or another. Would this solution be any different if instead of two tables there were three, or even more?
Disclaimer: This DB is a PostgreSQL and is already up and running in production and as much as I'd like to have a single users table it won't happen in the short term.
What is the best in terms of performance? Well, you should try on your data and your systems.
My recommendation is two left joins:
select s.*,
coalesce(fu.name, pu.name) as name
from subscriptions s left join
free_users fu
on fu.free_id = s.subscription_id and
s.type = 'free' left join
premium_users pu
on pu.premium_id = s.suscription_id and
s.type = 'premium';
You want indexes on free_users(free_id) and premium_users(premium_id). These are probably "free" because these ids should be the primary keys in the table.
If you use union all, then the optimizer may not use indexes for the joins. And not using indexes could have a dastardly impact on performance.

Why does the query optimizer use sort after merge join?

Consider this query:
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer.dbo.pt
where exists (select 1 from [dbo].[dt] dt
where pt.map=dt.map and dt.pda=pt.pda and dt.canceled=0)
except
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer_reporting.dbo.pt
I made this to make sure there is no data difference in the same table (pt) between a replication publisher database(appnameV2_Developer) and its subscriber database(appnameV2_Developer_reporting). The specific replication article has a semijoin on dt.
dt is a transaction header table with PK (map,pda)
pt is a transaction detail table with PK (map,pda,line)
Here's the execution plan
So, we have a Right Semi Join merge join. I would expect its result to be ordered by (map,pda,line). But then, a sort operator on (map,pda,line) is called.
Why does this sort occur (or, more accurately: why is the data not already sorted by that point)? Is the query optimizer lacking the logic of "when merge joining then its output is (still) sorted on the join predicates"? Am I missing something?
Because it decided to use a "Merge Join" to execute the EXCEPT clause. In order to perform a Merge Join both datasets must have the same ordering.
The thing is, the inner Merge Join (before the EXCEPT) is based on the table dt, not on pt. Therefore, the resulting rows won't have the same ordering as the other side of the EXCEPT, that is based on pt.
Why does SQL Server do that? Not clear. I would have done it differently. Maybe the stats are not updated. Maybe there is small amount of rows where the strategy does not matter too much.
The results from the first merge will be sorted by map, pda, line. However, you yourself mentioned join predicates, and the join predicates for this first merge are only based on map, pda (they're the predicates from inside the exists clause, except the cancelled one has been pushed down to the index scan). All that that first merge required was input sorted by map and pda, and so that's the only sort order guaranteed on that data, so far as the rest of the query is concerned.
But as we know, the outputs from this first merge were actually derived from input that was additionally sorted by line. It appears the optimizer isn't currently able to spot this circumstance. It may be that the order of optimizations mean that it's unlikely ever to recognise this situation. So currently, it introduces the extra sort.

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?
I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected
First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.
For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Is the GROUP BY clause applied after the WHERE clause in Hive?

Suppose I have the following SQL:
select user_group, count(*)
from table
where user_group is not null
group by user_group
Suppose further that 99% of the data has null user_group.
Will this discard the rows with null before the GROUP BY, or will one poor reducer end up with 99% of the rows that are later discarded?
I hope it is the former. That would make more sense.
Bonus points if you say what will happen by Hive version. We are using 0.11 and migrating to 0.13.
Bonus points if you can point to any documentation that confirms.
Sequence
FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
SELECT
ORDER BY arranges the remaining rows/groups
The first step is always the FROM clause. In your case, this is pretty straight-forward, because there's only one table, and there aren't any complicated joins to worry about. In a query with joins, these are evaluated in this first step. The joins are assembled to decide which rows to retrieve, with the ON clause conditions being the criteria for deciding which rows to join from each table. The result of the FROM clause is an intermediate result. You could think of this as a temporary table, consisting of combined rows which satisfy all the join conditions. (In your case the temporary table isn't actually built, because the optimizer knows it can just access your table directly without joining to any others.)
The next step is the WHERE clause. In a query with a WHERE clause, each row in the intermediate result is evaluated according to the WHERE conditions, and either discarded or retained. So null will be discarded before going to Group by clause
Next comes the GROUP BY. If there's a GROUP BY clause, the intermediate result is now partitioned into groups, one group for every combination of values in the columns in the GROUP BY clause.
Now comes the HAVING clause. The HAVING clause operates once on each group, and all rows from groups which do not satisfy the HAVING clause are eliminated.
Next comes the SELECT. From the rows of the new intermediate result produced by the GROUP BY and HAVING clauses, the SELECT now assembles the columns it needs.
Finally, the last step is the ORDER BY clause.
This query discards the rows with NULL before the GROUP BY operation.
Hope this link will be useful:-
http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.2.0/bk_dataintegration/content/hive-013-feature-subqueries-in-where-clauses.html

Where clause affecting join

Dumb question time. Oracle 10g.
Is it possible for a where clause to affect a join?
I've got a query in the form of:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on pd.product_value = to_number(p.product_value)) product_result
where product_name like '%prototype%';
Obviously this is a contrived example. No real need to show the table structure as it's all imaginary. Unfortunately, I can't show the real table structure or query. In this case, p.product_value is a VARCHAR2 field which in certain rows have an ID stored inside it rather than text. (Yes, bad design - but something I inherited and am unable to change)
The issue is in the join. If I leave out the where clause, the query works and rows are returned. However, if I add the where clause, I get "invalid number" error on the pd.product_value = to_number(p.product_value) join condition.
Obviously, the "invalid number" error happens when rows are joined which contain non-digits in the p.product_value field. However, my question is how are those rows being selected? If the join succeeds without the outer where clause, shouldn't the outer where clause just select rows from the result of the join? It appears what is happening is the where clause is affecting what rows are joined, despite the join being in an inner query.
Is my question making sense?
It affects the plan that's generated.
The actual order that tables are joined (and so filtered) is not dictated by the order you write your query, but by the statistics on the tables.
In one version, the plan generated co-incidentally means that the 'bad' rows never get processed; because the preceding joins filtered the result set down to a point that they're never joined on.
The introduction of the WHERE clause has meant that ORACLE now believes a different order of join is better (because filtering by the product name requires a certain index, or because it narrows the data down a lot, etc).
This new order means that the 'bad' rows get processed before the join that filters them out.
I would endeavour to clean the data before querying it. Possibly by creating a derived column where the value is already cast to a number, or left as NULL if it is not possible to do so.
You can also use EXPLAIN PLAN to see the different plans being gerenated from your queries.
Short answer: yes.
Long answer: the query engine is free to rewrite your query however it wants, as long as it returns the same results. All of the query is available to it to use for the purpose of producing the most efficient query it can.
In this case, I'd guess that there is an index that covers what you are wanting, but it doesn't cover product name, when you add that to the where clause, the index isn't used and instead there's a scan where both conditions are tested at the same time, thus your error.
Which is really an error in your join condition, you shouldn't be using to_number unless you are sure it's a number.
I guess your to_number(p.product_value) only applies for rows with a valid product_name.
What happens is that your join is applied before your where clause resulting in the failure of the to_number function.
What you need to do is include your product_name like '%prototype%' as a JOIN clause like this:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on product_name like '%prototype%' AND
pd.product_value = to_number(p.product_value));
For more background (and a really good read), I'd suggest reading Jonathan Gennick's Subquery Madness.
Basically, the problem is that Oracle is free to evaluate predicates in any order. So it is free to push (or not push) the product_name predicate into your subquery. It is free to evaluate the join conditions in any order. So if Oracle happens to pick a query plan where it filters out the non-numeric product_value rows before it applies the to_number, the query will succeed. If it happens to pick a plan where it applies the to_number before filtering out the non-numeric product_value rows, you'll get an error. Of course, it's also possible that it will return the first N rows successfully and then you'll get an error when you try to fetch row N+1 because row N+1 is the first time that it is trying to apply the to_number predicate to a non-numeric data.
Other than fixing the data model, you could potentially throw some hints into the query to force Oracle to evaluate the predicate that ensures that all the non-numeric data is filtered out before the to_number predicate is applied. But in general, it's a bit challenging to fully hint a query in a way that will force the optimizer to always evaluate things in the "proper" order.