Where clause affecting join - sql

Dumb question time. Oracle 10g.
Is it possible for a where clause to affect a join?
I've got a query in the form of:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on pd.product_value = to_number(p.product_value)) product_result
where product_name like '%prototype%';
Obviously this is a contrived example. No real need to show the table structure as it's all imaginary. Unfortunately, I can't show the real table structure or query. In this case, p.product_value is a VARCHAR2 field which in certain rows have an ID stored inside it rather than text. (Yes, bad design - but something I inherited and am unable to change)
The issue is in the join. If I leave out the where clause, the query works and rows are returned. However, if I add the where clause, I get "invalid number" error on the pd.product_value = to_number(p.product_value) join condition.
Obviously, the "invalid number" error happens when rows are joined which contain non-digits in the p.product_value field. However, my question is how are those rows being selected? If the join succeeds without the outer where clause, shouldn't the outer where clause just select rows from the result of the join? It appears what is happening is the where clause is affecting what rows are joined, despite the join being in an inner query.
Is my question making sense?

It affects the plan that's generated.
The actual order that tables are joined (and so filtered) is not dictated by the order you write your query, but by the statistics on the tables.
In one version, the plan generated co-incidentally means that the 'bad' rows never get processed; because the preceding joins filtered the result set down to a point that they're never joined on.
The introduction of the WHERE clause has meant that ORACLE now believes a different order of join is better (because filtering by the product name requires a certain index, or because it narrows the data down a lot, etc).
This new order means that the 'bad' rows get processed before the join that filters them out.
I would endeavour to clean the data before querying it. Possibly by creating a derived column where the value is already cast to a number, or left as NULL if it is not possible to do so.
You can also use EXPLAIN PLAN to see the different plans being gerenated from your queries.

Short answer: yes.
Long answer: the query engine is free to rewrite your query however it wants, as long as it returns the same results. All of the query is available to it to use for the purpose of producing the most efficient query it can.
In this case, I'd guess that there is an index that covers what you are wanting, but it doesn't cover product name, when you add that to the where clause, the index isn't used and instead there's a scan where both conditions are tested at the same time, thus your error.
Which is really an error in your join condition, you shouldn't be using to_number unless you are sure it's a number.

I guess your to_number(p.product_value) only applies for rows with a valid product_name.
What happens is that your join is applied before your where clause resulting in the failure of the to_number function.
What you need to do is include your product_name like '%prototype%' as a JOIN clause like this:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on product_name like '%prototype%' AND
pd.product_value = to_number(p.product_value));

For more background (and a really good read), I'd suggest reading Jonathan Gennick's Subquery Madness.
Basically, the problem is that Oracle is free to evaluate predicates in any order. So it is free to push (or not push) the product_name predicate into your subquery. It is free to evaluate the join conditions in any order. So if Oracle happens to pick a query plan where it filters out the non-numeric product_value rows before it applies the to_number, the query will succeed. If it happens to pick a plan where it applies the to_number before filtering out the non-numeric product_value rows, you'll get an error. Of course, it's also possible that it will return the first N rows successfully and then you'll get an error when you try to fetch row N+1 because row N+1 is the first time that it is trying to apply the to_number predicate to a non-numeric data.
Other than fixing the data model, you could potentially throw some hints into the query to force Oracle to evaluate the predicate that ensures that all the non-numeric data is filtered out before the to_number predicate is applied. But in general, it's a bit challenging to fully hint a query in a way that will force the optimizer to always evaluate things in the "proper" order.

Related

Why does the query optimizer use sort after merge join?

Consider this query:
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer.dbo.pt
where exists (select 1 from [dbo].[dt] dt
where pt.map=dt.map and dt.pda=pt.pda and dt.canceled=0)
except
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer_reporting.dbo.pt
I made this to make sure there is no data difference in the same table (pt) between a replication publisher database(appnameV2_Developer) and its subscriber database(appnameV2_Developer_reporting). The specific replication article has a semijoin on dt.
dt is a transaction header table with PK (map,pda)
pt is a transaction detail table with PK (map,pda,line)
Here's the execution plan
So, we have a Right Semi Join merge join. I would expect its result to be ordered by (map,pda,line). But then, a sort operator on (map,pda,line) is called.
Why does this sort occur (or, more accurately: why is the data not already sorted by that point)? Is the query optimizer lacking the logic of "when merge joining then its output is (still) sorted on the join predicates"? Am I missing something?
Because it decided to use a "Merge Join" to execute the EXCEPT clause. In order to perform a Merge Join both datasets must have the same ordering.
The thing is, the inner Merge Join (before the EXCEPT) is based on the table dt, not on pt. Therefore, the resulting rows won't have the same ordering as the other side of the EXCEPT, that is based on pt.
Why does SQL Server do that? Not clear. I would have done it differently. Maybe the stats are not updated. Maybe there is small amount of rows where the strategy does not matter too much.
The results from the first merge will be sorted by map, pda, line. However, you yourself mentioned join predicates, and the join predicates for this first merge are only based on map, pda (they're the predicates from inside the exists clause, except the cancelled one has been pushed down to the index scan). All that that first merge required was input sorted by map and pda, and so that's the only sort order guaranteed on that data, so far as the rest of the query is concerned.
But as we know, the outputs from this first merge were actually derived from input that was additionally sorted by line. It appears the optimizer isn't currently able to spot this circumstance. It may be that the order of optimizations mean that it's unlikely ever to recognise this situation. So currently, it introduces the extra sort.

Does this query include a correlated or non correlated subquery?

So I've written a simple query that gives me the ID #s for properties that show up only once in the property_usage table, along with the code for their associated usage type. Since I didn't want to include a column that shows the count of how many times each property ID shows up in the property_usage table, I wrote two subqueries to get a list of all the property IDs that only show up once. I then use the result of those subqueries (a single column of propertyIDs) to filter out those properties that show up more than once in the table.
here's the query:
select pu.property_id, pu.usage_type_id
from acres_final_40.property_usage pu
where pu.property_id not in
(select multiple_use_properties
from
(select pu.property_id multiple_use_properties, count(pu.property_id)
from acres_final_40.property_usage pu
group by pu.property_id having count(pu.property_id) > 1))
order by pu.property_id;
My question is: is that innermost subquery correlated or noncorrelated with the outermost query?
I have the following thoughts (see the paragraph below), but I'd like to know for sure whether I'm right about this. I'm learning all this stuff on my own and don't have anyone I can ask about this in person!
My feeling is that it's not, because it seems like the pu.propertyID column from the outermost query isn't a value that's passed into the innermost query. It seems like the innermost query may technically be a derived table, in which case my code is sloppy because I don't alias the table name in the FROM clause of that SELECT statement.
In a SQL database query, a correlated subquery (also known as a synchronized subquery) is a subquery (a query nested inside another query) that uses values from the outer query.
(Wikipedia, emph. mine.)
Yours does not, so it's not a correlated subquery. Basically, if you can cut away the subquery and run it as an independent query, without the outer context, it's definitely uncorrelated. It can be done in your case.
BTW you could probably rewrite it using a not exists clause to check if another record with the same PK but another property_id exists, and get a better query plan than using count(). This is my speculation, though; only an explain plan would show if there's a benefit.

Inconsistent results from BigQuery: same query, different number of rows

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2
Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

Best practices of Oracle LEFT OUTER JOIN

I am new to sql, i use Sql Developer (Oracle db).
When I need to select some data with null values I write one of these selects:
1)
SELECT i.number
,i.amount
,(SELECT value FROM ATTRIBUTES a
WHERE a.parameter = 'aaa' AND a.item_nr = i.number) AS atr_value
FROM ITEMS i
2)
SELECT i.number
,i.amount
,a.value as atr_value
FROM ITEMS i
left outer join ATTRIBUTES a
on a.parameter = 'aaa'
and a.item_nr = i.number
Questions:
What is difference?
How first approach is called (how can I google it)? Where can I read about it?
Which one should I use further (what is best practices), maybe there is better way to select same data?
Axample of tables:
Your two queries are not exactly the same. If you have multiple matches in the second table, then the first generates an error and the second generates multiple rows.
Which is better? As a general rule, the LEFT JOIN method (the second method) is considered the better practice than the correlated subquery (the first method). Oracle has a pretty good optimizer and it offers many ways of optimizing joins. I also think Oracle can use JOIN algorithms for the correlated subqueries (not all databases are so smart). And, with the right indexes, the two forms probably have very similar performance.
There are situations where correlated subqueries have better performance than the equivalent JOIN construct. For this example, though, I would expect the performance to be similar.
In case there is never more than one matching row in table attributes, the queries do the same. It's just two ways to query the same data. Both querys are fine and straight-forward.
In case there can be more than one match, query one (which is using a correlated subquery) would fail. It would be inappropriate for the given task then.
The query with the outer join is easier to extend, when you want a second column from the attributes table.
The first query makes it crystal-clear that you expect zero or one matches in table attributes for each item. In case of data inconsistency or if you have an error in your query such as a forgotten criteria, it will fail, which is good.
The second query would simply retrieve more rows in case of such error, which may not be desired.
So it's a matter of personal preference and of your choice how the query is to deal with inconsistencies which query to choose.

In SQL, does the COUNT(*) happen after the JOIN?

I have a query like follows :
SELECT LOCATION_CODE AS "Location",
COUNT(prha.authorization_status) AS "Reqn Lines Count Approved" ,
FROM tabl t1
JOIN table t1 ... etc
JOIN
My question is - suppose that I want to tally up both the counts of something, and then the "opposite" counts (i.e counting the nulls and zero's ) ; all within one query.
So I was wondering if this is possible? or does the COUNT(*) function only occur after we use the JOIN's ? thanks
I'm not sure that I completely understand what you're asking, but recent versions of Oracle do not technically have to perform joins at all if they would not affect the required result.
If you were counting records from a table, and joined to a table against which there was a foreign key constraint, then the optimiser can infer that the join is not required and can omit it.
Furthermore, I seem to recall that the optimiser can also perform aggregations prior to joins as well in some circumstances, if it would be more efficient to do so (for example, if joining between a DW fact table and dimension table, grouping at the atomic level of the dimension and selecting many dimension columns -- the aggregation can be performed on the fact table prior to the join to the dimension, in order to reduce the size of the sort needed on the aggregation).
So while under normal circumstances the join is going to be executed first, in some cases it will not.