Oracle SQL - JOIN performance in comparing null values

Oracle SQL - JOIN performance in comparing null values - sql

Good morning,
In a query I was writing yesterday between two decent-sized result sets (<50k results each), part of my JOIN was a clause to check if the data matched or was null (simplified version below):
SELECT a JOIN b ON a.class = b.class OR (a.class is null AND b.class is null)
However, I noticed a serious performance issue centered around the use of the OR statement. I worked around the issue using the following:
SELECT a JOIN b ON NVL(a.class, 'N/A') = NVL(b.class, 'N/A')
The first query has an unacceptably long run time, while the second is a couple of orders of magnitude faster (>45 minutes vs. <1). I would expect the OR to run slower due to more comparisons, but the cases in which a.class = b.class = null are comparatively few in this particular dataset.
What would cause such a dramatic increase in performance time? Does Oracle SQL not short-circuit boolean comparisons like many other languages? Is there a way to salvage the first query over the second (for use in general SQL not just Oracle)?

You're returning a cross product with any record with a null class. Is this OK for your results?
I created two sample query in 11gR2:
WITH a as
(select NULL as class, 5 as columna from dual
UNION
select NULL as class, 7 as columna from dual
UNION
select NULL as class, 9 as columna from dual
UNION
select 'X' as class, 3 as columna from dual
UNION
select 'Y' as class, 2 as columna from dual),
b as
(select NULL as class, 2 as columnb from dual
UNION
select NULL as class, 15 as columnb from dual
UNION
select NULL as class, 5 as columnb from dual
UNION
select 'X' as class, 7 as columnb from dual
UNION
select 'Y' as class, 9 as columnb from dual)
SELECT * from a JOIN b ON (a.class = b.class
OR (a.class is null AND b.class is null))
When I run EXPLAIN PLAN on this query, it indicates the tables (inline views in my case) are joined via NESTED LOOPS. NESTED LOOPS joins operate by scanning the first row of one table, then scanning each row of the other table for matches, then scanning the second row of the first table, looks for matches on the second table, etc. Because you are not directly comparing either table in the OR portion of your JOIN, the optimizer must use NESTED LOOPS.
Behind the scenes it may look something like:
Get Table A, row 1. If class is null, include this row from Table A on the result set.
While still on Table A Row 1, Search table B for all rows where class is null.
Perform a cross product on Table A Row 1 and all rows found in Table B
Include these rows in the result set
Get Table A, row 2. If class is null, include this row from Table A on the result set.
.... etc
When I change the SELECT statement to SELECT * FROM a JOIN b ON NVL(a.class, 'N/A') = NVL(b.class, 'N/A'), EXPLAIN indicates that a HASH JOIN is used. A hash join essentially generates a hash of each join key of the smaller table, and then scans the large table, finding the hash in the smaller table for each row that matches. In this case, since it's a simple Equijoin, the optimizer can hash each row of the driving table without problems.
Behind the scenes it may look something like:
Go through table A, converting NULL class values to 'N/A'
Hash each row of table A as you go.
Hash Table A is now in temp space or memory.
Scan table B, converting NULL class values to 'N/A', then computing hash of value. Lookup hash in hash table, if it exists, include the joined row from Table A and B in the result set.
Continue scanning B.
If you run an EXPLAIN PLAN on your queries, you probably will find similar results.
Even though the end result is the same, since you aren't joining the tables in the first query with "OR", the optimizer can't use a better join methodology. NESTED LOOPS can be very slow if the driving table is large or if you are forcing a full table scan against a large secondary table.
You can use the ANSI COALESCE function to emulate the NVL oracle function in other database systems. The real issue here is that you're attempting to join on a NULL value, where you really should have a "NO CLASS" or some other method of identifying a "null" class in the sense of null = nothing instead of null = unknown.
Addendum to answer your question in the comments:
For the null query the SQL engine will do the following:
Read Row 1 from Table A, class is null, convert to 'N/A'.
Table B has 3 Rows which have class is null, convert each null to 'N/A'.
Since the first row matches to all 3 rows, 3 rows are added to our result set, one for A1B1, A1B2, A1B3.
Read Row 2 From Table A, class is null, convert to 'N/A'/
Table B has 3 Rows which have class is null, convert each null to 'N/A'.
Since the second row matches to all 3 rows, 3 rows are added to our result set, one for A2B1, A2B2, A2B3.
Read Row 3 From Table A, class is null, convert to 'N/A'/
Table B has 3 Rows which have class is null, convert each null to 'N/A'.
Since the third row matches to all 3 rows, 3 rows are added to our result set, one for A3B1, A3B2, A3B3.
10.. Rows 4 and 5 aren't null so they won't be processed in this portion of the join.
For the 'N/A' query, the SQL engine will do the following:
Read Row 1 from Table A, class is null, convert to 'N/A', hash this value.
Read Row 2 from Table A, class is null, convert to 'N/A', hash this value.
Read Row 3 from Table A, class is null, convert to 'N/A', hash this value.
Read Row 4 from Table A, class not null, hash this value.
Read Row 5 from Table A, class not null, hash this value.
Hash table C is now in memory.
Read Row 1 from Table B, class is null, convert to 'N/A', hash the value.
Compare hashed value to hash table in memory, for each match add a row to the result set. 3 rows are found, A1, A2, and A3. Results are added A1B1, A2B1, A3B1.
Read Row 2 from Table B, class is null, convert to 'N/A', hash the value.
Compare hashed value to hash table in memory, for each match add a row to the result set. 3 rows are found, A1, A2, and A3. Results are added A1B2, A2B2, A3B2.
Read Row 3 from Table B, class is null, convert to 'N/A', hash the value.
Compare hashed value to hash table in memory, for each match add a row to the result set. 3 rows are found, A1, A2, and A3. Results are added A1B3, A2B3, A3B3.

In first case, because each null is different, database doesn't use optimization (for every row from a check each row from table b).
In second case database firstly change all nulls to 'N/A' and then only compare a.class and b.class, using optimization
Comparing nulls in Oracle is very time-consuming. Null is undefined value - one null is different from other null.
Compare result of two almost identical queries:
select 1 from dual where null is null
select 1 from dual where null = null
Only first query with special is null clause return correct answer. Therefore, the null values can not be indexed.

Try this one:
SELECT a from Table1 a JOIN JTable1 b ON a.class = b.class
where a.class is null
union all
SELECT a from Table1 a JOIN JTable1 b ON a.class = b.class
where b.class is null
should be magnatudes faster

The explanation is simple:
First one has to use nested loops in join operation, it always happened when you use OR operation.
Second one has to use hash join operation, which faster then previous one.

Why don't you make it little bit easier.
like
SELECT *
FROM a,b
WHERE
a.class(+)=b.class(+)
I think it's more readable.

Related

Exclude zero values but include NULL values

I want in my query to exclude only zero values but also keeping NULL.
I have tried a few options that work. It has a few 'JOIN' in it.
However.. They all seem to generate different results. The output have different amount of certain values.
E.g. lets say one option gives COLUMN1 10 rows with value 'A' and 5 rows of 'B' and 3 rows of 'C'. The other option gives me 7 rows of value 'A' and 9 rows of 'B' and 2 rows of 'C'. Which one should be most suitable (or neither?) and why:
where..
and a.exitreason<>'0' or a.exitreason is null
and (a.exitreason<>'0' or a.exitreason is null)
and ( isnull(a.exitreason,'') <>'0' OR a.exitreason is null)
Or include it in my JOIN part of the query (table LocalOffice)?
Thanks!
SELECT DISTINCT s.PeriodDate,s.Number,SiteID,
s.LocalID,s.Appointment,s.Agreement,s.AgreementCode,a.ExitReason
FROM Office s
INNER JOIN Employer e ON s.PeriodDate=e.PeriodDate AND s.EmployerID=e.EmployerID
LEFT JOIN LocalOffice a ON a.LocalOfficeID=a.LocalOfficeID
WHERE.....

The details you provided for your particular case aren't clear, but regarding the WHERE statements you've tried, they give different result sets as they rightly should (they're logically requesting different things).
Based on your initial description that you want all rows where given value is either 0 or NULL, your second condition will get you that.

Column Mismatch in UNION queries

I have columns A,B in Table 1 and columns B,C in Table 2.Need to perform Union between them .
For example : select A,B from Table 1 UNION select 0,B from Table 2.
I dont need this zero to solve the column mismatch.Instead is there any other solution?
Am asking the question by providing simple example . But in my case the table structure is very large and the queries are already built.Now I need to fix this union query by replacing this zero.(due to DB2 upgrade)
Can anyone help?

For two legs A and B in a union to be union compatible it is required that:
a) A and B have the same number of columns
b) The types for each column in A is compatible with the corresponding column in B
In your query you can use null that is part of every type:
select a, b from T1
UNION
select null, b from T2
Under certain circumstances, you may have to explicitly cast null to the same type as A has (probably not in this case):
select a, b from Table 1
UNION
select cast(null as ...), b from Table 2

A column returned in a SQL result set can only have one data type.
A union or union all results in rows from the first query and the rows of the second query (in case of union they are deduplicated).
So the first column of the first query needs to match the data type of the first column of the second query.
You can check this by running a describe:
describe select a,b from t1
If you work within a GUI (JDBC Connection) you could also use
call admin_cmd('describe select a,b from t1')
So if the some column does not match you have to explicitly cast the data types.

Compare one value of column A with all the values of column B in Hive HQL

I have two columns in one table say Column A and Column B. I need to search each value of Column A with All the values of column B each and every time and return true if the column A value is found in any of the rows of column B. How can i get this?
I have tried using the below command:
select column _A, column_B,(if (column_A =column_B), True, False) as test from sample;
If i use the above command, it is checking for that particular row alone. But I need true value, if a value of column A is found in any of the rows of column B.
How can i can check one value of column A with all the all the values of column B?
Or Is there any possibility to iterate and compare each value between two columns?

Solution
create temporary table t as select rand() as id, column_A, column_B from sample; --> Refer 1
select distinct t3.id,t3.column_A,t3.column_B,t3.match from ( --> Refer 3
select t1.id as id, t1.column_A as column_A, t1.column_B as column_B,--> Refer 2
if(t2.column_B is null, False, True) as match from t t1 LEFT OUTER JOIN
t t2 ON t1.column_A = t2.column_B
) t3;
Explanation
Create an identifier column to keep track of the rows in original table. I am using rand() here. We will take advantage of this to get the original rows in Step 3. Creating a temporary table t here for simplicity in next steps.
Use a LEFT OUTER JOIN with self to do your test that requires matching each column with another across all rows, yielding the match column. Note that here multiple duplicate rows may get created than in Sample table, but we have got a handle on the duplicates, since the id column for them will be same.
In this step, we apply distinct to get the original rows as in Sample table. You can then ditch the id column.
Notes
Self joins are costly in terms of performance, but this is unavoidable for solution to the question.
The distinct used in Step 3, is costly too. A more performant approach would be to use Window functions where we can partition by the id and pick the first row in the window. You can explore that.

You can do a left join to itself and check if the column key is null. If it is null, then that value is not found in the other table. Use if or "case when" function to check if it is null or not.
Select t1.column_A,
t1.column_B,
IF(t2.column_B is null, 'False', 'True') as test
from Sample t1
Left Join Sample t2
On t1.column_A = t2.column_B;

SQL Server inconsistent results over 2 columns using = and <>

I am trying to replace a manual process with an SQL-SERVER (2012) based automated one. Prior to doing this, I need to analyse the data in question over time to produce some data quality measures/statistics.
Part of this entails comparing the values in two columns. I need to count where they match and where they do not so I can prove my varied stats tally. This should be simple but seems not to be.
Basically, I have a table containing two columns both of which are defined identically as type INT with null values permitted.
SELECT * FROM TABLE
WHERE COLUMN1 is NULL
returns zero rows
SELECT * FROM TABLE
WHERE COLUMN2 is NULL
also returns zero rows.
SELECT COUNT(*) FROM TABLE
returns 3780
and
SELECT * FROM TABLE
returns 3780 rows.
So I have established that there are 3780 rows in my table and that there are no NULL values in the columns I am interested in.
SELECT * FROM TABLE
WHERE COLUMN1=COLUMN2
returns zero rows as expected.
Conversely therefore in a table of 3780 rows, with no NULL values in the columns being compared, I expect the following SQL
SELECT * FROM TABLE
WHERE COLUMN1<>COLUMN2
or in desperation
SELECT * FROM TABLE
WHERE NOT (COLUMN1=COLUMN2)
to return 3780 rows but it doesn't. It returns 3709!
I have tried SELECT * instead of SELECT COUNT(*) in case NULL values in some other columns were impacting but this made no difference, I still got 3709 rows.
Also, there are some negative values in 73 rows for COLUMN1 - is this what causes the issue (but 73+3709=3782 not 3780 my number of rows)?
What is a better way of proving the values in these numeric columns never match?
Update 09/09/2016: At Lamaks suggestion below I isolated the 71 missing rows and found that in each one, COLUMN1 = NULL and COLUMN2 = -99. So the issue is NULL values but why doesn't
SELECT * FROM TABLE WHERE COLUMN1 is NULL
pick them up? Here is the information in Information Schema Views and System Views:
ORDINAL_POSITION COLUMN_NAME DATA_TYPE CHARACTER_MAXIMUM_LENGTH IS_NULLABLE
1 ID int NULL NO
.. .. .. .. ..
7 COLUMN1 int NULL YES
8 COLUMN2 int NULL YES
CONSTRAINT_NAME
PK__TABLE___...
name type_desc is_unique is_primary_key
PK__TABLE___... CLUSTERED 1 1
Suspect the CHARACTER_MAXIMUM_LENGTH of NULL must be the issue?

You can find the count based on the below query using left join.
--To find COLUMN1=COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is not null
--To find COLUMN1<>COLUMN2 Count
--------------------------------
SELECT COUNT(T1.ID)
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.COLUMN1=T2.COLUMN2
WHERE t2.id is null

Through the exhaustive comment chain above with all help gratefully received, I suspect this to be a problem with the table creation script data types for the columns in question. I have no explanation from an SQL code point of view, as to why the "is NULL" intermittently picked up NULL values.
I was able to identify the 71 rows that were not being picked up as expected by using an "except".
i.e. I flipped the SQL that was missing 71 rows, namely:
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
through an except:
SELECT * FROM TABLE
EXCEPT
SELECT * FROM TABLE WHERE COLUMN1 <> COLUMN 2
Through that I could see that COLUMN1 was always NULL in the missing 71 rows - even though the "is NULL" was not picking them up for me when I ran
SELECT * FROM TABLE WHERE COLUMN1 IS NULL
which returned zero rows.
Regarding the comparison of values stored in the columns, as my data volumes are low (3780 recs), I am just forcing the issue by using ISNULL and setting to 9999 (a numeric value I know my data will never contain) to make it work.
SELECT * FROM TABLE
WHERE ISNULL(COLUMN1, 9999) <> COLUMN2
I then get the 3780 rows as expected. It's not ideal but it'll have to do and is more or less appropriate as there are null values in there so they have to be handled.
Also, using Bertrands tip above I could view the table creation script and the columns were definitely set up as INT.

Excluding a Null value returns 0 rows in a sub query

I'm trying to clean up some data in SQL server and add a foreign key between the two tables.
I have a large quantity of orphaned rows in one of the tables that I would like to delete. I don't know why the following query would return 0 rows in MS SQL server.
--This Query returns no Rows
select * from tbl_A where ID not in ( select distinct ID from tbl_B
)
When I include IS NOT NULL in the subquery I get the results that I expect.
-- Rows are returned that contain all of the records in tbl_A but Not in tbl_B
select * from tbl_A where ID not in ( select distinct ID from tbl_B
where ID is not null )
The ID column is nullable and does contain null values. IF I run just the subquery I get the exact same results except the first query returns one extra NULL row as expected.

This is the expected behavior of the NOT IN subquery. When a subquery returns a single null value NOT IN will not match any rows.
If you don't exclusively want to do a null check, then you will want to use NOT EXISTS:
select *
from tbl_A A
where not exists (select distinct ID
from tbl_B b
where a.id = b.id)
As to why the NOT IN is causing issues, here are some posts that discuss it:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL
NOT EXISTS vs NOT IN
What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?

Matching on NULL with equals (=) will return NULL or UNKNOWN as opposed to true/false from a logic standpoint. E.g. see http://msdn.microsoft.com/en-us/library/aa196339(v=sql.80).aspx for discussion.
If you want to include finding NULL values in table A where there is no NULL in table B (if B is the "parent" and A is the "child" in the "foreign key" relationship you desire) then you would need a second statement, something like the following. Also I would recommend qualifying the ID field with a table prefix or alias since the field names are the same in both tables. Finally, I would not recommend having NULL values as the key. But in any case:
select * from tbl_A as A where (A.ID not in ( select distinct B.ID from tbl_B as B ))
or (A.ID is NULL and not exists(select * from tbl_B as B where B.ID is null))

The problem is the non-comparability of nulls. If you are asking "not in" and there are nulls in the subquery it cannot say that anything anything is definitely not in becuase it is looking at those nulls as "unknown" and so the answer is always "unknown" in the three value logic that SQL uses.
Now of course that is all assuming you have ANSI_NULLS ON (which is the default) If you turn that off then suddenly NULLS become comparable and it will give you results, and probably the results you expect.

If the ids are never negative, you might consider something like:
select *
from tbl_A
where coalesce(ID, -1) not in ( select distinct coalesce(ID, -1) from tbl_B )
(Or if id is a string, use something line coalesce(id, '<null>')).
This may not work in all cases, but it has the virtue of simplicity on the coding level.

You probably have ANSI NULLs switched off. This compares null values so null=null will return true.
Prefix the first query with
SET ANSI_NULLS ON
GO

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Oracle SQL - JOIN performance in comparing null values - sql

Try this one: SELECT a from Table1 a JOIN JTable1 b ON a.class = b.class where a.class is null union all SELECT a from Table1 a JOIN JTable1 b ON a.class = b.class where b.class is null should be magnatudes faster

The explanation is simple: First one has to use nested loops in join operation, it always happened when you use OR operation. Second one has to use hash join operation, which faster then previous one.

Why don't you make it little bit easier. like SELECT * FROM a,b WHERE a.class(+)=b.class(+) I think it's more readable.

Related

Exclude zero values but include NULL values

Column Mismatch in UNION queries

Compare one value of column A with all the values of column B in Hive HQL

SQL Server inconsistent results over 2 columns using = and <>

Excluding a Null value returns 0 rows in a sub query

Categories

Resources