The old IN vs. Exists vs. Left Join (Where ___ Is or Is Not Null); Performance

The old IN vs. Exists vs. Left Join (Where ___ Is or Is Not Null); Performance - sql

I have found my self in quite a pickle. I have tables of only one column (supression or inclusion lists) that are more or less varchar(25) but the thing is I won't have time to index them before using them in the main query and, depending how inportant it is, I won't know how many rows are in each table. The base table at the heart of all this is some 1.4 million rows and some 50 columns.
My assumptions are as follows:
IN shouln't be used in cases with a lot of values (rows) returned because it looks though the values serially, right? (IN on a subquery not passed the values directly)
Joins (INNER for inclusion and LEFT and checking for Nulls when supression) are the best for large sets of data (over 1k rows or so to mach to)
EXISTS has always concerned me because it seems to be doing a subquery for every row (all 1.4 million? Yikes.)
My gut say, if feasable get the count of the supression table and use either IN (for sub 1k rows) and INNER/LEFT Join (for suppression tables above 1k rows) Note, and field I will be supressing on will be index in the big base table but the supression table won't be. Thoughts?
Thanks in advance for any and all comments and/or advice.

Assuming TSQL to mean SQL Server, have you seen this link regarding a comparison of NOT IN, NOT EXISTS, and LEFT JOIN IS NULL? In summary, as long as the columns being compared can not be NULL, NOT IN and NOT EXISTS are more efficient than LEFT JOIN/IS NULL...
Something to keep in mind about the difference between IN and EXISTS - EXISTS is a boolean operator, and returns true on the first time the criteria is satisfied. Though you see a correlated subquery in syntax, EXISTS has performed better than IN...
Also, IN and EXISTS only check for the existence of the value comparison. This means there's no duplication of records like you find when JOINing...
It really depends, so if you're really out to find what performs best you'll have to test & compare what the query plans are doing...

It won't matter what technique you use, if there is no index on the table on which you apply a filter or join, the system will do a table scan.
RE: Exists
It is not necessarily the case that the system will do a subquery for all 1.4 million rows. SQL Server is smart enough to do the inner Exists query and then evaluate that against the main query. In some cases, Exists can perform equal to or better than a Join.

Related

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?

I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected

First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.

For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Will a SQL DELETE with a sub query execute inefficiently if there are many rows in the source table?

I am looking at an application and I found this SQL:
DELETE FROM Phrase
WHERE PhraseId NOT IN(SELECT Id FROM PhraseSource)
The intention of the SQL is to delete rows from Phrase that are not in the PhraseSource table.
The two tables are identical and have the following structure
Id - GUID primary key
...
...
...
Modified int
the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.
Can someone tell me will this query execute the SELECT Id from PhraseSource for every row in the Phrase table? If so is there a more efficient way that this could be coded.

1. Will this query execute the SELECT Id from PhraseSource for every row?
No.
In SQL you express what you want to do, not how you want it to be done1. The engine will create an execution plan to do what you want in the most performant way it can.
For your query, executing the query for each row is not necessary. Instead the engine will create an execution plan that executes the subquery once, then does a left anti-semi join to determine what IDs are not present in the PhraseSource table.
You can verify this when you include the Actual Execution Plan in SQL Server Management Studio.
2. Is there a more efficient way that this could be coded?
A little bit more efficient, as follows:
DELETE
p
FROM
Phrase AS p
WHERE
NOT EXISTS (
SELECT
1
FROM
PhraseSource AS ps
WHERE
ps.Id=p.PhraseId
);
This has been shown in tests done by user Aaron Bertrand on sqlperformance.com: Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?:
Conclusion
[...] for the pattern of finding all rows in table A where some condition does not exist in table B, NOT EXISTS is typically going to be your best choice.
Another benefit of using NOT EXISTS with a correlated subquery is that it does not have problems when PhraseSource.Id can be NULL. I suggest you read up on IN/NOT IN vs NULL values in the subquery. E.g. you can read more about that on sqlbadpractices.com: Using NOT IN operator with null values.
The PhraseSource.Id column is probably not nullable in your schema, but I prefer using a method that is resilient in all possible schemas.
1. Exceptions exist when forcing the engine to use a specific path, e.g. with Table Hints or Query Hints. The engine doesn't always get things right.

In this case the sub-query could be evaluated for each row if the database system is not smart enough (but in case of MS SQL Server, I suppose it should be able to recognize the fact that you don't need to evaluate the subquery more than once).
Still there is a better solution:
DELETE p
FROM Phrase p
LEFT JOIN PhraseSource ps ON ps.Id = p.PhraseId
WHERE ps.Id IS NULL
This uses the LEFT JOIN which matches the rows of both tables, but in case there is no match it leaves the ps entry NULL. Now you just check for NULLs on the left side to see which Phrases do not have a match and will delete those.
All types of JOIN statements are very nicely described in this answer.
Here you can see three different approaches for a similar issue compared on MySQL. As #Drammy mentions, to actually see the performance of a given approach, you could see the execution plan on your target database and do performance testing on different approaches of the same problem.

That query should optimise into a join. Have you looked at the execution plan?
If you're experiencing poor performance it is likely because of the guid primary keys.
A primary key is clustered by default. If the guid primary key is clustered on your table that means the data in the tables is ordered by the primary key. The problem with guids as clustered keys is that when you delete one record the table has to be reordered and shuffled around on disk.
This article is a good read on the topic..
https://blog.codinghorror.com/primary-keys-ids-versus-guids/

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.

For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.

Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)

No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Optimize Joins (more than 2 tables, where filter)

Couple of Sybase database query questions:
If I do an join and have a where clause, would the filter be applied prior to the actual join? In other words, is it faster than join without any where conditions?
I have an example involving 3 tables (with columns listed below):
A: O1,....
B: E1,E2,...
C: O1, E2, E2
So my join looks like:
select A.*, B* from B,C,A
where
C.E1=B.E1 and C.E2=B.E2 and C.O1=A.O1
and A.O2 in (...)
and B.E3 in (...)
Would my joins be any significantly faster if I eliminated C and added O1 to table B instead?
B:E1,E2,O1....

First, you should use proper join syntax:
select A.*, B.*
from B join C
on C.E1=B.E1 and C.E2=B.E2 join
A
on C.O1=A.O1
where B.E3 in (...)
The "," means cross join and it is prone to problems since it is easily missed. Your query becomes much different if you say "from B C, A". Also, it gives you access to outer joins, and makes the statement much more readable.
The answer to your question is "yes". Your query will run faster if there are fewer tables being joined. If you are doing a join on a primary key, and the tables fit into memory, then the join is not expensive. However, it is still faster to just find the data in the record in B.
As I say this, there could be some boundary cases in some databases where this is not necessarily true. For instance, if there is only one value in the column, and the column is a long string, then adding the column onto pages could increase the number of pages needed for B. Such extreme cases are unlikely, and you should see performance improvements.

Speed will generally depends on number or rows the SQL Server has to read.
I don't think it makes any difference using a where clause or a join
Depends .. on how many rows are in the eliminated table
It can depend on the order you add the joins or where clauses in, e.g. if there are only a few rows in C, and you add that first as a table or where, it immediately cuts down on the number of matches that are possible in. If, however there are millions of rows in C, then you have to read millions to find the matches.
Modern optimizers can rearrange your query to be more efficient but dont rely on it.
What can really cut down the number of rows read is adding indexes to the join columns - if you have an index on A.O1 AND on C.O1 then it can cut down massivley on the number of reads.

MYSQL - Difference between IN and EXIST

MySql question:
What is the difference between [NOT] IN and [NOT] EXIST when doing subqueries in MySql.

EXISTS
EXISTS literally is for checking for the existence of specified criteria. In current standard SQL, it will allow you to specify more than one criteria for comparison - IE if you want to know when col_a and col_b both match - which makes it a little stronger than the IN clause. MySQL IN supports tuples, but the syntax is not portable, so EXISTS is a better choice both for readability and portability.
The other thing to be aware of with EXISTS is how it operates - EXISTS returns a boolean, and will return a boolean on the first match. So if you're dealing with duplicates/multiples, EXISTS will be faster to execute than IN or JOINs depending on the data and the needs.
IN
IN is syntactic sugar for OR clauses. While it's very accommodating, there are issues with dealing with lots of values for that comparison (north of 1,000).
NOT
The NOT operator just reverses the logic.
Subqueries vs JOINs
The mantra "always use joins" is flawed, because JOINs risks inflating the result set if there is more than one child record against a parent. Yes, you can use DISTINCT or GROUP BY to deal with this, but it's very likely this renders the performance benefit of using a JOIN moot. Know your data, and what you want for a result set - these are key to writing SQL that performs well.
To reiterate knowing when and why to know what to use - LEFT JOIN IS NULL is the fastest exclusion list on MySQL if the columns compared are NOT nullable, otherwise NOT IN/NOT EXISTS are better choices.
Reference:
MySQL: LEFT JOIN/IS NULL, NOT IN, NOT EXISTS on nullable columns
MySQL: LEFT JOIN/IS NULL, NOT IN, NOT EXISTS on NOT nullable columns

They work very differently:
EXISTS takes a single argument which should be a subquery (derived table) and checks if there is at least one row returned by the subquery.
IN takes two arguments, the first of which should be a single value (or a tuple), and the second of which is a subquery or a tuple and checks if the first value is contained in second.
However both can be used to check if a row in table A has a matching row in table B. Unless you are careful and know what you are doing I would stay clear of IN in MySQL as it often gives much poorer performance on more complex queries. Use NOT EXISTS or a LEFT JOIN ... WHERE ... IS NULL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas