SQL INNER JOIN vs WHERE IN big performance difference

SQL INNER JOIN vs WHERE IN big performance difference - sql

I have read multiple sources and still don't understand where a big difference comes from in a query I have for Microsoft SQL Server.
I need to count different alerts linked to vehicles (IdMateriel is synonymous to the id of the vehicle) based on types (CodeAlerte), state (Etat), and a Top true/false column, but beofre the different counts I need to select the data.
TLDR : There are two parameters, the current date as an SQL DATETIME, and the VARCHAR(MAX) string of entity codes separated by commas, which I split using STRING_SPLIT to use them either in WHERE IN clause or INNER JOIN. Using it in the first clause is ~10x faster than the second clause although it seems equivalent to me. Why?
First, the queries are based on the view created as follows:
CREATE OR ALTER VIEW [dbo].[AlertesVehicules]
WITH SCHEMABINDING
AS
SELECT dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte,
dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat,
Vehicule.Top, Vehicule.EtatVehicule,COUNT_BIG(*) AS COUNT
FROM dbo.Alerte
INNER JOIN dbo.Materiel on dbo.Alerte.IdMateriel= dbo.Materiel.Id
INNER JOIN dbo.Vehicule on dbo.Vehicule.Id= dbo.Materiel.Id
INNER JOIN dbo.TypeAlerte on dbo.Alerte.IdTypeAlerte = dbo.TypeAlerte.Id
WHERE dbo.Materiel.EntiteGestion is NOT NULL
AND dbo.TypeAlerte.CodeAlerte IN ('P07','P08','P09','P11','P12','P13','P14')
GROUP BY dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte,
dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat,
Vehicule.Top, Vehicule.EtatVehicule
GO
CREATE UNIQUE CLUSTERED INDEX IX_AlerteVehicule
ON dbo.AlertesVehicules (EntiteGestion,IdMateriel,CodeAlerte,Etat,TopAlerteMajeure);
GO
This first version of the query takes ~100ms:
SELECT DISTINCT a.IdMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTabAlertes
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule
INNER JOIN (SELECT value AS entiteGestion FROM STRING_SPLIT(#entiteGestion, ',')) eg
ON a.EntiteGestion = eg.entiteGestion
WHERE
a.CodeAlerte IN ('P08','P09')
AND #currentDate <= ISNULL(ase.DateFin, #currentDate)
According to SQL Sentry Plan Explorer, the actual execution plan starts with an index seek taking ~30% of the time on dbo.Alerte with the predicate Alerte.IdTypeAlerte=TypeAlerte.Id, outputting 369 000 rows of Etat, IdMateriel and IdTypeAlerte, which it then directly filters down to 7 742 based on predicate PROBE(Opt_Bitmapxxxx,Alerte.IdMateriel), and then inner joins taking ~25% of the time with the 2 results of another index seek of TypeAlerte but with predicates TypeAlerte.CodeAlerte = N'P08' and =N'P09'. So just these two parts take > 50ms, but I don't understand why there are so many initial results.
The second version takes ~10ms :
SELECT DISTINCT a.dMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTab
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule
WHERE
a.EntiteGestion IN (SELECT value FROM STRING_SPLIT(#entiteGestion, ','))
AND a.CodeAlerte IN ('P08','P09')
AND (#currentDate <= ISNULL(ase.DateFin, #currentDate))
For this one, SQL Sentry Plan Explorer starts with a View Clustered Index Seek directly on the view AlertesVehicules with Seek Predicates AlertesVehicule.EntiteGestion > Exprxxx1 and <Exprxxx2and Predicate AlertesVehicules.CodeAlerte=N'P08' and =N'P09'
Why are those two treated so differently when it seems to me that they are exactly equivalent?
For references, here are some threads I already looked into, but didn't seem to find an explanation in (except that there shouldn't be a difference):
SQL JOIN vs IN performance?
Performance gap between WHERE IN (1,2,3,4) vs IN (select * from STRING_SPLIT('1,2,3,4',','))

Related

How can this change be making my query slow (OR vs UNION) and can I fix it?

I've just been debugging a slow SQL query.
It's a join between 2 tables, with a WHERE clause conditioning on either a property of 1 table OR the other.
If I re-write it as a UNION then it's suddenly 2 orders of magnitude faster, even though those 2 queries produce identical outputs:
DECLARE #UserId UNIQUEIDENTIFIER = '0019813D-4379-400D-9423-56E1B98002CB'
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId) OR Bookings.MixedDealBroker in (#UserId))
--Execution time: ~4000ms
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
--Execution time: ~70ms
This seems rather surprising to me! I would have expected the SQL compiler to be entirely capable of identifying that the 2nd form was equivalent and would have used that compilation approach if it were available.
Some context notes:
I've checked and IN (#UserId) vs = #UserId makes no difference.
Nor does JOIN vs LEFT JOIN.
Those tables each have 100,000s records, and the filter cuts it down to ~100.
In the slow version it seems to be reading every row of both tables.
So:
Does anyone have any ideas for how this comes about.
What (if anything) can I do to fix the performance without just re-writing the query as a series of UNIONs (not viable for a variety of reasons.)
=-=-=-=-=-=-=
Execution Plans:

This is a common limitation of SQL engines, not just in SQL Server, but also other database systems as well. The OR complicates the predicate enough that the execution plan selected isn't always ideal. This probably relates to the fact that only one index can be seeked into per instance of a table object at a time (for the most part), or in your specific case, your OR predicate is across two different tables, and other factors with how SQL engines are designed.
By using a UNION clause, you now have two instances of the Bookings table referenced, which can individually be seeked on separately in the most efficient way possible. That allows the SQL Engine to pick a better execution plan to serve you query.
This is pretty much just one of those things that are the way they are because that's just the way it is, and you need to remember the UNION clause workaround for future encounters of this kind of performance issue.
Also, in response to your comment:
I don't understand how the difference can affect the EP, given that the 2 different "phrasings" of the query are identical?
A new execution plan is generated every time one doesn't exist in the plan cache for a given query, essentially. The way the Engine determines if a plan for a query is already cached is based on the exact hashing of that query statement, so even an extra space character at the end of the query can result in a new plan being generated. Theoretically that plan can be different. So a different written query (despite being logically the same) can surely result in a different execution plan.
There are other reasons a plan can change on re-generation too, such as different data and statistics of that data, in the tables referenced in the query between executions. But these reasons don't really apply to your question above.

As already stated, the OR condition prevents the database engine from efficiently using the indexes in a single query. Because the OR condition spans tables, I doubt that the Tuning Advisor will come up with anything useful.
If you have a case where the query you have posted is part of a larger query, or the results are complex and you do not want to repeat code, you can wrap your initial query in a Common Table Expression (CTE) or a subquery and then feed the combined results into the remainder of your query. Sometimes just selecting one or more PKs in your initial query will be sufficient.
Something like:
SELECT <complex select list>
FROM (
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings B
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
) PRE
JOIN Bookings B ON B.ID = PRE.BookingsID
JOIN BookingPricings BP ON BP.ID = PRE.BookingPricingsID
<more joins>
WHERE <more conditions>
Having just the IDs in your initial select make the UNION more efficient. The UNION can also be changed to a yet more-efficient UNION ALL with careful use of additional conditions, such as AND Bookings.MixedDealBroker <> #UserId in the second part, to avoid overlapping results.

Left join Ignore

I have recently noticed that SQL Server 2016 appears to be ignoring left joins if any column is not used in the select or where clause. Also not found in Actual execution plan.
This is good for if anyone added extra join but still not affecting performance.
I have query that took 9 sec, if I add column in select clause for Left join tables but without that only 1 sec.
Can anyone please check and suggest, Is that true or not?
Query with Actual execution plan. You can see there is no any table from left join in execution plan.

I'm not 100% sure what the question is asking, but a SQL optimizer can ignore left join. Consider this type of query:
select a.*
from a left join
b
on a.b_id = b.id;
If b.id is declared as unique (or equivalently a primary key) then the above query returns exactly the same result set as:
select a.*
from a;
I am not per se aware that SQL Server started putting this optimization in 2016. But the optimization is perfectly valid and (I believe) other optimizers do implement it.
Remember, SQL is a declarative language, not a procedural language. The SQL query describes the result set, not how it is produced.

If you have a left join and your matching condition don't return any data from the joined table it will return data as inner join return, when select statement does not contains columns from right tables. Not only in ms server 2016 but most of the DB's.
Left join reduces the performance of the query if there are large amount of data available in join tables.

Access SQL Syntax error: missing operator

I am trying to convert a T-SQL query to MS Access SQL and getting a syntax error that I am struggling to find. My MS Access SQL query looks like this:
INSERT INTO IndvRFM_PreSort (CustNum, IndvID, IndvRScore, IndRecency, IndvFreq, IndvMonVal )
SELECT
IndvMast.CustNum, IndvMast.IndvID, IndvMast.IndvRScore,
IndvMast.IndRecency, IndvMast.IndvFreq, IndvMast.IndvMonVal
FROM
IndvMast
INNER JOIN
OHdrMast ON IndvMast.IndvID = OHdrMast.IndvID
INNER JOIN
MyParameterSettings on 1=1].ProdClass
INNER JOIN
[SalesTerritoryFilter_Check all that apply] ON IndvMast.SalesTerr = [SalesTerritoryFilter_Check all that apply].SalesTerr
WHERE
(((OHdrMast.OrdDate) >= [MyParameterSettings].[RFM_StartDate]))
GROUP BY
IndvMast.CustNum, IndvMast.IndvID, IndvMast.IndvRScore,
IndvMast.IndRecency, IndvMast.IndvFreq, IndvMast.IndvMonVal,
[CustTypeFilter_Check all that apply].IncludeInRFM,
[ProductClassFilter_Check all that apply].IncludeInRFM,
[SourceCodeFilter_Check all that apply].IncludeInRFM,
IndvMast.FlgDontUse
I have reviewed differences between MS Access SQL and T-SQL at http://rogersaccessblog.blogspot.com/2013/05/what-are-differences-between-access-sql.html and a few other locations but with no luck.
All help is appreciated.
update: I have removed many lines trying to find the syntax error and I am still getting the same error when running just (which runs fine using T-SQL):
SELECT
IndvMast.CustNum, IndvMast.IndvID, IndvMast.IndvRScore,
IndvMast.IndRecency, IndvMast.IndvFreq, IndvMast.IndvMonVal
FROM
IndvMast
INNER JOIN
OHdrMast ON IndvMast.IndvID = OHdrMast.IndvID
INNER JOIN
[My Parameter Settings] ON 1 = 1

There are a number of items in your query that should also have failed in any SQL-compliant database:
You have fields from tables in GROUP BY not referenced in FROM or JOIN clauses.
Number of fields in SELECT query do not match number of fields in INSERT INTO clause.
The MyParameterSettings table is not properly joined with valid ON expression.
Strictly MS Access SQL items:
For more than one join, MS Access SQL requires paired parentheses but even this can get tricky if some tables are joined together and their paired result joins to outer where you get nested joins.
Expressions like ON 1=1 must be used in WHERE clause and for cross join tables as MyParameterSettings appears to be, use comma-separated tables.
For above reasons and more, it is advised for beginners to this SQL dialect to use the Query Design builder providing table diagrams and links (if you have the MS Access GUI .exe of course). Then, once all tables connect graphically with at least one field selected, jump into SQL view for any nuanced scripting logic.
Below is an adjustment to SQL statement to demonstrate the parentheses pairings and for best practices, uses table aliases especially with long table names.
INSERT INTO IndvRFM_PreSort (CustNum, IndvID, IndvRScore, IndRecency, IndvFreq, IndvMonVal)
SELECT
i.CustNum, i.IndvID, i.IndvRScore, i.IndRecency, i.IndvFreq, i.IndvMonVal
FROM
[MyParameterSettings] p, (IndvMast i
INNER JOIN
OHdrMast o ON i.IndvID = o.IndvID)
INNER JOIN
[SalesTerritoryFilter_Check all that apply] s ON i.SalesTerr = s.SalesTerr
WHERE
(o.OrdDate >= p.[RFM_StartDate])
GROUP BY
i.CustNum, i.IndvID, i.IndvRScore, i.IndRecency, i.IndvFreq, i.IndvMonVal
And in your smaller SQL subset, the last table does not need an ON 1=1 condition and may be redundant as well in SQL Server. Simply a comma separate table will suffice if you intend for cross join. The same is done in above example:
SELECT
IndvMast.CustNum, IndvMast.IndvID, IndvMast.IndvRScore,
IndvMast.IndRecency, IndvMast.IndvFreq, IndvMast.IndvMonVal
FROM
[My Parameter Settings], IndvMast
INNER JOIN
OHdrMast ON IndvMast.IndvID = OHdrMast.IndvID

I suppose there are some errors in your query, the first (more important).
Why do you use HAVING clause to add these conditions?
HAVING (((IndvMast.IndRecency)>(date()-7200))
AND (([CustTypeFilter_Check all that apply].IncludeInRFM)=1)
AND (([ProductClassFilter_Check all that apply].IncludeInRFM)=1)
AND (([SourceCodeFilter_Check all that apply].IncludeInRFM)=1)
AND ((IndvMast.FlgDontUse) Is Null))
HAVING usually used about conditions on aggregate functions (COUNT, SUM, MAX, MIN, AVG), for scalar value you must put in WHERE clause.
The second: You have 12 parenthesis opened and 11 closed in HAVING clause

Left join or Select in select (SQL - Speed of query)

I have something like this:
SELECT CompanyId
FROM Company
WHERE CompanyId not in
(SELECT CompanyId
FROM Company
WHERE (IsPublic = 0) and CompanyId NOT IN
(SELECT ShoppingLike.WhichId
FROM Company
INNER JOIN
ShoppingLike ON Company.CompanyId = ShoppingLike.UserId
WHERE (ShoppingLike.IsWaiting = 0) AND
(ShoppingLike.ShoppingScoreTypeId = 2) AND
(ShoppingLike.UserId = 75)
)
)
It has 3 select, I want to know how could I have it without making 3 selects, and which one has better speed for 1 million record? "select in select" or "left join"?

My experiences are from Oracle. There is never a correct answer to optimising tricky queries, it's a collaboration between you and the optimiser. You need to check explain plans and sometimes traces, often at each stage of writing the query, to find out what the optimiser in thinking. Having said that:
You could remove the outer SELECT by putting the entire contents of it's subquery WHERE clause in a NOT(...). On the face of it will prevent that outer full scan of Company (or it's index of CompanyId). Try it, check the output is the same and get timings, then remove it temporarily before trying the below. The NOT() may well cause the optimiser to stop considering an ANTI-JOIN against the ShoppingLike subquery due to an implicit OR being created.
Ensure that CompanyId and WhichId are defined as NOT NULL columns. Without this (or the likes of an explicit CompanyId IS NOT NULL) then ANTI-JOIN options are often discarded.
The inner most subquery is not correlated (does not reference anything from it's outer query) so can be extracted and tuned separately. As a matter of style I'd swap the table names round the INNER JOIN as you want ShoppingLike scanned first as it has all the filters against it. It wont make any difference but it reads easier and makes it possible to use a hint to scan tables in the order specified. I would even question the need for the Company table in this subquery.
You've used NOT IN when sometimes the very similar NOT EXISTS gives the optimiser more/alternative options.
All the above is just trial and error unless you start trying the explain plan. Oracle can, with a following wind, convert between LEFT JOIN and IN SELECT. 1M+ rows will create time to invest.

ORDER BY column that has allow null is slow. Why?

So really my question is WHY this worked.
Anyway, I had this query that does a few inner joins, has a where clause and does an order by on a nvarchar column. If I run the query WITHOUT order by, the query takes less than a second. If I run the query WITH order by, it takes 12 seconds.
Now I had a great idea and changed all the INNER JOINs to LEFT JOINs. And also included the ORDER BY clause. That took less than a second. So I remembered the difference between LEFT JOINs and INNER JOINs. INNER JOINs check for NULL and LEFT JOINs don't. So I went into the table design and unchecked "Allow Nulls". Now I run the query WITH INNER JOINs and a ORDER BY clause and the query takes less than a second. WHY?
From what I understand, the FROM, JOINS, WHERE, then SELECT clauses should run first and return a result set. Then the ORDER BY clause runs at the very end on the resultant record set. Therefore the query should have taken AT MOST a second, yes, even with the column allowing nulls. So why would the query take less than a second WITHOUT the order by clause, but take 12 seconds WITH order by clause? That doesn't make sense to me.
Query below:
SELECT PlanInfo.PlanId, PlanName, COALESCE(tResponsible, '') AS tResponsible, Processor, CustName, TaskCategoryId, MapId, tEnd,
CASE MapId WHEN 9 THEN 1 ELSE 2 END AS sor
FROM PlanInfo INNER JOIN [orders].dbo.BaanOrders_Ext ON PlanInfo.PlanName = [orders].dbo.BaanOrders_Ext.OrderNo
INNER JOIN [orders].dbo.BaanOrders ON PlanInfo.PlanName = [orders].dbo.BaanOrders.OrderNo
INNER JOIN Tasks ON PlanInfo.PlanId = Tasks.PlanId
INNER JOIN EngSchedToTimingMap ON Tasks.CatId = EngSchedToTimingMap.TaskCategoryId
WHERE (MapId = 9 OR MapId = 11 or MapId = 13 or MapId = 15)
AND([orders].dbo.BaanOrders_Ext.Processor = 'metest' OR tResponsible = 'metest')
ORDER BY PlanInfo.PlanId

I would have to guess that it is due to having an index on PlanInfo.PlanId, on which you are sorting.
SQL Server could streamline collection so that it follows the index and build the rest of the columns along that order. When the field is NULLable, the index cannot be used for sorting, because it will not contain the NULL values, which incidentally come first, so it decides to optimize along a different path.
Showing the Execution Plan always helps. Either paste the images of the plans, or just show the text-mode plans, i.e. add the line above the query, then execute it
SET SHOWPLAN_TEXT ON;
<the query>

When you use the ORDER BY clause, you force the database engine to sort the results. This takes some time (especially if the result contains many rows) - thus it is possible that a query that runs 1 second without an ORDER BY clause runs 12 seconds with it. Note that sorting takes at best O(N*log(N)) time where N is the number of rows.
The reason why NULLs are generally slow is the fact that they must be treated specially. Sorting with NULLs adds more complex comparison conditions and slows the sorting down.

If your question is "Why does the ORDER BY clause cause my query to run longer?" the answer is because sorting the results is added to the query execution plan.
If you use the "Show Estimated Query Execution Plan" tool in SQL Server Studio, it will show you exactly what it thinks the SQL Server engine will do.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas