Simple optimization in a SQL join - sql

Take the following simple SQL. It pulls any set of fields from two tables. A "Jobs" table, and some supporting table that we join off of the jobs table to get. Note that it's a left join because in this case the supporting table data is not required to be there.
select [fields]
from jobs j
left join supporting_data sd on sd.id = j.supporting_data_id
Would the query perform any different when written as follows:
select [fields]
from jobs j
left join supporting_data sd
on (j.supporting_data_id > 0) and (sd.id = j.supporting_id)
The difference being that if the main table has a record of "-1" which I commonly see in databases indicating "no value" then boolean short circuit evaluation should kick in and stop the query from checking the "supporting_data" table at all for that record.
Of course there should always be an index on the field. But if I had a record with "jobs.supporting_id = -1" then wouldn't this cause the database engine to scan the index for that record? Maybe negligible... Just wondering if there is any difference internally.

Related

SQL INNER JOIN vs WHERE IN big performance difference

I have read multiple sources and still don't understand where a big difference comes from in a query I have for Microsoft SQL Server.
I need to count different alerts linked to vehicles (IdMateriel is synonymous to the id of the vehicle) based on types (CodeAlerte), state (Etat), and a Top true/false column, but beofre the different counts I need to select the data.
TLDR : There are two parameters, the current date as an SQL DATETIME, and the VARCHAR(MAX) string of entity codes separated by commas, which I split using STRING_SPLIT to use them either in WHERE IN clause or INNER JOIN. Using it in the first clause is ~10x faster than the second clause although it seems equivalent to me. Why?
First, the queries are based on the view created as follows:
CREATE OR ALTER VIEW [dbo].[AlertesVehicules]
WITH SCHEMABINDING
AS
SELECT dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte,
dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat,
Vehicule.Top, Vehicule.EtatVehicule,COUNT_BIG(*) AS COUNT
FROM dbo.Alerte
INNER JOIN dbo.Materiel on dbo.Alerte.IdMateriel= dbo.Materiel.Id
INNER JOIN dbo.Vehicule on dbo.Vehicule.Id= dbo.Materiel.Id
INNER JOIN dbo.TypeAlerte on dbo.Alerte.IdTypeAlerte = dbo.TypeAlerte.Id
WHERE dbo.Materiel.EntiteGestion is NOT NULL
AND dbo.TypeAlerte.CodeAlerte IN ('P07','P08','P09','P11','P12','P13','P14')
GROUP BY dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte,
dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat,
Vehicule.Top, Vehicule.EtatVehicule
GO
CREATE UNIQUE CLUSTERED INDEX IX_AlerteVehicule
ON dbo.AlertesVehicules (EntiteGestion,IdMateriel,CodeAlerte,Etat,TopAlerteMajeure);
GO
This first version of the query takes ~100ms:
SELECT DISTINCT a.IdMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTabAlertes
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule
INNER JOIN (SELECT value AS entiteGestion FROM STRING_SPLIT(#entiteGestion, ',')) eg
ON a.EntiteGestion = eg.entiteGestion
WHERE
a.CodeAlerte IN ('P08','P09')
AND #currentDate <= ISNULL(ase.DateFin, #currentDate)
According to SQL Sentry Plan Explorer, the actual execution plan starts with an index seek taking ~30% of the time on dbo.Alerte with the predicate Alerte.IdTypeAlerte=TypeAlerte.Id, outputting 369 000 rows of Etat, IdMateriel and IdTypeAlerte, which it then directly filters down to 7 742 based on predicate PROBE(Opt_Bitmapxxxx,Alerte.IdMateriel), and then inner joins taking ~25% of the time with the 2 results of another index seek of TypeAlerte but with predicates TypeAlerte.CodeAlerte = N'P08' and =N'P09'. So just these two parts take > 50ms, but I don't understand why there are so many initial results.
The second version takes ~10ms :
SELECT DISTINCT a.dMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTab
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule
WHERE
a.EntiteGestion IN (SELECT value FROM STRING_SPLIT(#entiteGestion, ','))
AND a.CodeAlerte IN ('P08','P09')
AND (#currentDate <= ISNULL(ase.DateFin, #currentDate))
For this one, SQL Sentry Plan Explorer starts with a View Clustered Index Seek directly on the view AlertesVehicules with Seek Predicates AlertesVehicule.EntiteGestion > Exprxxx1 and <Exprxxx2and Predicate AlertesVehicules.CodeAlerte=N'P08' and =N'P09'
Why are those two treated so differently when it seems to me that they are exactly equivalent?
For references, here are some threads I already looked into, but didn't seem to find an explanation in (except that there shouldn't be a difference):
SQL JOIN vs IN performance?
Performance gap between WHERE IN (1,2,3,4) vs IN (select * from STRING_SPLIT('1,2,3,4',','))

How to LEFT JOIN 4 tables in SQL?

I have 4 tables - Controls, Risks, Processes & Regulations. They each have ID common instances of ID numbers. For example (ID1 exists across the 4 tables). The problem is that under each table, the number of instances of each ID varies (for ex, ID1 exists 5 times in Controls, 3 times in Risks, 0 in Processes & once in Regulations).
I need to LEFT JOIN all these tables so they are all joined by ID number
The code below works until Line 3, but when I add Line 4, it gives me a "Resultant table not allowed to have more than one AutoNumber field" error
SELECT *
FROM Controls
LEFT JOIN Processes ON Processes.TO_PRC_ID = Controls.TO_PRC_ID
LEFT JOIN Risks ON Risks.TO_PRC_ID = Controls.TO_PRC_ID
LEFT JOIN Regulations ON Regulations.TO_PRC_ID = Controls.TO_PRC_ID
MS Access requires extra parentheses for multiple joins:
SELECT *
FROM (Controls LEFT JOIN
Processes_Risks
ON Processes_Risks.TO_PRC_ID = Controls.TO_PRC_ID
) LEFT JOIN
Issues
ON Issues.TO_PRC_ID = Controls.TO_PRC_ID
And the process continues:
SELECT *
FROM ((Controls LEFT JOIN
Processes_Risks
ON Processes_Risks.TO_PRC_ID = Controls.TO_PRC_ID
) LEFT JOIN
Issues
ON Issues.TO_PRC_ID = Controls.TO_PRC_ID
) LEFT JOIN
Regulations
ON . . .
You have two or more table with the same column name so try use full qualified column name in select
SELECT c.TO_PRC_ID, p.TO_PRC_ID, r1.TO_PRC_ID, r2.TO_PRC_ID
FROM Controls c
LEFT JOIN Processes ON p p.TO_PRC_ID = c.TO_PRC_ID
LEFT JOIN Risks r1 ON r1.TO_PRC_ID = c.TO_PRC_ID
LEFT JOIN Regulations r2 ON r2.TO_PRC_ID = c.TO_PRC_ID
There are two different problems here. One problem is getting the right syntax for joining four tables. The other problem is the error message "Resultant table not allowed to have more than one AutoNumber field".
I don't have a copy of the tables being joined, but I suspect that more than one of them has an AutoNumber field in it. This is a field that automatically generates a record number when a new record is added to a table. Because the left join includes all of the fields in all of the tables, it will eventually include two different AutoNumber fields. MS Access cannot cope with that situation, so it declares there to be an error.
The proper though difficult way to deal with removing an AutoNumber field from a join is to list all of the other fields instead. So, instead of
FROM CONTROLS
one would need to code
FROM (SELECT A, B, C, D, WHATEVER FROM CONTROLS)
to eliminate the problem field.
If the tables have many fields, this becomes tedious to code. One alternative is to copy a table into a temporary table, drop the AutoNumber field from the copy, and use the copy instead of the original in the join. Whether this is a good or bad idea depends on the circumstances, such as how large the tables are, how often this would need to be done, and whether there is a good way to clean up the temporary tables later.

Need Input | SQL Dynamic Query

Have a requirement where I need to build a dynamic query based on user input and send the count of records from result set.
So there are 6 tables which I needs to make a join Inner for sure and rest table join will be based on user input and this should be performance oriented.
Here is the requirement
select count(A.A1) from table A
INNER JOIN table B on B.B1=A.A1
INNER JOIN table B on C.C1=B.B1
INNER JOIN table D on D.D1=C.C1
INNER JOIN table E on E.E1=D.D1
INNER JOIN table F on F.F1=E.E1
Now if user select some value in UI , then have to execute query as
select count(A.A1) from table A
INNER JOIN table B on B.B1=A.A1
INNER JOIN table B on C.C1=B.B1
INNER JOIN table D on D.D1=C.C1
INNER JOIN table E on E.E1=D.D1
INNER JOIN table F on F.F1=E.E1
INNER JOIN table B on G.G1=F.F1
Where G.Name like '%Germany%'
User can send 1- 5 choices and have to build the query and accordingly and send the result set
So if I add all the joins first and then add where clause as per the choice , then query will be easy and serve the purpose, but if user did not select any query then I am creating unnecessary join for the user choices.
So which will be better way to write having all the joins in advance and then filtering it or on demand join and with filters using dynamic query.
Could be great if someone can provide valuable inputs.
When SQL Server executes a query, there is a first step which is planning the query, i.e. deciding an strategy to get the query result.
If you use "inner joins" you're making it compulsory to include all the tables, becasuse "inner join" means that there must be matching rows on both tables of the join, so the query planner can't dicard any tables.
However, if you change the inner joins by left outer joins, it's not compulsory that there are matching rows on both sides of the join, so the query planner can decide if it includes or not the tables on the right. So, if you use left outer joins, and you don't select, or filter, or do any operation on fields on the right side of the joins, the query planner can discard then when executing the query. That's the easiest way to get rid of your concerns.
On the other hand, if you want to control what tables to inclued or not to include, and create a custom query for each case, you can use several techniques:
making a graph that includes the definition of the table relations, and using some graph manipulation library that allows you to get the necessary tables from the graph.I did this one, but is quite hard to achieve if you don't have experience with graps.
using Entity Framework. You must build a simple model including all the tables. And then, to run each query, you can programmatically build the query in LINQ, and EF will take care to generate and execute the SQL query for you.

Speeding up inner-joins and subqueries while restricting row size and table membership

I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts.
The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications.
After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates).
Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries.
The query (takes about 9 seconds on 3600 items with no indexes):
insert into stream (entry_id, user_id)
select entries.id, subscriptions_users.user_id
from entries
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id
where subscriptions_users.user_id = 1
and entries.id not in (select entry_id
from metadata
where metadata.user_id = 1)
and entries.id not in (select entry_id
from stream where user_id = 1);
The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream.
Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer).
This is close but not quite what I wanted to do.
Does anyone have any advice (nosql?) or know a more efficient way of composing the query?
Use:
INSERT INTO STREAM
(entry_id, user_id)
SELECT e.id,
su.user_id
FROM ENTRIES e
JOIN SUBSCRIPTIONS_USERS su ON su.subscription_id = e.subscription_id
AND su.user_id = 1
LEFT JOIN METADATA md ON md.entry_id = e.id
AND md.user_id = 1
LEFT JOIN STREAM s ON s.entry_id = e.id
AND s.user_id = 1
WHERE md.entry_id IS NULL
AND s.entry_id IS NULL
In MySQL, the LEFT JOIN/IS NULL is the most efficient means of getting data that exists in one table, but not another. Reference link
Check the query performance before looking at indexes.
In Postgres:
NOT IN
NOT EXISTS
LEFT JOIN / IS NULL
...are equivalent.
The query (takes about 9 seconds on
3600 items with no indexes):
Then I would try to start off with some indexes...
OR LEFT JOIN NULL (And Indexes)
SELECT *
FROM TABLEA A LEFT JOIN
TABLEB B ON A.ID = B. ID
WHERE B.ID IS NULL
One way to optimize the select is to replace the subqueries with joins.
Something like:
select entries.id, subscriptions_users.user_id
from entries
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id
left join metadata md on (user_id,entry_id)
left join stream str on (user_id, entry_id)
where subscriptions_users.user_id = 1 and where md.user_id is null and str.user_id is null;
You would have to make sure that the join conditions for the left join are correct. I am not sure what your exact schema is, so I can't.
Also, adding indexes would also help.

LEFT INNER JOIN vs. LEFT OUTER JOIN - Why does the OUTER take longer?

We have the query below. Using a LEFT OUTER join takes 9 seconds to execute. Changing the LEFT OUTER to an LEFT INNER reduces the execution time to 2 seconds, and the same number of rows are returned. Since the same number of rows from the dbo.Accepts table are being processed, regardless of the join type, why would the outer take 3x longer?
SELECT CONVERT(varchar, a.ReadTime, 101) as ReadDate,
a.SubID,
a.PlantID,
a.Unit as UnitID,
a.SubAssembly,
m.Lot
FROM dbo.Accepts a WITH (NOLOCK)
LEFT OUTER Join dbo.Marker m WITH (NOLOCK) ON m.SubID = a.SubID
WHERE a.LastModifiedTime BETWEEN #LastModifiedTimeStart AND #LastModifiedTimeEnd
AND a.SubAssembly = '400'
The fact that the same number of rows is returned is an after fact, the query optimizer cannot know in advance that every row in Accepts has a matching row in Marker, can it?
If you join two tables A and B, say A has 1 million rows and B has 1 row. If you say A LEFT INNER JOIN B it means only rows that match both A and B can result, so the query plan is free to scan B first, then use an index to do a range scan in A, and perhaps return 10 rows. But if you say A LEFT OUTER JOIN B then at least all rows in A have to be returned, so the plan must scan everything in A no matter what it finds in B. By using an OUTER join you are eliminating one possible optimization.
If you do know that every row in Accepts will have a match in Marker, then why not declare a foreign key to enforce this? The optimizer will see the constraint, and if is trusted, will take it into account in the plan.
1) in a query window in SQL Server Management Studio, run the command:
SET SHOWPLAN_ALL ON
2) run your slow query
3) your query will not run, but the execution plan will be returned. store this output
4) run your fast version of the query
5) your query will not run, but the execution plan will be returned. store this output
6) compare the slow query version output to the fast query version output.
7) if you still don't know why one is slower, post both outputs in your question (edit it) and someone here can help from there.
This is because the LEFT OUTER Join is doing more work than an INNER Join BEFORE sending the results back.
The Inner Join looks for all records where the ON statement is true (So when it creates a new table, it only puts in records that match the m.SubID = a.SubID). Then it compares those results to your WHERE statement (Your last modified time).
The Left Outer Join...Takes all of the records in your first table. If the ON statement is not true (m.SubID does not equal a.SubID), it simply NULLS the values in the second table's column for that recordset.
The reason you get the same number of results at the end is probably coincidence due to the WHERE clause that happens AFTER all of the copying of records.
Join (SQL) Wikipedia
Wait -- did you actually mean that "the same number of rows ... are being processed" or that "the same number of rows are being returned"? In general, the outer join would process many more rows, including those for which there is no match, even if it returns the same number of records.