How to speed up query with multiple INNER JOINs - sql

I've been toying around with switching from ms-access files to SQLite files for my simple database needs; for the usual reasons: smaller file size, less overhead, open source, etc.
One thing that is preventing me from making the switch is what seems to be a lack of speed in SQLite. For simple SELECT queries, SQLite seems to perform as well as, or better than MS-Access. The problem occurs with a fairly complex SELECT query with multiple INNER JOIN statements:
SELECT DISTINCT
DESCRIPTIONS.[oCode] AS OptionCode,
DESCRIPTIONS.[descShort] AS OptionDescription
FROM DESCRIPTIONS
INNER JOIN tbl_D_E ON DESCRIPTIONS.[oCode] = tbl_D_E.[D]
INNER JOIN tbl_D_F ON DESCRIPTIONS.[oCode] = tbl_D_F.[D]
INNER JOIN tbl_D_H ON DESCRIPTIONS.[oCode] = tbl_D_H.[D]
INNER JOIN tbl_D_J ON DESCRIPTIONS.[oCode] = tbl_D_J.[D]
INNER JOIN tbl_D_T ON DESCRIPTIONS.[oCode] = tbl_D_T.[D]
INNER JOIN tbl_Y_D ON DESCRIPTIONS.[oCode] = tbl_Y_D.[D]
WHERE ((tbl_D_E.[E] LIKE '%')
AND (tbl_D_H.[oType] ='STANDARD')
AND (tbl_D_J.[oType] ='STANDARD')
AND (tbl_Y_D.[Y] = '41')
AND (tbl_Y_D.[oType] ='STANDARD')
AND (DESCRIPTIONS.[oMod]='D'))
In MS-Access, this query executes in about 2.5 seconds. In SQLite, it takes a little over 8 minutes. It takes the same amount of time whether I'm running the query from VB code or from the command prompt using sqlite3.exe.
So my questions are the following:
Is SQLite just not optimized to handle multiple INNER JOIN statements?
Have I done something obviously stupid in my query (because I am new to SQLite) that makes it so slow?
And before anyone suggests a completely different technology, no I can not switch. My choices are MS-Access or SQLite. :)
UPDATE:
Assigning an INDEX to each of the columns in the SQLite database reduced the query time from over 8 minutes down to about 6 seconds. Thanks to Larry Lustig for explaining why the INDEXing was needed.

As requested, I'm reposting my previous comment as an actual answer (when I first posted the comment I was not able, for some reason, to post it as an answer):
MS Access is very aggressive about indexing columns on your behalf, whereas SQLite will require you to explicitly create the indexes you need. So, it's possible that Access has indexed either [Description] or [D] for you but that those indexes are missing in SQLite. I don't have experience with that amount of JOIN activity in SQLite. I used it in one Django project with a relatively small amount of data and did not detect any performance issues.

Do you have issues with referencial integrity? I ask because have the impression you've got unnecessary joins, so I re-wrote your query as:
SELECT DISTINCT
t.[oCode] AS OptionCode,
t.[descShort] AS OptionDescription
FROM DESCRIPTIONS t
JOIN tbl_D_H h ON h.[D] = t.[oCode]
AND h.[oType] = 'STANDARD'
JOIN tbl_D_J j ON j.[D] = t.[oCode]
AND j.[oType] = 'STANDARD'
JOIN tbl_Y_D d ON d.[D] = t.[oCode]
AND d.[Y] = '41'
AND d.[oType] ='STANDARD'
WHERE t.[oMod] = 'D'

If DESCRIPTIONS and tbl_D_E have multiple row scans then oCode and D should be indexed. Look at example here to see how to index and tell how many row scans there are (http://www.siteconsortium.com/h/p1.php?id=mysql002).
This might fix it though ..
CREATE INDEX ocode_index ON DESCRIPTIONS (oCode) USING BTREE;
CREATE INDEX d_index ON tbl_D_E (D) USING BTREE;
etc ....
Indexing correctly is one piece of the puzzle that can easily double, triple or more the speed of the query.

Related

Perfomance tweak for this query

I am writting the following query,
Execution plan
Its takes 30 seconds to load just 80 rows.
Is there anything we can do to reduce the time of running this query?
select
CO.ContributorsName [ContributorsName]
, D.DocumentLastPublished DocumentLastPublished
, CO.ContributorsImage [AuthorImage]
, T.NodeAliasPath
, D.DocumentID
, BD.*
from CMS_Tree T
inner join Cms_Class CC
on T.NodeClassID = CC.ClassID
and CC.ClassName = 'wv.blogdata'
inner join Cms_Document D
on T.NodeID = D.DocumentNodeID
inner join WV_BlogData BD
on D.DocumentForeignKeyValue = BD.BlogDataID
and COALESCE(BD.IsDeleted, 0) = 0
inner join WV_Contributors CO
on BD.AuthorID = CO.ContributorsID
where (
'ALL' = 'ALL'
or category = 'All'
)
and DocumentCulture = 'en-US'
Don't use * for all tables.Only specify column names what columns you need.Check your WHERE Clause also.
Covering indexes
(Looking at your execution plan, it looks like you've already got the appropriate covering indexes, but this is good general advice, and still worth a try)
If this is a frequently used query, make sure you've got the appropriate covering indexes on the tables involved. See this MSDN page for how to identify potential missing indexes. Note that adding indexes will improve query performance, at the cost of degrading your insert performance. You will also need to make sure you've got the appropriate maintenance plans in place to ensure your indexes don't get fragmented or unbalanced.
Query changes
I'd also recommend trying some changes to your query and comparing the execution plans.
It's difficult to make any meaningful suggestions without looking at your database and being able to try a few things.
From a cursory look at your query, the most obvious thing I can see is that you're performing an inner join on Cms_Class, but not selecting any of the data from it, or even joining it to other tables (apart from CMS_Tree). I'd suggest removing this join and using an exists statement instead, like so:
select
CO.ContributorsName [ContributorsName]
, D.DocumentLastPublished DocumentLastPublished
, CO.ContributorsImage [AuthorImage]
, T.NodeAliasPath
, D.DocumentID
, BD.*
from CMS_Tree T
inner join Cms_Document D
on T.NodeID = D.DocumentNodeID
inner join WV_BlogData BD
on D.DocumentForeignKeyValue = BD.BlogDataID
and COALESCE(BD.IsDeleted, 0) = 0
inner join WV_Contributors CO
on BD.AuthorID = CO.ContributorsID
where (
'ALL' = 'ALL'
or category = 'All'
)
and DocumentCulture = 'en-US'
and exists
(
select null
from Cms_Class CC
where T.NodeClassID = CC.ClassID
and CC.ClassName = 'wv.blogdata'
)
Give it a try, look at the execution plans, and see if it makes a difference for you.
If you create new covering indexes, re-run your queries and look at the execution plans again, because the most efficient query with missing indexes might not be the most efficient query once you've added indexes.
Document caching (SQL isn't always the best solution for accessing data)
Assuming you've done both of these, and the query performance is still too poor, you may want to ask yourself if you really need to query live data. Looking at your query, it looks like you're querying data from a CMS. The data in a CMS is only going to change when a content author actually makes a change. Most of the time, the data will stay the same from request to request. This means that doing a direct query from SQL every time you want to access content might be overkill for your needs.
A good use-case example is to look at how Umbraco CMS accesses its data. It keeps an XML document cache of all of the published documents on a given site. When a content author publishes changes, it then updates the XML document cache.
Accessing the cache is much more efficient than talking to SQL directly, and they even warn users not to use their SQL API for serving up CMS content, because it is too slow.

SQL Server choosing inefficient execution plan

I've got a query that gets run in certain circumstances with an 'over-simplified' execution plan that actually turns out to be quite slow (3-5 seconds). The query is:
SELECT DISTINCT Salesperson.*
FROM Salesperson
INNER JOIN SalesOrder on Salesperson.Id = SalesOrder.SalespersonId
INNER JOIN PrelimOrder on SalesOrder.Id = PrelimOrder.OrderId
INNER JOIN PrelimOrderStatus on PrelimOrder.CurrentStatusId = PrelimOrderStatus.Id
INNER JOIN PrelimOrderStatusType on PrelimOrderStatus.StatusTypeId = PrelimOrderStatusType.Id
WHERE
PrelimOrderStatusType.StatusTypeCode = 'Draft'
AND Salesperson.EndDate IS NULL
and the slow execution plan looks like:
The thing that stands out straight away is that the actual number of rows/executions is significantly higher than the respective estimates:
If I remove the Salesperson.EndDate IS NULL clause, then a faster, parallelized execution plan is run:
A similar execution plan also runs quite fast if I remove the DISTINCT keyword.
From what I can gather, it seems that the optimiser decides, based on its incorrect estimates, that the query won't be costly to run and therefore doesn't choose the parallelized plan. But I can't for the life of me figure out why it is choosing the incorrect plan. I have checked my statistics and they are all as they should be. I have tested in both SQL Server 2008 to 2016 with identical results.
SELECT DISTINCT is expensive. So, it is best to avoid it. Something like this:
SELECT sp.*
FROM Salesperson sp
WHERE EXISTS (SELECT 1
FROM SalesOrder so INNER JOIN
PrelimOrder po
ON so.Id = po.OrderId INNER JOIN
PrelimOrderStatus pos
ON po.CurrentStatusId = pos.Id INNER JOIN
PrelimOrderStatusType post
ON pos.StatusTypeId = post.Id
WHERE sp.Id = so.SalespersonId AND
post.StatusTypeCode = 'Draft'
) AND
sp.EndDate IS NULL;
Note: An index on SalesPerson(EndDate, Id) would be helpful.
As #Gordon Linoff already said, DISTINCT usually is bad news for performance. Often it means you're amassing way too much data and then squeezing it back together in a more compact set. Better to keep it small all throughout the process, if possible.
Also, it's kind of counter-intuitive that the query plan with index scans turns out to be faster than the one with index seeks; it seems (in this case) parallelism makes up for it. You could try playing around with the
Cost Threshold For Parallelism Option but beware that this is a server-wide setting! (then again, in my opinion the default of 5 is rather high for most use-cases I've run into personally; CPU's are aplenty these days, time still isn't =).
Bit of a long reach, but I was wondering if you could 'split' the query in 2, thus eliminating (a small) part of the guesswork of the server. I'm assuming here that StatusTypeCode is unique. (verify the datatype of the variable too!)
DECLARE #StatusTypeId int
SELECT #StatusTypeId = Id
FROM PrelimOrderStatusType
WHERE StatusTypeCode = 'Draft'
SELECT Salesperson.*
FROM Salesperson
WHERE Salesperson.EndDate IS NULL
AND EXISTS ( SELECT *
FROM SalesOrder
ON SalesOrder.SalespersonId = Salesperson.Id
JOIN PrelimOrder
ON PrelimOrder.OrderId = SalesOrder.Id
JOIN PrelimOrderStatus
ON PrelimOrderStatus.Id = PrelimOrder.CurrentStatusId
AND PrelimOrderStatus.StatusTypeId = #StatusTypeId)
If it doesn't help, could you give give the definition of the indexes that are being used?

Left join or Select in select (SQL - Speed of query)

I have something like this:
SELECT CompanyId
FROM Company
WHERE CompanyId not in
(SELECT CompanyId
FROM Company
WHERE (IsPublic = 0) and CompanyId NOT IN
(SELECT ShoppingLike.WhichId
FROM Company
INNER JOIN
ShoppingLike ON Company.CompanyId = ShoppingLike.UserId
WHERE (ShoppingLike.IsWaiting = 0) AND
(ShoppingLike.ShoppingScoreTypeId = 2) AND
(ShoppingLike.UserId = 75)
)
)
It has 3 select, I want to know how could I have it without making 3 selects, and which one has better speed for 1 million record? "select in select" or "left join"?
My experiences are from Oracle. There is never a correct answer to optimising tricky queries, it's a collaboration between you and the optimiser. You need to check explain plans and sometimes traces, often at each stage of writing the query, to find out what the optimiser in thinking. Having said that:
You could remove the outer SELECT by putting the entire contents of it's subquery WHERE clause in a NOT(...). On the face of it will prevent that outer full scan of Company (or it's index of CompanyId). Try it, check the output is the same and get timings, then remove it temporarily before trying the below. The NOT() may well cause the optimiser to stop considering an ANTI-JOIN against the ShoppingLike subquery due to an implicit OR being created.
Ensure that CompanyId and WhichId are defined as NOT NULL columns. Without this (or the likes of an explicit CompanyId IS NOT NULL) then ANTI-JOIN options are often discarded.
The inner most subquery is not correlated (does not reference anything from it's outer query) so can be extracted and tuned separately. As a matter of style I'd swap the table names round the INNER JOIN as you want ShoppingLike scanned first as it has all the filters against it. It wont make any difference but it reads easier and makes it possible to use a hint to scan tables in the order specified. I would even question the need for the Company table in this subquery.
You've used NOT IN when sometimes the very similar NOT EXISTS gives the optimiser more/alternative options.
All the above is just trial and error unless you start trying the explain plan. Oracle can, with a following wind, convert between LEFT JOIN and IN SELECT. 1M+ rows will create time to invest.

Question about SQL Server Optmization Sub Query vs. Join

Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?
SELECT DISTINCT S#
FROM shipments
WHERE P# IN (SELECT P#
FROM parts
WHERE color = ‘Red’)
vs.
SELECT DISTINCT S#
FROM shipments, parts
WHERE shipments.P# = parts.P#
AND parts.color = ‘Red’
The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.
Starting here, this answer has been updated
The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:
Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)
yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.
A query of the form:
Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID)
Where A.Gender='M'
yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.
These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.
Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:
Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B
on (A.UID=B.UID)
Where A.Gender='M'
The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.
When you add a predicate to the search on table 2 using a subquery:
Select F1, F2, F3 From Table1 Where F4='X' And UID in
(Select UID From Table2 Where F1='Y')
then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.
When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.
One last thing: someone suggested (and your question now includes) a query of the form:
Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'
The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.
With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.
So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.
In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.
So, the takeaway is simply that the best solution is the one that makes your development the most efficient.
Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:
SELECT DISTINCT S#
FROM SHIPMENTS s
JOIN PARTS p ON p.p# = s.p#
AND p.color = 'Red'
Check your explain plans to see which is better, because it depends on data and table setup.
If you aren't selecting anything from the table I would use an EXISTS clause.
SELECT DISTINCT S#
FROM shipments a
WHERE EXISTS (SELECT 1
FROM parts b
WHERE b.color = ‘Red’
AND a.P# = b.P#)
This will optimize out to be the same as the second one you posted.
SELECT DISTINCT S#
FROM shipments,parts
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;
Using IN forces SQL Server to not use indexing on that column, and subqueries are usually slower

Aggregating two selects with a group by in SQL is really slow

I am currently working with a query in in MSSQL that looks like:
SELECT
...
FROM
(SELECT
...
)T1
JOIN
(SELECT
...
)T2
GROUP BY
...
The inner selects are relatively fast, but the outer select aggregates the inner selects and takes an incredibly long time to execute, often timing out. Removing the group by makes it run somewhat faster and changing the join to a LEFT OUTER JOIN speeds things up a bit as well.
Why would doing a group by on a select which aggregates two inner selects cause the query to run so slow? Why does an INNER JOIN run slower than a LEFT OUTER JOIN? What can I do to troubleshoot this further?
EDIT: What makes this even more perplexing is the two inner queries are date limited and the overall query only runs slow when looking at date ranges between the start of July and any other day in July, but if the date ranges are anytime before the the July 1 and Today then it runs fine.
Without some more detail of your query its impossible to offer any hints as to what may speed your query up. A possible guess is the two inner queries are blocking access to any indexes which might have been used to perform the join resulting in large scans but there are probably many other possible reasons.
To check where the time is used in the query check the execution plan, there is a detailed explanation here
http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx
The basic run down is run the query, and display the execution plan, then look for any large percentages - they are what is slowing your query down.
Try rewriting your query without the nested SELECTs, which are rarely necessary. When using nested SELECTs - except for trivial cases - the inner SELECT resultsets are not indexed, which makes joining them to anything slow.
As Tetraneutron said, post details of your query -- we may help you rewrite it in a straight-through way.
Have you given a join predicate? Ie join table A ON table.ColA = table.ColB. If you don't give a predicate then SQL may be forced to use nested loops, so if you have a lot of rows in that range it would explain a query slow down.
Have a look at the plan in the SQL studio if you have MS Sql Server to play with.
After your t2 statement add a join condition on t1.joinfield = t2.joinfield
The issue was with fragmented data. After the data was defragmented the query started running within reasonable time constraints.
JOIN = Cartesian Product. All columns from both tables will be joined in numerous permutations. It is slow because the inner queries are querying each of the separate tables, but once they hit the join, it becomes a Cartesian product and is more difficult to manage. This would occur at the outer select statement.
Have a look at INNER JOINs as Tetraneutron recommended.