Never ending select with troublesome inner join

Never ending select with troublesome inner join - sql

I have two tables:
Query is (short version):
SELECT TOP (2000) *
FROM [...].[Documents] AS [D]
INNER JOIN dbo.SE_CMSClientMatters AS [mat] ON [mat].[Mcode] = [D].[custom2]
I cannot figure out for the life of me why it does not work -i.e. never completes execution - ran for 13 hours and still no sign of completing.
Documents table has around 12 million rows, ClientMatters has around 330,000 rows.
The original query has several other left joins and this is the only inner join. If I leave it out, the query completes in 20 seconds!
Its 4am so either I'm losing it or have missed something obvious. PS - I did rebuild the indexes. The Custom2 field is part of a group of fields indexed (see image).
Any help appreciated - thanks!

One of the issues I see is
mat.MCode is of type varchar(27)
D.custom2 is of type nvarchar(32)
This is horrible (performance-wise) when joining - one column is Unicode, the other is not.
Try to cast one to the other - something like this:
SELECT TOP (2000) *
FROM [...].[Documents] AS [D]
INNER JOIN dbo.SE_CMSClientMatters AS [mat]
ON CAST([mat].[Mcode] AS NVARCHAR(32)) = [D].[custom2]
As a general rule, you should always try to use the same datatype in columns that you use for joining - and joining is typically much easier and faster on numerical datatypes, rather than string-based datatypes.
If you could - try to convert one of these two columns to the same datatype as the other - I'm pretty sure that would speed things up significantly.
And also: an index on Documents where Custom2 is at the second position will NOT be able to be used for this join - try to create a separate index on custom2 alone:
CREATE NONCLUSTERED INDEX IX_Documents_custom2 ON dbo.Documents(custom2)

Related

SQL Server LEFT OUTER JOIN to sub query without parameter - Nested Loops Join?

I have a query that joins to a sub query. When an indexed nvarchar column is passed in as a string literal, the performance tanks. However, when I pass in the same value as a parameter, the execution plan changes drastically and the query executes quickly. I apologize for naming.
In the situation below, myStringColumn is indexed and the tables have millions of rows.
This one is bad:
SELECT myColumn from myTable1 LEFT OUTER JOIN
(SELECT myColumn2 from myTable2
WHERE myStringColumn = 'myFilter') ON...
This one is good:
Declare #myParameter as nvarchar = N'myFilter'
SELECT myColumn from myTable1 LEFT OUTER JOIN
(SELECT myColumn2 from myTable2
WHERE myStringColumn = #myParameter) ON...
What I do not understand is why the performance of the second is so much better. From what I can tell, it's because the nested loops join in the plan is much better in the second query, but I do not understand why. My guess is that for some reason SQL Server is having to loop through more rows for each iteration in the first, but I am at a loss as to why only changing to a parameter vs a string literal would have that much affect.
Questions: Why is the second query better than the first? What is SQL Server doing that makes the first so much slower?
Thank you.

The difference is that when you declare parameter you doing it correctly with N prefix and declaring variable as Nvarchar, same as your column.
While when you do WHERE - you are not adding N prefix, so internally varchar variable created, if you would do myStringColumn = N'myFilter' it would work the same way.
Now why is this happening and why it is important to match column type to parameters you passing - is because if parameter is not of column type, then SQL will convert ALL values in the column (so basically all rows) to the type of parameter passed in query (not other way around) - so that's why you have huge performance problem, basically index is lost in this case, cause all column values are being CONVERTED to varchar.
Here are some more explanations - https://www.sqlshack.com/query-performance-issues-on-varchar-data-type-using-an-n-prefix/

Slow Join on Varchar

I have a query in sql server with a join that is taking forever. I'm hoping someone might have a tip to speed it up.
I think the problem is I'm joining on a field called Reseller_CSN which has values like '_0070000050'
I've tried using the substring function in the join to return everything but underscore, example '0070000050' but I keep getting an error when I try to cast or convert the result to int or bigint.
Any tips would be greatly appreciated, the query is below:
SELECT
t1.RESELLER_CSN
,t1.FIRST_YEAR_RENEWAL
,t1.SEAT_SEGMENT
,t2.Target_End_Date_CY
,t2.Target_End_Date_PQ
,t2.Target_End_Date_CY_1
,t2.Target_End_Date_CY_2
,t1.ASSET_SUB_END_DATE
FROM dbo.new_all_renewals t1
LEFT JOIN dbo.new_all_renewals_vwTable t2
ON SUBSTRING(t1.RESELLER_CSN,2,11) = SUBSTRING(t2.RESELLER_CSN,2,11)

A join on processed columns invariably takes more effort than a join on raw columns. In this case, you can improve performance by using computed columns. For instance, on the first table, you can do:
alter table new_all_renewals add CSN_num as SUBSTRING(t1.RESELLER_CSN,2,11);
create index on new_all_renewals(CSN_num);
This will generate an index on the column, which should speed the query. (Note: you'll need to reference the computed column rather than actually using the function.)

Inner join vs separate statements

I am trying to get some more context on this out of curiosity. So far when I run 2 separate sql statements I find in SQL Profiler that I have no CPU cycles, less reads and less duration than taking the script and using Inner join. Is this a typical case, I am looking for help to understand this better.
Simple example:
SELECT * FROM dbo.ChargeCode
SELECT * FROM dbo.ChargeCodeGroup
vs
SELECT *
FROM dbo.ChargeCode c
INNER JOIN dbo.ChargeCodeGroup cc ON c.ChargeCodeGroupID = cc.ChargeCodeGroupID
From what I guess is that inner join cost extra CPU cycles because its doing a nested loop. Am I on the right track with this?

The simple answer is that you're doing two different things here. In your 1st example you're retrieving 2 separate entities. In your second example, your asking the RDBMS to combine (join) 2 entities into a single result set.
A join is one of the most powerful capabilities of an RDBMS - and it will (usually) do it as efficiently as it possibly can - but that's not to say it's free or cheap.

SELECT * FROM sometable
must scan whole table.
If there are indexes on ChargeCodeGroupID column on either table, it will be much faster for INNER JOIN to only scan index. (By their name, I guess there are). Of course, if there is no index on either ChargeCodeGroupID column, second query is slower than the first one.

Why does this SQL query take 8 hours to finish?

There is a simple SQL JOIN statement below:
SELECT
REC.[BarCode]
,REC.[PASSEDPROCESS]
,REC.[PASSEDNODE]
,REC.[ENABLE]
,REC.[ScanTime]
,REC.[ID]
,REC.[Se_Scanner]
,REC.[UserCode]
,REC.[aufnr]
,REC.[dispatcher]
,REC.[matnr]
,REC.[unitcount]
,REC.[maktx]
,REC.[color]
,REC.[machinecode]
,P.PR_NAME
,N.NO_NAME
,I.[inventoryID]
,I.[status]
FROM tbBCScanRec as REC
left join TB_R_INVENTORY_BARCODE as R
ON REC.[BarCode] = R.[barcode]
AND REC.[PASSEDPROCESS] = R.[process]
AND REC.[PASSEDNODE] = R.[node]
left join TB_INVENTORY as I
ON R.[inventid] = I.[id]
INNER JOIN TB_NODE as N
ON N.NO_ID = REC.PASSEDNODE
INNER JOIN TB_PROCESS as P
ON P.PR_CODE = REC.PASSEDPROCESS
The table tbBCScanRec has 556553 records while the table TB_R_INVENTORY_BARCODE has 260513 reccords and the table TB_INVENTORY has 7688. However, the last two tables (TB_NODE and TB_PROCESS) both have fewer than 30 records.
Incredibly, when it runs in SQL Server 2005, it takes 8 hours to return the result set.
Why does it take so much time to execute?
If the two inner joins are removed, it takes just ten seconds to finish running.
What is the matter?
There are at least two UNIQUE NONCLUSTERED INDEXes.
One is IX_INVENTORY_BARCODE_PROCESS_NODE on the table TB_R_INVENTORY_BARCODE, which covers four columns (inventid, barcode, process, and node).
The other is IX_BARCODE_PROCESS_NODE on the table tbBCScanRec, which covers three columns (BarCode, PASSEDPROCESS, and PASSEDNODE).

Well, standard answer to questions like this:
Make sure you have all the necessary indexes in place, i.e. indexes on N.NO_ID, REC.PASSEDNODE, P.PR_CODE, REC.PASSEDPROCESS
Make sure that the types of the columns you join on are the same, so that no implicit conversion is necessary.

You are working with around (556553 *30 *30) 500 millions of rows.
You probably have to add indexes on your tables.
If you are using SQL server, you can watch the plan query to see where you are losing time.
See the documentation here : http://msdn.microsoft.com/en-us/library/ms190623(v=sql.90).aspx
The query plan will help you to create indexes.

When you check the indexing, there should be clustered indexes as well - the nonclustered indexes use the clustered index, so not having one would render the nonclustered useless. Out-dated statistics could also be a problem.
However, why do you need to fetch ALL of the data? What is the purpose of that? You should have WHERE clauses restricting the result set to only what you need.

Slow query with unexpected index scan

I have this query:
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
the biggest table here is RESULT, contains 11.1M records. The left 2 tables about 1M.
this query works slowly (more than 10 minutes) and returns about 800 records. executing plan shows clustered index scan (over it's PRIMARY KEY (result.result_number, which actually doesn't take part in query)) over all 11M records.
RESULT.TEST_NUMBER is a clustered primary key.
if I change 2010-03-17 09:00 to 2010-03-17 10:00 - i get about 40 records. it executes for 300ms. and plan shows index seek (over result.test_number index)
if i replace * in SELECT clause to result.test_number (covered with index) - then all become fast in first case too. this points to hdd IO issues, but doesn't clarifies changing plan.
so, any ideas?
UPDATE:
sampled_date is in table sample and covered by index.
other fields from this query: test.sample_number is covered by index and result.test_number too.
UPDATE 2:
obviously than sql server in any reasons don't want to use index.
i did a small experiment: i remove INNER JOIN with result, select all test.test_number and after that do
SELECT * FROM RESULT WHERE TEST_NUMBER IN (...)
this, of course, works fast. but i cannot get what is the difference and why query optimizer choose such inappropriate way to select data in 1st case.
UPDATE 3:
after backing up database and restoring to database with new name - both requests work fast as expected even on much more ranges...
so - are there any special commands to clean or optimize, whatever, that can be relevant to this? :-(

A couple things to try:
Update statistics
Add hints to the query about what index to use (in SQL Server you might say WITH (INDEX(myindex)) after specifying a table)
EDIT: You noted that copying the database made it work, which tells me that the index statistics were out of date. You can update them with something like UPDATE STATISTICS mytable on a regular basis.
Use EXEC sp_updatestats to update the whole database.

The first thing I would do is specify the exact columns I want, and see if the problems persists. I doubt you would need all the columns from all three tables.
It sounds like it has trouble getting all the rows out of the result table. How big is a row? Look at how big all the data in the table is and divide it by the number of rows. Right click on the table -> properties..., Storage tab.
Try putting where clause into a subquery to force it to do that first?
SELECT *
FROM
(SELECT * FROM sample
WHERE sampled_date
BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00') s
INNER JOIN test ON s.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
OR this might work better if you expect a small number of samples
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sample.sample_ID in (
SELECT sample_ID
FROM sample
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
)

If you do a SELECT *, you want all the data from the table. The data for the table is in the clustered index - the leaf nodes of the clustered index are the data pages.
So if you want all of those data pages anyway, and since you're joining 1 mio. rows to 11 mio. rows (1 out of 11 isn't very selective for SQL Server), using an index to find the rows, and then do bookmark lookups into the actual data pages for each of those rows found, might just not be very efficient, and thus SQL Server uses the clustered index scan instead.
So to make a long story short: only select those rows you really need! You thus give SQL Server a chance to use an index, do a seek there, and find the necessary data.
If you only select three, four columns, then the chances that SQL Server will find and use an index that contains those columns are just so much higher than if you ask for all the data from all the tables involved.
Another option would be to try and find a way to express a subquery, using e.g. a Common Table Expression, that would grab data from the two smaller tables, and reduce that number of rows even more, and join the hopefully quite small result against the main table. If you have a small result set of only 40 or 800 results (rather than two tables with 1 mio. rows each), then SQL Server might be more inclined to use a Clustered Index Seek and do bookmark lookups on 40 or 800 rows, rather than doing a full Clustered Index Scan.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Never ending select with troublesome inner join - sql

Related

SQL Server LEFT OUTER JOIN to sub query without parameter - Nested Loops Join?

Slow Join on Varchar

Inner join vs separate statements

Why does this SQL query take 8 hours to finish?

Slow query with unexpected index scan

Categories

Resources