Slow Join on Varchar

Slow Join on Varchar - sql

I have a query in sql server with a join that is taking forever. I'm hoping someone might have a tip to speed it up.
I think the problem is I'm joining on a field called Reseller_CSN which has values like '_0070000050'
I've tried using the substring function in the join to return everything but underscore, example '0070000050' but I keep getting an error when I try to cast or convert the result to int or bigint.
Any tips would be greatly appreciated, the query is below:
SELECT
t1.RESELLER_CSN
,t1.FIRST_YEAR_RENEWAL
,t1.SEAT_SEGMENT
,t2.Target_End_Date_CY
,t2.Target_End_Date_PQ
,t2.Target_End_Date_CY_1
,t2.Target_End_Date_CY_2
,t1.ASSET_SUB_END_DATE
FROM dbo.new_all_renewals t1
LEFT JOIN dbo.new_all_renewals_vwTable t2
ON SUBSTRING(t1.RESELLER_CSN,2,11) = SUBSTRING(t2.RESELLER_CSN,2,11)

A join on processed columns invariably takes more effort than a join on raw columns. In this case, you can improve performance by using computed columns. For instance, on the first table, you can do:
alter table new_all_renewals add CSN_num as SUBSTRING(t1.RESELLER_CSN,2,11);
create index on new_all_renewals(CSN_num);
This will generate an index on the column, which should speed the query. (Note: you'll need to reference the computed column rather than actually using the function.)

Related

SQL Server LEFT OUTER JOIN to sub query without parameter - Nested Loops Join?

I have a query that joins to a sub query. When an indexed nvarchar column is passed in as a string literal, the performance tanks. However, when I pass in the same value as a parameter, the execution plan changes drastically and the query executes quickly. I apologize for naming.
In the situation below, myStringColumn is indexed and the tables have millions of rows.
This one is bad:
SELECT myColumn from myTable1 LEFT OUTER JOIN
(SELECT myColumn2 from myTable2
WHERE myStringColumn = 'myFilter') ON...
This one is good:
Declare #myParameter as nvarchar = N'myFilter'
SELECT myColumn from myTable1 LEFT OUTER JOIN
(SELECT myColumn2 from myTable2
WHERE myStringColumn = #myParameter) ON...
What I do not understand is why the performance of the second is so much better. From what I can tell, it's because the nested loops join in the plan is much better in the second query, but I do not understand why. My guess is that for some reason SQL Server is having to loop through more rows for each iteration in the first, but I am at a loss as to why only changing to a parameter vs a string literal would have that much affect.
Questions: Why is the second query better than the first? What is SQL Server doing that makes the first so much slower?
Thank you.

The difference is that when you declare parameter you doing it correctly with N prefix and declaring variable as Nvarchar, same as your column.
While when you do WHERE - you are not adding N prefix, so internally varchar variable created, if you would do myStringColumn = N'myFilter' it would work the same way.
Now why is this happening and why it is important to match column type to parameters you passing - is because if parameter is not of column type, then SQL will convert ALL values in the column (so basically all rows) to the type of parameter passed in query (not other way around) - so that's why you have huge performance problem, basically index is lost in this case, cause all column values are being CONVERTED to varchar.
Here are some more explanations - https://www.sqlshack.com/query-performance-issues-on-varchar-data-type-using-an-n-prefix/

Never ending select with troublesome inner join

I have two tables:
Query is (short version):
SELECT TOP (2000) *
FROM [...].[Documents] AS [D]
INNER JOIN dbo.SE_CMSClientMatters AS [mat] ON [mat].[Mcode] = [D].[custom2]
I cannot figure out for the life of me why it does not work -i.e. never completes execution - ran for 13 hours and still no sign of completing.
Documents table has around 12 million rows, ClientMatters has around 330,000 rows.
The original query has several other left joins and this is the only inner join. If I leave it out, the query completes in 20 seconds!
Its 4am so either I'm losing it or have missed something obvious. PS - I did rebuild the indexes. The Custom2 field is part of a group of fields indexed (see image).
Any help appreciated - thanks!

One of the issues I see is
mat.MCode is of type varchar(27)
D.custom2 is of type nvarchar(32)
This is horrible (performance-wise) when joining - one column is Unicode, the other is not.
Try to cast one to the other - something like this:
SELECT TOP (2000) *
FROM [...].[Documents] AS [D]
INNER JOIN dbo.SE_CMSClientMatters AS [mat]
ON CAST([mat].[Mcode] AS NVARCHAR(32)) = [D].[custom2]
As a general rule, you should always try to use the same datatype in columns that you use for joining - and joining is typically much easier and faster on numerical datatypes, rather than string-based datatypes.
If you could - try to convert one of these two columns to the same datatype as the other - I'm pretty sure that would speed things up significantly.
And also: an index on Documents where Custom2 is at the second position will NOT be able to be used for this join - try to create a separate index on custom2 alone:
CREATE NONCLUSTERED INDEX IX_Documents_custom2 ON dbo.Documents(custom2)

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?

I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected

First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.

For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Sql Server Geometry Column causing query to take long time to run

I have written a query to fetch the polygon data from Sql database.
I have following query to fetch the results.
SELECT ZIP,
NAME,
STABB,
AREA,
TYPE,
orgZc.OrganizationId,
orgZc.[ZipCode] AS ORGzip,
REPLACE(REPLACE(REPLACE(REPLACE(GEOM.STAsText(),'POLYGON ((',' '),'MULTIPOLYGON (((',' '),'))',''),')))','')AS WKT
FROM USZIP
INNER JOIN ORGANIZTION_ZIP_CODES orgZc ON orgZc.[ZipCode]=USZIP.zip
WHERE orgZc.OrganizationId=#ORGANIZATION_ID
On this table i have already added a spatial index as below
CREATE SPATIAL INDEX SIndx_SpatialTable_geometry_col1
ON USZIP(GEOM) WITH ( BOUNDING_BOX = ( -90, -180, 90, 180 ) );
But it took 38 sec to fetch the 2483 records. Can anyone help me to optimize this query

My guess is that important part of your query is the from and where clauses. However, you can test this by removing the line:
REPLACE(REPLACE(REPLACE(REPLACE(GEOM.STAsText(),'POLYGON ((',' '),'MULTIPOLYGON (((',' '),'))',''),')))','')AS WKT
To see if that processing is taking up a lot of time.
For this part of the query:
FROM USZIP INNER JOIN
ORGANIZATION_ZIP_CODES orgZc
ON orgZc.[ZipCode] = USZIP.zip
WHERE orgZc.OrganizationId = #ORGANIZATION_ID;
You say that the zip code is "a primary column". However, it has to be the first column in a composite index (or primary key) in order to be used for the join. So, you really want an index on USZIP(zip) for the join to work. (I'm guessing this is true based on the name of the table, but I want to be explicit.)
Second, your where clause is limited to one OriganizationId, presumably of many. If so, you want an index on ORGANIZATION_ZIP_CODES(OrganizationId). Or, better yet, on ORGANIZATION_ZIP_CODES(OrganizationId, ZipCode).

I found the solution. I added a new column and updated REPLACE(REPLACE(REPLACE(REPLACE(GEOM.STAsText(),'POLYGON ((',' '),'MULTIPOLYGON (((',' '),'))',''),')))','')AS WKT Now i can fetch from the newly added column directly without doing any manipulations. Now it is taking 3 sec to fetch 2483 records

SQL Server - Query Short-Circuiting?

Do T-SQL queries in SQL Server support short-circuiting?
For instance, I have a situation where I have two database and I'm comparing data between the two tables to match and copy some info across. In one table, the "ID" field will always have leading zeros (such as "000000001234"), and in the other table, the ID field may or may not have leading zeros (might be "000000001234" or "1234").
So my query to match the two is something like:
select * from table1 where table1.ID LIKE '%1234'
To speed things up, I'm thinking of adding an OR before the like that just says:
table1.ID = table2.ID
to handle the case where both ID's have the padded zeros and are equal.
Will doing so speed up the query by matching items on the "=" and not evaluating the LIKE for every single row (will it short circuit and skip the LIKE)?

SQL Server does NOT short circuit where conditions.
it can't since it's a cost based system: How SQL Server short-circuits WHERE condition evaluation .

You could add a computed column to the table. Then, index the computed column and use that column in the join.
Ex:
Alter Table Table1 Add PaddedId As Right('000000000000' + Id, 12)
Create Index idx_WhateverIndexNameYouWant On Table1(PaddedId)
Then your query would be...
select * from table1 where table1.PaddedID ='000000001234'
This will use the index you just created to quickly return the row.

You want to make sure that at least one of the tables is using its actual data type for the IDs and that it can use an index seek if possible. It depends on the selectivity of your query and the rate of matches though to determine which one should be converted to the other. If you know that you have to scan through the entire first table, then you can't use a seek anyway and you should convert that ID to the data type of the other table.
To make sure that you can use indexes, also avoid LIKE. As an example, it's much better to have:
WHERE
T1.ID = CAST(T2.ID AS VARCHAR) OR
T1.ID = RIGHT('0000000000' + CAST(T2.ID AS VARCHAR), 10)
than:
WHERE
T1.ID LIKE '%' + CAST(T2.ID AS VARCHAR)
As Steven A. Lowe mentioned, the second query might be inaccurate as well.
If you are going to be using all of the rows from T1 though (in other words a LEFT OUTER JOIN to T2) then you might be better off with:
WHERE
CAST(T1.ID AS INT) = T2.ID
Do some query plans with each method if you're not sure and see what works best.
The absolute best route to go though is as others have suggested and change the data type of the tables to match if that's at all possible. Even if you can't do it before this project is due, put it on your "to do" list for the near future.

How about,
table1WithZero.ID = REPLICATE('0', 12-len(table2.ID))+table2.ID
In this case, it should able to use the index on the table1

Just in case it's useful, as the linked page in Mladen Prajdic's anwer explains, CASE clauses are short-circuit evaluated.

If the ID is purely numeric (as your example), I would reccomend (if possible) changing that field to a number type instead. If the database is allready in use it might be hard to change the type though.

fix the database to be consistent
select * from table1 where table1.ID LIKE '%1234'
will match '1234', '01234', '00000000001234', but also '999991234'. Using LIKE pretty much guarantees an index scan (assuming table1.ID is indexed!). Cleaning up the data will improve performance significantly.
if cleaning up the data is not possible, write a user-defined function (UDF) to strip off leading zeros, e.g.
select * from table1 where dbo.udfStripLeadingZeros(table1.ID) = '1234'
this may not improve performance (since the function will have to run for each row) but it will eliminate false matches and make the intent of the query more obvious
EDIT: Tom H's suggestion to CAST to an integer would be best, if that is possible.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Slow Join on Varchar - sql

Related

SQL Server LEFT OUTER JOIN to sub query without parameter - Nested Loops Join?

Never ending select with troublesome inner join

Performance of JOINS in SAP HANA Calculation View

Sql Server Geometry Column causing query to take long time to run

SQL Server - Query Short-Circuiting?

Categories

Resources