SQL Sever Query Join Optimization - sql

I have looked for answers online but cant find a definitive answer. For example you have 2 join clauses:
1.
JOIN T2 ON T1.[ID] = T2.[ID]
2.
JOIN T2 ON T1.[ID] = REPLACE(T2.[ID],'A', '')
Now the 2nd one performs worse due to the function on the join clause. What is the exact reason for this?
And for example, if this code was in a stored procedure what would be best to optimise it? To remove the replace function and add it to the table level so all of this is completed before any joins?
Any advice or links to further information would be great. Thanks

In your second example, you are attempting to find a record in T2 - but instead of the value being the T1.ID value, you are applying a function to T2.ID - REPLACE(T2.[ID],'A', '')
If you had an index on T2.ID - at best it would scan the index and not seek it - thus causing a performance difference.
This is where it get's harder to explain - the index is stored as a b+tree, of the values for the T2.ID on the table. The index understands that field and can search / sort by it, but it doesn't understand any logic applied to it.
It does not know if REPLACE('A123','A', '') = 123 - without in executing the function on the value in the index and checking the resulting equality.
AAA123 would also be equal, 1A23, 12A3, 123A etc, there is a never ending amount of combinations that would in fact match - but the only way in which it can figure out if a single index entry matches is to run that value through the function and then check the equality.
If it can only figure that out when running the index value through the function - it can only properly answer the query correctly if it does that for every entry in the index - e.g. an index scan of every entry, being passed into the function and the output being checked.
As Jeroen mentions the term is SARGable or SARGability, Search ARGumentABLE, although I personally prefer to explain it as Seek ARGumentABLE since that is a closer match to the query plan operator.
It should be noted that this concept has nothing to do with it being a join, any predicate within SQL has this restriction - a single table query with a where predicate can have the same issue.
Can this problem be avoided? It can but only in some instances, where you can reverse the operation.
Consider a table with an ID column, I could construct a predicate such as this :
WHERE ID * 2 = #paramValue
The only way SQL Server would know if an ID entry multiplied by 2 is the passed in value is to process every entry, double it and check. So that is the index scan scenario again.
In this instance we can re-write it:
WHERE ID = #paramValue / 2.0
Now SQL Server will perform the mathematics once, divide the passed in value and it can then check that against the index in a seekable manner. The difference in the SQL written looks a potentially trivial difference of stating the problem, but makes a very large difference to how the database can resolve the predicate.

SQL Server has four basic methods for handling joins (as do other databases):
Nested loop without an index. This is like two nested for loops and is usually the slowest method.
Index looped (nested loop with an index). This a scan of one table with a lookup in the second.
Merge join. This assumes that the two tables are ordered and loops through the two tables at the same time (this can also be accomplished using indexes).
Hash join. The keys for the two tables are hashed and hash-tables are used for matching.
In general, the first of these is the slowestthe second of these -- using an index -- is the fastest. (There are exceptions). The second is often the fastest.
When you use an equality comparison between two columns in the table, SQL Server has a lot of information for deciding on the best join algorithm to use:
It has information on indexes.
It has statistics on the column.
Without this information, SQL Server often defaults to the nested-loop join. I find that it does this even when it could use the expression for a merge- or hash- based join.
As a note, you can work around this by using a computed column:
alter table t2 add id_no_a as (replace(id, 'A', '')) persisted;
create index idx_t2_id_no_a on t2(id_no_a);
Then phrase
on T1.[ID] = t2.id_no_a

Example of using union to avoid searches without index:
DECLARE #T1 TABLE (ID VARCHAR(16), CODE INT)
DECLARE #T2 TABLE (ID VARCHAR(16), CODE INT)
INSERT INTO #T1 VALUES ('ASD',1)
INSERT INTO #T1 VALUES ('DFG',2)
INSERT INTO #T1 VALUES ('RTY',3)
INSERT INTO #T1 VALUES ('AZX',4)
INSERT INTO #T1 VALUES ('GTY',5)
INSERT INTO #T1 VALUES ('KKO',6)
INSERT INTO #T2 VALUES ('ASD',1)
INSERT INTO #T2 VALUES ('SD',2)
INSERT INTO #T2 VALUES ('DFG',3)
INSERT INTO #T2 VALUES ('RTY',4)
INSERT INTO #T2 VALUES ('AZX',5)
INSERT INTO #T2 VALUES ('ZX',6)
INSERT INTO #T2 VALUES ('GTY',7)
INSERT INTO #T2 VALUES ('GTYA',8)
INSERT INTO #T2 VALUES ('KKO',9)
INSERT INTO #T2 VALUES ('KKOA',10)
INSERT INTO #T2 VALUES ('AKKOA',11)
SELECT * FROM #T1 T1 INNER JOIN (SELECT ID FROM #T2 WHERE ID NOT LIKE '%A%')T2 ON T2.ID = T1.ID
UNION ALL
SELECT * FROM #T1 T1 INNER JOIN (SELECT REPLACE(ID,'A','')ID FROM #T2 WHERE ID LIKE '%A%')T2 ON T2.ID = T1.ID
This is what you can do without schema changes.
With schema changes you need to create a calculated indexed column into T2 and join with it. This is is much faster and most of the effort is placed on inserts/updates to maintain the extra column and the index on it.

Related

Count rows with column varbinary NOT NULL tooks a lot of time

This query
SELECT COUNT(*)
FROM Table
WHERE [Column] IS NOT NULL
takes a lot of time. The table has 5000 rows, and the column is of type VARBINARY(MAX).
What can I do?
Your query needs to do a table scan on a column that can potentially be very large without any way to index it. There isn't much you can do to fix this without changing your approach.
One option is to split the table into two tables. The first table could have all the details you have now in it and the second table would have just the file. You can make this a 1-1 table to ensure data is not duplicated.
You would only add the binary data as needed into the second table. If it is not needed anymore, you simply delete the record. This will allow you to simply write a JOIN query to get the information you are looking for.
SELECT
COUNT(*)
FROM dbo.Table1
INNER JOIN dbo.Table2
ON Table1.Id = Table2.Id

SQL Server performance Issue. When ever the number of record "In parameter" increases, Query performance is degraded significantly

Select count (*)
from table
where id in (1,2,3,4,5,6.....500)
These ID's are populated externally through some script. As soon as In parameter exceeds a certain number of entries, the query slows down 6 times.
Any suggestion or help will be appreciated
As has been suggested in the comments, you can use a temporary table to populate the externally generated ID's and then join on them. You can do create the temporary table as such :
Create table #TEMP(ID INT)
INSERT INTO #TEMP (ID) VALUES
(1), (2), (3), (4), (5) --Populate this with parameter as ID's are externally generated.
And then join as such :
Select t.*
from table t
inner join #Temp temp on t.ID = temp.ID
I sincerely hope this is an example and you are not actually trying to do this :
Select count (*)
from table
where id in (1,2,3,4,5,6.....500)
Since if in your case, ID's are unique (which they mostly are) and are not being deleted (which is quite a usual practice), then, the total number of ID's in the IN clause will be the result of COUNT(*), in which case, you don't have to have an IN clause. You can just count the total number of values in the parameter you plan to use in the IN clause and that should be good.
Hope this helps!!!

Optimization - sql :How to show all data exists in multiple tables

I have two table. I want to find all the rows in table One that exists in table Two, and back. I had the answer, but I want it faster.
Example:
Create table One (ID INT, Value INT, location VARCHAR(10))
Create table Two (ID INT, Value INT, location VARCHAR(10))
INSERT INTO One VALUES(1,2,'Hanoi')
INSERT INTO One VALUES(2,1,'Hanoi')
INSERT INTO One VALUES(1,4,'Hanoi')
INSERT INTO One VALUES(3,5,'Hanoi')
INSERT INTO Two VALUES(1,5,'Saigon')
INSERT INTO Two VALUES(4,6,'Saigon')
INSERT INTO Two VALUES(5,7,'Saigon')
INSERT INTO Two VALUES(2,8,'Saigon')
INSERT INTO Two VALUES(2,8,'Saigon')
And answers:
SELECT * FROM One WHERE ID IN (SELECT ID FROM Two)
UNION ALL
SELECT *FROM Two WHERE ID IN (SELECT ID FROM One)
With this query, the system scan the table 4 times
enter image description here
I want the system scan the table twice (table One once, table Two once).
Am I crazy?
You can try something like:
-- CREATE TABLES
IF OBJECT_ID ( 'tempdb..#One' ) IS NOT NULL
DROP TABLE #One;
IF OBJECT_ID ( 'tempdb..#Two' ) IS NOT NULL
DROP TABLE #Two;
CREATE TABLE #One (ID INT, Value INT, location VARCHAR(10))
CREATE TABLE #Two (ID INT, Value INT, location VARCHAR(10))
-- INSERT DATA
INSERT INTO #One VALUES(1,2,'Hanoi')
INSERT INTO #One VALUES(2,1,'Hanoi')
INSERT INTO #One VALUES(1,4,'Hanoi')
INSERT INTO #One VALUES(3,5,'Hanoi')
INSERT INTO #Two VALUES(1,5,'Saigon')
INSERT INTO #Two VALUES(4,6,'Saigon')
INSERT INTO #Two VALUES(5,7,'Saigon')
INSERT INTO #Two VALUES(2,8,'Saigon')
INSERT INTO #Two VALUES(2,8,'Saigon')
-- CREATE INDEX
CREATE NONCLUSTERED INDEX IX_One ON #One (ID) INCLUDE (Value, location)
CREATE NONCLUSTERED INDEX IX_Two ON #Two (ID) INCLUDE (Value, location)
-- SELECT DATA
SELECT o.ID
,o.Value
,o.location
FROM #One o
WHERE EXISTS (SELECT 1 FROM #Two t WHERE o.ID = t.ID)
UNION ALL
SELECT t.ID
,t.Value
,t.location
FROM #Two t
WHERE EXISTS (SELECT 1 FROM #One o WHERE t.ID = o.ID)
but it depends how "big" data you have. If the data is really big (millions of rows) and you are running Enterprise version of SQL Server, you may consider using columnstore indexes.
The reason you're scanning the tables twice is because you are reading from Table X and looking up the corresponding value from table Y. And once that's finished you do the same but starting from table Y and then looking for matches in Table Y. After that, both results are combined and returned to the caller.
In a way, that's not a bad thing although if tables are 'wide' and contain a lot of columns you don't need then you are doing a lot of IO for no good reason. Additionally, in your example, the quest for a matching ID in the other table requires scanning the whole table because there is no 'logic' to the ID field. It simply is a list of values. To speed up things you should add an index on the ID field that helps the system find a particular ID value MUCH MUCH faster. Additionally, this also limits the amount of data that needs to be read for the lookup phase: the server will only read from the index which only contains the ID values (**) and not all the other, unneeded fields.
To be honest, I find your requirement a bit strange, but I'm guessing it's mostly due to simplification as to make it understandable here on SO. My first reaction was to suggest to use a JOIN between both tables, but since the ID fields are non-unique this then results in duplicates! To work around that I added a DISTINCT but then things slowed down severely. In the end, doing just the WHERE ID IN (...) turned out to be the most efficient approach.
Adding indexes on the ID field made it faster although the effect wasn't as big as I expected, probably because there are few other fields and the gain in IO is negligible (read: it all fits in memory even though I tried this on 5 million rows).
FYI: Personally I prefer the construction WHERE EXISTS() over WHERE IN (...) but they're both equivalent and actually produced the exact same query plan.
(**: aside from the indexed fields, every index also contains the clustered index -- which usually is the Primary Key of the table -- fields in its leaf data. For more information Kimberly L. Tripp has some interesting articles about indexes and how they work.)

redshift select distinct returns repeated values

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0
This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them
Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates

Optimizing an Oracle SQL query which uses IN clause extensively

I maintain an application where I am trying to optimize an Oracle SQL query wherein multiple IN clauses are used. This query is now a blocker as it hogs nearly 3 minutes of execution time and affects application performance severely.The query is called from Java code(JDBC) and looks like this :
Select disctinct col1,col2,col3,.. colN from Table1
where 1=1 and not(col1 in (idsetone1,idsetone2,... idsetoneN)) or
(col1 in(idsettwo1,idsettwo2,...idsettwoN))....
(col1 in(idsetN1,idsetN2,...idsetNN))
The ID sets are retrieved from a different schema and therefore a JOIN between column1 of table 1 and ID sets is not possible. ID sets have grown over time with use of the application and currently they number more than 10,000 records.
How can I start with optimizing this query ?
I really doupt about "The ID sets are retrieved from a different schema and therefore a JOIN between column1 of table 1 and ID sets is not possible." Of course you can join the tables, provided you got select privileges on it.
Anyway, let's assume it is not possible due to whatever reason. One solution could be to insert all entries first into a Nested Table and the use this one:
CREATE OR REPLACE TYPE NUMBER_TABLE_TYPE AS TABLE OF NUMBER;
Select disctinct col1,col2,col3,.. colN from Table1
where 1=1
and not (col1 NOT MEMBER OF (NUMBER_TABLE_TYPE(idsetone1,idsetone2,... idsetoneN))
OR
(col1 MEMBER OF NUMBER_TABLE_TYPE(idsettwo1,idsettwo2,...idsettwoN))
Regarding the max. number of elements Oracle Documentation says: Because a nested table does not have a declared size, you can put as many elements in the constructor as necessary.
I don't know how serious you can take this statement.
You should put all the items into one temporary table and to an explicit join:
Select your cols
from Table1
left join table_with_items
on table_with_items.id = Table1.col1
where table_with_items.id is null;
Also that distinct suggest a problem in your business logic or in the architecture of application. Why do you have duplicate ids? You should get rid of that distinct.