Trying to optimize a system objects search - sql

I’m trying to search the database for any stored procedures that contain one of about 3500 different values.
I created a table to store the values in. I’m running the query below. The problem is, just testing it with a SELECT TOP 100 is taking 3+ mins to run (I have 3500+ values). I know it’s happening due to the query using LIKE.
I’m wondering if anyone has an idea on how I could optimize the search. The only results I need are the names of every value being searched for (pulled directly from the table I created: “SearchTerms”) and then a column that displays a 1 if it exists, 0 if it doesn’t.
Here’s the query I’m running:
SELECT
trm.Pattern,
(CASE
WHEN sm.object_id IS NULL THEN 0
ELSE 1
END) AS “Exists”
FROM dbo.SearchTerms trm
LEFT OUTER JOIN sys.sql_modules sm
ON sm.definition LIKE '%' + trm.Pattern + '%'
ORDER BY trm.Pattern
Note: it’s a one-time deal —it’s not something that will be run consistently.

Try CTE and get your Patterns which exists in any stored procedure with WHERE condition using EXISTS (...). Then use LEFT JOIN with dbo.SearchTerms and your CTE to get 1 or 0 value for Exists column.
;WITH ExistsSearchTerms AS (
SELECT Pattern
FROM dbo.SearchTerms
WHERE EXISTS (SELECT 1 FROM sys.sql_modules sm WHERE sm.definition LIKE '%' + Pattern + '%')
)
SELECT trm.Pattern, IIF(trmExist.Pattern IS NULL, 0, 1) AS "Exists"
FROM dbo.SearchTerms trm
LEFT JOIN dbo.SearchTerms trmExist
ON trm.Pattern = trmExist.Pattern
ORDER BY Pattern
Reference :
SQL performance on LEFT OUTER JOIN vs NOT EXISTS
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server

Related

Hive - Multiple sub-queries in where clause is failing

I am trying to create a table by checking two sub-query expressions within the where clause but my query fails with the below error :
Unsupported sub query expression. Only 1 sub query expression is
supported
Code snippet is as follows (Not the exact code. Just for better understanding) :
Create table winners row format delimited fields terminated by '|' as
select
games,
players
from olympics
where
exists (select 1 from dom_sports where dom_sports.players = olympics.players)
and not exists (select 1 from dom_sports where dom_sports.games = olympics.games)
If I execute same command with only one sub-query in where clause it is getting executed successfully. Having said that is there any alternative to achieve the same in a different way ?
Of course. You can use left join.
Inner join will act as exists. and left join + where clause will mimic the not exists.
There can be issue with granularity but that depends on your data.
select distinct
olympics.games,
olympics.players
from olympics
inner join dom_sports dom_sports on dom_sports.players = olympics.players
left join dom_sports dom_sports2 where dom_sports2.games = olympics.games
where dom_sports2.games is null

Improve performance of SQL Query with dynamic like

I need to search for people whose FirstName is included (a substring of) in the FirstName of somebody else.
SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]
FROM [dbo].[people] people
INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
WHERE EXISTS (SELECT *
FROM [dbo].[people_NAME] peopleName2
WHERE peopleName2.[Id] != people.[id]
AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')
It is so slow! I know it's because of the "'%' + peopleName.[FirstName] + '%'", because if I replace it with a hardcoded value like '%G%', it runs instantly.
With my dynamic like, my top 10 takes mores that 10 seconds!
I want to be able to run it on much bigger database.
What can I do?
Take a look at my answer about using LIKE operator here
It could be quite performant if you use some tricks
You can gain much speed if you play with collation, try this:
SELECT DISTINCT TOP 10 p.[Id], n.[LastName], n.[FirstName]
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
WHERE EXISTS (
SELECT 'x' x
FROM [dbo].[people_NAME] n2
WHERE n2.[Id] != p.[id]
AND
lower(n2.[FirstName]) collate latin1_general_bin
LIKE
'%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
)
As you can see we are using binary comparision instead of string comparision and this is much more performant.
Pay attention, you are working with people's names, so you can have issues with special unicode characters or strange accents.. etc.. etc..
Normally the EXISTS clause is better than INNER JOIN but you are using also a DISTINCT that is a GROUP BY on all columns.. so why not to use this?
You can switch to INNER JOIN and use the GROUP BY instead of the DISTINCT so testing COUNT(*)>1 will be (very little) more performant than testing WHERE n2.[Id] != p.[id], especially if your TOP clause is extracting many rows.
Try this:
SELECT TOP 10 p.[Id], n.[LastName], n.[FirstName]
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
INNER JOIN [dbo].[people_NAME] n2 on
lower(n2.[FirstName]) collate latin1_general_bin
LIKE
'%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
GROUP BY n1.[Id], n1.[FirstName]
HAVING COUNT(*)>1
Here we are matching also the name itself, so we will find at least one match for each name.
But We need only names that matches other names, so we will keep only rows with match count greater than one (count(*)=1 means that name match only with itself).
EDIT: I did all test using a random names table with 100000 rows and found that in this scenario, normal usage of LIKE operator is about three times worse than binary comparision.
This is a hard problem. I don't think a full text index will help, because you want to compare two columns.
That doesn't leave good options. One possibility is to implement ngrams. These are sequences of characters (say, 3 in a row) that come from a string. From my first name, you would have:
gor
ord
rdo
don
Then you can use these for direct matching on another column. Then you have to do additional work to see if the full name for one column matches another. But the ngrams should significantly reduce the work space.
Also, implementing ngrams requires work. One method uses a trigger which calculates the ngrams for each name and then inserts them into an ngram table.
I'm not sure if all this work is worth the effort to solve your problem. But it is possible to speed up the search.
You can do this,
With CTE as
(
SELECT top 10 peopleName.[Id], peopleName.[LastName], peopleName.[FirstName]
FROM
[dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
WHERE EXISTS (SELECT 1
FROM [dbo].[people_NAME] peopleName2
WHERE peopleName2.[Id] != people.[id]
AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')
order by peopleName.[Id]
)
//here join CTE with people table if at all it is require
select * from CTE
IF joining with people is not require then no need of CTE.
Have you tried a JOIN instead of a correlated query ?.
Being unable to use an index it won't have an optimal performance, but it should be a bit better than a correlated subquery.
SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]
FROM [dbo].[people] people
INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
INNER JOIN [dbo].[people_NAME] peopleName2 on peopleName2.[Id] <> people.[id] AND
peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%'

Check the query efficiency

I have this below SQL query that I want to get an opinion on whether I can improve it using Temp Tables or something else or is this good enough? So basically I am just feeding the result set from inner query to the outer one.
SELECT S.SolutionID
,S.SolutionName
,S.Enabled
FROM dbo.Solution S
WHERE s.SolutionID IN (
SELECT DISTINCT sf.SolutionID
FROM dbo.SolutionToFeature sf
WHERE sf.SolutionToFeatureID IN (
SELECT sfg.SolutionToFeatureID
FROM dbo.SolutionFeatureToUsergroup SFG
WHERE sfg.UsergroupID IN (
SELECT UG.UsergroupID
FROM dbo.Usergroup UG
WHERE ug.SiteID = #SiteID
)
)
)
It's going to depend largely on the indexes you have on those tables. Since you are only selecting data out of the Solution table, you can put everything else in an exists clause, do some proper joins, and it should perform better.
The exists clause will allow you to remove the distinct you have on the SolutionToFeature table. Distinct will cause a performance hit because it is basically creating a temp table behind the scenes to do the comparison on whether or not the record is unique against the rest of the result set. You take a pretty big hit as your tables grow.
It will look something similar to what I have below, but without sample data or anything I can't tell if it's exactly right.
Select S.SolutionID, S.SolutionName, S.Enabled
From dbo.Solutin S
Where Exists (
select 1
from dbo.SolutionToFeature sf
Inner Join dbo.SolutionToFeatureTousergroup SFG on sf.SolutionToFeatureID = SFG.SolutionToFeatureID
Inner Join dbo.UserGroup UG on sfg.UserGroupID = UG.UserGroupID
Where S.SolutionID = sf.SolutionID
and UG.SiteID = #SiteID
)

Optimize query in TSQL 2005

I have to optimize this query can some help me fine tune it so it will return data faster?
Currently the output is taking somewhere around 26 to 35 seconds. I also created index based on attachment table following is my query and index:
SELECT DISTINCT o.organizationlevel, o.organizationid, o.organizationname, o.organizationcode,
o.organizationcode + ' - ' + o.organizationname AS 'codeplusname'
FROM Organization o
JOIN Correspondence c ON c.organizationid = o.organizationid
JOIN UserProfile up ON up.userprofileid = c.operatorid
WHERE c.status = '4'
--AND c.correspondence > 0
AND o.organizationlevel = 1
AND (up.site = 'ALL' OR
up.site = up.site)
--AND (#Dept = 'ALL' OR #Dept = up.department)
AND EXISTS (SELECT 1 FROM Attachment a
WHERE a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
ORDER BY o.organizationcode
I can't just change anything in db due to permission issues, any help would be much appreciated.
I believe your headache is coming from this part in specific...like in a where exists can be your performance bottleneck.
AND EXISTS (SELECT 1 FROM Attachment a
WHERE a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
This can be written as a join instead.
SELECT DISTINCT o.organizationlevel, o.organizationid, o.organizationname, o.organizationcode,
o.organizationcode + ' - ' + o.organizationname AS 'codeplusname'
FROM Organization o
JOIN Correspondence c ON c.organizationid = o.organizationid
JOIN UserProfile up ON up.userprofileid = c.operatorid
left join article a on a.contextid = c.correspondenceid
AND a.context = 'correspondence'
and right(attachmentname,4) in ('.doc','.rtf')
....
This eliminates both the like and the where exists. put your where clause at the bottom.it's a left join, so a.anycolumn is null means the record does not exist and a.anycolumn is not null means a record was found. Where a.anycolumn is not null will be the equivalent of a true in the where exists logic.
Edit to add:
Another thought for you...I'm unsure what you are trying to do here...
AND (up.site = 'ALL' OR
up.site = up.site)
so where up.site = 'All' or 1=1? is the or really needed?
and quickly on right...Right(column,integer) gives you the characters from the right of the string (I used a 4, so it'll take the 4 right chars of the column specified). I've found it far faster than a like statement runs.
This is always going to return true so you can eliminate it (and maybe the join to up)
AND (up.site = 'ALL' OR up.site = up.site)
If you can live with dirty reads then with (nolock)
And I would try Attachement as a join. Might not help but worth a try. Like is relatively expensive and if it is doing that in a loop where it could it once that would really help.
Join Attachment a
on a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
I know there are some people on SO that insist that exists is always faster than a join. And yes it is often faster than a join but not always.
Another approach is the create a #temp table using
CREATE TABLE #Temp (contextid INT PRIMARY KEY CLUSTERED);
insert into #temp
Select distinct contextid
from atachment
where context = 'correspondence'
AND ( attachmentname like '%.rtf' or attachmentname like '%.doc'))
order by contextid;
go
select ...
from correspondence c
join #Temp
on #Temp.contextid = c.correspondenceid
go
drop table #temp
Especially if productID is the primary key or part of the primary key on correspondence creating the PK on #temp will help.
That way you can be sure that like expression is only evaluated once. If the like is the expensive part and in a loop then it could be tanking the query. I use this a lot where I have a fairly expensive core query and I need to those results to pick up reference data from multiple tables. If you do a lot of joins some times the query optimizer goes stupid. But if you give the query optimizer PK to PK then it does not get stupid and is fast. The down side is it takes about 0.5 seconds to create and populate the #temp.

I Need some sort of Conditional Join

Okay, I know there are a few posts that discuss this, but my problem cannot be solved by a conditional where statement on a join (the common solution).
I have three join statements, and depending on the query parameters, I may need to run any combination of the three. My Join statement is quite expensive, so I want to only do the join when the query needs it, and I'm not prepared to write a 7 combination IF..ELSE.. statement to fulfill those combinations.
Here is what I've used for solutions thus far, but all of these have been less than ideal:
LEFT JOIN joinedTable jt
ON jt.someCol = someCol
WHERE jt.someCol = conditions
OR #neededJoin is null
(This is just too expensive, because I'm performing the join even when I don't need it, just not evaluating the join)
OUTER APPLY
(SELECT TOP(1) * FROM joinedTable jt
WHERE jt.someCol = someCol
AND #neededjoin is null)
(this is even more expensive than always left joining)
SELECT #sql = #sql + ' INNER JOIN joinedTable jt ' +
' ON jt.someCol = someCol ' +
' WHERE (conditions...) '
(this one is IDEAL, and how it is written now, but I'm trying to convert it away from dynamic SQL).
Any thoughts or help would be great!
EDIT:
If I take the dynamic SQL approach, I'm trying to figure out what would be most efficient with regards to structuring my query. Given that I have three optional conditions, and I need the results from all of them my current query does something like this:
IF condition one
SELECT from db
INNER JOIN condition one
UNION
IF condition two
SELECT from db
INNER JOIN condition two
UNION
IF condition three
SELECT from db
INNER JOIN condition three
My non-dynamic query does this task by performing left joins:
SELECT from db
LEFT JOIN condition one
LEFT JOIN condition two
LEFT JOIN condition three
WHERE condition one is true
OR condition two is true
OR condition three is true
Which makes more sense to do? since all of the code from the "SELECT from db" statement is the same? It appears that the union condition is more efficient, but my query is VERY long because of it....
Thanks!
LEFT JOIN
joinedTable jt ON jt.someCol = someCol AND jt.someCol = conditions AND #neededjoin ...
...
OR
LEFT JOIN
(
SELECT col1, someCol, col2 FROM joinedTable WHERE someCol = conditions AND #neededjoin ...
) jt ON jt.someCol = someCol
...
OR
;WITH jtCTE AS
(SELECT col1, someCol, col2 FROM joinedTable WHERE someCol = conditions AND #neededjoin ...)
SELECT
...
LEFT JOIN
jtCTE ON jtCTE.someCol = someCol
...
To be honest, there is no such construct as a conditional JOIN unless you use literals.
If it's in the SQL statement it's evaluated... so don't have it in the SQL statement by using dynamic SQL or IF ELSE
the dynamic sql solution is usually the best for these situations, but if you really need to get away from that a series of if statments in a stroed porc will do the job. It's a pain and you have to write much more code but it will be faster than trying to make joins conditional in the statement itself.
I would go for a simple and straightforward approach like this:
DECLARE #ret TABLE(...) ;
IF <coondition one> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
IF <coondition two> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
IF <coondition three> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
SELECT DISTINCT ... FROM #ret ;
Edit: I am suggesting a table variable, not a temporary table, so that the procedure will not recompile every time it runs. Generally speaking, three simpler inserts have a better chance of getting better execution plans than one big huge monster query combining all three.
However, we can not guess-timate performance. we must benchmark to determine it. Yet simpler code chunks are better for readability and maintainability.
Try this:
LEFT JOIN joinedTable jt
ON jt.someCol = someCol
AND jt.someCol = conditions
AND #neededJoin = 1 -- or whatever indicates join is needed
I think you'll find it is good performance and does what you need.
Update
If this doesn't give the performance I claimed, then perhaps that's because the last time I did this using joins to a table. The value I needed could come from one of 3 tables, based on 2 columns, so I built a 'join-map' table like so:
Col1 Col2 TableCode
1 2 A
1 4 A
1 3 B
1 5 B
2 2 C
2 5 C
1 11 C
Then,
SELECT
V.*,
LookedUpValue =
CASE M.TableCode
WHEN 'A' THEN A.Value
WHEN 'B' THEN B.Value
WHEN 'C' THEN C.Value
END
FROM
ValueMaster V
INNER JOIN JoinMap M ON V.Col1 = M.oOl1 AND V.Col2 = M.Col2
LEFT JOIN TableA A ON M.TableCode = 'A'
LEFT JOIN TableB B ON M.TableCode = 'B'
LEFT JOIN TableC C ON M.TableCode = 'C'
This gave me a huge performance improvement querying these tables (most of them dozens or hundreds of million-row tables).
This is why I'm asking if you actually get improved performance. Of course it's going to throw a join into the execution plan and assign it some cost, but overall it's going to do a lot less work than some plan that just indiscriminately joins all 3 tables and then Coalesce()s to find the right value.
If you find that compared to dynamic SQL it's only 5% more expensive to do the joins this way, but with the indiscriminate joins is 100% more expensive, it might be worth it to you to do this because of the correctness, clarity, and simplicity over dynamic SQL, all of which are probably more valuable than a small improvement (depending on what you're doing, of course).
Whether the cost scales with the number of rows is also another factor to consider. If even with a huge amount of data you only save 200ms of CPU on a query that isn't run dozens of times a second, it's a no-brainer to use it.
The reason I keep hammering on the fact that I think it's going to perform well is that even with a hash match, it wouldn't have any rows to probe with, or it wouldn't have any rows to create a hash of. The hash operation is going to stop a lot earlier compared to using the WHERE clause OR-style query of your initial post.
The dynamic SQL solution is best in most respects; you are trying to run different queries with different numbers of joins without rewriting the query to do different numbers of joins - and that doesn't work very well in terms of performance.
When I was doing this sort of stuff an æon or so ago (say the early 90s), the language I used was I4GL and the queries were built using its CONSTRUCT statement. This was used to generate part of a WHERE clause, so (based on the user input), the filter criteria it generated might look like:
a.column1 BETWEEN 1 AND 50 AND
b.column2 = 'ABCD' AND
c.column3 > 10
In those days, we didn't have the modern JOIN notations; I'm going to have to improvise a bit as we go. Typically there is a core table (or a set of core tables) that are always part of the query; there are also some tables that are optionally part of the query. In the example above, I assume that 'c' is the alias for the main table. The way the code worked would be:
Note that table 'a' was referenced in the query:
Add 'FullTableName AS a' to the FROM clause
Add a join condition 'AND a.join1 = c.join1' to the WHERE clause
Note that table 'b' was referenced...
Add bits to the FROM clause and WHERE clause.
Assemble the SELECT statement from the select-list (usually fixed), the FROM clause and the WHERE clause (occasionally with decorations such as GROUP BY, HAVING or ORDER BY too).
The same basic technique should be applied here - but the details are slightly different.
First of all, you don't have the string to analyze; you know from other circumstances which tables you need to add to your query. So, you still need to design things so that they can be assembled, but...
The SELECT clause with its select-list is probably fixed. It will identify the tables that must be present in the query because values are pulled from those tables.
The FROM clause will probably consist of a series of joins.
One part will be the core query:
FROM CoreTable1 AS C1
JOIN CoreTable2 AS C2
ON C1.JoinColumn = C2.JoinColumn
JOIN CoreTable3 AS M
ON M.PrimaryKey = C1.ForeignKey
Other tables can be added as necessary:
JOIN AuxilliaryTable1 AS A
ON M.ForeignKey1 = A.PrimaryKey
Or you can specify a full query:
JOIN (SELECT RelevantColumn1, RelevantColumn2
FROM AuxilliaryTable1
WHERE Column1 BETWEEN 1 AND 50) AS A
In the first case, you have to remember to add the WHERE criterion to the main WHERE clause, and trust the DBMS Optimizer to move the condition into the JOIN table as shown. A good optimizer will do that automatically; a poor one might not. Use query plans to help you determine how able your DBMS is.
Add the WHERE clause for any inter-table criteria not covered in the joining operations, and any filter criteria based on the core tables. Note that I'm thinking primarily in terms of extra criteria (AND operations) rather than alternative criteria (OR operations), but you can deal with OR too as long as you are careful to parenthesize the expressions sufficiently.
Occasionally, you may have to add a couple of JOIN conditions to connect a table to the core of the query - that is not dreadfully unusual.
Add any GROUP BY, HAVING or ORDER BY clauses (or limits, or any other decorations).
Note that you need a good understanding of the database schema and the join conditions. Basically, this is coding in your programming language the way you have to think about constructing the query. As long as you understand this and your schema, there aren't any insuperable problems.
Good luck...
Just because no one else mentioned this, here's something that you could use (not dynamic). If the syntax looks weird, it's because I tested it in Oracle.
Basically, you turn your joined tables into sub-selects that have a where clause that returns nothing if your condition does not match. If the condition does match, then the sub-select returns data for that table. The Case statement lets you pick which column is returned in the overall select.
with m as (select 1 Num, 'One' Txt from dual union select 2, 'Two' from dual union select 3, 'Three' from dual),
t1 as (select 1 Num from dual union select 11 from dual),
t2 as (select 2 Num from dual union select 22 from dual),
t3 as (select 3 Num from dual union select 33 from dual)
SELECT m.*
,CASE 1
WHEN 1 THEN
t1.Num
WHEN 2 THEN
t2.Num
WHEN 3 THEN
t3.Num
END SelectedNum
FROM m
LEFT JOIN (SELECT * FROM t1 WHERE 1 = 1) t1 ON m.Num = t1.Num
LEFT JOIN (SELECT * FROM t2 WHERE 1 = 2) t2 ON m.Num = t2.Num
LEFT JOIN (SELECT * FROM t3 WHERE 1 = 3) t3 ON m.Num = t3.Num