I need to search for people whose FirstName is included (a substring of) in the FirstName of somebody else.
SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]
FROM [dbo].[people] people
INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
WHERE EXISTS (SELECT *
FROM [dbo].[people_NAME] peopleName2
WHERE peopleName2.[Id] != people.[id]
AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')
It is so slow! I know it's because of the "'%' + peopleName.[FirstName] + '%'", because if I replace it with a hardcoded value like '%G%', it runs instantly.
With my dynamic like, my top 10 takes mores that 10 seconds!
I want to be able to run it on much bigger database.
What can I do?
Take a look at my answer about using LIKE operator here
It could be quite performant if you use some tricks
You can gain much speed if you play with collation, try this:
SELECT DISTINCT TOP 10 p.[Id], n.[LastName], n.[FirstName]
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
WHERE EXISTS (
SELECT 'x' x
FROM [dbo].[people_NAME] n2
WHERE n2.[Id] != p.[id]
AND
lower(n2.[FirstName]) collate latin1_general_bin
LIKE
'%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
)
As you can see we are using binary comparision instead of string comparision and this is much more performant.
Pay attention, you are working with people's names, so you can have issues with special unicode characters or strange accents.. etc.. etc..
Normally the EXISTS clause is better than INNER JOIN but you are using also a DISTINCT that is a GROUP BY on all columns.. so why not to use this?
You can switch to INNER JOIN and use the GROUP BY instead of the DISTINCT so testing COUNT(*)>1 will be (very little) more performant than testing WHERE n2.[Id] != p.[id], especially if your TOP clause is extracting many rows.
Try this:
SELECT TOP 10 p.[Id], n.[LastName], n.[FirstName]
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
INNER JOIN [dbo].[people_NAME] n2 on
lower(n2.[FirstName]) collate latin1_general_bin
LIKE
'%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
GROUP BY n1.[Id], n1.[FirstName]
HAVING COUNT(*)>1
Here we are matching also the name itself, so we will find at least one match for each name.
But We need only names that matches other names, so we will keep only rows with match count greater than one (count(*)=1 means that name match only with itself).
EDIT: I did all test using a random names table with 100000 rows and found that in this scenario, normal usage of LIKE operator is about three times worse than binary comparision.
This is a hard problem. I don't think a full text index will help, because you want to compare two columns.
That doesn't leave good options. One possibility is to implement ngrams. These are sequences of characters (say, 3 in a row) that come from a string. From my first name, you would have:
gor
ord
rdo
don
Then you can use these for direct matching on another column. Then you have to do additional work to see if the full name for one column matches another. But the ngrams should significantly reduce the work space.
Also, implementing ngrams requires work. One method uses a trigger which calculates the ngrams for each name and then inserts them into an ngram table.
I'm not sure if all this work is worth the effort to solve your problem. But it is possible to speed up the search.
You can do this,
With CTE as
(
SELECT top 10 peopleName.[Id], peopleName.[LastName], peopleName.[FirstName]
FROM
[dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
WHERE EXISTS (SELECT 1
FROM [dbo].[people_NAME] peopleName2
WHERE peopleName2.[Id] != people.[id]
AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')
order by peopleName.[Id]
)
//here join CTE with people table if at all it is require
select * from CTE
IF joining with people is not require then no need of CTE.
Have you tried a JOIN instead of a correlated query ?.
Being unable to use an index it won't have an optimal performance, but it should be a bit better than a correlated subquery.
SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]
FROM [dbo].[people] people
INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
INNER JOIN [dbo].[people_NAME] peopleName2 on peopleName2.[Id] <> people.[id] AND
peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%'
Related
I have a client with a stored procedure that currently take 25 minutes to run. I have narrowed the cause of this to the following statement (changed column and table names)
UPDATE #customer_emails_tmp
SET #customer_emails_tmp.Possible_Project_Ref = cp.order_project_no,
#customer_emails_tmp.Possible_Project_id = cp.order_uid
FROM #customer_emails_tmp e
CROSS APPLY (
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] p
WHERE e.Subject LIKE '%' + p.order_title + '%'
AND p.order_date < e.timestamp
ORDER BY p.order_date DESC
) as cp
WHERE e.Possible_Project_Ref IS NULL;
There are 3 slightly different version of the above, joining to 1 of three tables. The issue is the CROSS APPLY LIKE '%' + p.title + '%'. I have tried looking into CONTAINS() and FREETEXT() but as far as my testing and investigations go, you cannot do CONTAINS(e.title, p.title) or FREETEXT(e.title,p.title).
Have I miss read something or is there a better way to write the above query?
Any help on this is much appreciated.
EDIT
Updated query to actual query used. Execution plan:
https://www.brentozar.com/pastetheplan/?id=B1YPbJiX5
Tmp table has the following indexes:
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_first_recipient ON #customer_emails_tmp (First_Recipient);
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_first_recipient_domain_name ON #customer_emails_tmp (First_Recipient_Domain_Name);
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_client_id ON #customer_emails_tmp (customer_emails_client_id);
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_subject ON #customer_emails_tmp ([subject]);
There is no index on the [order] table for column order_title
Edit 2
The purpose of this SP is to link orders (amongst others) to sent emails. This is done via multiple UPDATE statements; all other update statements are less than a second in length; however, this one ( and 2 others exactly the same but looking at 2 other tables) take an extraordinary amount of time.
I cannot remove the filter on Possible_Project_Ref IS NULL as we only want to update the ones that are null.
Also, I cannot change WHERE e.Subject LIKE '%' + p.order_title + '%' to WHERE e.Subject LIKE p.order_title + '%' because the subject line may not start with the p.order_title, for example it could start with FW: or RE:
Reviewing your execution plan, I think the main issue is you're reading a lot of data from the order table. You are reading 27,447,044 rows just to match up to find 783 rows. Your 20k row temp table is probably nothing by comparison.
Without knowing your data or desired business logic, here's a couple things I'd consider:
Updating First Round of Exact Matches
I know you need to keep your %SearchTerm% parameters, but some data might have exact matches. So if you run an initial update for exact matches, it will reduce the ones you have to search with %SearchTerm%
Run something like this before your current update
/*Recommended index for this update*/
CREATE INDEX ix_test ON [order](order_title,order_date) INCLUDE (order_project_no, order_uid)
UPDATE #customer_emails_tmp
SET Possible_Project_Ref = cp.order_project_no
,Possible_Project_id = cp.order_uid
FROM #customer_emails_tmp e
CROSS APPLY (
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] p
WHERE e.Subject = p.order_title
AND p.order_date < e.timestamp
ORDER BY p.order_date DESC
) as cp
WHERE e.Possible_Project_Ref IS NULL;
Narrowing Search Range
This will technically change your matching criteria, but there are probably certain logical assumptions you can make that won't impact the final results. Here are a couple of ideas for you to consider, to get you thinking this way, but only you know your business. The end goal should be to narrow the data read from the order table
Is there a customer id you can match on? Something like e.customerID = p.customerID? Do you really match any email to any order?
Can you narrow your search date range to something like x days before timestamp? Do you really need to search all historical orders for all of time? Would you even want a match if an email matches to an order from 5 years ago? For this, try updating your APPLY date filter to something like p.order_date BETWEEN DATEADD(dd,-30,e.[timestamp]) AND e.[timestamp]
Other Miscellaneous Notes
If I'm understanding this correctly, you are trying to link email to some sort of project #. Ideally, when the email are generated, they would be linked to a project immediately. I know this is not always possible resource/time wise, but the clean solution is to calculate this at the beginning of the process, not afterwards. Generally anytime you have to use fuzzy string matching, you will have data issues. I know business always wants results "yesterday" and always pushes for the shortcut, and nobody ever wants to update legacy processes, but sometimes you need to if you want clean data
I'd review your indexes on the temp table. Generally I find the cost to create the indexes and for SQL Server to maintain them as I update the temp table is not worth it. So 9 times out of 10, I leave the temp table as a plain heap with 0 indexes
First, filter the NULLs when you create #customer_emails_tmp, not after. Then you can lose:
WHERE e.Possible_Project_Ref IS NULL. This way you are only bringing in rows you need instead of retrieving rows you don't need, then filtering them.
Next, us this for your WHERE clause:
WHERE EXISTS (SELECT 1 FROM [order] AS p WHERE p.order_date < e.timestamp)
If an order date doesn't have any later timestamps in e, none of the rows in e will be considered.
Next remove the timestamp filter from your APPLY subquery. Now your subquery looks like this:
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] AS p
WHERE e.Subject LIKE '%' + p.order_title + '%'
ORDER BY p.order_date DESC
This way you are applying your "Subject Like" filter to a much smaller set of rows. The final query would look like this:
UPDATE #customer_emails_tmp
SET #customer_emails_tmp.Possible_Project_Ref = cp.order_project_no,
#customer_emails_tmp.Possible_Project_id = cp.order_uid
FROM #customer_emails_tmp e
CROSS APPLY (
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] p
WHERE e.Subject LIKE '%' + p.order_title + '%'
ORDER BY p.order_date DESC
) as cp
WHERE EXISTS (SELECT 1 FROM [order] AS p WHERE p.order_date < e.timestamp);
I’m trying to search the database for any stored procedures that contain one of about 3500 different values.
I created a table to store the values in. I’m running the query below. The problem is, just testing it with a SELECT TOP 100 is taking 3+ mins to run (I have 3500+ values). I know it’s happening due to the query using LIKE.
I’m wondering if anyone has an idea on how I could optimize the search. The only results I need are the names of every value being searched for (pulled directly from the table I created: “SearchTerms”) and then a column that displays a 1 if it exists, 0 if it doesn’t.
Here’s the query I’m running:
SELECT
trm.Pattern,
(CASE
WHEN sm.object_id IS NULL THEN 0
ELSE 1
END) AS “Exists”
FROM dbo.SearchTerms trm
LEFT OUTER JOIN sys.sql_modules sm
ON sm.definition LIKE '%' + trm.Pattern + '%'
ORDER BY trm.Pattern
Note: it’s a one-time deal —it’s not something that will be run consistently.
Try CTE and get your Patterns which exists in any stored procedure with WHERE condition using EXISTS (...). Then use LEFT JOIN with dbo.SearchTerms and your CTE to get 1 or 0 value for Exists column.
;WITH ExistsSearchTerms AS (
SELECT Pattern
FROM dbo.SearchTerms
WHERE EXISTS (SELECT 1 FROM sys.sql_modules sm WHERE sm.definition LIKE '%' + Pattern + '%')
)
SELECT trm.Pattern, IIF(trmExist.Pattern IS NULL, 0, 1) AS "Exists"
FROM dbo.SearchTerms trm
LEFT JOIN dbo.SearchTerms trmExist
ON trm.Pattern = trmExist.Pattern
ORDER BY Pattern
Reference :
SQL performance on LEFT OUTER JOIN vs NOT EXISTS
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
Oracle database.
I've got the following segment of SQL that's performing a full table scan on PROVIDER P1 table. I believe this is because it's dynamically building a like clause as you can see on line XXX.
I've got an index on PROVIDER.TERMINAL_NUMBER and the following SQL snippet does use the correct index.
select * from providers where terminal_number like '1234%'
so why does the following not hit that index?
SELECT P1.PROVIDER_NUMBER, P1.TERMINAL_NUMBER, PC."ORDER" FROM PROVIDERS P1
INNER JOIN PROVIDER_CONFIG PC
ON PC.PROVIDER_NUMBER = P1.PROVIDER_NUMBER
WHERE EXISTS (
SELECT E2.* FROM EQUIPMENT E1
INNER JOIN EQUIPMENT E2
ON E1.MERCHANT_NUMBER = E2.MERCHANT_NUMBER
WHERE E1.TERMINAL_NUMBER = 'SA323F'
AND E1.STATUS IN (0, 9)
AND E2.STATUS IN (0, 9)
XXX
AND P1.TERMINAL_NUMBER LIKE SUBSTR(E2.TERMINAL_NUMBER, 0, length(E2.TERMINAL_NUMBER) - 1) || '%'
)
ORDER BY PC."ORDER" DESC
Here ...
select * from providers where terminal_number like '1234%'
... the Optimiser knows all the fitting numbers start with a fixed prefix and so will be co-located in the index. Hence reading the index is likely to be very efficient.
But here there is no such knowledge ...
P1.TERMINAL_NUMBER LIKE SUBSTR(E2.TERMINAL_NUMBER, 0, length(E2.TERMINAL_NUMBER) - 1) || '%'
There can be any number of different prefixes from E2.TERMINAL_NUMBER and the query will be returning records from all over the PROVIDERS table. So indexed reads will be highly inefficient, and a blunt approach of full scans is the right option.
It may be possible to rewrite the query so it works more efficiently - for instance you would want a Fast Full Index Scan rather than a Full Table Scan. But without knowing your data and business rules we're not really in a position to help, especially when dynamic query generation is involved.
One thing which might improve performance would be to replace the WHERE EXISTS with a WHERE IN...
SELECT P1.PROVIDER_NUMBER, P1.TERMINAL_NUMBER, PC."ORDER" FROM PROVIDERS P1
INNER JOIN PROVIDER_CONFIG PC
ON PC.PROVIDER_NUMBER = P1.PROVIDER_NUMBER
WHERE substr(P1.TERMINAL_NUMBER, 1, 5) IN (
SELECT SUBSTR(E2.TERMINAL_NUMBER, 1, 5)
FROM EQUIPMENT E1
INNER JOIN EQUIPMENT E2
ON E1.MERCHANT_NUMBER = E2.MERCHANT_NUMBER
WHERE E1.TERMINAL_NUMBER = 'SA323F'
AND E1.STATUS IN (0, 9)
AND E2.STATUS IN (0, 9)
)
ORDER BY PC."ORDER" DESC
This would work if the length of the terminal number is constant. Only you know your data, so only you can tell whether it will fly.
If this query does not use an index:
select *
from providers
where terminal_number like '1234%'
Then presumably terminal_number is numeric and not a string. The type conversion prevents the use of the index.
If you want to use an index, then convert the value to a string and use a string index:
create index idx_providers_terminal_number_str on providers(cast(terminal_number as varchar2(255)));
Then write the query as:
select *
from providers
where cast(terminal_number as varchar2(255)) like '1234%'
Please see the DDL below:
CREATE TABLE [dbo].[TBX_RRDGenieDeletedItem](
[DeletedId] [decimal](25, 0) NOT NULL
) ON [PRIMARY]
INSERT INTO TBX_RRDGenieDeletedItem values (90309955000010401948421)
CREATE TABLE [dbo].[dbNicheCIS](
[OccurrenceID] [decimal](25, 0) NULL,
[OccurrenceFileNo] [varchar](20) NULL
)
INSERT INTO dbNicheCIS values (90309955000010401948421,'3212')
CREATE TABLE [dbo].[Asset_Table](
[user_crimenumber] [varchar](4000) NOT NULL
)
INSERT INTO Asset_Table VALUES ('3212; 4512; 34322; 45674; 33221')
The only table I designed was dbNicheCIS. I am trying to find all of the rows in tbx_rrdgeniedeleteditem that are also in Asset_Table using the LIKE statement. Asset_Table contains the OccurrenceFileNo (note that asset table contains occurrencefileno: 3212, which relates to OccurrenceID: 90309955000010401948421). I have tried this:
Select user_crimenumber from tbx_rrdgeniedeleteditem --asset_table.user_crimenumber
inner join dbNicheCIS on tbx_rrdgeniedeleteditem.deletedid = dbNicheCIS.OccurrenceID
cross join asset_table
where deletedid like '903%' and asset_table.user_crimenumber like '%' + occurrencefileno + '%'
It works, but it takes hours to run. Is there a better way to approach it rather than a cross join?
You can use INNER JOIN and also you can eliminate LIKE for the number comparison like below
Select user_crimenumber from tbx_rrdgeniedeleteditem
inner join dbNicheCIS
on tbx_rrdgeniedeleteditem.deletedid = dbNicheCIS.OccurrenceID
inner join asset_table
on CAST(LEFT([DeletedId], 3) AS [decimal](25, 0)) =903
and asset_table.user_crimenumber like '%' + occurrencefileno + '%'
you can make use of the in operator in this case
SELECT * FROM TBX_RRDGenieDeletedItem
WHERE DeletedId IN (
SELECT DISTINCT OccurrenceID FROM dbNicheCIS
INNER JOIN Split(...) ON ...)
Updated: You can create a custom split function which will split the values into a temp table and then do the join.
You need to index your tables to get faster query response.
CREATE INDEX [IX_dbNicheCIS_OccurrenceID] ON [dbNicheCIS]
([OccurrenceID] ASC, [OccurrenceFileNo] ASC)
CREATE INDEX [IX_TBX_RRDGenieDeletedItem_DeletedId] ON [dbo].[TBX_RRDGenieDeletedItem]
([DeletedId] ASC)
Creating such indexes replaces "Table scan" in query execution plan with faster "Index scan" and "Index seek". But you cannot solve like '%' + occurrencefileno + '%' problem with simple indexes.
There you will have to use full text indexes. After you define fulltext index on asset_table.user_crimenumber, you can use following query
SELECT user_crimenumber
FROM tbx_rrdgeniedeleteditem di --asset_table.user_crimenumber
JOIN dbNicheCIS dnc
ON di.deletedid = dnc.OccurrenceID
CROSS JOIN asset_table at
WHERE di.deletedid like '903%'
AND CONTAINS(at.user_crimenumber, occurrencefileno)
But it is a bad practice to store your occurrencefileno list as a varchar value delimited with semicolons. If you were the author of this database design, you should have tried to normalize the data, so that you got one row for every occurrencefileno and not a string like '3212; 4512; 34322; 45674; 33221'.
You can also create as a first step before querying a normalized version of asset_table.user_crimenumber and then use this table with normal indexes as a base for your further queries.
To split your asset_table.user_crimenumber fields you can use the Fn_Split() function as mentioned in this answer.
There is also option using the fnSplit to rewrite your query this way:
SELECT user_crimenumber
FROM tbx_rrdgeniedeleteditem di --asset_table.user_crimenumber
JOIN dbNicheCIS dnc
ON di.deletedid = dnc.OccurrenceID
INNER JOIN (
SELECT at.user_crimenumber, f.item FROM asset_table at
CROSS APPLY dbo.fnSplit(at.user_crimenumber,';') f ) at
ON at.item=dnc.occurrencefileno
WHERE di.deletedid like '903%'
If you create fnSplit as CLR in C# as described here you may get even faster results. But it will not speed up your query magically.
I have to optimize this query can some help me fine tune it so it will return data faster?
Currently the output is taking somewhere around 26 to 35 seconds. I also created index based on attachment table following is my query and index:
SELECT DISTINCT o.organizationlevel, o.organizationid, o.organizationname, o.organizationcode,
o.organizationcode + ' - ' + o.organizationname AS 'codeplusname'
FROM Organization o
JOIN Correspondence c ON c.organizationid = o.organizationid
JOIN UserProfile up ON up.userprofileid = c.operatorid
WHERE c.status = '4'
--AND c.correspondence > 0
AND o.organizationlevel = 1
AND (up.site = 'ALL' OR
up.site = up.site)
--AND (#Dept = 'ALL' OR #Dept = up.department)
AND EXISTS (SELECT 1 FROM Attachment a
WHERE a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
ORDER BY o.organizationcode
I can't just change anything in db due to permission issues, any help would be much appreciated.
I believe your headache is coming from this part in specific...like in a where exists can be your performance bottleneck.
AND EXISTS (SELECT 1 FROM Attachment a
WHERE a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
This can be written as a join instead.
SELECT DISTINCT o.organizationlevel, o.organizationid, o.organizationname, o.organizationcode,
o.organizationcode + ' - ' + o.organizationname AS 'codeplusname'
FROM Organization o
JOIN Correspondence c ON c.organizationid = o.organizationid
JOIN UserProfile up ON up.userprofileid = c.operatorid
left join article a on a.contextid = c.correspondenceid
AND a.context = 'correspondence'
and right(attachmentname,4) in ('.doc','.rtf')
....
This eliminates both the like and the where exists. put your where clause at the bottom.it's a left join, so a.anycolumn is null means the record does not exist and a.anycolumn is not null means a record was found. Where a.anycolumn is not null will be the equivalent of a true in the where exists logic.
Edit to add:
Another thought for you...I'm unsure what you are trying to do here...
AND (up.site = 'ALL' OR
up.site = up.site)
so where up.site = 'All' or 1=1? is the or really needed?
and quickly on right...Right(column,integer) gives you the characters from the right of the string (I used a 4, so it'll take the 4 right chars of the column specified). I've found it far faster than a like statement runs.
This is always going to return true so you can eliminate it (and maybe the join to up)
AND (up.site = 'ALL' OR up.site = up.site)
If you can live with dirty reads then with (nolock)
And I would try Attachement as a join. Might not help but worth a try. Like is relatively expensive and if it is doing that in a loop where it could it once that would really help.
Join Attachment a
on a.contextid = c.correspondenceid
AND a.context = 'correspondence'
AND ( a.attachmentname like '%.rtf' or a.attachmentname like '%.doc'))
I know there are some people on SO that insist that exists is always faster than a join. And yes it is often faster than a join but not always.
Another approach is the create a #temp table using
CREATE TABLE #Temp (contextid INT PRIMARY KEY CLUSTERED);
insert into #temp
Select distinct contextid
from atachment
where context = 'correspondence'
AND ( attachmentname like '%.rtf' or attachmentname like '%.doc'))
order by contextid;
go
select ...
from correspondence c
join #Temp
on #Temp.contextid = c.correspondenceid
go
drop table #temp
Especially if productID is the primary key or part of the primary key on correspondence creating the PK on #temp will help.
That way you can be sure that like expression is only evaluated once. If the like is the expensive part and in a loop then it could be tanking the query. I use this a lot where I have a fairly expensive core query and I need to those results to pick up reference data from multiple tables. If you do a lot of joins some times the query optimizer goes stupid. But if you give the query optimizer PK to PK then it does not get stupid and is fast. The down side is it takes about 0.5 seconds to create and populate the #temp.