In SQL Server 2005, I have a Product search that looks like:
select ProductID, Name, Email
from Product
where Name = #Name
I've been asked to ignore a couple "special" characters in Product.Name, so that a search for "Potatoes" returns "Po-ta-toes" as well as "Potatoes". My first thought is to just do this:
select ProductID, Name, Email
from Product
where REPLACE(Name, '-', '') = #Name
...but on second thought, I wonder if I'm killing performance by running a function on EVERY candidate result. Does SQL have some optimization magic that help it do this kind of thing quickly? Can you think of anything easier I might be able to try with the requirements I have?
More standards-based: You could add a new column, e.g., searchable_name, precalculate the results of the REPLACE (and any other tweaks, e.g., SOUNDEX) on INSERT/UPDATE and store them in the new column, then search against that column.
Less standards-based: Lots of RDBMS provide a feature where you can create an INDEX using a function; this is often called a functional index. Your situation seems fairly well suited to such a feature.
Most powerful/flexible: Use a dedicated search tool such as Lucene. It might seem overkill for this situation, but they were designed for searching, and most offer sophisticated stemming algorithms that would almost certainly solve this problem.
You will likely get better performance if you are willing to force the first character to be alphabetic, like this...
select ProductID, Name, Email
from Product
where REPLACE(Name, '-', '') = #Name
And Name Like Left(#Name, 1) + '%'
If the name column is indexed, you will likely get an index seek instead of a scan. The downside is, you will not return rows where the value is "-po-ta-to-es" because the first character does not match.
Can you add a field to your product table with a search-able version of the product name with special characters already removed? Then you can do the 'replace' only once for each record, and do efficient searches against the new field.
Related
Alright.. this is both an interesting solution, and a question of "is there a better way to do this".
I have a database table of addresses, broken into various fields (name, phone, suite, address, etc..)
we want to be able to do a very loose search against the address records, with multiple parameters.
so if someone searches "123 Mac", that's an easy enough match against an address.. but if they search for "David lincoln", where the name is "David something" and the address is "123 lincoln park", that's a much trickier search.
I came across another post where someone used a cross apply for search parameters, which I though was just nifty, so I fashioned something similar.
I take the user search string, and break it into values (split) on the spaces, and insert that into a temp table in memory.
On the search table, I've created a somewhat constrained view, with a "searchText" column where I've literally mashed all the conceivable search fields into one big text field. I then created an index on this view. (had to force the use of the view/index, which does perform substantially better than the engine attempting to build a plan against the underlying tables)
And finally the query:
create table #searchValues (SString varchar(100) null)
insert into #searchValues (sstring) select '%'+[value]+'%' from dbo.Split(ltrim(rtrim(replace(#searchstring,'%',' '))),' ')
select top 50 addressID, ROW_NUMBER() OVER (ORDER BY a.[3mUsage] desc, searchText) AS RowNumber
from vwAddressSearch a WITH (NOEXPAND)
cross apply #searchValues s
where searchText like s.SString
group by addressID, [3musage], searchText
having count(*) = (select count(*) from #searchValues)
So this works well enough... (I take the output of this query and re-join it back to the main table to pull all the relevant values)
Also note this is an AND logic, not an OR, which is why I'm having to group and compare counts.
But the somewhat fugly part.. that indexed view is still around 900k rows, and the more search terms there are, the more cross apply and text searching there ends up being.. the performance is .. okay.. not great, not horrible.
it actually seems to perform slightly better than a manual select where... searchtext like '%one%' and searchtext like '%two%'.. etc.
Anyway, the question here, the group by and count compare, it works.. but it seems a little ugly to me. is there a better way to do this?
I have a repository of SQL queries and I want to understand which queries use certain tables or fields.
Let's say I want to understand what queries use the email field, how can I write it?
Example SQL query:
select
users.email as email_user
,users.email as email_user_too
,email as email_user_too_2
email as email_user_too_3,
back_email as wrong_email -- wrong field
from users
So to state the problem more accurately, you are sorting through a list of SQL queries [as text], and you now need to find the queries that use certain fields using SQL & RegEx (Regular Expressions) in PostgreSQL. (please tag the question so that StackOverflow indexes your question correctly, more importantly, readers have more context about the question)
PostgreSQL has Regular Expression support OOTB (Out Of The Box). So we skip exploring other ways to do this. (If you are reading this as Microsoft SQL Server person, then I strongly suggest you to have a read of this brilliant article on Microsoft's website on defining a Table-Valued UDF (User Defined Function))
The simplest way I could think of to approach your problem, is to throw away what we don't want out of the query text first, and then filter out what's left.
This way, after throwing away the stuff you don't need, you will be left with a set of "tokens" that you can easily filter, and I'm putting token in quotes since we are not really parsing the SQL language, but if we did that would be the first step: to extract tokens.. (:
Take this query for example:
With Queries (
Id
, QueryText
) As (
values (1, 'select
users.email as email_user
,users.email as email_user_too
,email as email_user_too_2,
email as email_user_too_3,
back_email as wrong_email -- wrong field
from users')
)
Select QueryText
, found
From (
Select Id
, QueryText
, regexp_split_to_table (QueryText, '(--[\s\w]+|select|from|as|where|[ \s\n,])') As found
From Queries
) As Result
Where found != ''
And found = 'back_email'
I have sourced the concept of a "query repository" with a WITH statement for ease of doing the pseudo-code.
I have also selected few words/characters to split QueryText with. Like select, where etc. We don't need these in our 'found' set.
And in the end, as you can see above, I simply used found as what's left and filtered it with the field name you are looking for. (Assuming that you know the field you are looking for)
You could improve upon the RegEx I did, or change the method as you wish to make it better. But I think the general concept addresses what you need to achieve. One problem I can see with my solution right off the bat is the fact that you can search for anything really, not just names of the selected fields - which begs the question, why use RegEx, and not Like statements? But again, as I mentioned, you can improve upon the RegEx and address specific requirements you may have. Using Like might limit you in that direction. (In other words, only you know what's good for you. I can't say that from here.)
You can play with the query online here: db-fiddle query and use https://regex101.com/ for testing your RegEx.
Disclaimer I'm not a PostgreSQL developer. There must be other, perhaps better ways of doing this. (:
I have a scenario where I need to perform following operation:
SELECT *
FROM
[dbo].[MyTable]
WHERE
[Url] LIKE '%<some url>%';
I have to use two % (wildcard characters) at the beginning and the end of Url ('%<some url>%') as user should be able to search the complete url even if he types partial text. For example, if url is http://www.google.co.in and user types "goo", then the url must appear in search results. LIKE operator is causing performance issues. I need an alternative so that I can get rid of this statement and wildcards. In other words, I don't want to use LIKE statement in this scenario. I tried using T-SQL CONTAINS but it is not solving my problem. Is there any other alternative available than can perform pattern matching and provide me results quickly?
Starting a like with a % is going to cause a scan. No getting around it. It has to evaluate every value.
If you index the column it should be an index (rather than table) scan.
You don't have an alternative that will not cause a scan.
Charindex and patindex are alternatives but will still scan and not fix the performance issue.
Could you break the components out into a separate table?
www
google
co
in
And then search on like 'goo%'?
That would use an index as it does not start with %.
Better yet you could search on 'google' and get an index seek.
And you would want to have the string unique in that table with a separate join on Int PK so it does not return multiple www for instance.
Suspect FullText Contains was not faster because FullText kept the URL as one word.
You could create a FULLTEXT index.
First create your catalog:
CREATE FULLTEXT CATALOG ft AS DEFAULT;
Now assuming your table is called MyTable, the column is TextColumn and it has a unique index on it called UX_MyTable_TextColumn:
CREATE FULLTEXT INDEX ON [dbo].[MyTable](TextColumn)
KEY INDEX UX_MyTable_TextColumn
Now you can search the table using CONTAINS:
SELECT *
FROM MyTable
WHERE CONTAINS(TextColumn, 'searchterm')
To my knowledge there's no alternative to like or contains (full text search feature) which would give better performance.
What you can do is try to improve performance by optimising your query.
To do that, you need to know a bit about your users & how they'll use your system.
I suspect most people will enter a URL from the start of the address (i.e. without protocol), so you could do something like this:
declare #searchTerm nvarchar(128) = 'goo'
set #searchTerm = coalesce(replace(#searchTerm ,'''',''''''),'')
select #searchTerm
SELECT *
FROM [dbo].[MyTable]
WHERE [Url] LIKE 'http://' + #searchTerm + '%'
or [Url] LIKE 'https://' + #searchTerm + '%'
or [Url] LIKE 'http://www.' + #searchTerm + '%'
or [Url] LIKE 'https://www.' + #searchTerm + '%'
or [Url] LIKE '%' + #searchTerm + '%'
option (fast 1); --get back the first result asap;
That then gives you some optimisation; i.e. if the url's http://www.google.com the index on the url column can be used since http://www.goo is at the start of the string.
The option (fast 1) piece on the end's to ensure this benefit is seen; since the last URL like %searchTerm% can't make use of indexes, we'd rather return responses as soon as we can rather than wait for that slow part to complete.
Have a think of other common usage patterns and ways around those.
As written, your query cannot be further optimized, and there is no way of getting around the LIKE to do your searching. The only thing you can do to improve performance is reduce the SELECT to return only the columns you need if you don't need all of them, and create an index on URL with those columns included. The LIKE will not be able to use the index for seeking, but the reduced data size for scanning can help. If you have a SQL Server edition that supports compression, that will help as well.
For instance, if you really need only column A, write
SELECT A FROM [dbo].[MyTable] WHERE [Url] LIKE '%<some url>%';
And create the index as
CREATE INDEX IX_MyTable_URL
ON MyTable([Url])
INCLUDE (A) WITH (DATA_COMPRESSION = PAGE);
If A is already included in your primary key, the INCLUDE is unnecessary.
Your query is a very simple one and I see no reason for it to be slow. The dbms wil read record for record and compare strings. Usually it can even do this in parallel threads.
What do you think can be the reason for your statement being so slow? Are there billions of records in your table? Do your records contain so much data?
Your best bet is not to care about the query, but about the database and your system. Others have already suggested an index on the url column, so rather than scanning the table, the index can be scanned. Is max degree of parallelism mistakenly set? Is your table fragmented? Is your hardware appropriate? These are the things to consider here.
However: charindex('oogl', url) > 0 does the same as url like '%oogl%', but internally they work differently somehow. For some people the LIKE expression turned out faster, for others the CHARINDEX method. Maybe it depends on the query, number of processors, operating system, whatever. It may be worth a try.
I have a query which slows down immensely when i add an addition where part
which essentially is just a like lookup on a varchar(500) field
where...
and (xxxxx.yyyy like '% blahblah %')
I've been racking my head but pretty much the query slows down terribly when I add this in.
I'm wondering if anyone has suggestions in terms of changing field type, index setup, or index hints or something that might assist.
any help appreciated.
sql 2000 enterprise.
HERE IS SOME ADDITIONAL INFO:
oops. as some background unfortunately I do need (in the case of the like statement) to have the % at the front.
There is business logic behind that which I can't avoid.
I have since created a full text catalogue on the field which is causing me problems
and converted the search to use the contains syntax.
Unfortunately although this has increased performance on occasion it appears to be slow (slower) for new word searchs.
So if i have apple.. apple appears to be faster the subsequent times but not for new searches of orange (for example).
So i don't think i can go with that (unless you can suggest some tinkering to make that more consistent).
Additional info:
the table contains only around 60k records
the field i'm trying to filter is a varchar(500)
sql 2000 on windows server 2003
The query i'm using is definitely convoluted
Sorry i've had to replace proprietary stuff.. but should give you and indication of the query:
SELECT TOP 99 AAAAAAAA.Item_ID, AAAAAAAA.CatID, AAAAAAAA.PID, AAAAAAAA.Description,
AAAAAAAA.Retail, AAAAAAAA.Pack, AAAAAAAA.CatID, AAAAAAAA.Code, BBBBBBBB.blahblah_PictureFile AS PictureFile,
AAAAAAAA.CL1, AAAAAAAA.CL1, AAAAAAAA.CL2, AAAAAAAA.CL3
FROM CCCCCCC INNER JOIN DDDDDDDD ON CCCCCCC.CID = DDDDDDDD.CID
INNER JOIN AAAAAAAA ON DDDDDDDD.CID = AAAAAAAA.CatID LEFT OUTER JOIN BBBBBBBB
ON AAAAAAAA.PID = BBBBBBBB.Product_ID INNER JOIN EEEEEEE ON AAAAAAAA.BID = EEEEEEE.ID
WHERE
(CCCCCCC.TID = 654321) AND (DDDDDDDD.In_Use = 1) AND (AAAAAAAA.Unused = 0)
AND (DDDDDDDD.Expiry > '10-11-2010 09:23:38') AND
(
(AAAAAAAA.Code = 'red pen') OR
(
(my_search_description LIKE '% red %') AND (my_search_description LIKE '% nose %')
AND (DDDDDDDD.CID IN (63,153,165,305,32,33))
)
)
AND (DDDDDDDD.CID IN (20,32,33,63,64,65,153,165,232,277,294,297,300,304,305,313,348,443,445,446,447,454,472,479,481,486,489,498))
ORDER BY AAAAAAAA.f_search_priority DESC, DDDDDDDD.Priority DESC, AAAAAAAA.Description ASC
You can see throwing in the my_search_description filter also includes a dddd.cid filter (business logic).
This is the part which is slowing things down (from a 1.5-2 second load of my pages down to a 6-8 second load (ow ow ow))
It might be my lack of understanding of how to have the full text search catelogue working.
Am very impressed by the answers so if anyone has any tips I'd be most greatful.
If you haven't already, enable full text indexing.
Unfortunately, using the LIKE clause on a query really does slow things down. Full Text Indexing is really the only way that I know of to speed things up (at the cost of storage space, of course).
Here's a link to an overview of Full-Text Search in SQL Server which will show you how to configure things and change your queries to take advantage of the full-text indexes.
More details would certainly help, but...
Full-text indexing can certainly be useful (depending on the more details about the table and your query). Full Text indexing requires a good bit of extra work both in setup and querying, but it's the only way to try to do the sort of search you seek efficiently.
The problem with LIKE that starts with a Wildcard is that SQL server has to do a complete table scan to find matching records - not only does it have to scan every row, but it has to read the contents of the char-based field you are querying.
With or without a full-text index, one thing can possibly help: Can you narrow the range of rows being searched, so at least SQL doesn't need to scan the whole table, but just some subset of it?
The '% blahblah %' is a problem for improving performance. Putting the wildcard at the beginning tells SQL Server that the string can begin with any legal character, so it must scan the entire index. Your best bet if you must have this filter is to focus on your other filters for improvement.
Using LIKE with a wildcard at the beginning of the search pattern forces the server to scan every row. It's unable to use any indexes. Indexes work from left to right, and since there is no constant on the left, no index is used.
From your WHERE clause, it looks like you're trying to find rows where a specific word exists in an entry. If you're searching for a whole word, then full text indexing may be a solution for you.
Full text indexing creates an index entry for each word that's contained in the specified column. You can then quickly find rows that contain a specific word.
As other posters have correctly pointed out, the use of the wildcard character % within the LIKE expression is resulting in a query plan being produced that uses a SCAN operation. A scan operation touches every row in the table or index, dependant on the type of scan operation being performed.
So the question really then becomes, do you actually need to search for the given text string anywhere within the column in question?
If not, great, problem solved but if it is essential to your business logic then you have two routes of optimization.
Really go to town on increasing the overall selectivity of your query by focusing your optimization efforts on the remaining search arguments.
Implement a Full Text Indexing Solution.
I don't think this is a valid answer, but I'd like to throw it out there for some more experienced posters comments...are these equivlent?
where (xxxxx.yyyy like '% blahblah %')
vs
where patindex(%blahbalh%, xxxx.yyyy) > 0
As far as I know, that's equivlent from a database logic standpoint as it's forcing the same scan. Guess it couldn't hurt to try?
So I have a stored procedure that accepts a product code like 1234567890. I want to facilitate a wildcard search option for those products. (i.e. 123456*) and have it return all those products that match. What is the best way to do this?
I have in the past used something like below:
SELECT #product_code = REPLACE(#product_code, '*', '%')
and then do a LIKE search on the product_code field, but i feel like it can be improved.
What your doing already is about the best you can do.
One optimization you might try is to ensure there's an index on the columns you're allowing this on. SQL Server will still need to do a full scan for the wildcard search, but it'll be only over the specific index rather than the full table.
As always, checking the query plan before and after any changes is a great idea.
A couple of random ideas
It depends, but you might like to consider:
Always look for a substring by default. e.g. if the user enters "1234", you search for:
WHERE product like "%1234%"
Allow users full control. i.e. simply take their input and pass it to the LIKE clause. This means that they can come up with their own custom searches. This will only be useful if your users are interested in learning.
WHERE product like #input