Sybase Multiple Substrings Search - sql

I need to get data back from a text field. The input is not all going to be pretty...some of the users don't spell well or consistently. I need to look for a variety of misspellings as well as alternative terms.
I am working with Sybase ASE and am wondering if the AND statement is getting unwieldy and may not be optimal? Here is one attempt:
AND (entry_txt like 'fight' OR
entry_txt like 'confron%' OR
entry_txt like 'aggres%' OR
entry_txt like 'grab' OR
entry_txt like 'push' OR
entry_txt like 'strike' OR
entry_txt like 'hit' OR
entry_txt like 'assa%')
It will get longer as I add some new requirements for additional terms as well as some proprietary names and 8-9 more variations therein! Is there a more efficient way to do this or is that it?
I have also read that LIKE should be used for partial string comparison and IN for values from a set. How about values from a set of partial strings? Could I /should I use IN here and does that help performance?
I am searching thousands of docs so there is a lot of data to have to go through.

Yes, for the ones that you don't have % you can use IN, for the others you still need to use OR.
It would look something like this:
AND (entry_txt in ('fight', 'grab', 'push', 'strike', 'hit')
OR entry_txt like 'confron%'
OR entry_txt like 'aggres%'
OR entry_txt like 'assa%')

You can actually put "like" expressions in an expression - another column in a table, or a variable.
So you could create a table with one varchar column called "like_expr" or something like that.
Then put all the above expressions into it, including the ones without % in, because they'll just degenerate to an equality operation.
In terms of efficiency, if entry_txt is indexed then the index can be used. I would think Sybase would find it easier to join to the like_expr table than to do lots and lots of ORs, but both should use the index - that should be a separate issue.)
create table abe (a varchar(20))
insert abe values ('hello')
create table abe2 (l varchar(20))
insert abe2 values ('h%')
select * from abe a where exists (select 1 from abe2 where a.a like l)
a
hello

Related

Can you use DOES NOT CONTAIN in SQL to replace not like?

I have a table called logs.lognotes, and I want to find a faster way to search for customers who do not have a specific word or phrase in the note. I know I can use "not like", but my question is, can you use DOES NOT CONTAINS to replace not like, in the same way you can use:
SELECT *
FROM table
WHERE CONTAINS (column, ‘searchword’)
Yes, you should be able to use NOT on any boolean expression, as mentioned in the SQL Server Docs here. And, it would look something like this:
SELECT *
FROM table
WHERE NOT CONTAINS (column, ‘searchword’)
To search for records that do not contain the 'searchword' in the column. And, according to
Performance of like '%Query%' vs full text search CONTAINS query
this method should be faster than using LIKE with wildcards.
You can also simply use this:
select * from tablename where not(columnname like '%value%')

How to overcome ( SQL Like operator ) performance issues?

I have Users table contains about 500,000 rows of users data
The Full Name of the User is stored in 4 columns each have type nvarchar(50)
I have a computed column called UserFullName that's equal to the combination of the 4 columns
I have a Stored Procedure searching in Users table by name using like operatior as below
Select *
From Users
Where UserFullName like N'%'+#FullName+'%'
I have a performance issue while executing this SP .. it take long time :(
Is there any way to overcome the lack of performance of using Like operator ?
Not while still using the like operator in that fashion. The % at the start means your search needs to read every row and look for a match. If you really need that kind of search you should look into using a full text index.
Make sure your computed column is indexed, that way it won't have to compute the values each time you SELECT
Also, depending on your indexing, using PATINDEX might be quicker, but really you should use a fulltext index for this kind of thing:
http://msdn.microsoft.com/en-us/library/ms187317.aspx
If you are using index it will be good.
So you can give the column an id or something, like this:
Alter tablename add unique index(id)
Take a look at that article http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/like-performance-tuning .
It easily describes how LIKE works in terms of performance.
Likely, you are facing such issues because your whole table shoould be traversed because of first % symbol.
You should try creating a list of substrings(in a separate table for example) representing k-mers and search for them without preceeding %. Also, an index for such column would help. Please read more about KMer here https://en.m.wikipedia.org/wiki/K-mer .
This will not break an index and will be more efficient to search.

Is there an alternation for wildcards using LIKE in T-SQL?

My query looks for dataset containing a particular label, let say:
SELECT * FROM Authors
WHERE Title LIKE #pattern
where #pattern is defined by user. So, %abc% would match abcd, 0abc, etc. Sometimes there are labels like
Xabc-ONE
blaYabc-TWO-sometext
Zabc-THREE
blubXabc-FOUR
and I'm looking for labels containing abc and ONE or TWO, something like %abc%(ONE|TWO)%. Is it possible?
You can add support for regular expressions to SQL Server via a CLR function, as shown in this answer, but it may not be possible for you in your environment. Check with your friendly sysadmin!
Maybe I don't understand your question right, but why not simply
SELECT * FROM Authors
WHERE Title LIKE '%abc%ONE%' OR '%abc%TWO%'
?
LIKE is just a search with wildcards, nothing more, so there's actually no other way of doing what you want with LIKE. If you need more, have a look into regular expressions. But be aware that it's slower than LIKE and in your case absolutely not necessary.
UPDATE:
From comments "don't want to care for what the user wants to match"...then simply do it like this:
SELECT * FROM Authors
WHERE Title LIKE CONCAT('%', $userInput, '%ONE%') OR CONCAT('%', $userInput, '%TWO%')
Or do I still don't get you right?
If you are using SQL Server you can enable the full text search engine and use the keyboards contains and near to find abc near ONE
I will say in the query(pseudo code)
Select something from table where CONTAINS(column_name, 'abc NEAR ONE')
http://msdn.microsoft.com/en-us/library/ms142568.aspx

Best SQL query for list of records containing certain characters?

I'm working with a relatively large SQL Server 2000 DB at the moment. It's 80 GB in size, and have millions and millions of records.
I currently need to return a list of names that contains at least one of a series of illegal characters. By illegal characters is just meant an arbitrary list of characters that is defined by the customer. In the below example I use question mark, semi-colon, period and comma as the illegal character list.
I was initially thinking to do a CLR function that worked with regular expressions, but as it's SQL server 2000, I guess that's out of the question.
At the moment I've done like this:
select x from users
where
columnToBeSearched like '%?%' OR
columnToBeSearched like '%;%' OR
columnToBeSearched like '%.%' OR
columnToBeSearched like '%,%' OR
otherColumnToBeSearched like '%?%' OR
otherColumnToBeSearched like '%;%' OR
otherColumnToBeSearched like '%.%' OR
otherColumnToBeSearched like '%,%'
Now, I'm not a SQL expert by any means, but I get the feeling that the above query will be very inefficient. Doing 8 multiple wildcard searches in a table with millions of records, seems like it could slow the system down rather seriously. While it seems to work fine on test servers, I am getting the "this has to be completely wrong" vibe.
As I need to execute this script on a live production server eventually, I hope to achieve good performance, so as not to clog the system. The script might need to be expanded later on to include more illegal characters, but this is very unlikely.
To sum up: My aim is to get a list of records where either of two columns contain a customer-defined "illegal character". The database is live and massive, so I want a somewhat efficient approach, as I believe the above queries will be very slow.
Can anyone tell me the best way for achieving my result? Thanks!
/Morten
It doesn't get used much, but the LIKE statement accepts patterns in a similar (but much simplified) way to Regex. This link is the msdn page for it.
In your case you could simplify to (untested):
select x from users
where
columnToBeSearched like '%[?;.,]%' OR
otherColumnToBeSearched like '%[?;.,]%'
Also note that you can create the LIKE pattern as a variable, allowing for the customer defined part of your requirements.
One other major optimization: If you've got an updated date (or timestamp) on the user row (for any audit history type of thing), then you can always just query rows updated since the last time you checked.
If this is a query that will be run repeatedly, you are probably better off creating an index for it. The syntax escapes me at the moment, but you could probably create a computed column (edit: probably a PERSISTED computed column) which is 1 if columnToBeSearched or otherColumnToBeSearched contain illegal characters, and 0 otherwise. Create an index on that column and simply select all rows where the column is 1. This assumes that the set of illegal characters is fixed for that database installation (I assume that that's what you mean by "specified by the customer"). If, on the other hand, each query might specify a different set of illegal characters, this won't work.
By the way, if you don't mind the risk of reading uncommitted rows, you can run the query in a transaction with the the isolation level READ UNCOMMITTED, so that you won't block other transactions.
You can try to partition your data horizontally and "split" your query in a number of smaller queries. For instance you can do
SELECT x FROM users
WHERE users.ID BETWEEN 1 AND 5000
AND -- your filters on columnToBeSearched
putting your results back together in one list may be a little inconvenient, but if it's a report you're only extracting once (or once in a while) it may be feasible.
I'm assuming ID is the primary key of users or a column that has a index defined, which means SQL should be able to create an efficient execution plan, where it evaluates users.ID BETWEEN 1 AND 5000 (fast) before trying to check the filters (which may be slow).
Look up PATINDEX it allows you to put in an array of characters PATINDEX('[._]',ColumnName) returns a 0 or a value of the first occurance of an illegal character found in a certain value. Hope this helps.

How to use like condition with multiple values in sql server 2005?

I need to filter out records based on some text matching in nvarchar(1000) column.
Table has more than 400 thousands records and growing. For now, I am using Like condition:-
SELECT
*
FROM
table_01
WHERE
Text like '%A1%'
OR Text like '%B1%'
OR Text like '%C1%'
OR Text like '%D1%'
Is there any preferred work around?
SELECT
*
FROM
table_01
WHERE
Text like '%[A-Z]1%'
This will check if the texts contains A1, B1, C1, D1, ...
Reference to using the Like Condition in SQL Server
You can try the following if you know the exact position of your sub string:
SELECT
*
FROM
table_01
WHERE
SUBSTRING(Text,1,2) in ('B1','C1','D1')
Have a look at LIKE on msdn.
You could reduce the number filters by combining more details into a single LIKE clause.
SELECT
*
FROM
table_01
WHERE
Text like '%[ABCD]1%'
If you can create a FULLTEXT INDEX on that column of your table (that assumes a lot of research on performance and space), then you are probably going to see a big improvement on performance on text matching. You can go to this link to see what FULLTEXT SEARCH is
and this link to see how to create a FULLTEXT INDEX.
I needed to do this so that I could allow two different databases in a filter for the DatabaseName column in an SQL Server Profiler Trace Template.
All you can do is fill in the body of a Like clause.
Using the reference in John Hartscock's answer, I found out that the like clause uses a sort of limited regex pattern.
For the OP's scenario, MSMS has the solution.
Assuming I want databases ABCOne, ABCTwo, and ABCThree, I come up with what is essentially independent whitelists for each character:
Like ABC[OTT][NWH][EOR]%
Which is easily extensible to any set of strings. It won't be ironclad, that last pattern would also match ABCOwe, ABCTnr, or ABCOneHippotamus, but if you're filtering a limited set of possible values there's a good chance you can make it work.
You could alternatively use the [^] operator to present a blacklist of unacceptable characters.