How to search for string in SQL treating apostrophe and single quote as equal - sql

We have a database where our customer has typed "Bob's" one time and "Bob’s" another time. (Note the slight difference between the single-quote and apostrophe.)
When someone searches for "Bob's" or "Bob’s", I want to find all cases regardless of what they used for the apostrophe.
The only thing I can come up with is looking at people's queries and replacing every occurrence of one or the other with (’|'') (Note the escaped single quote) and using SIMILAR TO.
SELECT * from users WHERE last_name SIMILAR TO 'O(’|'')Dell'
Is there a better way, ideally some kind of setting that allows these to be interchangeable?

You can use regexp matching
with a_table(str) as (
values
('Bob''s'),
('Bob’s'),
('Bobs')
)
select *
from a_table
where str ~ 'Bob[''’]s';
str
-------
Bob's
Bob’s
(2 rows)
Personally I would replace all apostrophes in a table with one query (I had the same problem in one of my projects).

If you find that both of the cases above are valid and present the same information then you might actually consider taking care of your data before it arrives into the database for later retrieval. That means you could effectively replace one sign into another within your application code or before insert trigger.
If you have more cases like the one you've mentioned then specifying just LIKE queries would be a way to go, unfortunately.
You could also consider hints for your customer while creating another user that would fetch up records from database and return closest matches if there are any to avoid such problems.
I'm afraid there is no setting that makes two of these symbols the same in DQL of Postgres. At least I'm not familiar with one.

Related

SQL Server Text Searching

I have a business requirement where we need to do somce crazy name matching against records stored in the database and I was wondering if there is any easy way to do it using SQL Server.
Name Stored in the DB : Austin K
Name to be Matched from UI : Austin Kierland
That's just a sample. In reality, there could be whole lot of different permutations and combinations.
If it's other way round, I could've used wild character but in this case, the name in the database is smaller than the search criteria.
Any suggestions?
Realistically - no. Databases were meant for comparing absolute values, not for messy comparisons. The way they store their data internally just isn't fit for really messy matching. Actually even a superpowerful dedicated search engine like Google, that has a LOT of messy matching features, wouldn't be able to pull off your example without prior knowledge.
I don't know how the requirement is precisely worded, but I'd either shoot the feature request with "technically impossible", or implement a rule set for which messy matches are tried - for your example, you could easily 'hard code' that multiple searches are executed when capitalized words are entered, shortening them so a single letter. No idea if that's a solution to your problem though.
You can do a normal search using the LIKE operator which determines whether a specific character string matches a specified pattern. The problem you will run into is the probability of the returning of multiple records or incorrect people. I've had similar requirement myself for a business app and the best solution to the issue is to require other qualifying values rather then just name. If you do a partial name search without other qualifying data you are certainly going to come across the false positive matches and/or multiple records. In my case I built a web service that checks eligibility allowing text search for first & last name but also added date of birth, primary person SSN, and gender which ensured the matching person was in deed the person intended to search for. If my situation was like yours in which name was the only search criteria my recommendation to the business would be we cannot perform the search until qualifying data is entered into the database otherwise there is no accurate way to query the results they are looking for.

Wildcards in database

Any one have any pointers how I can store wildcards in a database and the see which row(s) a string matches? Can it be done?
e.g.
DB contains a table like:
So john3136 should get 10 times his regular pay. fred3136 would get half his regular pay.
harry3136 probably crashes the app since there is no matching data ;-)
The code needs to do something like:
foreach(Employee e in all_employees) {
SELECT Multiplier FROM PayScales WHERE
//??? e.Name matches the PayScales.Name wildcard
}
Thanks!
Edit
This is a real world issue: I've got a parameter file that contains wildcards. The code currently iterates through employees, iterates through the param file looking for a match - you can see why I'd like to "databaserize" it ;-)
Wildcards are optional. The row could have said "john3136" to only match one employee. (The real app isn't actually employees, so it does make sense even if it looks like overkill in this simple example)
One option open: I do know all the employee names before I start, so I could iterate through them and effectively expand the wildcards in a temporary table. (so if I have john3136* in the starting table, it might expand to john3136, john31366 etc based on the list of employees). I was hoping to find a better way than this since it requires more maintenance (e.g. if we add functionality to add an employee we need to maintain the expanded wildcards table).
SELECT * FROM payscales
WHERE e.Name
LIKE regexp_replace(name, E'^\\*|\\*$', '%', 'g');
I don't know which database you're using. The above query works on postgresql and just replace your trailing and leading wildcard with %, that's the LIKE wildcard.
If no wildcard is present, it must match the full string.

Best SQL query for list of records containing certain characters?

I'm working with a relatively large SQL Server 2000 DB at the moment. It's 80 GB in size, and have millions and millions of records.
I currently need to return a list of names that contains at least one of a series of illegal characters. By illegal characters is just meant an arbitrary list of characters that is defined by the customer. In the below example I use question mark, semi-colon, period and comma as the illegal character list.
I was initially thinking to do a CLR function that worked with regular expressions, but as it's SQL server 2000, I guess that's out of the question.
At the moment I've done like this:
select x from users
where
columnToBeSearched like '%?%' OR
columnToBeSearched like '%;%' OR
columnToBeSearched like '%.%' OR
columnToBeSearched like '%,%' OR
otherColumnToBeSearched like '%?%' OR
otherColumnToBeSearched like '%;%' OR
otherColumnToBeSearched like '%.%' OR
otherColumnToBeSearched like '%,%'
Now, I'm not a SQL expert by any means, but I get the feeling that the above query will be very inefficient. Doing 8 multiple wildcard searches in a table with millions of records, seems like it could slow the system down rather seriously. While it seems to work fine on test servers, I am getting the "this has to be completely wrong" vibe.
As I need to execute this script on a live production server eventually, I hope to achieve good performance, so as not to clog the system. The script might need to be expanded later on to include more illegal characters, but this is very unlikely.
To sum up: My aim is to get a list of records where either of two columns contain a customer-defined "illegal character". The database is live and massive, so I want a somewhat efficient approach, as I believe the above queries will be very slow.
Can anyone tell me the best way for achieving my result? Thanks!
/Morten
It doesn't get used much, but the LIKE statement accepts patterns in a similar (but much simplified) way to Regex. This link is the msdn page for it.
In your case you could simplify to (untested):
select x from users
where
columnToBeSearched like '%[?;.,]%' OR
otherColumnToBeSearched like '%[?;.,]%'
Also note that you can create the LIKE pattern as a variable, allowing for the customer defined part of your requirements.
One other major optimization: If you've got an updated date (or timestamp) on the user row (for any audit history type of thing), then you can always just query rows updated since the last time you checked.
If this is a query that will be run repeatedly, you are probably better off creating an index for it. The syntax escapes me at the moment, but you could probably create a computed column (edit: probably a PERSISTED computed column) which is 1 if columnToBeSearched or otherColumnToBeSearched contain illegal characters, and 0 otherwise. Create an index on that column and simply select all rows where the column is 1. This assumes that the set of illegal characters is fixed for that database installation (I assume that that's what you mean by "specified by the customer"). If, on the other hand, each query might specify a different set of illegal characters, this won't work.
By the way, if you don't mind the risk of reading uncommitted rows, you can run the query in a transaction with the the isolation level READ UNCOMMITTED, so that you won't block other transactions.
You can try to partition your data horizontally and "split" your query in a number of smaller queries. For instance you can do
SELECT x FROM users
WHERE users.ID BETWEEN 1 AND 5000
AND -- your filters on columnToBeSearched
putting your results back together in one list may be a little inconvenient, but if it's a report you're only extracting once (or once in a while) it may be feasible.
I'm assuming ID is the primary key of users or a column that has a index defined, which means SQL should be able to create an efficient execution plan, where it evaluates users.ID BETWEEN 1 AND 5000 (fast) before trying to check the filters (which may be slow).
Look up PATINDEX it allows you to put in an array of characters PATINDEX('[._]',ColumnName) returns a 0 or a value of the first occurance of an illegal character found in a certain value. Hope this helps.

How do I perform a simple one-statement SQL search across tables?

Suppose that two tables exist: users and groups.
How does one provide "simple search" in which a user enters text and results contain both users and groups whose names contain the text?
The result of the search must distinguish between the two types.
The trick is to combine a UNION with a literal string to determine the type of 'object' returned. In most (?) cases, UNION ALL will be more efficient, and should be used unless duplicates are required in the sub-queries. The following pattern should suffice:
SELECT "group" type, name
FROM groups
WHERE name LIKE "%$text%"
UNION ALL
SELECT "user" type, name
FROM users
WHERE name LIKE "%$text%"
NOTE: I've added the answer myself, because I came across this problem yesterday, couldn't find a good solution, and used this method. If someone has a better approach, please feel free to add it.
If you use "UNION ALL" then the db doesn't try to remove duplicates - you won't have duplicates between the two queries anyway (since the first column is different), so UNION ALL will be faster.
(I assume that you don't have duplicates inside each query that you want to remove)
Using LIKE will cause a number of problems as it will require a table scan every single time when the LIKE comparator starts with a %. This forces SQL to check every single row and work it's way, byte by byte, through the string you are using for comparison. While this may be fine when you start, it quickly causes scaling issues.
A better way to handle this is using Full Text Search. While this would be a more complex option, it will provide you with better results for very large databases. Then you can use a functioning version of the example Bobby Jack gave you to UNION ALL your two result sets together and display the results.
I would suggest another addition
SELECT "group" type, name
FROM groups
WHERE UPPER(name) LIKE UPPER("%$text%")
UNION ALL
SELECT "user" type, name
FROM users
WHERE UPPER(name) LIKE UPPER("%$text%")
You could convert $text to upper case first or do just do it in the query. This way you get a case insensitive search.

Need Pattern for dynamic search of multiple sql tables

I'm looking for a pattern for performing a dynamic search on multiple tables.
I have no control over the legacy (and poorly designed) database table structure.
Consider a scenario similar to a resume search where a user may want to perform a search against any of the data in the resume and get back a list of resumes that match their search criteria. Any field can be searched at anytime and in combination with one or more other fields.
The actual sql query gets created dynamically depending on which fields are searched. Most solutions I've found involve complicated if blocks, but I can't help but think there must be a more elegant solution since this must be a solved problem by now.
Yeah, so I've started down the path of dynamically building the sql in code. Seems godawful. If I really try to support the requested ability to query any combination of any field in any table this is going to be one MASSIVE set of if statements. shiver
I believe I read that COALESCE only works if your data does not contain NULLs. Is that correct? If so, no go, since I have NULL values all over the place.
As far as I understand (and I'm also someone who has written against a horrible legacy database), there is no such thing as dynamic WHERE clauses. It has NOT been solved.
Personally, I prefer to generate my dynamic searches in code. Makes testing convenient. Note, when you create your sql queries in code, don't concatenate in user input. Use your #variables!
The only alternative is to use the COALESCE operator. Let's say you have the following table:
Users
-----------
Name nvarchar(20)
Nickname nvarchar(10)
and you want to search optionally for name or nickname. The following query will do this:
SELECT Name, Nickname
FROM Users
WHERE
Name = COALESCE(#name, Name) AND
Nickname = COALESCE(#nick, Nickname)
If you don't want to search for something, just pass in a null. For example, passing in "brian" for #name and null for #nick results in the following query being evaluated:
SELECT Name, Nickname
FROM Users
WHERE
Name = 'brian' AND
Nickname = Nickname
The coalesce operator turns the null into an identity evaluation, which is always true and doesn't affect the where clause.
Search and normalization can be at odds with each other. So probably first thing would be to get some kind of "view" that shows all the fields that can be searched as a single row with a single key getting you the resume. then you can throw something like Lucene in front of that to give you a full text index of those rows, the way that works is, you ask it for "x" in this view and it returns to you the key. Its a great solution and come recommended by joel himself on the podcast within the first 2 months IIRC.
What you need is something like SphinxSearch (for MySQL) or Apache Lucene.
As you said in your example lets imagine a Resume that will composed of several fields:
List item
Name,
Adreess,
Education (this could be a table on its own) or
Work experience (this could grow to its own table where each row represents a previous job)
So searching for a word in all those fields with WHERE rapidly becomes a very long query with several JOINS.
Instead you could change your framework of reference and think of the Whole resume as what it is a Single Document and you just want to search said document.
This is where tools like Sphinx Search do. They create a FULL TEXT index of your 'document' and then you can query sphinx and it will give you back where in the Database that record was found.
Really good search results.
Don't worry about this tools not being part of your RDBMS it will save you a lot of headaches to use the appropriate model "Documents" vs the incorrect one "TABLES" for this application.