Best SQL query for list of records containing certain characters? - sql

I'm working with a relatively large SQL Server 2000 DB at the moment. It's 80 GB in size, and have millions and millions of records.
I currently need to return a list of names that contains at least one of a series of illegal characters. By illegal characters is just meant an arbitrary list of characters that is defined by the customer. In the below example I use question mark, semi-colon, period and comma as the illegal character list.
I was initially thinking to do a CLR function that worked with regular expressions, but as it's SQL server 2000, I guess that's out of the question.
At the moment I've done like this:
select x from users
where
columnToBeSearched like '%?%' OR
columnToBeSearched like '%;%' OR
columnToBeSearched like '%.%' OR
columnToBeSearched like '%,%' OR
otherColumnToBeSearched like '%?%' OR
otherColumnToBeSearched like '%;%' OR
otherColumnToBeSearched like '%.%' OR
otherColumnToBeSearched like '%,%'
Now, I'm not a SQL expert by any means, but I get the feeling that the above query will be very inefficient. Doing 8 multiple wildcard searches in a table with millions of records, seems like it could slow the system down rather seriously. While it seems to work fine on test servers, I am getting the "this has to be completely wrong" vibe.
As I need to execute this script on a live production server eventually, I hope to achieve good performance, so as not to clog the system. The script might need to be expanded later on to include more illegal characters, but this is very unlikely.
To sum up: My aim is to get a list of records where either of two columns contain a customer-defined "illegal character". The database is live and massive, so I want a somewhat efficient approach, as I believe the above queries will be very slow.
Can anyone tell me the best way for achieving my result? Thanks!
/Morten

It doesn't get used much, but the LIKE statement accepts patterns in a similar (but much simplified) way to Regex. This link is the msdn page for it.
In your case you could simplify to (untested):
select x from users
where
columnToBeSearched like '%[?;.,]%' OR
otherColumnToBeSearched like '%[?;.,]%'
Also note that you can create the LIKE pattern as a variable, allowing for the customer defined part of your requirements.
One other major optimization: If you've got an updated date (or timestamp) on the user row (for any audit history type of thing), then you can always just query rows updated since the last time you checked.

If this is a query that will be run repeatedly, you are probably better off creating an index for it. The syntax escapes me at the moment, but you could probably create a computed column (edit: probably a PERSISTED computed column) which is 1 if columnToBeSearched or otherColumnToBeSearched contain illegal characters, and 0 otherwise. Create an index on that column and simply select all rows where the column is 1. This assumes that the set of illegal characters is fixed for that database installation (I assume that that's what you mean by "specified by the customer"). If, on the other hand, each query might specify a different set of illegal characters, this won't work.
By the way, if you don't mind the risk of reading uncommitted rows, you can run the query in a transaction with the the isolation level READ UNCOMMITTED, so that you won't block other transactions.

You can try to partition your data horizontally and "split" your query in a number of smaller queries. For instance you can do
SELECT x FROM users
WHERE users.ID BETWEEN 1 AND 5000
AND -- your filters on columnToBeSearched
putting your results back together in one list may be a little inconvenient, but if it's a report you're only extracting once (or once in a while) it may be feasible.
I'm assuming ID is the primary key of users or a column that has a index defined, which means SQL should be able to create an efficient execution plan, where it evaluates users.ID BETWEEN 1 AND 5000 (fast) before trying to check the filters (which may be slow).

Look up PATINDEX it allows you to put in an array of characters PATINDEX('[._]',ColumnName) returns a 0 or a value of the first occurance of an illegal character found in a certain value. Hope this helps.

Related

Replace specific characters within SQL query

I'm struggling with some special characters that work fine with my SQL query, however will create problems in a secondary system (Excel), so I would like to replace them already during the query if possible.
TRANSACTIONS
ID DESC
1 14ft
2 15/16ft
3 17ft
This is just a dummy example, but "/" represents one of the characters I need to remove, but there are a few different. Although it should technically work, I can't use:
select ID, case when DESC = '15/16ft' then '15_16ft' else DESC from TRANSACTIONS
I can't keep track on all the strings, so I should approach based on character. I'd prefer converting them to another char or removing them altogether.
Unfortunately not sure on the exact db engine, although good chance it's an IBM based product, but most "generic" SQL queries tend to run fine. And just to emphazise that I'm looking to convert data within the SQL query, not update the database records. Thanks a lot!

How to search for string in SQL treating apostrophe and single quote as equal

We have a database where our customer has typed "Bob's" one time and "Bob’s" another time. (Note the slight difference between the single-quote and apostrophe.)
When someone searches for "Bob's" or "Bob’s", I want to find all cases regardless of what they used for the apostrophe.
The only thing I can come up with is looking at people's queries and replacing every occurrence of one or the other with (’|'') (Note the escaped single quote) and using SIMILAR TO.
SELECT * from users WHERE last_name SIMILAR TO 'O(’|'')Dell'
Is there a better way, ideally some kind of setting that allows these to be interchangeable?
You can use regexp matching
with a_table(str) as (
values
('Bob''s'),
('Bob’s'),
('Bobs')
)
select *
from a_table
where str ~ 'Bob[''’]s';
str
-------
Bob's
Bob’s
(2 rows)
Personally I would replace all apostrophes in a table with one query (I had the same problem in one of my projects).
If you find that both of the cases above are valid and present the same information then you might actually consider taking care of your data before it arrives into the database for later retrieval. That means you could effectively replace one sign into another within your application code or before insert trigger.
If you have more cases like the one you've mentioned then specifying just LIKE queries would be a way to go, unfortunately.
You could also consider hints for your customer while creating another user that would fetch up records from database and return closest matches if there are any to avoid such problems.
I'm afraid there is no setting that makes two of these symbols the same in DQL of Postgres. At least I'm not familiar with one.

Performance of SQL functions vs. code functions

We're currently investigating the load against our SQL server and looking at ways to alleviate it. During my post-secondary education, I was always told that, from a performance standpoint, it was cheaper to make SQL Server do the work. But is this true?
Here's an example:
SELECT ord_no FROM oelinhst_sql
This returns 783119 records in 14 seconds. The field is a char(8), but all of our order numbers are six-digits long so each has two blank characters leading. We typically trim this field, so I ran the following test:
SELECT LTRIM(ord_no) FROM oelinhst_sql
This returned the 783119 records in 13 seconds. I also tried one more test:
SELECT LTRIM(RTRIM(ord_no)) FROM oelinhst_sql
There is nothing to trim on the right, but I was trying to see if there was any overhead in the mere act of calling the function, but it still returned in 13 seconds.
My manager was talking about moving things like string trimming out of the SQL and into the source code, but the test results suggest otherwise. My manager also says he heard somewhere that using SQL functions meant that indexes would not be used. Is there any truth to this either?
Only optimize code that you have proven to be the slowest part of your system. Your data so far indicates that SQL string manipulation functions are not effecting performance at all. take this data to your manager.
If you use a function or type cast in the WHERE clause it can often prevent the SQL server from using indexes. This does not apply to transforming returned columns with functions.
It's typically user defined functions (UDFs) that get a bad rap with regards to SQL performance and might be the source of the advice you're getting.
The reason for this is you can build some pretty hairy functions that cause massive overhead with exponential effect.
As you've found with rtrim and ltrim this isn't a blanket reason to stop using all functions on the sql side.
It somewhat depends on what all is encompassed by: "things like string trimming", but, for string trimming at least, I'd definitely let the database do that (there will be less network traffic as well). As for the indexes, they will still be used if you're where clause is just using the column itself (as opposed to a function of the column). Use of the indexes won't be affected whatsoever by using functions on the actual columns you're retrieving (just on how you're selecting the rows).
You may want to have a look at this for performance improvement suggestions: http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/
As I said in my comment, reduce the data read per query and you will get a speed increase.
You said:
our order numbers are six-digits long
so each has two blank characters
leading
Makes me think you are storing numbers in a string, if so why are you not using a numeric data type? The smallest numeric type which will take 6 digits is an INT (I'm assuming SQL Server) and that already saves you 4 bytes per order number, over the number of rows you mention that's quite a lot less data to read off disk and send over the network.
Fully optimise your database before looking to deal with the data outside of it; it's what a database server is designed to do, serve data.
As you found it often pays to measure but I what I think your manager may have been referring to is somthing like this.
This is typically much faster
SELECT SomeFields FROM oelinhst_sql
WHERE
datetimeField > '1/1/2011'
and
datetimeField < '2/1/2011'
than this
SELECT SomeFields FROM oelinhst_sql
WHERE
Month(datetimeField) = 1
and
year(datetimeField) = 2011
even though the rows that are returned are the same

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

How do I perform a simple one-statement SQL search across tables?

Suppose that two tables exist: users and groups.
How does one provide "simple search" in which a user enters text and results contain both users and groups whose names contain the text?
The result of the search must distinguish between the two types.
The trick is to combine a UNION with a literal string to determine the type of 'object' returned. In most (?) cases, UNION ALL will be more efficient, and should be used unless duplicates are required in the sub-queries. The following pattern should suffice:
SELECT "group" type, name
FROM groups
WHERE name LIKE "%$text%"
UNION ALL
SELECT "user" type, name
FROM users
WHERE name LIKE "%$text%"
NOTE: I've added the answer myself, because I came across this problem yesterday, couldn't find a good solution, and used this method. If someone has a better approach, please feel free to add it.
If you use "UNION ALL" then the db doesn't try to remove duplicates - you won't have duplicates between the two queries anyway (since the first column is different), so UNION ALL will be faster.
(I assume that you don't have duplicates inside each query that you want to remove)
Using LIKE will cause a number of problems as it will require a table scan every single time when the LIKE comparator starts with a %. This forces SQL to check every single row and work it's way, byte by byte, through the string you are using for comparison. While this may be fine when you start, it quickly causes scaling issues.
A better way to handle this is using Full Text Search. While this would be a more complex option, it will provide you with better results for very large databases. Then you can use a functioning version of the example Bobby Jack gave you to UNION ALL your two result sets together and display the results.
I would suggest another addition
SELECT "group" type, name
FROM groups
WHERE UPPER(name) LIKE UPPER("%$text%")
UNION ALL
SELECT "user" type, name
FROM users
WHERE UPPER(name) LIKE UPPER("%$text%")
You could convert $text to upper case first or do just do it in the query. This way you get a case insensitive search.