SQL Query Optimisation (Direction of Condition Evaluation) - sql

Let's say I have a dictionary of 26000 words, 1000 words per letter of the alphabet.
If I want to find all the words that have an 'e' in them, I write:
SELECT *
FROM dict
WHERE word LIKE '%e%';
If I wanted to reduce that to only the words beginning with 'a' I could change the like condition or I could do this:
SELECT *
FROM dict
WHERE word LIKE '%e%'
AND id < 1000;
Lots of words have the letter 'e' in them and so would return true only to fail the second requirement if the conditions are evaluated left to right but I would expect faster results if the condition is evaluated from right to left.
My question is, would it be better to have the id < 1000 as the first or second condition or does this depend on the type of database.

The location of the condition is irrelevant, the same number of scans (if applicable) will be required. They are not parsed in order -- the optimizer determines what is applied, and when, based on table statistics and indexes (if any exist). Those statistics change, and can become out of date (which is why maintenance is important).

It would be bad to assume id < 1000 to be the equivalent of
SELECT * FROM dict WHERE word LIKE'a%'.
If you designed your database this way it would violate First Normal form. 1NF, Specifically: There's no top-to-bottom ordering to the rows.
Technically there isn't a way to ensure this ordering is valid, especially if you wanted to add a word starting with 'A' after you setup your initial state.

One of the key design principles of modern relational database management systems is that you, the user, have no true control or say over how the data is actually being stored on the hard drive by the RDBMS. This means that you cannot assume that the data is (a) stored in alphabetical order on the drive, or (b) that when you retrieve the data, it will be retrieved in alphabetical order. The only way to be absolutely 100% sure that you are getting the data you want is to spell out the way you want it, and anything else is an assumption that some day may blow up in your face.
Why does this matter? Because your query assumes that the data you'll be getting will be in alphabetical order, starting with "A" and going up. (And that assumes consistent case--what about "A" vs "a"? Anything with leading spaces or numbers? Different systems handle different data differently...) Fixing this is simple enough, add an ORDER BY clause, such as:
select * from dict where word like ("%e%") and id < 1000 order by word;
Of course, if you have more than 1000 words beginning with "A" and containing "e", you're in trouble... and if you have less than 1000, you end up with a bunch of "B" words. Try something like:
select * from dict where left(word. 1) = "A" and word like ("%e%");
Depending on your RDBMS and any indexing you have on the table, the system could first identify all "A" words, and then run the "contains e" check on only them.

Try switching your where clause conditions around and then compare the execution plans.
This will show you the difference, if any (I would guess they will be identical, in this case)
The bottom line is, most of the time it makes no difference.
However it can change the execution plan.

Related

Filter a query using the CONCAT function or similar

I have a query that is filtered on a list of order numbers. The actual filed for the order number is 9 characters long (char). However, occasionally the system that the end users get their order numbers from will generate an extra 0 or single alpha character to the beginning of this order number. I am trying to account for that using the existing SQL and although it is running, it takes exponentially longer (and sometimes won't even run).
Is the approach I am taking below the best way to account for these differences?
order number field example:
066005485,
066005612
example of what may be entered and I need to account for:
0066005485,
A066005612
Here is what I have tried that seems to not work or at least be EXTREMELY slow:
SELECT S.order_no AS 'contract_no',
S.SIZE_INDEX AS 'technical_index',
S.open_qty AS 'contract_open_qty',
S.order_qty AS 'contract_order_qty',
E.excess,
(S.order_qty - E.excess) AS 'new_contract_size_qty'
FROM EXCESS E
JOIN SIM S ON RIGHT(E.GPS_CONTRACT_NUMBER,9) = S.order_no AND E.[AFS TECH INDEX] = S.size_index
WHERE S.order_no IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697')
OR CONCAT('0',S.order_no) IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697')
ORDER BY S.order_no,
S.size_index
Any thoughts on something that may work better or I am missing?
I can't do anything about the nasty join that requires the right function. If you have any influence over the data base designers it could be fruitful to either have that key (E.GPS_CONTRACT_NUMBER) cleaned up before it is put into the table or get them to add another field where the RIGHT(E.GPS_CONTRACT_NUMBER,9) has already been performed and an index can be created.
But there is definitely something you can do to remove the concat function calculation and take advantage of any index on S.order_no. I noticed your Where clause looks like order_no IN listofvals OR Concat('0', order_no) IN samelistofvals . So instead of adding a zero onto order_no remove a zero from everything in the IN list.
Where order_no IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697',
'066003816','066003817','066005485','066005612','066005390','066005616','066005617','066005952','066005798','066006673','066005802','066006196','066006197','066006199','066006205','066006697')
Notice that the IN-list is on two lines and the second line is just the first repeated with the leading 0 removed and any entry beginning with "A" removed entirely. This simplifies the Where clause and allows use of indexes, if any exist.
If the efficiency problem is in the WHERE clause (not considering the JOIN operation), in order to improve the situation, you can try using the "pseudo-regex" pattern matching way with LIKE:
WHERE
S.order_no LIKE '[A0]06600%'
OR
S.order_no LIKE '06600%'
Warning: this pattern will match also strings that end with other numbers (e.g. 8648).
Does it work for you?

Querying time higher with 'Where' than without it

I have something what I think is a srange issue. Normally, I think that a Query should last less time if I put a restriction (so that less rows are processed). But I don't know why, this is not the case. Maybe I'm putting something wrong, but I don't get error; the query just seems to run 'till infinity'.
This is the query
SELECT
A.ENTITYID AS ORG_ID,
A.ID_VALUE AS LEI,
A.MODIFIED_BY,
A.AUDITDATETIME AS LAST_DATE_MOD
FROM (
SELECT
CASE WHEN IFE.NEWVALUE IS NOT NULL
then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_TYPE')
ELSE NULL
end as ID_TYPE,
case when IFE.NEWVALUE is not null
then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_VALUE')
ELSE NULL
END AS ID_VALUE,
(select u.username from admin.users u where u.userid = ife.analystuserid) as Modified_by,
ife.*
FROM ife.audittrail ife
WHERE
--IFE.AUDITDATETIME >= '01-JUN-2016' AND
attributeid = 499
AND ROWNUM <= 10000
AND (CASE WHEN IFE.NEWVALUE IS NOT NULL then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_TYPE') ELSE NULL end) = '38') A
--WHERE A.AUDITDATETIME >= '01-JUN-2016';
So I tried with the two clauses commented (one per each time of course).
And with both of them happens the same; the query runs for so long time that I have to abort it.
Do you know why this could be happening? How could I do, maybe in a different way, to put the restriction?
The values of the field AUDITDATETIME are '06-MAY-2017', for example. In that format.
Thank you very much in advance
I think you may misunderstand how databases work.
Firstly, read up on EXPLAIN - you can find out exactly what is taking time, and why, by learning to read the EXPLAIN statement.
Secondly - the performance characteristics of any given query are determined by a whole range of things, but usually the biggest effort goes not in processing rows, but finding them.
Without an index, the database has to look at every row in the database and compare it to your where clause. It's the equivalent of searching in the phone book for a phone number, rather than a name (the phone book is indexed on "last name").
You can improve this by creating indexes - for instance, on columns "AUDITDATETIME" and "attributeid".
Unlike the phone book, a database server can support multiple indexes - and if those indexes match your where clause, your query will be (much) faster.
Finally, using an XML string extraction for a comparison in the where clause is likely to be extremely slow unless you've got an index on that XML data.
This is the equivalent of searching the phone book and translating the street address from one language to another - not only do you have to inspect every address, you have to execute an expensive translation step for each item.
You probably need index(es)... We can all make guesses on what indexes you already have, and need to add, but most dbms have built in query optimizers.
If you are using MS SQL Server you can execute query with query plan, that will tell you what index you need to add to optimize this particular query. It will even let you copy /paste the command to create it.

For an Oracle NUMBER datatype, LIKE operator vs BETWEEN..AND operator

Assume mytable is an Oracle table and it has a field called id. The datatype of id is NUMBER(8). Compare the following queries:
select * from mytable where id like '715%'
and
select * from mytable where id between 71500000 and 71599999
I would think the second is more efficient since I think "number comparison" would require fewer number of assembly language instructions than "string comparison". I need a confirmation or correction. Please confirm/correct and throw any further comment related to either operator.
UPDATE: I forgot to mention 1 important piece of info. id in this case must be an 8-digit number.
If you only want values between 71500000 and 71599999 then yes the second one is much more efficient. The first one would also return values between 7150-7159, 71500-71599 etc. and so forth. You would either need to sift through unecessary results or write another couple lines of code to filter the rest of them out. The second option is definitely more efficient for what you seem to want to do.
It seems like the execution plan on the second query is more efficient.
The first query is doing a full table scan of the id's, whereas the second query is not.
My Test Data:
Execution Plan of first query:
Execution Plan of second query:
I don't like the idea of using LIKE with a numeric column.
Also, it may not give the results you are looking for.
If you have a value of 715000000, it will show up in the query result, even though it is larger than 71599999.
Also, I do not like between on principle.
If a thing is between two other things, it should not include those two other things. But this is just a personal annoyance.
I prefer to use >= and <= This avoids confusion when I read the query. In addition, sometimes I have to change the query to something like >= a and < c. If I started by using the between operator, I would have to rewrite it when I don't want to be inclusive.
Harv
In addition to the other points raised, using LIKE in the manner you suggest would cause Oracle to not use any indexes on the ID column due to the implicit conversion of the data from number to character, resulting in a full table scan when using LIKE versus and index range scan when using BETWEEN. Assuming, of course, you have an index on ID. Even if you don't, however, Oracle will have to do the type conversion on each value it scans in the LIKE case, which it won't have to do in the other.
You can use math function, otherwise you have to use to_char function to use like, but it will cause performance problems.
select * from mytable where floor(id /100000) = 715
or
select * from mytable where floor(id /100000) = TO_NUMBER('715') // this is parametric

Best SQL query for list of records containing certain characters?

I'm working with a relatively large SQL Server 2000 DB at the moment. It's 80 GB in size, and have millions and millions of records.
I currently need to return a list of names that contains at least one of a series of illegal characters. By illegal characters is just meant an arbitrary list of characters that is defined by the customer. In the below example I use question mark, semi-colon, period and comma as the illegal character list.
I was initially thinking to do a CLR function that worked with regular expressions, but as it's SQL server 2000, I guess that's out of the question.
At the moment I've done like this:
select x from users
where
columnToBeSearched like '%?%' OR
columnToBeSearched like '%;%' OR
columnToBeSearched like '%.%' OR
columnToBeSearched like '%,%' OR
otherColumnToBeSearched like '%?%' OR
otherColumnToBeSearched like '%;%' OR
otherColumnToBeSearched like '%.%' OR
otherColumnToBeSearched like '%,%'
Now, I'm not a SQL expert by any means, but I get the feeling that the above query will be very inefficient. Doing 8 multiple wildcard searches in a table with millions of records, seems like it could slow the system down rather seriously. While it seems to work fine on test servers, I am getting the "this has to be completely wrong" vibe.
As I need to execute this script on a live production server eventually, I hope to achieve good performance, so as not to clog the system. The script might need to be expanded later on to include more illegal characters, but this is very unlikely.
To sum up: My aim is to get a list of records where either of two columns contain a customer-defined "illegal character". The database is live and massive, so I want a somewhat efficient approach, as I believe the above queries will be very slow.
Can anyone tell me the best way for achieving my result? Thanks!
/Morten
It doesn't get used much, but the LIKE statement accepts patterns in a similar (but much simplified) way to Regex. This link is the msdn page for it.
In your case you could simplify to (untested):
select x from users
where
columnToBeSearched like '%[?;.,]%' OR
otherColumnToBeSearched like '%[?;.,]%'
Also note that you can create the LIKE pattern as a variable, allowing for the customer defined part of your requirements.
One other major optimization: If you've got an updated date (or timestamp) on the user row (for any audit history type of thing), then you can always just query rows updated since the last time you checked.
If this is a query that will be run repeatedly, you are probably better off creating an index for it. The syntax escapes me at the moment, but you could probably create a computed column (edit: probably a PERSISTED computed column) which is 1 if columnToBeSearched or otherColumnToBeSearched contain illegal characters, and 0 otherwise. Create an index on that column and simply select all rows where the column is 1. This assumes that the set of illegal characters is fixed for that database installation (I assume that that's what you mean by "specified by the customer"). If, on the other hand, each query might specify a different set of illegal characters, this won't work.
By the way, if you don't mind the risk of reading uncommitted rows, you can run the query in a transaction with the the isolation level READ UNCOMMITTED, so that you won't block other transactions.
You can try to partition your data horizontally and "split" your query in a number of smaller queries. For instance you can do
SELECT x FROM users
WHERE users.ID BETWEEN 1 AND 5000
AND -- your filters on columnToBeSearched
putting your results back together in one list may be a little inconvenient, but if it's a report you're only extracting once (or once in a while) it may be feasible.
I'm assuming ID is the primary key of users or a column that has a index defined, which means SQL should be able to create an efficient execution plan, where it evaluates users.ID BETWEEN 1 AND 5000 (fast) before trying to check the filters (which may be slow).
Look up PATINDEX it allows you to put in an array of characters PATINDEX('[._]',ColumnName) returns a 0 or a value of the first occurance of an illegal character found in a certain value. Hope this helps.

How do I perform a simple one-statement SQL search across tables?

Suppose that two tables exist: users and groups.
How does one provide "simple search" in which a user enters text and results contain both users and groups whose names contain the text?
The result of the search must distinguish between the two types.
The trick is to combine a UNION with a literal string to determine the type of 'object' returned. In most (?) cases, UNION ALL will be more efficient, and should be used unless duplicates are required in the sub-queries. The following pattern should suffice:
SELECT "group" type, name
FROM groups
WHERE name LIKE "%$text%"
UNION ALL
SELECT "user" type, name
FROM users
WHERE name LIKE "%$text%"
NOTE: I've added the answer myself, because I came across this problem yesterday, couldn't find a good solution, and used this method. If someone has a better approach, please feel free to add it.
If you use "UNION ALL" then the db doesn't try to remove duplicates - you won't have duplicates between the two queries anyway (since the first column is different), so UNION ALL will be faster.
(I assume that you don't have duplicates inside each query that you want to remove)
Using LIKE will cause a number of problems as it will require a table scan every single time when the LIKE comparator starts with a %. This forces SQL to check every single row and work it's way, byte by byte, through the string you are using for comparison. While this may be fine when you start, it quickly causes scaling issues.
A better way to handle this is using Full Text Search. While this would be a more complex option, it will provide you with better results for very large databases. Then you can use a functioning version of the example Bobby Jack gave you to UNION ALL your two result sets together and display the results.
I would suggest another addition
SELECT "group" type, name
FROM groups
WHERE UPPER(name) LIKE UPPER("%$text%")
UNION ALL
SELECT "user" type, name
FROM users
WHERE UPPER(name) LIKE UPPER("%$text%")
You could convert $text to upper case first or do just do it in the query. This way you get a case insensitive search.