Is it possible to use LIKE with a set of strings instead of a single element? - sql

I have a list of proper names (in a table), and another table with a free-text field. I want to check whether that field contains any of the proper names. If it were just one, I could do
WHERE free_text LIKE "%proper_name%"
but how do you do that for an entire list? Is there a better string function I can use with a list?
Thanks

No, like does not have that capability.
Many databases support regular expressions, which enable to you do what you want. For instance, in Postgres this is phrased as:
where free_text ~ 'name1|name2|name3'
Many databases also have full-text search capabilities that speed such searches.
Both capabilities are highly specific to the database you are using.

Well, you can use LIKE in a standard JOIN, but the query most likely will be slow, because it will search each proper name in each free_text.
For example, if you have 10 proper names in a list and a certain free_text value contains the first name, the server will continue processing the rest of 9 names.
Here is the query:
SELECT -- DISTINCT
free_text_table.*
FROM
free_text_table
INNER JOIN proper_names_table ON free_text_table.free_text LIKE proper_names_table.proper_name
;
If a certain free_text value contains several proper names, that row will be returned several times, so you may need to add DISTINCT to the query. It depends on what you need.
It is possible to use LATERAL JOIN to avoid Cartesian product (where each row in free_text_table is compared to each rows in proper_names_table). The end result may be faster, than the simple variant. It depends on your data distribution.
Here is SQL Server syntax.
SELECT
free_text_table.*
FROM
free_text_table
CROSS APPLY
(
SELECT TOP(1)
proper_names_table.proper_name
FROM proper_names_table
WHERE free_text_table.free_text LIKE proper_names_table.proper_name
-- ORDER BY proper_names_table.frequency
) AS A
;
Here we don't need DISTINCT, there will be at most one row in the result for each row from free_text_table (one or zero). Optimiser should be smart enough to stop reading and processing proper_names_table as soon as the first match is found due to TOP(1) clause.
If you also can somehow order your proper names and put those that are most likely to be found first, then the query is more likely to be faster than a simple JOIN. (Add a suitable ORDER BY clause in subquery).

Related

Optimising LIKE expressions that start with wildcards

I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.
SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'
I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.
I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.
Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results
SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')
Is there any way to optimise queries with preceding wildcards?
Here is one (not really recommended) solution.
Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.
When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:
abcd
bcd
cd
d
along with the unique id of the row in Table. (This can all be done using a trigger.)
Create an index on AddressSubstrings(AddressSubstring).
Then you can phrase your query as:
SELECT *
FROM Table t JOIN
AddressSubstrings ads
ON t.table_id = ads.table_id
WHERE ads.AddressSubstring LIKE 'nham%';
Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).
If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.
I can't offer a complete solution to this difficult problem.
But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing HWilson with ilson and the row containing ABC123000654 with 654, here's a suggestion.
WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'
Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.
I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like ABC123000654.
Not without a serious preparation effort, hwilson1.
At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).
With leading wildcards, you're out of luck.
The workaround is a serious preparation effort, as stated up front:
It would boil down to Vertica's text search feature, where that problem is solved. See here:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Tables/TextSearch/UsingTextSearch.htm
For any other database platform, including MS SQL, you'll have to do that manually.
In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.
You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:
If your input table (just showing the columns that matter) is this:
id |the_search_col |other_col
42|The Restaurant at the End of the Universe|Arthur Dent
43|The Hitch-Hiker's Guide to the Galaxy |Ford Prefect
Your auxiliary search table could contain:
id |seq|search_token
42| 1|Restaurant
42| 2|End
42| 3|Universe
43| 1|Hitch-Hiker
43| 2|Guide
43| 3|Galaxy
Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]
You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:
Pivot sql convert rows to columns
Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.
Lay an index on search_token and you'll have a very fast access path to a big table.
Not a stroll in the park, I agree, but promising ...
Happy playing -
Marco the Sane

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

Placing index columns on the left of a mysql WHERE statement?

I was curious since i read it in a doc. Does writing
select * from CONTACTS where id = ‘098’ and name like ‘Tom%’;
speed up the query as oppose to
select * from CONTACTS where name like ‘Tom%’ and id = ‘098’;
The first has an indexed column on the left side. Does it actually speed things up or is it superstition?
Using php and mysql
Check the query plans with explain. They should be exactly the same.
This is purely superstition. I see no reason that either query would differ in speed. If it was an OR query rather than an AND query however, then I could see that having it on the left may spped things up.
interesting question, i tried this once. query plans are the same (using EXPLAIN).
but considering short-circuit-evaluation i was wondering too why there is no difference (or does mysql fully evaluate boolean statements?)
You may be mis-remembering or mis-reading something else, regarding which side the wildcards are on a string literal in a Like predicate. Putting the wildcard on the right (as in yr example), allows the query engine to use any indices that might exist on the table column you are searching (in this case - name). But if you put the wildcard on the left,
select * from CONTACTS where name like ‘%Tom’ and id = ‘098’;
then the engine cannot use any existing index and must do a complete table scan.

How best to sum multiple boolean values via SQL?

I have a table that contains, among other things, about 30 columns of boolean flags that denote particular attributes. I'd like to return them, sorted by frequency, as a recordset along with their column names, like so:
Attribute Count
attrib9 43
attrib13 27
attrib19 21
etc.
My efforts thus far can achieve something similar, but I can only get the attributes in columns using conditional SUMs, like this:
SELECT SUM(IIF(a.attribIndex=-1,1,0)), SUM(IIF(a.attribWorkflow =-1,1,0))...
Plus, the query is already getting a bit unwieldy with all 30 SUM/IIFs and won't handle any changes in the number of attributes without manual intervention.
The first six characters of the attribute columns are the same (attrib) and unique in the table, is it possible to use wildcards in column names to pick up all the applicable columns?
Also, can I pivot the results to give me a sorted two-column recordset?
I'm using Access 2003 and the query will eventually be via ADODB from Excel.
This depends on whether or not you have the attribute names anywhere in data. If you do, then birdlips' answer will do the trick. However, if the names are only column names, you've got a bit more work to do--and I'm afriad you can't do it with simple SQL.
No, you can't use wildcards to column names in SQL. You'll need procedural code to do this (i.e., a VB Module in Access--you could do it within a Stored Procedure if you were on SQL Server). Use this code build the SQL code.
It won't be pretty. I think you'll need to do it one attribute at a time: select a string whose value is that attribute name and the count-where-True, then either A) run that and store the result in a new row in a scratch table, or B) append all those selects together with "Union" between them before running the batch.
My Access VB is more than a bit rusty, so I don't trust myself to give you anything like executable code....
Just a simple count and group by should do it
Select attribute_name
, count(*)
from attribute_table
group by attribute_name
To answer your comment use Analytic Functions for that:
Select attribute_table.*
, count(*) over(partition by attribute_name) cnt
from attribute_table
In Access, Cross Tab queries (the traditional tool for transposing datasets) need at least 3 numeric/date fields to work. However since the output is to Excel, have you considered just outputting the data to a hidden sheet then using a pivot table?

How do I perform a simple one-statement SQL search across tables?

Suppose that two tables exist: users and groups.
How does one provide "simple search" in which a user enters text and results contain both users and groups whose names contain the text?
The result of the search must distinguish between the two types.
The trick is to combine a UNION with a literal string to determine the type of 'object' returned. In most (?) cases, UNION ALL will be more efficient, and should be used unless duplicates are required in the sub-queries. The following pattern should suffice:
SELECT "group" type, name
FROM groups
WHERE name LIKE "%$text%"
UNION ALL
SELECT "user" type, name
FROM users
WHERE name LIKE "%$text%"
NOTE: I've added the answer myself, because I came across this problem yesterday, couldn't find a good solution, and used this method. If someone has a better approach, please feel free to add it.
If you use "UNION ALL" then the db doesn't try to remove duplicates - you won't have duplicates between the two queries anyway (since the first column is different), so UNION ALL will be faster.
(I assume that you don't have duplicates inside each query that you want to remove)
Using LIKE will cause a number of problems as it will require a table scan every single time when the LIKE comparator starts with a %. This forces SQL to check every single row and work it's way, byte by byte, through the string you are using for comparison. While this may be fine when you start, it quickly causes scaling issues.
A better way to handle this is using Full Text Search. While this would be a more complex option, it will provide you with better results for very large databases. Then you can use a functioning version of the example Bobby Jack gave you to UNION ALL your two result sets together and display the results.
I would suggest another addition
SELECT "group" type, name
FROM groups
WHERE UPPER(name) LIKE UPPER("%$text%")
UNION ALL
SELECT "user" type, name
FROM users
WHERE UPPER(name) LIKE UPPER("%$text%")
You could convert $text to upper case first or do just do it in the query. This way you get a case insensitive search.