Select rows in SQL with regex values that match input? - sql

Rather than selecting rows based on whether their string value equals a given regular expression input, I want to select rows with regular expressions that match a given string input.
As far as purpose, I'm am trying to identify website names from input URLs.
TABLE
WEBSITE REGEX
The New York Times ^.+\.nytimes.com.*$
Is there a good way to do this? I'm using postgres, and I was hoping to avoid large loops.
Thanks!

This seems to work fine:
CREATE TABLE Sites
(
SiteName text,
RegEx text
);
INSERT INTO Sites VALUES ('NY Times', '^.+\.nytimes.com.*$');
Then you can do:
SELECT * FROM Sites
WHERE 'http://www.nytimes.com/Foo' ~ RegEx;
Fiddle
Keep in mind this might start to get slow if you had a lot of rows, as it's going to have to do a sequential table scan each time and run the regular expression against each row. A better approach might be to parse the URL first and normalize it in some way, then look for an exact match in the table.

Related

Optimising LIKE expressions that start with wildcards

I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.
SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'
I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.
I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.
Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results
SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')
Is there any way to optimise queries with preceding wildcards?
Here is one (not really recommended) solution.
Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.
When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:
abcd
bcd
cd
d
along with the unique id of the row in Table. (This can all be done using a trigger.)
Create an index on AddressSubstrings(AddressSubstring).
Then you can phrase your query as:
SELECT *
FROM Table t JOIN
AddressSubstrings ads
ON t.table_id = ads.table_id
WHERE ads.AddressSubstring LIKE 'nham%';
Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).
If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.
I can't offer a complete solution to this difficult problem.
But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing HWilson with ilson and the row containing ABC123000654 with 654, here's a suggestion.
WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'
Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.
I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like ABC123000654.
Not without a serious preparation effort, hwilson1.
At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).
With leading wildcards, you're out of luck.
The workaround is a serious preparation effort, as stated up front:
It would boil down to Vertica's text search feature, where that problem is solved. See here:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Tables/TextSearch/UsingTextSearch.htm
For any other database platform, including MS SQL, you'll have to do that manually.
In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.
You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:
If your input table (just showing the columns that matter) is this:
id |the_search_col |other_col
42|The Restaurant at the End of the Universe|Arthur Dent
43|The Hitch-Hiker's Guide to the Galaxy |Ford Prefect
Your auxiliary search table could contain:
id |seq|search_token
42| 1|Restaurant
42| 2|End
42| 3|Universe
43| 1|Hitch-Hiker
43| 2|Guide
43| 3|Galaxy
Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]
You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:
Pivot sql convert rows to columns
Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.
Lay an index on search_token and you'll have a very fast access path to a big table.
Not a stroll in the park, I agree, but promising ...
Happy playing -
Marco the Sane

SQL Contains - only match at start

For some reason I cannot find the answer on Google! But with the SQL contains function how can I tell it to start at the beginning of a string, I.e I am looking for the full-text equivalent to
LIKE 'some_term%'.
I know I can use like, but since I already have the full-text index set up, AND the table is expected to have thousands of rows, I would prefer to use Contains.
Thanks!
You want something like this:
Rather than specify multiple terms, you can use a 'prefix term' if the
terms begin with the same characters. To use a prefix term, specify
the beginning characters, then add an asterisk (*) wildcard to the end
of the term. Enclose the prefix term in double quotes. The following
statement returns the same results as the previous one.
-- Search for all terms that begin with 'storm'
SELECT StormID, StormHead, StormBody FROM StormyWeather
WHERE CONTAINS(StormHead, '"storm*"')
http://www.simple-talk.com/sql/learn-sql-server/full-text-indexing-workbench/
You can use CONTAINS with a LIKE subquery for matching only a start:
SELECT *
FROM (
SELECT *
FROM myTable WHERE CONTAINS('"Alice in wonderland"')
) AS S1
WHERE S1.edition LIKE 'Alice in wonderland%'
This way, the slow LIKE query will be run against a smaller set
The only solution I can think of it to actually prepend a unique word to the beginning of every field in the table.
e.g. Update every row so that 'xfirstword ' appears at the start of the text (e.g. Field1). Then you can search for CONTAINS(Field1, 'NEAR ((xfirstword, "TERM*"),0)')
Pretty crappy solution, especially as we know that the full text index stores the actual position of each word in the text (see this link for details: http://msdn.microsoft.com/en-us/library/ms142551.aspx)
I am facing the similar issue. This is what I have implemented as a work around.
I have made another table and pulled only the rows like 'some_term%'.
Now, on this new table I have implemented the FullText search.
Please do inform me if you tried some other better approach

How to use like condition with multiple values in sql server 2005?

I need to filter out records based on some text matching in nvarchar(1000) column.
Table has more than 400 thousands records and growing. For now, I am using Like condition:-
SELECT
*
FROM
table_01
WHERE
Text like '%A1%'
OR Text like '%B1%'
OR Text like '%C1%'
OR Text like '%D1%'
Is there any preferred work around?
SELECT
*
FROM
table_01
WHERE
Text like '%[A-Z]1%'
This will check if the texts contains A1, B1, C1, D1, ...
Reference to using the Like Condition in SQL Server
You can try the following if you know the exact position of your sub string:
SELECT
*
FROM
table_01
WHERE
SUBSTRING(Text,1,2) in ('B1','C1','D1')
Have a look at LIKE on msdn.
You could reduce the number filters by combining more details into a single LIKE clause.
SELECT
*
FROM
table_01
WHERE
Text like '%[ABCD]1%'
If you can create a FULLTEXT INDEX on that column of your table (that assumes a lot of research on performance and space), then you are probably going to see a big improvement on performance on text matching. You can go to this link to see what FULLTEXT SEARCH is
and this link to see how to create a FULLTEXT INDEX.
I needed to do this so that I could allow two different databases in a filter for the DatabaseName column in an SQL Server Profiler Trace Template.
All you can do is fill in the body of a Like clause.
Using the reference in John Hartscock's answer, I found out that the like clause uses a sort of limited regex pattern.
For the OP's scenario, MSMS has the solution.
Assuming I want databases ABCOne, ABCTwo, and ABCThree, I come up with what is essentially independent whitelists for each character:
Like ABC[OTT][NWH][EOR]%
Which is easily extensible to any set of strings. It won't be ironclad, that last pattern would also match ABCOwe, ABCTnr, or ABCOneHippotamus, but if you're filtering a limited set of possible values there's a good chance you can make it work.
You could alternatively use the [^] operator to present a blacklist of unacceptable characters.

MySql Regular Expression Select Columns Matching (dynamic)multiple Values within Stored Procedure

I am trying to generate a query where I want to select columns(text) matching multiple values.
eg: I have two columns, id and description. suppose my first row contains description column with value
Google is website and an awesome
search engine
, second row with description column value
Amazon website is an awesome eCommerce
store
I have created a query
Select * from table_name where
description REGEXP 'Website \|
Search'
It returns both the rows i.e with both google and amazon, but i want to return only google as i want those rows with both the words website and search also the number of words to be matched is also not fixed, basically the query I am creating is for a search drop down,
All the words that are passed should be present in the column, the order of the words present in the column is not important. If there are other better options besides using regex , please do point out.
Editing: the number of words that are passed are dynamic and not known, the user may pass additional words to be matched against the column. I would be using the query within a stored Procedure
Really don;t think the regex solution is going to be good for you from a performance point of view. Think you should be looking for FULL text searches.
Specifically you need to create a full text index with something like this in the table definition:
create table testTable
(
Id int auto_increment not null,
TextCol varchar(500)
fulltext(TextCol)
);
Then your query gets easier:
select * from testTable where Match(TextCol) against ('web')
AND Match(TextCol) against ('server')
Strongly suggest you read the MySQL docs regarding FULLTEXT matching and there are lots of little tricks and features that will be useful in this task (including more efficient ways to run the query above)
Edit: Perhaps Boolean mode will help you to an easy solution like this:
Match(textCol) against ('web+ Server+' in boolean mode)
All you have to do is build the against string so I think this can be done in an SP with out dynamic SQL

How do I perform a simple one-statement SQL search across tables?

Suppose that two tables exist: users and groups.
How does one provide "simple search" in which a user enters text and results contain both users and groups whose names contain the text?
The result of the search must distinguish between the two types.
The trick is to combine a UNION with a literal string to determine the type of 'object' returned. In most (?) cases, UNION ALL will be more efficient, and should be used unless duplicates are required in the sub-queries. The following pattern should suffice:
SELECT "group" type, name
FROM groups
WHERE name LIKE "%$text%"
UNION ALL
SELECT "user" type, name
FROM users
WHERE name LIKE "%$text%"
NOTE: I've added the answer myself, because I came across this problem yesterday, couldn't find a good solution, and used this method. If someone has a better approach, please feel free to add it.
If you use "UNION ALL" then the db doesn't try to remove duplicates - you won't have duplicates between the two queries anyway (since the first column is different), so UNION ALL will be faster.
(I assume that you don't have duplicates inside each query that you want to remove)
Using LIKE will cause a number of problems as it will require a table scan every single time when the LIKE comparator starts with a %. This forces SQL to check every single row and work it's way, byte by byte, through the string you are using for comparison. While this may be fine when you start, it quickly causes scaling issues.
A better way to handle this is using Full Text Search. While this would be a more complex option, it will provide you with better results for very large databases. Then you can use a functioning version of the example Bobby Jack gave you to UNION ALL your two result sets together and display the results.
I would suggest another addition
SELECT "group" type, name
FROM groups
WHERE UPPER(name) LIKE UPPER("%$text%")
UNION ALL
SELECT "user" type, name
FROM users
WHERE UPPER(name) LIKE UPPER("%$text%")
You could convert $text to upper case first or do just do it in the query. This way you get a case insensitive search.