SQL Regexp like pattern match on any combination of words - sql

So I'm working in the perl sql flavor with a regexp_like and need to combine two tables based on pattern matching. One item might be 'Mammogram-bilateral' the other may be 'bilateral mammogram scan' I really need help with matching to get 9 out of 10 words matched, if that doesn't work then 8 out of 10 and so on, or the most like characters (or words) in both tables.
I really need help getting the base of this going, the rest I could clean myself. I could clean myself ( for instance 'removal of pacemaker' i need to be different from 'insertion of pacemaker', but I understand fixing that might be a big task). The problem I'm having is in getting rows like that to match on just a regexp_like(x,y,'i') type join.
Newbie to regular expressions, I have spent hours searching for help but can't find anything, sorry if I missed something
UPDATE: Okay, so I for clarification, I've currently got this running-
regexp_substr(x.x.concept_name,'\w+\b',1,1) = regexp_substr(y.x,'\w+\b',1,1)
AND regexp_substr (x.x,'\w+\b',1,2) = regexp_substr (x.y,'\w+\b',1,2)
AND regexp_substr (x.x,'\w+\b',1,3) = regexp_substr (x.y,'\w+\b',1,3)
and so on...
So that matches the first 3 words (also it just occurred to do a whitespace count as a filter, in this case if it has 3 whitespaces). i would basically need to do 1,2 = (1,1||1,2||1,3) and so on forever, some of these have over 100 whitespaces though...
so Regexp_like doesn't quite work, but i'm trying to find a regexp_substr() to work.
Update 2: Levenshtein distance might help for some of these, but i would need to find the shortest distance between those two, I'm not aware of a way to do that though.
update 4:
'Spinal Fusion' = 'Fusion of the Spine'
'Mammogram-bilateral' = 'bilateral mammogram scan'
'Echocardiogram (ECG)' = 'ECG'
update 5: I actually use regexp_ilike(x,y), but regexp_like() seems to be more common. its Vertica SQL syntax, which uses PCRE (Perl)

Related

Replace function, keep unknown substrings/wildcards

I have tried looking for answers online, but I am lacking the right nomenclature to find any answers matching my question.
The DB I am working with is an inconsistent mess. I am currently trying to import a number of maintenance codes which I have to link to a pre-existing Excel table. For this reason, the maintenance code I import have to be very universal.
The table is designed to work with 2-3 digit number (time lengths), followed by a time unit.
For example, SERV-01W and SERV-03M .
As these used to be added to the DB by hand, a large number of older maintenance codes are actually written with 1 digit numbers.
For example, SERV-1W and SERV-3M.
I would like to replace the old codes by the new codes. In other words, I want to add a leading 0 if only one digit is used in the code.
REPLACE(T.Code,'-[0-9][DWM]','-0[0-9][DWM]') unfortunately does not work, most likely because I am using wildcards in the result string.
What would be a good way of handling this issue?
Thank you in advance.
Assuming I understand your requirement this should get you what you are after:
WITH VTE AS(
SELECT *
FROM (VALUES('SERV-03M'),
('SERV-01W'),
('SERV-1Q'),
('SERV-4X')) V(Example))
SELECT Example,
ISNULL(STUFF(Example, NULLIF(PATINDEX('%-[0-9][A-z]%',Example),0)+1,0,'0'),Example) AS NewExample
FROM VTE;
Instead of trying to replace the pattern, I used PATINDEX to find the pattern and then inject the extra '0' character. If the pattern wasn't found, so 0 was returned by PATINDEX, I forced the expression to return NULL and then wrapped the entire thing with a further ISNULL, so that the original value was returned.
I find a simple CASE expression to be a simple way to express the logic:
SELECT (CASE WHEN code LIKE '%-[0-9][0-9]%'
THEN code
ELSE REPLACE(code, '-', '-0')
END)
That is, if the code has two digits, then do nothing. Otherwise, add a zero. The code should be quite clear on what it is doing.
This is not generalizable (it doesn't add two zeros for instance), but it does do exactly what you are asking for.

DB2 complex like

I have to write a select statement following the following pattern:
[A-Z][0-9][0-9][0-9][0-9][A-Z][0-9][0-9][0-9][0-9][0-9]
The only thing I'm sure of is that the first A-Z WILL be there. All the rest is optional and the optional part is the problem. I don't really know how I could do that.
Some example data:
B/0765/E 3
B/0765/E3
B/0764/A /02
B/0749/K
B/0768/
B/0784//02
B/0807/
My guess is that I best remove al the white spaces and the / in the data and then execute the select statement. But I'm having some problems writing the like pattern actually.. Anyone that could help me out?
The underlying reason for this is that I'm migrating a database. In the old database the values are just in 1 field but in the new one they are splitted into several fields but I first have to write a "control script" to know what records in the old database are not correct.
Even the following isn't working:
where someColumn LIKE '[a-zA-Z]%';
You can use Regular Expression via xQuery to define this pattern. There are many question in StackOverFlow that talk about patterns in DB2, and they have been solved with Regular Expressions.
DB2: find field value where first character is a lower case letter
Emulate REGEXP like behaviour in SQL

using regular expression in sql

I have the following rows in my table
COL1 EXTRA DOUBLE TEST
12 TEST
123 EXTRA
125 EXTRA 95 DOUBLE
EXTRA 45 99 DOUBLE
I am using regular expressions to filter out the rows and move them appropriately to different columns. So:
For the first row, I want 12 to be extracted and put in column TEST.
For the second row, I want 123 to be extracted and put in column EXTRA.
For 3rd row, I want 125 to be extracted and put in column EXTRA.
I want to ignore 95.
For the last row, I want 45 to be extracted and put in column EXTRA.
I can extract the values and put them in appropriate columns through my query, I am using this regular expression for extracting the values:
'%[0-9]%[^A-Z]%[0-9]%'
the problem with this regular expression is that it extracts 12, but it does not extract 123 from the second row, if I change the regular expression to:
'%[0-9]*%[^A-Z]%[0-9]%'
then it extracts 123, but for the third row, it concatenates 125 with 95 so I get 12595. Is there any way I can avoid 95 and just get the value 125? If I remove the star then it does not do any concatenation.
Any help will be appreciated. I posted this question before, but some of you were asking for more explanation so I posted a new question for that.
I believe that the regex that you are looking for is below. This will match digits followed by numbers, followed by ignoring any future number patterns. However, I believe that when you use the %regex%regex%..., that it runs each regex separately, so I am not sure about the nuances of regex in SQL. However, if you run this against rubular.com it seems to solve the problem you are asking. Hopefully it can be of some use in your regex search :)
([0-9]*)([^A-Z])(?>[0-9]*)
However, I did just look at your other examples of the letters coming first, and that would not work here. But, maybe this can still be of use to you
SQL Server does not natively support Regex. It does support some limited pattern matching through Like and Patindex.
If you really want to use Regex inside of SQL Server you can use a .NET language like C# to create a special CLR and import that into SQL Server, but that comes with a number of drawbacks. If you want to use Regex the better way is to have an application that runs on top of SQL Server. That application could be written in any language that can interface ODBC like C# or Python, and in fact in an intro article I talk about interfacing Python with SQL Server to use regex on Simple-Talk.
But, the patterns you provide are using SQL Servers more limited pattern matching capabilities rather than Regex, so that seems to be what you want. There is a full description at Pattern Matching in Search Conditions
As for solving your particular problem, you don't seem to have one particular pattern but several possible patterns anway. That type of situation is almost impossible to handle with a single SQL Server pattern and the regex logic gets unnecessarily complicated too. So, if I were in your position I would not try to create a single pattern but a series of cases and then extract the number you need based on that.
Assuming this is SQL 2005 (or later I guess... I can only speak for 2005), and all different permutations of COL1 data are in your question:
UPDATE NameOfYourTable
SET TEST = SUBSTRING(Col1, 0, LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] TEST%', Col1) - 1))
WHERE COL1 LIKE '%[0-9] TEST%'
UPDATE NameOfYourTable
SET EXTRA = SUBSTRING(Col1, 0, LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] EXTRA%', Col1) - 1))
WHERE COL1 LIKE '%[0-9] EXTRA%'
UPDATE NameOfYourTable
SET EXTRA = SUBSTRING(Col1, PATINDEX('%[0-9]%', Col1), LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] [0-9]%', Col1) + LEN('EXTRA ')))'
WHERE COL1 LIKE 'EXTRA [0-9]%'
Somehow though, I really don't think this is going to resolve your problem. I would strongly advise you to make sure this will catch all the cases you need to handle by running this on some test data.
If you have a lot of different cases to handle, then the better alternative I think would be to make a small console program in something like C# (that has much better RegExp support) to sift through your data and apply updates that way. Trying to handle numerous permutations of COL1 data is going to be a nightmare in SQL.
Also read these on LIKE, PATINDEX and their (limited) pattern-matching abilities:
LIKE: http://msdn.microsoft.com/en-us/library/ms179859(v=sql.90).aspx
PATINDEX: http://msdn.microsoft.com/en-us/library/ms188395(v=sql.90).aspx

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

MySQL: select the closest match?

I want to show the closest related item for a product. So say I am showing a product and the style number is SG-sfs35s. Is there a way to select whatever product's style number is closest to that?
Thanks.
EDIT: to answer your questions. Well I definitely want to keep the first 2 letters as that is the manufacturer code but as for the part after the first dash, just whatever matches closest. so for example SG-sfs35s would match SG-shs35s much more than SG-sht64s. I hope this makes sense whenever I do LIKE product_style_number it only pulls the exact match.
There normally isn't a simple way to match product codes that are roughly similar.
A more SQL friendly solution is to create a new table that maps each product to all the products it is similar to.
This table would either need to be maintained manually, or a more sophisticated script can be executed periodically to update it.
If your product codes follow a consistent pattern (all the letters are the same for similar products, with only the numbers changing), then you should be able to use a regular expression to match the similar items. There are docs on this here...
It sounds like what you want is levenshtein distance .
Unfortunately, there isn't a built-in levenshtein function for mysql, but some folks have come up with a user-defined function that does it(deadlink).
You will probably want to do it as a stored procedure, as I expect that the algorithm may not be trivial.
For example, you may split the term at the -, so you have two parts. You do a LIKE query on each part and use that to make a decision.
You could just loop though, replacing the last character with "%" until you get at least one result, in your stored procedure.
Sounds like you need something like Lucene, though i'm not sure if that would be overkill for your situation. But it certainly would be able to do text searches and return the ones most similar first.
If you need something more simple I would try to start by searching with the full product code, then if that doesn't work try to use wildcards/remove some characters until you return a result.
JD Isaacks.
This situation of yours is very simple to solve.
It`s not like you need to use Artificial Intelligence like the Google.
http://www.w3schools.com/sql/sql_wildcards.asp
Take a look at this manual at w3schools about wildcards to use with your SELECT code.
But also you will need to create a new table with 3 columns: LeftCode, RightCode and WildCard.
Example:
Rows on Table:
LeftCode = SG | RightCode = 35s | WildCard = SG-s_s35s
LeftCode = SG | RightCode = 64s | WildCard = SG-s_t64s
SQL Code
If the user typed the code that matches the row1 of the table:
SELECT * FROM PRODUCTS WHERE CODE LIKE "$WildCard";
Where $WildCard is the PHP variable containing the column 3 of the new table.
I hope I helped, even 4 years late...