Replace function, keep unknown substrings/wildcards - sql

I have tried looking for answers online, but I am lacking the right nomenclature to find any answers matching my question.
The DB I am working with is an inconsistent mess. I am currently trying to import a number of maintenance codes which I have to link to a pre-existing Excel table. For this reason, the maintenance code I import have to be very universal.
The table is designed to work with 2-3 digit number (time lengths), followed by a time unit.
For example, SERV-01W and SERV-03M .
As these used to be added to the DB by hand, a large number of older maintenance codes are actually written with 1 digit numbers.
For example, SERV-1W and SERV-3M.
I would like to replace the old codes by the new codes. In other words, I want to add a leading 0 if only one digit is used in the code.
REPLACE(T.Code,'-[0-9][DWM]','-0[0-9][DWM]') unfortunately does not work, most likely because I am using wildcards in the result string.
What would be a good way of handling this issue?
Thank you in advance.

Assuming I understand your requirement this should get you what you are after:
WITH VTE AS(
SELECT *
FROM (VALUES('SERV-03M'),
('SERV-01W'),
('SERV-1Q'),
('SERV-4X')) V(Example))
SELECT Example,
ISNULL(STUFF(Example, NULLIF(PATINDEX('%-[0-9][A-z]%',Example),0)+1,0,'0'),Example) AS NewExample
FROM VTE;
Instead of trying to replace the pattern, I used PATINDEX to find the pattern and then inject the extra '0' character. If the pattern wasn't found, so 0 was returned by PATINDEX, I forced the expression to return NULL and then wrapped the entire thing with a further ISNULL, so that the original value was returned.

I find a simple CASE expression to be a simple way to express the logic:
SELECT (CASE WHEN code LIKE '%-[0-9][0-9]%'
THEN code
ELSE REPLACE(code, '-', '-0')
END)
That is, if the code has two digits, then do nothing. Otherwise, add a zero. The code should be quite clear on what it is doing.
This is not generalizable (it doesn't add two zeros for instance), but it does do exactly what you are asking for.

Related

SQL Regexp like pattern match on any combination of words

So I'm working in the perl sql flavor with a regexp_like and need to combine two tables based on pattern matching. One item might be 'Mammogram-bilateral' the other may be 'bilateral mammogram scan' I really need help with matching to get 9 out of 10 words matched, if that doesn't work then 8 out of 10 and so on, or the most like characters (or words) in both tables.
I really need help getting the base of this going, the rest I could clean myself. I could clean myself ( for instance 'removal of pacemaker' i need to be different from 'insertion of pacemaker', but I understand fixing that might be a big task). The problem I'm having is in getting rows like that to match on just a regexp_like(x,y,'i') type join.
Newbie to regular expressions, I have spent hours searching for help but can't find anything, sorry if I missed something
UPDATE: Okay, so I for clarification, I've currently got this running-
regexp_substr(x.x.concept_name,'\w+\b',1,1) = regexp_substr(y.x,'\w+\b',1,1)
AND regexp_substr (x.x,'\w+\b',1,2) = regexp_substr (x.y,'\w+\b',1,2)
AND regexp_substr (x.x,'\w+\b',1,3) = regexp_substr (x.y,'\w+\b',1,3)
and so on...
So that matches the first 3 words (also it just occurred to do a whitespace count as a filter, in this case if it has 3 whitespaces). i would basically need to do 1,2 = (1,1||1,2||1,3) and so on forever, some of these have over 100 whitespaces though...
so Regexp_like doesn't quite work, but i'm trying to find a regexp_substr() to work.
Update 2: Levenshtein distance might help for some of these, but i would need to find the shortest distance between those two, I'm not aware of a way to do that though.
update 4:
'Spinal Fusion' = 'Fusion of the Spine'
'Mammogram-bilateral' = 'bilateral mammogram scan'
'Echocardiogram (ECG)' = 'ECG'
update 5: I actually use regexp_ilike(x,y), but regexp_like() seems to be more common. its Vertica SQL syntax, which uses PCRE (Perl)

DB2 complex like

I have to write a select statement following the following pattern:
[A-Z][0-9][0-9][0-9][0-9][A-Z][0-9][0-9][0-9][0-9][0-9]
The only thing I'm sure of is that the first A-Z WILL be there. All the rest is optional and the optional part is the problem. I don't really know how I could do that.
Some example data:
B/0765/E 3
B/0765/E3
B/0764/A /02
B/0749/K
B/0768/
B/0784//02
B/0807/
My guess is that I best remove al the white spaces and the / in the data and then execute the select statement. But I'm having some problems writing the like pattern actually.. Anyone that could help me out?
The underlying reason for this is that I'm migrating a database. In the old database the values are just in 1 field but in the new one they are splitted into several fields but I first have to write a "control script" to know what records in the old database are not correct.
Even the following isn't working:
where someColumn LIKE '[a-zA-Z]%';
You can use Regular Expression via xQuery to define this pattern. There are many question in StackOverFlow that talk about patterns in DB2, and they have been solved with Regular Expressions.
DB2: find field value where first character is a lower case letter
Emulate REGEXP like behaviour in SQL

using regular expression in sql

I have the following rows in my table
COL1 EXTRA DOUBLE TEST
12 TEST
123 EXTRA
125 EXTRA 95 DOUBLE
EXTRA 45 99 DOUBLE
I am using regular expressions to filter out the rows and move them appropriately to different columns. So:
For the first row, I want 12 to be extracted and put in column TEST.
For the second row, I want 123 to be extracted and put in column EXTRA.
For 3rd row, I want 125 to be extracted and put in column EXTRA.
I want to ignore 95.
For the last row, I want 45 to be extracted and put in column EXTRA.
I can extract the values and put them in appropriate columns through my query, I am using this regular expression for extracting the values:
'%[0-9]%[^A-Z]%[0-9]%'
the problem with this regular expression is that it extracts 12, but it does not extract 123 from the second row, if I change the regular expression to:
'%[0-9]*%[^A-Z]%[0-9]%'
then it extracts 123, but for the third row, it concatenates 125 with 95 so I get 12595. Is there any way I can avoid 95 and just get the value 125? If I remove the star then it does not do any concatenation.
Any help will be appreciated. I posted this question before, but some of you were asking for more explanation so I posted a new question for that.
I believe that the regex that you are looking for is below. This will match digits followed by numbers, followed by ignoring any future number patterns. However, I believe that when you use the %regex%regex%..., that it runs each regex separately, so I am not sure about the nuances of regex in SQL. However, if you run this against rubular.com it seems to solve the problem you are asking. Hopefully it can be of some use in your regex search :)
([0-9]*)([^A-Z])(?>[0-9]*)
However, I did just look at your other examples of the letters coming first, and that would not work here. But, maybe this can still be of use to you
SQL Server does not natively support Regex. It does support some limited pattern matching through Like and Patindex.
If you really want to use Regex inside of SQL Server you can use a .NET language like C# to create a special CLR and import that into SQL Server, but that comes with a number of drawbacks. If you want to use Regex the better way is to have an application that runs on top of SQL Server. That application could be written in any language that can interface ODBC like C# or Python, and in fact in an intro article I talk about interfacing Python with SQL Server to use regex on Simple-Talk.
But, the patterns you provide are using SQL Servers more limited pattern matching capabilities rather than Regex, so that seems to be what you want. There is a full description at Pattern Matching in Search Conditions
As for solving your particular problem, you don't seem to have one particular pattern but several possible patterns anway. That type of situation is almost impossible to handle with a single SQL Server pattern and the regex logic gets unnecessarily complicated too. So, if I were in your position I would not try to create a single pattern but a series of cases and then extract the number you need based on that.
Assuming this is SQL 2005 (or later I guess... I can only speak for 2005), and all different permutations of COL1 data are in your question:
UPDATE NameOfYourTable
SET TEST = SUBSTRING(Col1, 0, LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] TEST%', Col1) - 1))
WHERE COL1 LIKE '%[0-9] TEST%'
UPDATE NameOfYourTable
SET EXTRA = SUBSTRING(Col1, 0, LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] EXTRA%', Col1) - 1))
WHERE COL1 LIKE '%[0-9] EXTRA%'
UPDATE NameOfYourTable
SET EXTRA = SUBSTRING(Col1, PATINDEX('%[0-9]%', Col1), LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] [0-9]%', Col1) + LEN('EXTRA ')))'
WHERE COL1 LIKE 'EXTRA [0-9]%'
Somehow though, I really don't think this is going to resolve your problem. I would strongly advise you to make sure this will catch all the cases you need to handle by running this on some test data.
If you have a lot of different cases to handle, then the better alternative I think would be to make a small console program in something like C# (that has much better RegExp support) to sift through your data and apply updates that way. Trying to handle numerous permutations of COL1 data is going to be a nightmare in SQL.
Also read these on LIKE, PATINDEX and their (limited) pattern-matching abilities:
LIKE: http://msdn.microsoft.com/en-us/library/ms179859(v=sql.90).aspx
PATINDEX: http://msdn.microsoft.com/en-us/library/ms188395(v=sql.90).aspx

How to find strings which are similar to given string in SQL server?

I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.
For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:
1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری
I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:
I am using MS SQL Server not MySQL
my table contents are in Persian, so I can't use Levenshtein distance and similar methods
I prefer an SQL Server only solution, not an indexing or daemon based solution.
The best solution would be a solution which help us sort result by similarity, but, its optional.
Do you have any suggestion for that?
Thanks
MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?
SELECT * FROM table WHERE input LIKE '%مختار%'
Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?
--This part searches for a string you want
declare #MyString varchar(max)
set #MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)
--This part searches for that string
select searchColumn, ABS(Len(searchColumn) - Len(#MyString)) as Similarity
from table where data LIKE '%' + #MyString + '%'
Order by Similarity, searchColumn
The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query.
The absolute part can be avoided obviously but I did it just in case.
Hope that helps =-)
Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.
The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.
The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.
And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection
Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.
Look at the following reference:
http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for #method Algorithm:
0 The Levenshtein edit distance algorithm
1 The Jaccard similarity coefficient algorithm
2 A form of the Jaro-Winkler distance algorithm
3 Longest common subsequence algorithm
Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server

MySQL: select the closest match?

I want to show the closest related item for a product. So say I am showing a product and the style number is SG-sfs35s. Is there a way to select whatever product's style number is closest to that?
Thanks.
EDIT: to answer your questions. Well I definitely want to keep the first 2 letters as that is the manufacturer code but as for the part after the first dash, just whatever matches closest. so for example SG-sfs35s would match SG-shs35s much more than SG-sht64s. I hope this makes sense whenever I do LIKE product_style_number it only pulls the exact match.
There normally isn't a simple way to match product codes that are roughly similar.
A more SQL friendly solution is to create a new table that maps each product to all the products it is similar to.
This table would either need to be maintained manually, or a more sophisticated script can be executed periodically to update it.
If your product codes follow a consistent pattern (all the letters are the same for similar products, with only the numbers changing), then you should be able to use a regular expression to match the similar items. There are docs on this here...
It sounds like what you want is levenshtein distance .
Unfortunately, there isn't a built-in levenshtein function for mysql, but some folks have come up with a user-defined function that does it(deadlink).
You will probably want to do it as a stored procedure, as I expect that the algorithm may not be trivial.
For example, you may split the term at the -, so you have two parts. You do a LIKE query on each part and use that to make a decision.
You could just loop though, replacing the last character with "%" until you get at least one result, in your stored procedure.
Sounds like you need something like Lucene, though i'm not sure if that would be overkill for your situation. But it certainly would be able to do text searches and return the ones most similar first.
If you need something more simple I would try to start by searching with the full product code, then if that doesn't work try to use wildcards/remove some characters until you return a result.
JD Isaacks.
This situation of yours is very simple to solve.
It`s not like you need to use Artificial Intelligence like the Google.
http://www.w3schools.com/sql/sql_wildcards.asp
Take a look at this manual at w3schools about wildcards to use with your SELECT code.
But also you will need to create a new table with 3 columns: LeftCode, RightCode and WildCard.
Example:
Rows on Table:
LeftCode = SG | RightCode = 35s | WildCard = SG-s_s35s
LeftCode = SG | RightCode = 64s | WildCard = SG-s_t64s
SQL Code
If the user typed the code that matches the row1 of the table:
SELECT * FROM PRODUCTS WHERE CODE LIKE "$WildCard";
Where $WildCard is the PHP variable containing the column 3 of the new table.
I hope I helped, even 4 years late...