I have the following rows in my table
COL1 EXTRA DOUBLE TEST
12 TEST
123 EXTRA
125 EXTRA 95 DOUBLE
EXTRA 45 99 DOUBLE
I am using regular expressions to filter out the rows and move them appropriately to different columns. So:
For the first row, I want 12 to be extracted and put in column TEST.
For the second row, I want 123 to be extracted and put in column EXTRA.
For 3rd row, I want 125 to be extracted and put in column EXTRA.
I want to ignore 95.
For the last row, I want 45 to be extracted and put in column EXTRA.
I can extract the values and put them in appropriate columns through my query, I am using this regular expression for extracting the values:
'%[0-9]%[^A-Z]%[0-9]%'
the problem with this regular expression is that it extracts 12, but it does not extract 123 from the second row, if I change the regular expression to:
'%[0-9]*%[^A-Z]%[0-9]%'
then it extracts 123, but for the third row, it concatenates 125 with 95 so I get 12595. Is there any way I can avoid 95 and just get the value 125? If I remove the star then it does not do any concatenation.
Any help will be appreciated. I posted this question before, but some of you were asking for more explanation so I posted a new question for that.
I believe that the regex that you are looking for is below. This will match digits followed by numbers, followed by ignoring any future number patterns. However, I believe that when you use the %regex%regex%..., that it runs each regex separately, so I am not sure about the nuances of regex in SQL. However, if you run this against rubular.com it seems to solve the problem you are asking. Hopefully it can be of some use in your regex search :)
([0-9]*)([^A-Z])(?>[0-9]*)
However, I did just look at your other examples of the letters coming first, and that would not work here. But, maybe this can still be of use to you
SQL Server does not natively support Regex. It does support some limited pattern matching through Like and Patindex.
If you really want to use Regex inside of SQL Server you can use a .NET language like C# to create a special CLR and import that into SQL Server, but that comes with a number of drawbacks. If you want to use Regex the better way is to have an application that runs on top of SQL Server. That application could be written in any language that can interface ODBC like C# or Python, and in fact in an intro article I talk about interfacing Python with SQL Server to use regex on Simple-Talk.
But, the patterns you provide are using SQL Servers more limited pattern matching capabilities rather than Regex, so that seems to be what you want. There is a full description at Pattern Matching in Search Conditions
As for solving your particular problem, you don't seem to have one particular pattern but several possible patterns anway. That type of situation is almost impossible to handle with a single SQL Server pattern and the regex logic gets unnecessarily complicated too. So, if I were in your position I would not try to create a single pattern but a series of cases and then extract the number you need based on that.
Assuming this is SQL 2005 (or later I guess... I can only speak for 2005), and all different permutations of COL1 data are in your question:
UPDATE NameOfYourTable
SET TEST = SUBSTRING(Col1, 0, LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] TEST%', Col1) - 1))
WHERE COL1 LIKE '%[0-9] TEST%'
UPDATE NameOfYourTable
SET EXTRA = SUBSTRING(Col1, 0, LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] EXTRA%', Col1) - 1))
WHERE COL1 LIKE '%[0-9] EXTRA%'
UPDATE NameOfYourTable
SET EXTRA = SUBSTRING(Col1, PATINDEX('%[0-9]%', Col1), LEN(Col1) - (LEN(Col1) - PATINDEX('%[0-9] [0-9]%', Col1) + LEN('EXTRA ')))'
WHERE COL1 LIKE 'EXTRA [0-9]%'
Somehow though, I really don't think this is going to resolve your problem. I would strongly advise you to make sure this will catch all the cases you need to handle by running this on some test data.
If you have a lot of different cases to handle, then the better alternative I think would be to make a small console program in something like C# (that has much better RegExp support) to sift through your data and apply updates that way. Trying to handle numerous permutations of COL1 data is going to be a nightmare in SQL.
Also read these on LIKE, PATINDEX and their (limited) pattern-matching abilities:
LIKE: http://msdn.microsoft.com/en-us/library/ms179859(v=sql.90).aspx
PATINDEX: http://msdn.microsoft.com/en-us/library/ms188395(v=sql.90).aspx
Related
So I'm working in the perl sql flavor with a regexp_like and need to combine two tables based on pattern matching. One item might be 'Mammogram-bilateral' the other may be 'bilateral mammogram scan' I really need help with matching to get 9 out of 10 words matched, if that doesn't work then 8 out of 10 and so on, or the most like characters (or words) in both tables.
I really need help getting the base of this going, the rest I could clean myself. I could clean myself ( for instance 'removal of pacemaker' i need to be different from 'insertion of pacemaker', but I understand fixing that might be a big task). The problem I'm having is in getting rows like that to match on just a regexp_like(x,y,'i') type join.
Newbie to regular expressions, I have spent hours searching for help but can't find anything, sorry if I missed something
UPDATE: Okay, so I for clarification, I've currently got this running-
regexp_substr(x.x.concept_name,'\w+\b',1,1) = regexp_substr(y.x,'\w+\b',1,1)
AND regexp_substr (x.x,'\w+\b',1,2) = regexp_substr (x.y,'\w+\b',1,2)
AND regexp_substr (x.x,'\w+\b',1,3) = regexp_substr (x.y,'\w+\b',1,3)
and so on...
So that matches the first 3 words (also it just occurred to do a whitespace count as a filter, in this case if it has 3 whitespaces). i would basically need to do 1,2 = (1,1||1,2||1,3) and so on forever, some of these have over 100 whitespaces though...
so Regexp_like doesn't quite work, but i'm trying to find a regexp_substr() to work.
Update 2: Levenshtein distance might help for some of these, but i would need to find the shortest distance between those two, I'm not aware of a way to do that though.
update 4:
'Spinal Fusion' = 'Fusion of the Spine'
'Mammogram-bilateral' = 'bilateral mammogram scan'
'Echocardiogram (ECG)' = 'ECG'
update 5: I actually use regexp_ilike(x,y), but regexp_like() seems to be more common. its Vertica SQL syntax, which uses PCRE (Perl)
First of all and before detailing the problem I'm dealing with let me tell you that I'm currently an SQL-newbie so whenever it's possible I'll appreciate plain explanations and simple solutions. Here's what I have:
Given this query:
SELECT
table1.id as id,
table1.tag1 as tag1,
table2.tag2 as tag2,
table2.tag2 like '%'+table1.tag1'%' as match
FROM table1
INNER JOIN table2
ON table1.id = table2.id
I'm getting this table:
id tag1 tag2 match
1 ice cream ice-cream false
2 sweets sweets true
3 bakery bakery true
4 sweets ice-cream false
The problem I want to solve is that I'd want "match" column to interpret as "true" similar words as those from the first row. Therefore in my desired output I'd like this mentioned cell to be "true" instead of "false".
Thanks in advance.
"Similar" can be estimated in a fair few ways. Some methods are...elaborate. A good place to start is with "edit distance." This is also named "Levenshtein distance" after its creator. The idea is pretty easy to understand, and the results make sense. (It also sounds pretty much like what you're asking for.) While there are variations, the basic idea is to count up how many characters you need to change to convert one string into another. So "ice cream" to "ice-cream" requires one change. That's close. "ice cream" to "nice dream" takes a lot more changes. You can look up the algorithm and find plenty about it with good examples. Closer to home, spell-checkers have traditionally had this algorithm in their bag of tricks. That's one way that they can suggest similar words with different roots.
Levenshtein is not enabled in Postgres by default, but it is included in a standard extension named "fuzzystrmatch":
https://www.postgresql.org/docs/current/fuzzystrmatch.html
That extension also includes some "phonetic" matching algorithms, which aren't so much what it sounds like you're after. Depending on how you're deploying, there's another extension with a bunch of fuzzy string matching tools but, honestly, I'd start with Levenshtein anyway.
https://github.com/eulerto/pg_similarity
If you end up on RDS, pg_similarity is supported.
Other suggestions you'll likely hear include LIKE, regex, and trigrams (great!, but a bit more involved).
Fuzzy string matching is a big subject, and super interesting. If you pursue this further, it will help to know what kind of record counts you're dealing with, how long your strings are (shorter string are harder to fuzzy compare as there's not as much to work with), your Postges version, etc.
You need to decide exactly what you want as a match. Let me assume that a space can match any character. Then use:
table2.tag2 like '%' + replace(table1.tag1, ' ', '_') + '%' as match
Alternatively, you might want to remove all spaces and hyphens for the comparison:
replace(replace(table2.tag2, ' ', ''), '-', '') like '%' + replace(replace(table1.tag1, ' ', ''), '-', '') + '%'
#Morris'es answer is good for Postgres. In you are on Refshift you can create a Python UDF for fuzzy match that takes 2 strings as input and returns either a binary judgement or some measure of similarity between these strings. Here is a good example of Levenshtein algorithm implementation with Python UDF: Periscope community thread
The function returns the string "distance" between two words (how many characters are different).
You can use the output as levenshtein(table2.tag2,table1.tag1)<=1 as match
I have tried looking for answers online, but I am lacking the right nomenclature to find any answers matching my question.
The DB I am working with is an inconsistent mess. I am currently trying to import a number of maintenance codes which I have to link to a pre-existing Excel table. For this reason, the maintenance code I import have to be very universal.
The table is designed to work with 2-3 digit number (time lengths), followed by a time unit.
For example, SERV-01W and SERV-03M .
As these used to be added to the DB by hand, a large number of older maintenance codes are actually written with 1 digit numbers.
For example, SERV-1W and SERV-3M.
I would like to replace the old codes by the new codes. In other words, I want to add a leading 0 if only one digit is used in the code.
REPLACE(T.Code,'-[0-9][DWM]','-0[0-9][DWM]') unfortunately does not work, most likely because I am using wildcards in the result string.
What would be a good way of handling this issue?
Thank you in advance.
Assuming I understand your requirement this should get you what you are after:
WITH VTE AS(
SELECT *
FROM (VALUES('SERV-03M'),
('SERV-01W'),
('SERV-1Q'),
('SERV-4X')) V(Example))
SELECT Example,
ISNULL(STUFF(Example, NULLIF(PATINDEX('%-[0-9][A-z]%',Example),0)+1,0,'0'),Example) AS NewExample
FROM VTE;
Instead of trying to replace the pattern, I used PATINDEX to find the pattern and then inject the extra '0' character. If the pattern wasn't found, so 0 was returned by PATINDEX, I forced the expression to return NULL and then wrapped the entire thing with a further ISNULL, so that the original value was returned.
I find a simple CASE expression to be a simple way to express the logic:
SELECT (CASE WHEN code LIKE '%-[0-9][0-9]%'
THEN code
ELSE REPLACE(code, '-', '-0')
END)
That is, if the code has two digits, then do nothing. Otherwise, add a zero. The code should be quite clear on what it is doing.
This is not generalizable (it doesn't add two zeros for instance), but it does do exactly what you are asking for.
I have to write a select statement following the following pattern:
[A-Z][0-9][0-9][0-9][0-9][A-Z][0-9][0-9][0-9][0-9][0-9]
The only thing I'm sure of is that the first A-Z WILL be there. All the rest is optional and the optional part is the problem. I don't really know how I could do that.
Some example data:
B/0765/E 3
B/0765/E3
B/0764/A /02
B/0749/K
B/0768/
B/0784//02
B/0807/
My guess is that I best remove al the white spaces and the / in the data and then execute the select statement. But I'm having some problems writing the like pattern actually.. Anyone that could help me out?
The underlying reason for this is that I'm migrating a database. In the old database the values are just in 1 field but in the new one they are splitted into several fields but I first have to write a "control script" to know what records in the old database are not correct.
Even the following isn't working:
where someColumn LIKE '[a-zA-Z]%';
You can use Regular Expression via xQuery to define this pattern. There are many question in StackOverFlow that talk about patterns in DB2, and they have been solved with Regular Expressions.
DB2: find field value where first character is a lower case letter
Emulate REGEXP like behaviour in SQL
I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.
For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:
1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری
I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:
I am using MS SQL Server not MySQL
my table contents are in Persian, so I can't use Levenshtein distance and similar methods
I prefer an SQL Server only solution, not an indexing or daemon based solution.
The best solution would be a solution which help us sort result by similarity, but, its optional.
Do you have any suggestion for that?
Thanks
MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?
SELECT * FROM table WHERE input LIKE '%مختار%'
Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?
--This part searches for a string you want
declare #MyString varchar(max)
set #MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)
--This part searches for that string
select searchColumn, ABS(Len(searchColumn) - Len(#MyString)) as Similarity
from table where data LIKE '%' + #MyString + '%'
Order by Similarity, searchColumn
The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query.
The absolute part can be avoided obviously but I did it just in case.
Hope that helps =-)
Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.
The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.
The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.
And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection
Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.
Look at the following reference:
http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for #method Algorithm:
0 The Levenshtein edit distance algorithm
1 The Jaccard similarity coefficient algorithm
2 A form of the Jaro-Winkler distance algorithm
3 Longest common subsequence algorithm
Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server