Regular Expression puzzle - vb.net

In (Visual Basic, .NET):
Dim result As Match = Regex.Match(aStr, aMatchStr)
If result.Success Then
Dim result0 As String = result.Groups(0).Value
Dim result1 As String = result.Groups(1).Value
End If
With: aStr equal to (whitespace is normal space and there are seven spaces between n and ():
"AMEVDIEERPK + 7 Oxidation (M)"
Why does result1 become an empty string for aMatchStr equal to
"\s*(\d*).*?Oxidation\s+\(M\)"
but becomes "7" for aMatchStr equal to
"\s*(\d*)\s*Oxidation\s+\(M\)"
?
(result0 becomes equal to "AMEVDIEERPK + 7 Oxidation (M)")
(This is from MSQuant, MascotResultParser.vb, function modificationParseMatch()).

\s* Zero or more whitespace
(\d*) Zero or more digits (captured)
.*? Any characters (non greedy, so up to the next match
Oxidation Matches the word Oxidation
\s+(M) Matches with one or more whitespace then (M)
The problem here is that you are matching 0 or more of any characters prior to the word Oxidation, including any possible digits, eating the digits which might match the previous \d
\s*(\d*)\s*Oxidation\s+(M)
The difference here is that you are specifying whitespace only before the Oxidation. Not eating the digits.
Change the \d* to \d+ to catch the numbers

I think it's because the matching starts at the first character and moves on from there...
For your first regular expression:
Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*).*?Oxidation\s+(M)"? Yes.. stop matching.
For your second regular expression:
Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
Does "MEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
Does "EVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
...
Does " 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? Yes
If for the first regular expression you'd used \d+ instead of \d* you'd have got a better result.
This is not exactly how regular expressions work, but you get the idea.

Thanks for the quick responses!
The numbers in the input are left out if there is only one
(peptide) modification instead of 7 as in the previous
example, e.g.:
"AMEVDIEERPK + Oxidation (M)"
and there would be no match if "\d+" was used. But maybe I
should use two regular expressions, one for each of these two
cases. This would increase the complexity of the program
somewhat (as I want to avoid memory garbage from
constructing regular expression for each string to be
matched), but is acceptable.
What I really wanted to do was to let the user specificy a
match rule without requiring the rule to match from the
beginning of the (peptide) modification (that's why I tried
to introduce the non-greedy match).
Right now the user's rule is prepended with "\s*(\d*)\s*"
and the user must thus specifify "Oxidation\s+(M)" to
match. Specifying e.g. "dation\s+(M)" will not work.

To answer your second message, you (or your user) can specify \w*dation\s+\(M\) to match either Oxydation (M) or Gradation (M) or dation (M).

With the syntax update, it seems we don't need to worry about the difference between \d+ and \d*. There's always a + sign present, even if there are no digits. Matching this + constrains the regex to the point that it works as expected:
"\s* // whitespace before +
\+ // The + sign itself
\s* // whitespace after +
(\d*) // optional digits
.*? // any non-digit between the last digit and Oxidation (M)
Oxidation\s+\(M\)"
Since the + must be matched first, and must be matched precisely once, the AMEVDIEERPK prefix cannot be matched by .*?.

I settled on using \w* for now. The user will be required
to specify matching for any white space, but it covers the
majority of cases for this particular application and how it
is commonly used.
So for the example the regular expression is then:
\s*(\d*)\s*\w*Oxidation\s+\(M\)

". * ?" in this example will always match zero characters, since "* ?" does shortest possible match. As a result, since the thing right before the 'O' is a space, "\ d *" can match 0 digits.
(Sorry about the spaces in the quotes; the auto-formatter was eating my syntax.)
Reference: Quantifiers in Regular Expressions (MSDN)

I am sorry, there is more to the syntax...
The plus sign can not be relied on. It separates the
(peptide) sequence and the (peptide) modifications. There
can be more than one modification for each sequence. Sample
with two modifications (there is 7 spaces between "2" and
"L"):
"KLIDLTQFPAFVTPMGK + Oxidation (M); 2 Lysine-13C615N2 (K-full)"
The user could specify "\S+\s+(K-full)" for the second
modification and "2" should be extracted.
Here are some more sample lines (after the plus sign):
" Phospho (ST); 2 Dimethyl (K); Dimethyl (N-term)"
" Phospho (ST); 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"
" N-Acetyl (Protein)"
" 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"
" N-Acetyl (Protein); 2 Lysine-13C615N2 (K-full)"
" Oxidation (M); N-Acetyl (Protein)"
" Oxidation (M); N-Acetyl (Protein); Lysine-13C615N2 (K-full)"
" N-Acetyl (Protein); Lysine-13C615N2 (K-full)"
" Oxidation (M); Lysine-13C615N2 (K-full)"
" Oxidation (M)"
" 2 Oxidation (M); Lysine-13C615N2 (K-full)"
A sample file with user defined rules can be found at
(packed in 7-zip format):
<http://www.pil.sdu.dk/1/MSQuant/CEBIquantModes,2008-11-10.7z>

Related

regex: match everything, but not a certain string including white sapce (regular expression, inspite of, anything but, VBA visual basic)

Folks, there are already billions of questions on "regex: match everything, but not ...", but non seems to fit my simple question.
A simple string: "1 Rome, 2 London, 3 Wembley Stadium" and I want to match just "1 Rome, 2 London, 3 Wembley Stadium", in order to extract only the names but not the ranks ("Rome, London, Wembley Stadium").
Using a regex tester (https://extendsclass.com/regex-tester.html), I can simply match the opposite by:
([0-9]+\s*) and it gives me:
"1 Rome, 2 London, 3 Wembley Stadium".
But how to reverse it? I tried something like:
[^0-9 |;]+[^0-9 |;], but it also excludes white spaces that I want to maintain (e.g. after the comma and in between Wembley and Stadium, "1 Rome, 2 London, 3 Wembley Stadium"). I guess the "0-9 " needs be determined somehow as one continuous string. I tried various brackets, quotation marks, \s*, but nothing jet.
Note: I'm working in a visual basic environment and not allowing lookbehinds!
You can use
\d+\s*(.*?)(?=,\s*\d+\s|$)
See the regex demo, get the values from match.Submatches(0). Details:
\d+ - one or more digits
\s* - zero or more whitespaces
(.*?) - Group 1: zero or more chars other than line break chars as few as possible
(?=,\s*\d+\s|$) - a positive lookahead that requires ,, zero or more whitespaces, one or more digits and then a whitespace OR end of string immediately to the right of the current location.
Here is a demo of how to get all matches:
Sub TestRegEx()
Dim matches As Object, match As Object
Dim str As String
str = "1 Rome, 2 London, 3 Wembley Stadium"
Set regex = New regExp
regex.Pattern = "\d+\s*(.*?)(?=,\s*\d+\s|$)"
regex.Global = True
Set matches = regex.Execute(str)
For Each match In matches
Debug.Print match.subMatches(0)
Next
End Sub
Output:

Regex to replace multiple patterns with single not working

I am working on replacing multiple occurance of string 0000 with single random number in HANA SQL
I have used these patterns
'(\w+)\s+\1'
'([0000 ]+) \1'
but all occurrences are replaced except the last occurrence of the pattern
SELECT REPLACE_REGEXPR('(\w+)\s+\1' IN '0000 0000 0000' WITH ROUND(RAND()*1000) OCCURRENCE ALL) AS a2
FROM DUMMY;
Current output is
RANDOM 0000
expected output is
RANDOM
Try this regex:
((0000) +)+(0000)
Look Here
And if it's OK to use any digit and more \ less times then 4:
(\d+ +)+\d+
Good Luck!
You may use
\b(\d+)(?:\s+\1)+\b
See the regex demo
You need \d to match digits (if you need to match letters and _ keep on using \w).
Also, to match 1 or more repetitions of a sequence of patterns you need (?:....)+, a + quantified non-capturing group.
Pattern details
\b - word boundary
(\d+) - Group 1: one or more digits
(?:\s+\1)+ - 1+ repetitions of 1+ whitespaces and the same value as captured in Group 1
\b - word boundary
Regex graph:

Teradata regular expressions, look behind

I have a field, Simplified_Description and I'm looking for patterns in it. Specifically, I'm looking for a pattern like 6 X 8 or 6X8 or 600X800. I want to pull out the first and second numbers into new fields. I've been able to get the first number (with much help) using a look-ahead.
REGEXP_substr(Simplified_Description, '[0-9]+(?= {0,1}[X] {0,1}[0-9]+)') AS FirstNum,
When I try to get the second number by changing the look-ahead to a look-behind (by simply adding in a "<"),
REGEXP_substr(Simplified_Description, '[0-9]+(?<= {0,1}[X] {0,1}[0-9]+)') AS SecondNum
I now get an error
SELECT Failed. [9134] The pattern specified is not a valid pattern.
I am a complete newb on regular expressions, especially on look-ahead and look-behind, so it's possible I have some extremely simple error, but I can't figure it out as what I'm doing appears to be the correct syntax.
You may use the following regex to extract the first number:
REGEXP_substr(Simplified_Description, '\d+(?=\s*X\s*\d)') AS FirstNum
and this regex for the second number:
REGEXP_substr(Simplified_Description, '\d+\s*X\s*\K\d+') AS SecondNum
See the regex 1 and regex 2 demo.
Patter 1 details
\d+ - 1 or more digits that are followed with...
(?=\s*X\s*\d) - a sequence of patterns:
\s* - 0+ whitespaces
X - an X char
\s* - 0+ whitespaces
\d - a digit.
Pattern 2 details
\d+ - 1 or more digits
\s*X\s* - an X char enclosed with any 0+ whitespace chars
\K - a match reset operator that omits (removes) the text matched so far from the match value
\d+ - 1 or more digits.

Find Each Occurrence of X and Insert a Carriage Return

A colleague has some data he is putting into a flat file (.txt) and needs to insert a carriage return before EACH occurrence of 'POL01', 'SUB01','VEH01','MCO01'.
I did use:
For Each line1 As String In System.IO.File.ReadAllLines(BodyFileLoc)
If line1.Contains("POL01") Or line1.Contains("SUB01") Or line1.Contains("VEH01") Or line1.Contains("MCO01") Then
Writer.WriteLine(Environment.NewLine & line1)
Else
Writer.WriteLine(line1)
End If
Next
But unfortunately it turns out that the file is not formatted in 'lines' by SSIS but as one whole string.
How can I insert a carriage return before every occurrence of the above?
Test Text
POL01CALT302276F 332 NBPM 00101 20151113201511130001201611132359 2015111300010020151113000100SUB01CALT302276F 332 NBPMP01 Akl Abi-Khalil 19670131 M U33 Stoford Close SW19 6TJ 2015111300010020151113000100VEH01CALT302276F 332 NBPM001LV56 LEJ N 2006VAUXHALL CA 2015111300010020151113000100MCO01CALT302276F 332 NBPM0101 0 2015111300010020151113000100POL01CALT742569N
You can use regular expressions for this, specifically by using Regex.Replace to find and replace each occurrence of the strings you're looking for with a newline followed by the matching text:
Dim str as String = "xxxPOL01xxxSUB01xxxVEH01xxxMCO01xxx"
Dim output as String = Regex.Replace(str, "((?:POL|SUB|VEH|MCO)01)", Environment.NewLine + "$1")
'output contains:
'xxx
'POL01xxx
'SUB01xxx
'VEH01xxx
'MCO01xxx
There may be a better way to construct this regular expression, but this is a simple alternation on the different letters, followed by 01. This matched text is represented by the $1 in the replacement string.
If you're new to regular expressions, there are a number of tools that help you understand them - for example, regex101.com will show you an explanation of the one I have used here:

Extra blank space between words

Please help me with 2 questions on how to do the GREL expression for:
If there are double spaces between 2 words in a column, how can I eliminate 1 space Example: Robert--Smith to Robert-Smith The minus character equals a blank for illustration
How can I look for an exact word in a text filter.
Thanks!
1°) try transform---> value.replace(" "," ")
Or, simply common transforms ----> collapse consecutive white spaces
2°) Column ---> text filters and enter you word
Or, do column---> Facet---> Customs facet and type : value.contains(" you_word ")
or value.contains(/(yourexactword)/)
This will return a True or False facet
H.
#hpiedcoq is the right answer if you need to have them in GREL. if not you can just use the point and click interface:
for the first question: Select your column and select Edit cells > Common transforms > Collapse consecutive white space
for the second question: select your column > text filter > enter the work you are looking for. You can select case sensitive if you want to take into account upper and lower case in your search.
1.1 transform -- > value.replace(" "," ")
Deletes all double whitespace.
1.2 transform -- > value.trim()
Deletes all double whitespace and deletes whitespaces before and after the string.
1.3 transform -- > value.replace(/\b \b/," ")
Replace with regular expression, deletes only double whitespace between two words.
Text filter > turn on regular expression and use \b.
Text filter with regular expression: \bWord\b = exact word, before and after the word may or may not be a only whitespace.