regexp_like to selects rows where an attribute string contains several different words - sql

A bit new to regexp and looking for some help understanding some of the capabilities. I'am currently trying to select some sets of data that start with a word followed by a space and then several possible words.
Example 1:
I am basically looking to select data such as Product1 green, Product1 red, Product1 blue (green, red or blue basically) but not:
xyz Product1, Product1 black, Product1 white, Product1 garbage red.
I have tried to the following queries with not much success:
Where regexp_like(item, 'Product1 [green | red | blue]');
Where regexp_like(item, 'Product1 [green, red, blue]');
Where regexp_like(item, '^Product1 [green, red, blue]');
Hypothetically, does anybody know of a way I could also implement an 'AND', for example selecting items which contain the words green and red in the same attribute.
Example 2:
Similar situation, but trying to match a word after a punctuation
Where regexp_like (job, 'Commerce [[:punct:]] .*');
With this query I am looking to select jobs which have
Commerce - test
Commerce : abcdefg
These queries are not working as I would expect them to and I'm not able to quite figure out why. I am assuming I have misunderstood the construct of these regular expressions.
Any help / explanations would be greatly appreciated!

For the first, try the following
WHERE REGEXP_LIKE(ITEM, '^Product1.*(green|red|blue)')
or
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue)')
or
WHERE REGEXP_LIKE(ITEM, '^Product1 +(green|red|blue)')
depending on what you expect after the Product1 - the first case allows zero or more characters of any kind, the second requires that there be a single space after Product1, and the third requires one or more blanks after Product1.
Not sure where you're going exactly on the second one. If you really want strings that begin with 'Commerce', followed by a space, followed by a punctuation character, another space, and then anything, try
WHERE REGEXP_LIKE(JOB, '^Commerce [:punct:] .*');
If instead of a punctuation character you're looking for either ':' or '-', try
WHERE REGEXP_LIKE(JOB, '^Commerce [:-] .*');
I'm no great expert on regular expressions but I'll try to offer some explanations:
^ requires that the following element be at the beginning of the string. Thus, in the first case ^Product1 means "'Product1' must be at the the start of the string".
In regular expressions parentheses are used to group expressions, so in the first case (green|red|blue) are grouped together.
| is a logical OR, so (green|red|blue) means "must be one of 'green' or 'red' or 'blue'".
Square brackets are used for character classes. You can use either predefined classes, such as :punct: or :space:, or you can make up your own as in [:-]. During regular expression interpretation a square bracket character class, no matter how long, represents a single character in the string being matched. So in the regular expression ^Commerce [:-] .* the character class [:-] means "look for either a colon or a dash". If you want to indicate that you expect multiple occurrences of characters in the class, one after another, use one of the repetition operators (* or +) after the class - so [abc]* would match all of abcabcabc.
Also keep in mind that in a regular expression every character means something, so you can't use whitespace to make regular expressions more legible because the whitespace becomes something that will be looked for when the expression is interpreted.
Share and enjoy.
Edit
Didn't notice your question about AND earlier. A simple way to AND together multiple expressions is to just put them one after another. To look for (green|red|blue), followed by a space, followed by (green|red|blue) a simple expression would be
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue) (green|red|blue)')
If potentially multiple spaces were to be allowed between the colors
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue) +(green|red|blue)')
could be used.
Resistance is useless.

Related

REGEXP_REPLACE explanation

Hi may i know what does the below query means?
REGEXP_REPLACE(number,'[^'' ''-/0-9:-#A-Z''[''-`a-z{-~]', 'xy') ext_number
part 1
In terms of explaining what the function function call is doing:
It is a function call to analyse an input string 'number' with a regex (2nd argument) and replace any parts of the string which match a specific string. As for the name after the parenthesis I am not sure, but the documentation for the function is here
part 2
Sorry to be writing a question within an answer here but I cannot respond in comments yet (not enough rep)
Does this regex work? Unless sql uses different syntax this would appear to be a non-functional regex. There are some red flags, e.g:
The entire regex is wrapped in square parenthesis, indicating a set of characters but seems to predominantly hold an expression
There is a range indicator between a single quote and a character (invalid range: if a dash was required in the match it should be escaped with a '\' (backslash))
One set of square brackets is never closed
After some minor tweaks this regex is valid syntax:
^'' ''\-\/0-9:-#A-Z''[''-a-z{-~]`, but does not match anything I can think of, it is important to know what string is being examined/what the context is for the program in order to identify what the regex might be attempting to do
It seems like it is meant to replaces all ASCII control characters in the column or variable number with xy.
[] encloses a class of characters. Any character in that class matches. [^] negates that, hence all characters match, that are not in the class.
- is a range operator, e.g. a-z means all characters from a to z, like abc...xyz.
It seams like characters enclosed in ' should be escaped (The second ' is to escape the ' in the string itself.) At least this would make some sense. (But for none of the DBMS I found having a regexp_replace() function (Postgres, Oracle, DB2, MariaDB, MySQL), I found something in the docs, that would indicate this escape mechanism. They all use \, but maybe I missed something? Unfortunately you didn't tag which DBMS you're actually using!)
Now if you take an ASCII table you'll see, that the ranges in the expression make up all printable characters (counting space as printable) in groups from space to /, 0 to 9, : to #, etc.. Actually it might have been shorter to express it as '' ''-~, space to ~.
Given the negation, all these don't match. The ones left are from NUL to US and DEL. These match and get replaced by xy one by one.

Objective C - RegEx - Invalid Range when trying to match spaces [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

How can I negate this expression

I have absolutely no clue how to work with regex's. I am less than a beginner.
I want to find any invalid css names from a string, so I can exclude them. Looking online I found a way to select the valid names using this:
/-?[_a-zA-Z]+[_a-zA-Z0-9-]*/g
What I want to do is negate this expression, so that only '1999' is matched in this example input:
holding-page single 1999 contact id-12 contact single single
To "negate" an expression, turn it into a negative look ahead:
/(?<!\S)(?!-?[_a-zA-Z]+[_a-zA-Z0-9-]*)\S+(?!\S)/g
See live demo.
What this does is match a complete term, but one that does not match your positive regex.
A "complete term" is matched using (?<!\S)\S+(?!\S), which is \S+ (one or more non-whitespace) wrapped in negative look arounds for not a non-whitespace to prevent matching part of a term.
Note that "not a non-whitespace" is not the same as "whitespace", because "not a non-whitespace" also matches the start and end of the input, so leading and trailing terms that are invalid will match too.
Your positive regex has been turned into a negative look ahead by enclosing it in (?!...).

Teradata regular expressions, 0 or 1 spaces

In Teradata, I'm looking for one regular expression pattern that would allow me to find a pattern of some numbers, then a space or maybe no space, and then 'SF'. It should return 7 in both cases below:
SELECT
REGEXP_INSTR('12345 1000SF', pattern),
REGEXP_INSTR('12345 1000 SF', pattern)
Or, my actual goal is to extract the 1000 in both cases if there's an easier way, probably using REGEXP_SUBSTR. More details are below if you need them.
I have a column that contains free text and I would like to extract the square footage. But, in some cases, there is a space between the number and 'SF' and in some cases there is not:
'other stuff 1000 SF'
'other stuff 1000SF'
I am trying to use the REGEXP_INSTR function to find the starting position. Through google, I have found the pattern for the first to be
'([0-9])+ SF'
When I try the pattern for the second, I try
'([0-9])+SF'
and I get the error
SELECT Failed. [2662] SUBSTR: string subscript out of bounds
I've also found an answer to a similar questions, but they don't work for Teradata. For example, I don't think you can use ? in Teradata.
The error message indicates you're using SUBSTR, not REGEXP_SUBSTR.
Try this:
RegExp_Substr(col, '[0-9]*(?= {0,1}SF)')
Find multiple digits followed by a single optional blank followed by SF and extract those digits.
I would pattern it like this:
\b(\d+)\s*[Ss][Ff]\b
\b # word boundary
(\d+) # 1 or more digits (captured)
\s* # 0 or more white-space characters
[Ss] # character class
[Ff] # character class
\b # word boundary
Demo

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.