Regex: Difference between (?=.*gh) and (?=\w*gh) - regex-lookarounds

I am new to Regex and I am can't seem to solve this:
When trying to match words that contain "gh" using positive lookahead:
(?=\w*gh) Works perfectly but (?=.*gh) matches every word.
Can someone help me with this please. Why does the Regex with the wildcard fail to match only words that contains a 'gh'?
Eg:
\b(?=\w*gh)[a-z]+\b matches only words with 'gh' right, cat, tight, dog
but
\b(?=.*gh)[a-z]+\b matches everything: right, cat, tight, dog

. in regex matches any character except line terminators. This includes whitespace.
So (?=.*gh) will match everything in the text up to the characters 'gh'
e.g.
Isle of Wight
If you have two or more words containing 'gh', it'll match the entire text up to the last one, since the preceding 'gh's match .
e.g.
Isle of Wight flights
\w matches only word characters. i.e. [a-zA-Z0-9_]so it won't match whitespace and therefore won't capture all the words before the one containing 'gh'
e.g.
Isle of Wight
Update
Your edited regex will take the text captured by the lookahead (see above) and then match all words in it. 'Dog' is never captured in your example.
Try using https://regex101.com/
Given: right, cat, tight, dog
\b(?=.*gh)[a-z]+\b matches right, cat, tight, dog
\b(?=\w*gh)[a-z]+\b matches right, cat, tight, dog

Related

openrefine how remove certain words from the end of each cell

i have a column in openrefine, which has cells with content like:
This dog is a great dog.
This cat is a great cat,
i would like to remove the words dog, cat from the end of each cell (if punctuation could be removed also, it would be great).
i have tried with
\bdog\s*$
but i get errors, or no replacement done
I am using openrefine 3.3.
value.replace(\bdog|\bcat\s*$,'')
error i get:
Parsing error at offset 14: Missing number, string, identifier, regex, or parenthesized expression
desired output:
This dog is a great
This cat is a great
also, it would be great if i could also remove all characters in the end like " : , . (actually i am looking for a regex to cluster publishers -librarian data) so if you could suggest words i should remove from the end of the cells i would be grateful
I combined Ettore answer with the split() function value.split(' ')[-1] that select the last part word of a string.
The results is :
replace(value,value.split(' ')[-1],'') + value.split(' ')[-1].replace(/cat|dog/,'')
where
replace(value,value.split(' ')[-1],'') select your string expect the last work
value.split(' ')[-1].replace(/cat|dog/,'') replace the last word with nothing if it contains cat or dog.
Note that the expression is working because of the punctuation at the end of the string. Not a perfect solution but you may be able to build something from here.

Grammar and unicode characters

Why the below Grammar fails to parse for unicode characters?
it parses fine after removing word boundaries from <sym>.
#!/usr/bin/env perl6
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<✓> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('✓'); # Nil
From the « and » "left and right word boundary" doc:
[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.
✓ isn't a word character. So the word boundary assertion fails.
What is and isn't a "word character"
"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:
Characters whose Unicode general category starts with an L, which stands for Letter.1
Characters whose Unicode general category is Nd, which stands for Number, decimal.2
_, an underscore.
"alpha 'Nd under"
In a comment below #p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".
But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).2
This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".
Footnotes
1 Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the ら character JJ also mentions. There are other letter sub-categories too.
2 Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and ① are excluded because they have a "typographic context". For a billion+ people their native languages use १ to denote one and १ is included in the Nd category for decimal digits. But for another billion+ people their native languages use 一 for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly ६ (Devanagari 6) is included in the Nd category but 六 (Han number 6) is excluded.
I keep starting my answers with "Raiph is right". But he is. Also, an example of why this is so:
for <y ✓ Ⅲ> {
say $_.uniprops;
say m/<|w>/;
}
The second line of the loop compares against the word boundary anchor; just the first character, which can be a part of an actual word, matches that anchor. It also prints the Unicode properties in the first line of the loop; in the first case it's a letter, (Ll), it's not in the other two cases. You can use any Ll character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words.
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<ら> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('ら'); # This is a hiragana letter, so it works.

Regex to match BIN ranges

I'm trying to write a regex that matches the numbers 456725 to 456744 (Last 2 digits, 25-44), but can't seem to figure out a correct regex format. I've tried ^(4567[2-4][0-9]) but using this also matches 456745 which it shouldn't.
If you do it like ^(4567[2-4][0-9]), you are allowing any number in the range between [2-4] together with any number in the range between [0-9], which is obviously not what you wanted.
So you need to change for something like:
^4567(?:2[5-9]|3[0-9]|4[0-4])
Explanation
^ asserts position at start of the string
4567 matches the characters 4567 literally
Non-capturing group (?:2[5-9]|3[0-9]|4[0-4])
1st Alternative 2[5-9]
2 matches the character 2 literally
Match a single character present in the list [5-9]
2nd Alternative 3[0-9]
3 matches the character 3 literally
Match a single character present in the list [0-9]
3rd Alternative 4[0-4]
4 matches the character 4 literally
Match a single character present in the list [0-4]
You could use the page regex101 to learn more and read good explanations on the subject. Hope it helps.
If your variable is just an integer it is best to just compare it as such...
For the regex though..the ^(4567 is correct your issue is the [2-4] and [0-9] those are independent of each other. You need to put the pieces together so only 25-29 and 40-44 are allowed.
This should get you on the right track:
^(4567(?:2[5-9]|3[0-9]|4[0-4]))$

Regex for letters, digits, no spaces

I'm trying to create a Regex to check for 6-12 characters, one being a digit, the rest being any characters, no spaces. Can Regex do this? I'm trying to do this in objective-c and I'm not familiar with Regex at all. I've been reading a couple tutorials, but most are for matching simple cases of a number, or a set of numbers, but not exactly what i'm looking for. I can do it with methods, but I was wondering if it that would be too slow and I figured I could try learning something new.
asdfg1 == ok
asdfg 1 != ok
asdfgh != ok
123456 != ok
asdfasgdasgdasdfasdf != ok
use this regex ^(?=.*\d)(?=.*[a-zA-Z])[^ ]{6,12}$
It seems that you mean "letter" when you say "character", right? And (thanks to burning_LEGION for pointing that out) there may be only one digit?
In that case, use
^(?=\D*\d\D*$)[^\W_]{6,12}$
Explanation:
^ # Start of string
(?=\D*\d\D*$) # Assert that there is exactly one digit in the string
[^\W_] # Match a letter or digit (explanation below)
{6,12} # 6-12 times
$ # End of string
[^\W_] might look a little odd. How does it work? Well, \w matches any letter, digit or underscore. \W matches anything that \w doesn't match. So [^\W] (meaning "match any character that is not not alphanumeric/underscore") is essentially the same as \w, but by adding _ to this character class, we can remove the underscore from the list of allowed characters.
i didn't try though, but i think here is the answer
(^[^\d\x20]*\d[^\d\x20]*$){6,12}
This is for one digit: ^[^\d\x20]{0,11}\d{1}[^\d\x20]{0,11}$ but I can`t get limited to 6-12 length, you can use other function to check length first and if it from 6 to 12 check with this regex witch I wrote.

Extracting functions arguments using RegExp (PREG)

Consider the following function arguments (they are already extracted of the function):
Monkey,"Blue Monkey", "Red, blue and \"Green'", 'Red, blue and "Green\''
Is there a way to extract arguments to get the following array ouput using regexp and stripping white spaces:
[Monkey, "Blue Monkey", "Red, blue and \"Green'", 'Red, blue and "Green\'']
I'm stuck using this RegExp which is not permisive enough:
/(("[^"]+"|[^\s,]+))/g
This looks a little nasty but it works:
/(?:"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*[\x5C"])*"|'(?:[^\x5C']+|\x5C(?:\x5C\x5C)*[\x5C'])*'|[^"',]+)+/g
I used \x5C instead of the plain backslash character \ as too much of those can be confusing.
This regular expression consists of the parts:
"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*[\x5C"])*" matches double quoted string declarations
'(?:[^\x5C']+|\x5C(?:\x5C\x5C)*[\x5C'])*' matches single quoted string declarations
[^"',]+ matches anything else (except commas).
The parts of "(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*[\x5C"])*" are:
[^\x5C"]+ matches anything except the backspace and quote character
\x5C(?:\x5C\x5C)*[\x5C"] matches proper escape sequences like \", \\, \\\", \\\\, etc.
Not sure exactly what you're seeking, nor yet how to do this in SQL, but isn't something like this sufficient:
(Using python as an example)
import re
x = '''Monkey, "Blue Monkey", "Red, blue and "Green\\"", 'Red, blue and "Green\\'\''''
l = re.split(',\s*',x)
print x
for a in l:
print a