Workaround for Impala Regex lookahead and lookbehind - sql

If I use Hive, the below works fine. But if I use Impala, it throws error:
select regexp_replace("foobarbarfoo","bar(?=bar)","<NA>");
WARNINGS: Could not compile regexp pattern: bar(?=bar)
Error: invalid perl operator: (?=
Basically, Impala doesn't support lookahead and lookbehind
https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_incompatible_changes.html#incompatible_changes_200
Is there a workaround for this today? Maybe use UDF?
Thanks.

Since you are using regexp_replace, match and capture the part of string you want to keep (but want to use as must-have context) and replace with a backreference. See the regexp_replace Impala reference:
These examples show how you can replace parts of a string matching a pattern with replacement text, which can include backreferences to any () groups in the pattern string. The backreference numbers start at 1, and any \characters must be escaped as \\.
So, here, you may use
select regexp_replace("foobarbarfoo","bar(bar)","<NA>\\1");
^ ^ ^^^
Note it will not work to replace consecutive matches, however, it will work in the current scenario and foobarbarfoo will turn into foo<NA>barfoo (note that Go regex engine is also RE2, hence this option is chosen at regex101.com).

Related

Trino regexp_replace this character in the beginning but not in the middle Trino [duplicate]

I am a complete Reg-exp noob, so please bear with me. Tried to google this, but haven't found it yet.
What would be an appropriate way of writing a Regular expression matching files starting with a dot, such as .buildpath or .htaccess?
Thanks a lot!
In most regex languages, ^\. or ^[.] will match a leading dot.
The ^ matches the beginning of a string in most languages. This will match a leading .. You need to add your filename expression to it.
^\.
Likewise, $ will match the end of a string.
You may need to substitute the \ for the respective language escape character. However, under Powershell the Regex I use is: ^(\.)+\/
Test case:
"../NameOfFile.txt" -match '^(\\.)+\\\/'
works, while
"_./NameOfFile.txt" -match '^(\\.)+\\\/'
does not.
Naturally, you may ask, well what is happening here?
The (\\.) searches for the literal . followed by a +, which matches the previous character at least once or more times.
Finally, the \\\/ ensures that it conforms to a Window file path.
It depends a bit on the regular expression library you use, but you can do something like this:
^\.\w+
The ^ anchors the match to the beginning of the string, the \. matches a literal period (since an unescaped . in a regular expression typically matches any character), and \w+ matches 1 or more "word" characters (alphanumeric plus _).
See the perlre documentation for more info on Perl-style regular expressions and their syntax.
It depends on what characters are legal in a filename, which depends on the OS and filesystem.
For example, in Windows that would be:
^\.[^<>:"/\\\|\?\*\x00-\x1f]+$
The above expression means:
Match a string starting with the literal character .
Followed by at least one character which is not one of (whole class of invalid chars follows)
I used this as reference regarding which chars are disallowed in filenames.
To match the string starting with dot in java you will have to write a simple expression
^\\..*
^ means regular expression is to be matched from start of string
\. means it will start with string literal "."
.* means dot will be followed by 0 or more characters

Regex to convert CamelCase to snake case in Redshift

I am trying to convert CamelCase to either snake case or separated by a delimiter using regex in SQL (AWS Redshift). So something like
regexp_replace(MyString, '([A-Z]+)', '-$1')
except I need to specify not at the beginning of the string. Right now,
MyString -> -my-string instead of my-string.
How do I do this?
Match and capture any char before the uppercase letters, and restore it using another backreference in the replacement pattern:
regexp_replace(MyString, '(.)([A-Z]+)', '$1-$2')
^^^ ^^^^^
See the regex demo.
I understand you already LOWER the result after the regex replacement.

How to negate regex with lookarounds

I have a regex-expression
(?<=#)'|'(?=%)
It successfully matches any apostrophe that is placed around %# in this objective-c string
#"UPDATE RESTAURANTS SET CITY='%#', NAME='%#' ", city, #"Joy's Restaurant";
But I want the opposite thing, to match any apostrophe that is NOT around %# i.e. to only match the apostrophe in Joy's Restaurant in this example.
Any ideas how to do that?
Negative lookarounds are pretty straight forward. Use (?!…) for a negative lookahead and (?<!…) for a negative lookbehind. For example:
(?<!#)'(?!%)
Will match any apostrophe so long as it is not immediately preceded by a # and it is not followed by a %. Notice that you have to remove the alternation (|) as you want to make sure that both lookarounds are satisfied.
Use a Negative Lookbehind and Negative Lookahead instead.
(?<!#)'(?!%)
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
'%#'|(')
Live Demo

Write regex for pattern like W00001

I am new to Regular Expressions and any help is highly appreciated.
Pattern like W00000,W00001,W00002,W00004
Must begin with W
Each string before comma must be six characters
String can only be repeated four times
Comma in between
Must not begin or end with comma
I tried below pattern and some others, like (^[W]{1}\d{5}){1,4}'), and none of them work correctly:
Select 'X' from dual Where REGEXP_LIKE ('W12342','(^[W]{1}\d{5})(?<!,)$')
My understanding is that the OP is saying the match should fail if the string begins or ends with a comma, not just that the preceding or trailing commas shouldn't match, so anchors are needed. Also, based on the regex he attempted, I infer that a single group, such as W00000, should match. So, I think the regex should be this, if the characters following the W must always be digits:
^W[:digit:]{5}(,W[:digit:]{5}){0,3}$
Or this, if they can be something other than digits:
^W[^,]{5}(,W[^,]{5}){0,3}$
UPDATE:
The OP posted the following comment:
I am on Oracle 11g and [:digit:] doesn't work. When I replace it with [0-9] it then works fine.
According to the documentation, Oracle 11g conforms to the POSIX regex standard and should be able to use POSIX character classes such as [:digit:]. However, I noticed in the docs that Oracle 11g does support Perl-style backslash character class abbreviations, which I didn't think was the case when I originally wrote this answer. In that case, the following should work:
^W\d{5}(,W\d{5}){0,3}$
Well in that case, you can do this:
(W[^,]{5},){3}W[^,]{5}
If I understood correctly, this should do it!
^W[0-9]{5}(,W[0-9]{5}){0,3}$
One W12345 pattern, maybe followed by one to 3 ,W12345 blocks.
Edit1: Adding ^$ to fail if there is a comma
Edit2: Fix class, since it fails on Oracle 11g

Regular expression to match specific variations of function

I am trying to construct a regular expression to find the text of the following variations.
NSLocalizedString(#"TEXT")
NSLocalizedStringFromTable(#"TEXT")
NSLocalizedStringWithDefaultValue(#"TEXT")
...
The goal is to extract TEXT. I have been able to construct a regex for each individual function or macro, e.g., (?<=NSLocalizedString)\(#"(.*?)". However, I am looking for a solution that does the job no matter what the name of the function as long as it starts with NSLocalizedString.
I assumed it was as simple as (?<=NSLocalizedString\w+)\(#"(.*?)", but that does't seem to do the trick.
How about this one?
/NSLocalizedString\w*\(#"(.*)"\)/
Explanation:
NSLocalizedString 'NSLocalizedString'
\w+ word characters (a-z, A-Z, 0-9, _) (0 or
more times (matching the most amount
possible))
\( '('
#" '#"'
( group and capture to \1:
.* any character except \n (0 or more times
(matching the most amount possible))
) end of \1
" '"'
\) ')'
The only reason your regex doesn't work is because the regex engine doesn't support variable length lookbehinds. The (?<=NSLocalizedString\w+) is variable length so can't be used.
Firstly it needs to be \w* not \w+, to allow your first example string to match.
If you move the \w* outside the lookbehind (?<=NSLocalizedString)\w* it will work just fine.
Alternatively, since you have to use a capturing group to grab the text value anyway, theres no need for the lookbehind at all. Change the (?<= to a (?: and it becomes a non-capturing group (which can be variable length), and then just grab your text value from group 1.
Your attempt was:
(?<=NSLocalizedString\w+)\(#"(.*?)"
Both of these minor changes should make it work:
(?<=NSLocalizedString)\w*\(#"(.*?)"
(?:NSLocalizedString\w*)\(#"(.*?)"
The following is actually not supported in Objective-C:
The solution that will extract exactly TEXT without using any groups is:
NSLocalizedString\w*\(#"\K[^"]*
It avoids the need to use a negative lookbehind (which can't be used for reasons I explain below) by using the \K modifier, which chops off anything before it from the match.