Redshift Regex count. Repetition operator error - sql

I am trying to do a simple regex pattern match in Redshift
I have this code and I get the following error:
REGEXP_COUNT ( "code", '^(?=.{8}$)[A-z]{2,5}[0-9]{3,6}$' )
ERROR: Invalid preceding regular expression prior to repetition operator. The error occured while parsing the regular expression fragment: '^(?>>>HERE>>>=.{8}$)[A-'.
The pattern works fine in testing in python and online checkers I'm guessing its a REGEX language problem. I have checked in PostgreSQL documentation on REGEX to try get help as I can't find much details on actual Redshift.
Thanks,

Related

Extract characters between a string and the first occurrence of something in BigQuery

I want to extract a set of characters between "u1=" and the first semi-colon using a regex. For instance, given the following string: id=1w54;name=nick;u1=blue;u2=male;u3=ohio;u5=
The desired regex output should be just blue.
I tested (?<=u1=)[^;]* on https://regex101.com and it works. However, when I run this in BigQuery, using regexp_extract(string, '(?<=u1=)[^;]*') , I get an error that reads "Cannot parse regular expression: invalid perl operator: (?<"
I'm confused why this isn't working in BQ. Any help would be appreciated.
You can use regexp_extract() like this:
regexp_extract(string, 'u1=([^;]+)')

Django __iregex crashing for regular expression ^(\\. \\.)$

When I try to make an __iregex call using the regular expression '^(\\. \\.)$' I get:
DataError: invalid regular expression: parentheses () not balanced
I am using PSQL backend so the django documentation states that the equivalent SQL command should be
SELECT ... WHERE title ~* '^(\\. \\.)$';
When I run this query manually through the PSQL command line it works fine. Is there some bug with Django that I don't know about that is causing this to crash?
Edit: Also, it fails for variations of this regular expression, for example
'^(S\\. \\.)$'
'^(\\. S\\.)$'
'^(\\. \\.S)$'
The solution is to replace all " " characters with \s before sending the regexp into __iregex.

BigQuery - Illegal Escape Sequence

I'm having an issue matching regular expression in BigQuery. I have the following line of code that tries to identify user agents:
when regexp_contains((cs_user_agent), '^AppleCoreMedia\/1\.(.*)iPod') then "iOS App - iPod"
However, BigQuery doesn't seem to like escape sequences for some reason and I get this error that I can't figure out:
Syntax error: Illegal escape sequence: \/ at [4:63]
This code works fine in a regex validator I use, but BigQuery is unhappy with it and I can't figure out why. Thanks in advance for the help
Use regexp_contains((cs_user_agent), r'^AppleCoreMedia\/1\.(.*)iPod')

Workaround for Impala Regex lookahead and lookbehind

If I use Hive, the below works fine. But if I use Impala, it throws error:
select regexp_replace("foobarbarfoo","bar(?=bar)","<NA>");
WARNINGS: Could not compile regexp pattern: bar(?=bar)
Error: invalid perl operator: (?=
Basically, Impala doesn't support lookahead and lookbehind
https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_incompatible_changes.html#incompatible_changes_200
Is there a workaround for this today? Maybe use UDF?
Thanks.
Since you are using regexp_replace, match and capture the part of string you want to keep (but want to use as must-have context) and replace with a backreference. See the regexp_replace Impala reference:
These examples show how you can replace parts of a string matching a pattern with replacement text, which can include backreferences to any () groups in the pattern string. The backreference numbers start at 1, and any \characters must be escaped as \\.
So, here, you may use
select regexp_replace("foobarbarfoo","bar(bar)","<NA>\\1");
^ ^ ^^^
Note it will not work to replace consecutive matches, however, it will work in the current scenario and foobarbarfoo will turn into foo<NA>barfoo (note that Go regex engine is also RE2, hence this option is chosen at regex101.com).

Error: Failed to parse regular expression "": pattern too large - compile failed

I find the following phenomena:
I have a BQ query with 100s of fields extracted using REGEXP_EXTRACT function.
I added a new expression and got the following Error: Failed to parse regular expression "": pattern too large - compile failed.
When querying this expression alone, everything runs fine, in a larger query, i get the error.
This is a replica of the problem base on the github sample data and a simple regex:
SELECT repository.description,
REGEXP_EXTRACT(repository.description,r'(?:\w){0}(\w)') as Pos1,
REGEXP_EXTRACT(repository.description,r'(?:\w){1}(\w)') as Pos2,
REGEXP_EXTRACT(repository.description,r'(?:\w){2}(\w)') as Pos3,
.
. here it goes on and on in the same pattern
.
REGEXP_EXTRACT(repository.description,r'(?:\w){198}(\w)') as Pos199,
REGEXP_EXTRACT(repository.description,r'(?:\w){199}(\w)') as Pos200,
REGEXP_EXTRACT(repository.description,r'(?:\w){200}(\w)') as Pos201,
FROM [publicdata:samples.github_nested] LIMIT 1000
It returns:
Failed to parse regular expression "(?:\w){162}(\w)": pattern too large - compile failed
but when running:
SELECT repository.description,
REGEXP_EXTRACT(repository.description,r'(?:\w){162}(\w)') as Pos163,
FROM [publicdata:samples.github_nested] LIMIT 1000
Everything runs OK...
Is there a limit to # of REGEXP_EXTRACTs, or their combined complexity, that can be used in a single query?
I'll look into the issue. As a workaround, it looks like what you're trying to do is to split out the field into separate fields per character position... so turn "abc" into {pos1: "a", pos2: "b", pos3: "c"}. Is that correct? If so, you might want to try the LEFT() and RIGHT() functions. As in
LEFT(1, reponsitory.description) as pos1,
RIGHT(1, LEFT(2, reponsitory.description)) as pos2,
RIGHT(1, LEFT(3, reponsitory.description)) as pos3.
This should use fewer resources than compiling 200 regular expressions (although it is still not likely to be fast).