BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative) - sql

I need to extract 8 digits after a known string:
| MyString | Extract: |
| ---------------------------- | -------- |
| mypasswordis 12345678 | 12345678 |
| # mypasswordis 12345678 | 12345678 |
| foobar mypasswordis 12345678 | 12345678 |
I can do this with regex like:
(?<=mypasswordis.*)[0-9]{8})
However, when I want to do this in BigQuery using the REGEXP_EXTRACT command, I get the error message, "Cannot parse regular expression: invalid perl operator: (?<".
I searched through the re2 library and saw there doesn't seem to be an equivalent for positive lookbehind.
Is there any way I can do this using other methods? Something like
SELECT REGEXP_EXTRACT(MyString, r"(?<=mypasswordis.*)[0-9]{8}"))

You need a capturing group here to extract a part of a pattern, see the REGEXP_EXTRACT docs you linked to:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group. If the expression does not contain a capturing group, the function returns the entire matching substring.
Also, the .* pattern is too costly, you only need to match whitespace between the word and the digits.
In general, to "convert" a (?<=mypasswordis).* pattern with a positive lookbehind, you can use mypasswordis(.*).
In this case, you can use
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]{8})"))
Or just
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]+)"))
See the re2 regex online test.

Try to not use regexp as much as you can, its quite slow. Try substring and instr as example:
SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1)
otherwise Wiktor Stribiżew have probably right answer.

Use REGEXP_REPLACE instead to match what you don't want and delete that:
REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

Related

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?

For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')

This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

Get the last part of the value returned by split_part() function

I have a file_path string separated by forward slashes. I want to split them based on the forward slashes and return the file name.
INPUT
//a/b/c/xyz.png
OUTPUT
xyz.png
CURRENT SOLUTION
SELECT REVERSE(SPLIT_PART(REVERSE('//a/b/c/xyz.py'), '/', 1)) as "file_name";
Is there a more efficient way of doing this?

regexp_match() is more concise:
select (regexp_match('//a/b/c/xyz.py', '[^/]+$'))[1]

I would just use regexp_replace() to remove everything before the last slash (included):
select regexp_replace('//a/b/c/xyz.png', '.*/', '')
Demo on DB Fiddle:
| regexp_replace |
| :------------- |
| xyz.png |
You can also use substring(), which may or may not be more efficient:
substring('//a/b/c/xyz.png' from '[^/]*$')

PostgreSQL 14 will support negative index so it will be straightforward operation.
split_part
Splits string at occurrences of delimiter and returns the n'th field (counting from one), or when n is negative, returns the |n|'th-from-last field.
split_part('abc,def,ghi,jkl', ',', -2) → ghi
In this particular scenario:
SELECT SPLIT_PART('//a/b/c/xyz.py', '/', -1) as "file_name";

Extract particular character using StandardSQL

I would like to extract particular character from strings using StandardSQL.
I would like to extract the character after limit=.
For instance, from below strings I would like to extract 10, 3 and null. For everything that has null I also would like to make all null = 1.
partner=&limit=10
partner=aex&limit=3&filters%5Bpartner%5D
partner=aex&limit=&filters%5Bpartner%5D
I only know how to use substring function but the problem here is the positions of limit= are not always the same.

You can use REGEXP_EXTRACT. For example:
SELECT REGEXP_EXTRACT('partner=aex&limit=3&filters%5Bpartner%5D', 'limit=(\\d+)');
+-------+
| $col1 |
+-------+
| 3 |
+-------+

Query to search substring in column

I have a table that has a substring value in the column and I want to write a query that checks if input string has the substring.
My table looks like:
| company | host |
| ------- | ---------- |
| ebay | ebay.com |
| google | google.com |
| yahoo | yahoo.com |
My input will be like www.ebay.com or https://www.ebay.com or www.qa.ebay.com or www.dev.ebay.com..
If I get any of the inputs I want to return the first record.
I tried looking at the CHARINDEX, INSTR but they are work in reverse. My scenario is I have substring to be searched in table and the actual string as input.
Any help is appreciated.

You can use like for this, but you also need string concatenation. In ANSI standard SQL, this looks like:
select t.*
from t
where #inputstring like concat('%.', t.host)
where #inputstring is the string you are inputting.
Note: You can also use the concatenation infix operation, which is typically || (standard) or +.

You can use the SQL wildcard like so:
SELECT * FROM table WHERE host LIKE '%ebay.com';

Go for this:
SELECT * FROM table WHERE host LIKE '%SearchString%'
It will pull all rows containing the SearchString.

You can achieve this using like operator.
Select * from yourtable
where ? like concat('%', company, '%');
parameter ? with your input.

PostgreSQL String search for partial patterns removing exrtaneous characters

Looking for a simple SQL (PostgreSQL) regular expression or similar solution (maybe soundex) that will allow a flexible search. So that dashes, spaces and such are omitted during the search. As part of the search and only the raw characters are searched in the table.:
Currently using:
SELECT * FROM Productions WHERE part_no ~* '%search_term%'
If user types UTR-1 it fails to bring up UTR1 or UTR 1 stored in the database.
But the matches do not happen when a part_no has a dash and the user omits this character (or vice versa)
EXAMPLE search for part UTR-1 should find all matches below.
UTR1
UTR --1
UTR 1
any suggestions...

You may well find the offical, built-in (from 8.3 at least) fulltext search capabilities in postrgesql worth looking at:
http://www.postgresql.org/docs/8.3/static/textsearch.html
For example:
It is possible for the parser to produce overlapping tokens from the
same of text.
As an example, a hyphenated word will be reported both as the entire word
and as each component:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1

SELECT *
FROM Productions
WHERE REGEXP_REPLACE(part_no, '[^[:alnum:]]', '')
= REGEXP_REPLACE('UTR-1', '[^[:alnum:]]', '')
Create an index on REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') for this to work fast.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative) - sql

Try to not use regexp as much as you can, its quite slow. Try substring and instr as example: SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1) otherwise Wiktor Stribiżew have probably right answer.

Use REGEXP_REPLACE instead to match what you don't want and delete that: REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

Related

How to add delimiter to String after every n character using hive functions?

Get the last part of the value returned by split_part() function

Extract particular character using StandardSQL

Query to search substring in column

PostgreSQL String search for partial patterns removing exrtaneous characters

Categories

Resources