regex extract big query with numeric data - sql

how would I be able to grab the number 2627995 from this string
"hellotest/2627995?hl=en"
I want to grab the number 2627995, here is my current regex but it does not work when I use regex extract from big query
(\/)\d{7,7}
SELECT
REGEXP_EXTRACT(DESC, r"(\/)\d{7,7}")
AS number
FROM
`string table`
here is the output
Thank you!!

I think you just want to match all digits coming after the last path separator, before either the start of the query parameter, or the end of the URL.
SELECT REGEXP_EXTRACT(DESC, r"/(\d+)(?:\?|$)") AS number
FROM `string table`
Demo

Try this one: r"\/(\d+)"

Your code returns the slash because you captured it (see the parentheses in (\/)\d{7,7}). REGEXP_EXTRACT only returns the captured substring.
Thus, you could just wrap the other part of your regex with the parentheses:
SELECT
REGEXP_EXTRACT(DESC, r"/(\d{7})")
AS number
FROM
`string table`
NOTE:
In BigQuery, regex is specified with string literals, not regex literals (that are usually delimited with forward slashes), that is why you do not need to escape the / char (it is not a special regex metacharacter)
{7,7} is equal to {7} limiting quantifier, meaning seven occurrences.
Also, if you are sure the number is at the end of string or is followed with a query string, you can enhance it as
REGEXP_EXTRACT(DESC, r"/(\d+)(?:[?#]|$)")
where the regex means
/ - a / char
(\d+) - Group 1 (the actual output): one or more digits
(?:[?#]|$) - either ? or # char, or end of string.

Related

Remove template text on regexp_replace in Oracle's SQL

I am trying to remove template text like &#x; or &#xx; or &#xxx; from long string
Note: x / xx / xxx - is number, The length of the number is unknown, The cell type is CLOB
for example:
SELECT 'H'ello wor±ld' FROM dual
A desirable result:
Hello world
I know that regexp_replace should be used, But how do you use this function to remove this text?
You can use
SELECT REGEXP_REPLACE(col,'&&#\d+;')
FROM t
where
& is put twice to provide escaping for the substitution character
\d represents digits and the following + provides the multiple occurrences of them
ending the pattern with ;
or just use a single ampersand ('&#\d+;') for the pattern as in the case of Demo , since an ampersand has a special meaning for Oracle, a usage is a bit problematic.
In case you wanted to remove the entities because you don't know how to replace them by their character values, here is a solution:
UTL_I18N.UNESCAPE_REFERENCE( xmlquery( 'the_double_quoted_original_string' RETURNING content).getStringVal() )
In other words, the original 'H'ello wor±ld' should be passed to XMLQUERY as '"H'ello wor±ld"'.
And the result will be 'H'ello wo±ld'

How can I remove characters in a string after a specific special character (~) in snowflake sql?

I am using Snowflake SQL. I would like to remove characters from a string after a special character ~. How can I do that?
here is the whole scenario. Let me explain. I do have a string like 'CK#123456~fndkjfgdjkg'. Now, i want only the number after #.And not anything after ~. This is number length varies for that field value. It might be 1 or 5 or 3. And i want to add the condition in where class where this number is equal to check_num from other table after joining. I am trying REGEXP_SUBSTR(A.SRC_TXT, '(?<=CK#)(.+?\b)') = C.CHK_NUM in the where condition. I am getting the error as 'No repititive argument after ?'
You can use a regex for this
-- To remove just the character after a ~
select regexp_replace('fo~o bar','~.', '');
-- returns 'fo bar'
--If you want to keep the ~
select regexp_replace('fo~o bar','~.', '~');
-- returns 'fo~ bar'
--If you want to remove everything after the ~
select regexp_replace('fo~o bar','~.*', '');
--returns 'fo'
If you need to remove other specific character sets after a ~, you can probably do this with a slightly more complicated regex, but I'd need examples of your desired input/output to help with that.
EDIT for updated question
This regex replace should get what you need.
select regexp_replace('CK#123456~fndkjfgdjkg','CK#(\\d*)~.*', '\\1');
-- returns 123456
(\\d*) gets ANY number of digits in a row, and the \\1 causes it to replace the match with what was in the first set of parenthesis, which is your list of digits. the CK# and ~.* are there to make sure the whole string gets matched and replaced.
If the CK# can vary as well, you can use .*? like this.
select regexp_replace('ABCD123HI#123456~fndkjfgdjkg','.*?#(\\d*)~.*', '\\1')
-- returns 123456
I'd probably do something like the following, easy enough but not as cool as RegEx type of functions.
set my_string='fooo~12345';
set search_for_me = '~';
SELECT SUBSTR($my_string, 1, DECODE(position($search_for_me, $my_string), 0, length($my_string), position($search_for_me, $my_string)));
I hope this helps...Rich
It looks like lookahead and lookbehinds are not supported in REGEXP functions, they seem to work in the PATTERN clause of a LIST command. Snowflake documentation makes no mention either way of lookahead or lookbehinds.
In your example:
It seems that the query engine is looking for that repeating argument, where you are attempting a lookbehind
You have not specified what you wanted extracted. You have two capture groups, but in this scenario everything would be returned
Since you are looking to remove everything after ~ you have a delimiter, why not use it in your REGEXP_SUBSTR function?
Try the following:
SELECT $1,REGEXP_SUBSTR($1,'\\w+#(.+?)~',1,1,'is',1)
FROM VALUES
('CK#123456~fndkjfgdjkg')
,('QH#128fklj924~fndkjfgdjkg')
;
This looks for:
One or more word characters
Followed by #
Capturing one or more characters upto and not including ~
Returns the characters within the capture group
You can change the .+? to \\d+? to make sure the pattern is only digits. Backslashes must be escaped with a backslash.
The descriptions for each argument of the function can be found here:
https://docs.snowflake.net/manuals/sql-reference/functions/regexp_substr.html
You could check this!!
select substr('CK#123456~fndkjfgdjkg',4,6) from dual;
OUTPUT
123456
https://docs.snowflake.net/manuals/sql-reference/functions/substr.html

How can i find two extra characters in DB2 and list down those in a column?

I have written this expression for checking extra characters and I am counting the occurrence of those extra characters.
REGEXP_COUNT('Mr.John® Êlite', regexp_extract ('Mr.John® Êlite','[^\x00-\x7F]'))
It's working fine if the string has only one extra character e.g
Mr. John®
It will take out ® and give me count as 1.
But if my string has two extra characters, it will only pick the first one and ignore the second character e.g
Mr.John® Êlite
My function will extract ® and ignore Ê.
I have tried subquery as well.Not working.Need help
As noted by Wiktor Stribiżew REGEXP_COUNT needs just a source string and regexp:
db2 "values REGEXP_COUNT('Mr.John® Êlite', '[^\x00-\x7F]')"
1
-----------
2
Because you used REGEXP_EXTRACT, it does extract the first occurrence only:
The REGEXP_EXTRACT scalar function returns one occurrence of a substring of a string that matches the regular expression pattern.
and only then you do actual count.

Oracle sql REGEXP_REPLACE expression to replace a number in a string matching a pattern

I have a string 'ABC.1.2.3'
I wish to replace the middle number with 1.
Input 'ABC.1.2.3'
Output 'ABC.1.1.3'
Input 'XYZ.2.2.1'
Output 'XYZ.2.1.1'
The is, replace the number after second occurrence of '.' with 1.
I know my pattern is wrong, the sql that I have at the moment is :
select REGEXP_REPLACE ('ABC.1.2.8', '(\.)', '.1.') from dual;
You can use capturing groups to refer to surrounding numbers in replacement string later:
select REGEXP_REPLACE ('ABC.1.2.8', '([0-9])\.[0-9]+\.([0-9])', '\1.1.\2') from dual;
You could use
^([^.]*\.[^.]*\.)\d+(.*)
See a demo on regex101.com.
This is:
^ # start of the string
([^.]*\.[^.]*\.) # capture anything including the second dot
\d+ # 1+ digits
(.*) # the rest of the string up to the end
This is replaced by
$11$2

Oracle SQL - find string pattern in string

I need to extract some text from a string, but only where the text matches a string pattern. The string pattern will consist of...
2 numbers, a forward slash and 6 numbers
e.g. 12/123456
or
2 numbers, a forward slash, 6 numbers, a hyphen and 2 numbers
e.g. 12/123456-12
I know how to use INSTR to find a specific string. Is it possible to find a string that matches a specific pattern?
You'll need to use regexp_like to filter the results and regexp_substr to get the substring.
Here is roughly what it should look like:
select id, myValue, regexp_substr(myValue, '[0-9]{2}/[0-9]{6}') as myRegExMatch
from Foo
where regexp_like(myValue,'^([a-zA-Z0-9 ])*[0-9]{2}/[0-9]{6}([a-zA-Z0-9 ])*$')
with a link to a SQLFiddle that you can see in action and adjust to your taste.
The regexp_like provided in the sample above takes into consideration the alphanumerics and whitespace characters that may bound the number pattern.
Use regexp_like.
where regexp_like(col_name,'\s[0-9]{2}\/[0-9]{6}(-[0-9]{2})?\s')
\s matches a space. Include them at the start and end of pattern.
[0-9]{2}\/[0-9]{6} matches 2 numerics, a forward slash and 6 numerics
(-[0-9]{2})? is optional for a hyphen and 2 numerics following the previous pattern.
regexp_like(col_name,'^\d{2}/\d{6}($|-\d{2}$)')
or
regexp_like(col_name,'^\d{2}/\d{6}(-\d{2})?$')