I need to extract some text from a string, but only where the text matches a string pattern. The string pattern will consist of...
2 numbers, a forward slash and 6 numbers
e.g. 12/123456
or
2 numbers, a forward slash, 6 numbers, a hyphen and 2 numbers
e.g. 12/123456-12
I know how to use INSTR to find a specific string. Is it possible to find a string that matches a specific pattern?
You'll need to use regexp_like to filter the results and regexp_substr to get the substring.
Here is roughly what it should look like:
select id, myValue, regexp_substr(myValue, '[0-9]{2}/[0-9]{6}') as myRegExMatch
from Foo
where regexp_like(myValue,'^([a-zA-Z0-9 ])*[0-9]{2}/[0-9]{6}([a-zA-Z0-9 ])*$')
with a link to a SQLFiddle that you can see in action and adjust to your taste.
The regexp_like provided in the sample above takes into consideration the alphanumerics and whitespace characters that may bound the number pattern.
Use regexp_like.
where regexp_like(col_name,'\s[0-9]{2}\/[0-9]{6}(-[0-9]{2})?\s')
\s matches a space. Include them at the start and end of pattern.
[0-9]{2}\/[0-9]{6} matches 2 numerics, a forward slash and 6 numerics
(-[0-9]{2})? is optional for a hyphen and 2 numerics following the previous pattern.
regexp_like(col_name,'^\d{2}/\d{6}($|-\d{2}$)')
or
regexp_like(col_name,'^\d{2}/\d{6}(-\d{2})?$')
Related
REGEXP_REPLACE("My dog is funny and happy", r"(\S+ \S+ \S+)", r"*") This is my SQL for achieving this. My output should look something like this = My dog is funny *and happy
When I try the above query it removes the first few words. How do I work this out?
You should use a backreference:
REGEXP_REPLACE("My dog is funny and happy", r"^((?:\S+\s+){4})", r"\1*")
REGEXP_REPLACE("My dog is funny and happy", r"^(?:\S+\s+){4}", r"\0*")
See the regex demo. Details:
^ - start of string
((?:\S+\s+){4}) - Group 1 (\1 in the replacement will refer to this group value): four occurrences of one or more non-whitespaces followed with one or more whitespaces.
\0 refers to the whole match value.
See the regexp_replace reference:
REGEXP_REPLACE(value, regexp, replacement)
Returns a STRING where all substrings of value that match regular
expression regexp are replaced with replacement.
You can use backslashed-escaped digits (\1 to \9) within the
replacement argument to insert text matching the corresponding
parenthesized group in the regexp pattern. Use \0 to refer to the
entire matching text.
how would I be able to grab the number 2627995 from this string
"hellotest/2627995?hl=en"
I want to grab the number 2627995, here is my current regex but it does not work when I use regex extract from big query
(\/)\d{7,7}
SELECT
REGEXP_EXTRACT(DESC, r"(\/)\d{7,7}")
AS number
FROM
`string table`
here is the output
Thank you!!
I think you just want to match all digits coming after the last path separator, before either the start of the query parameter, or the end of the URL.
SELECT REGEXP_EXTRACT(DESC, r"/(\d+)(?:\?|$)") AS number
FROM `string table`
Demo
Try this one: r"\/(\d+)"
Your code returns the slash because you captured it (see the parentheses in (\/)\d{7,7}). REGEXP_EXTRACT only returns the captured substring.
Thus, you could just wrap the other part of your regex with the parentheses:
SELECT
REGEXP_EXTRACT(DESC, r"/(\d{7})")
AS number
FROM
`string table`
NOTE:
In BigQuery, regex is specified with string literals, not regex literals (that are usually delimited with forward slashes), that is why you do not need to escape the / char (it is not a special regex metacharacter)
{7,7} is equal to {7} limiting quantifier, meaning seven occurrences.
Also, if you are sure the number is at the end of string or is followed with a query string, you can enhance it as
REGEXP_EXTRACT(DESC, r"/(\d+)(?:[?#]|$)")
where the regex means
/ - a / char
(\d+) - Group 1 (the actual output): one or more digits
(?:[?#]|$) - either ? or # char, or end of string.
I have written this expression for checking extra characters and I am counting the occurrence of those extra characters.
REGEXP_COUNT('Mr.John® Êlite', regexp_extract ('Mr.John® Êlite','[^\x00-\x7F]'))
It's working fine if the string has only one extra character e.g
Mr. John®
It will take out ® and give me count as 1.
But if my string has two extra characters, it will only pick the first one and ignore the second character e.g
Mr.John® Êlite
My function will extract ® and ignore Ê.
I have tried subquery as well.Not working.Need help
As noted by Wiktor Stribiżew REGEXP_COUNT needs just a source string and regexp:
db2 "values REGEXP_COUNT('Mr.John® Êlite', '[^\x00-\x7F]')"
1
-----------
2
Because you used REGEXP_EXTRACT, it does extract the first occurrence only:
The REGEXP_EXTRACT scalar function returns one occurrence of a substring of a string that matches the regular expression pattern.
and only then you do actual count.
I need to find rows where the phone number field contains unexpected characters.
Most of the values in this field look like:
123456-7890
This is expected. However, we are also seeing character values in this field such as * and #.
I want to find all rows where these unexpected character values exist.
Expected:
Numbers are expected
Hyphen with numbers is expected (hyphen alone is not)
NULL is expected
Empty is expected
Tried this:
WHERE phone_num is not like ' %[0-9,-,' ' ]%
Still getting rows where phone has numbers.
from https://regexr.com/3c53v address you can edit regex to match your needs.
I am going to use example regex for this purpose
select * from Table1
Where NOT REGEXP_LIKE(PhoneNumberColumn, '^[+]*[(]{0,1}[0-9]{1,4}[)]{0,1}[-\s\./0-9]*$')
You can use translate()
...
WHERE translate(Phone_Number,'a1234567890-', 'a') is NOT NULL
This will strip out all valid characters leaving behind the invalid ones. If all the characters are valid, the result would be NULL. This does not validate the format, for that you'd need to use REGEXP_LIKE or something similar.
You can use regexp_like().
...
WHERE regexp_like(phone_num, '[^ 0123456789-]|^-|-$')
[^ 0123456789-] matches any character that is not a space nor a digit nor a hyphen. ^- matches a hyphen at the beginning and -$ on the end of the string. The pipes are "ors" i.e. a|b matches if pattern a matches of if pattern b matches.
Oracle has REGEXP_LIKE for regex compares:
WHERE REGEXP_LIKE(phone_num,'[^0-9''\-]')
If you're unfamiliar with regular expressions, there are plenty of good sites to help you build them. I like this one
I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.