sql - what does '^\d+\D+$' mean in regexp - sql

I came through an expression -
select * from table where regexp_like(field, '^\d+\D+$');
I'm sure of what the expression does, but please can someone explain what '^\d+\D+$' refers to exactly?
Thanks.

^ beginning of string
\d single digit
+ one or more occurrences of preceding
\D nondigit character
+ one or more occurrences
$ end of string
So, it means one or more digits followed by one or more nondigits, and that should be the whole string, not a substring.

Related

regex extract big query with numeric data

how would I be able to grab the number 2627995 from this string
"hellotest/2627995?hl=en"
I want to grab the number 2627995, here is my current regex but it does not work when I use regex extract from big query
(\/)\d{7,7}
SELECT
REGEXP_EXTRACT(DESC, r"(\/)\d{7,7}")
AS number
FROM
`string table`
here is the output
Thank you!!
I think you just want to match all digits coming after the last path separator, before either the start of the query parameter, or the end of the URL.
SELECT REGEXP_EXTRACT(DESC, r"/(\d+)(?:\?|$)") AS number
FROM `string table`
Demo
Try this one: r"\/(\d+)"
Your code returns the slash because you captured it (see the parentheses in (\/)\d{7,7}). REGEXP_EXTRACT only returns the captured substring.
Thus, you could just wrap the other part of your regex with the parentheses:
SELECT
REGEXP_EXTRACT(DESC, r"/(\d{7})")
AS number
FROM
`string table`
NOTE:
In BigQuery, regex is specified with string literals, not regex literals (that are usually delimited with forward slashes), that is why you do not need to escape the / char (it is not a special regex metacharacter)
{7,7} is equal to {7} limiting quantifier, meaning seven occurrences.
Also, if you are sure the number is at the end of string or is followed with a query string, you can enhance it as
REGEXP_EXTRACT(DESC, r"/(\d+)(?:[?#]|$)")
where the regex means
/ - a / char
(\d+) - Group 1 (the actual output): one or more digits
(?:[?#]|$) - either ? or # char, or end of string.

Extract string between different special symbols

I am having following string in my query
.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt
beginning with a period from which I need to extract the segment between the final \ and the file extension period, meaning following expected result
ABC__123_123_123_ABC123
Am fairly new to using REGEXP and couldn't help myself to an elegant (or workable) solution with what Q&A here or else. In all queries the pattern is the same in quantity and order but for my growth of knowledge I'd prefer to not just count and cut.
You can use REGEXP_REPLACE function such as
REGEXP_REPLACE(col,'(.*\\)(.*)\.(.*)','\2')
in order to extract the piece starting from the last slash upto the dot. Preceding slashes in \\ and \. are used as escape characters to distinguish the special characters and our intended \ and . characters.
Demo
You need just regexp_substr and simple regexp ([^\]+)\.[^.]*$
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'([^\]+)\.[^.]*$',
1, -- position
1, -- occurence
null, -- match_parameter
1 -- subexpr
) substring
from dual;
([^\]+)\.[^.]*$ means:
([^\]+) - find one or more(+) any characters except slash([] - set, ^ - negative, ie except) and name it as group \1(subexpression #1)
\. - then simple dot (. is a special character which means any character, so we need to "escape" it using \ which is an escape character)
[^.]* - zero or more any characters except .
$ - end of line
So this regexp means: find a substring which consist from: one or more any characters except slash followed by dot followed by zero or more any characters except dot and it should be in the end of string. And subexpr parameter = 1, says oracle to return first subexpression (ie first matched group in (...))
Other parameters you can find in the doc.
Here is my simple full compatible example with Oracle 11g R2, PCRE2 and some other languages.
Oracle 11g R2 using function substr (Reference documentation)
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}',
1,
1
) substring
from dual;
Pattern: ((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}
Result: ABC__123_123_123_ABC123
Just as simple as it can be, regular expressions always follow a minimal standard, as you can see portability also provided, just for the case someone else is interested in going the simplest way.
Hopefully, this will help you out!

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

Sybase to Teradata inquiry LIKE '[0-9]'

CASE
WHEN <in_data> LIKE '[0-9][0-9][0-9][0-9][0-9][0-9]' THEN SUBSTR(<in_data>,1,3)
ELSE '000'
END
We're doing a migration project from Sybase to Teradata, and having a problem figuring this one out :) I'm still new to Teradata.
I would like to ask the equivalent TD code for this -
LIKE '[0-9][0-9][0-9][0-9][0-9][0-9]' to Teradata
Basically, it just checks whether the digits are numeric value.
Can someone give me a hint on this
You can also use REGEXP_SUBSTR to directly extract the three digits:
COALESCE(REGEXP_SUBSTR(in_data,'^[0-9]{3}(?=[0-9]{3}$)'), '000')
This looks for the first three digits and then does a lookahead for three following digits without adding them to the overall match.
^ indicates the begin of the string, '$' the end, so there are no other characters before or after the six digits. (?=...) is a so-called "lookahead", i.e. those three digits are checked, but ignored.
If there's no match the regex returns NULL which is changed to '000'.
You need to use regexp instead of like, since [0-9][0-9][0-9][0-9][0-9][0-9] is a regular expression.
To do an exact match, you need to add anchors. ie, to match the string which contains an exact 6 digit chars.
regexp '^[0-9]{6}$'
or
regexp '^[[:digit:]]{6}$'

SQL : REGEX MATCH - Character followed by numbers inside quotes

I have a column in sql which holds value inside double quotes like "P1234567" , "P1234" etc..
I need to identify only columns which start with letter P and is followed by seven digits (numbers) only. I tried where column like'"P[0-9][0-9][0-9][0-9][0-9][0-9][0-9]"' but it doesn't seem to work.
Can someone please correct me or point me to a thread which can help me out?
Thanks
Standard SQL has no regex support, but most SQL engines have regex extensions added to them on top of the standard SQL. So, for example, if you're using MySQL then you'd do this:
... WHERE column REGEXP '^"P[0-9]{7}"'
And if you're using Postgres then that would be:
... WHERE column ~ '^"P[0-9]{7}"'
(updated to match the double-quote part of the question, I'd misunderstood that to begin with)
How about using length and isnumeric:
Select
*
from
mytable
where
mycolumn like '"P%'
and len(mycolumn) = 10 --2 chars for quotes + 1 for 'P' + 7 for the digits
and isnumeric(substring(mycolumn, 3, 7))=1
This answer is for SQL Server, other DBMS's may have a different syntax for length