Oracle SQL regexp_substr number extraction behavior - sql

In a sense I've answered my own question, but I'm trying to understand the answer better:
When using regexp_substr (in oracle) to extract the first occurrence of a number (either single or multi digits), how/why do the modifiers * and + impact the results? Why does + provide the behavior I'm looking for and * does not? * is my default usage in most regular expressions so I was surprised it didn't suit my need.
For example, in the following:
select test,
regexp_substr(TEST,'\d') Pattern1,
regexp_substr(TEST,'\d*') Pattern2,
regexp_substr(TEST,'\d+') Pattern3
from (
select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);
the use of regexp_substr(TEST,'\d*') returns null value for the input "W 123" - since 'zero or more' digits exist in the string, I'm confused by this behavior. I'm also confused why it does work on the string '123 W'
my understanding is that * means zero or more occurrences of the element it follows and + means 1 or more occurrence of the preceding element. In the example provided for pattern2 [\d*] why does it successfully capture "123" from "123 W" but it does not take 123 from "W 123" as zero or more occurrences of a digit do exist, they just don't exist in the beginning of the string. Is there additional [implied] logic attached to using *?
Note: I looked around for a while trying to find similar questions that helped me capture the '123' from 'W 123' but the closest i found was variations of regexp_replace which would not meet my needs.

So the regexp_count indicates there are FOUR substrings that match the \d* pattern.
The third of those is the '123'. The implication is that the first and second are derived from the W and space and what you have is a zero length result that 'consumes' one character of the source string.
select test,
regexp_count(TEST,'\d*') Pattern2_c,
regexp_substr(TEST,'\d*') Pattern2,
regexp_substr(TEST,'\d*',1,1) Pattern2_1,
regexp_substr(TEST,'\d*',1,2) Pattern2_2,
regexp_substr(TEST,'\d*',1,3) Pattern2_3,
regexp_substr(TEST,'\d*',1,4) Pattern2_4
from (select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);
Oracle has a weird thing about zero length strings and null.
The result doesn't "feel" right, but then if you ask a computer deep philosophical questions about how many zero length substrings are contained in a string, I wouldn't bet on any answer.

After thinking through this, it actually makes sense. The pattern \d* is saying to match any number zero or more times. The problem here is that the beginning of the string will always match this pattern, because of the zero or more times.
If the string begins with a number, then it will include those numbers, so given 123 W, the pattern matches 123. However, given the pattern W 123 the pattern also matches at the beginning, but it matches against 0 characters. This is why you get a NULL result.
This is a general regex thing and not an Oracle thing. You have to be careful with the * quantifier.
Here are two regex fiddle examples to illustrate this, using the string W 123:
\d+ shows 1 match on 123
\d* shows 1 match on nothing (i.e. the beginning of the string)

Related

How do I extract consonants from a string field?

How do I extract only the consonants from a field in records that contain names?
For example, if I had the following record in the People table:
Field
Value
Name
Richard
How could I extract only the consonants in "Richard" to get "R,c,r,d"?
If you mean "how can I remove all vowels from the input" so that 'Richard' becomes 'Rchrd', then you can use the translate function as Boneist has shown, but with a couple more subtle additions.
First, you can completely remove a character with translate, if it appears in the second argument and it doesn't have a corresponding "translate to" character in the third argument.
Second, alas, if the third (and last) argument to translate is null the function returns null (and the same if the last argument is the empty string; there is a very small number of instances where Oracle does not treat the empty string as null, but this is not one of them). So, to make the whole thing work, you need to add an extra character to both the second and the third argument - a character you do NOT want to remove. It may be anything (it doesn't even need to appear in the input string), just not one of the characters to remove. In the illustration below I use the period character (.) but you can use any other character - just not a vowel.
Pay attention too to upper vs lower case letters. Ending up with:
with
sample_inputs (name) as (
select 'Richard' from dual union all
select 'Aliosha' from dual union all
select 'Ai' from dual union all
select 'Ng' from dual
)
select name, translate(name, '.aeiouAEIOU', '.') as consonants
from sample_inputs
;
NAME CONSONANTS
------- ----------
Richard Rchrd
Aliosha lsh
Ai
Ng Ng
Should be able to string a couple replace functions together
Select replace(replace(Value, 'A', ''), 'E', '')),...etc
You can easily do this with the translate() function, e.g.:
WITH people AS (SELECT 'Name' field, 'Richard' val FROM dual UNION ALL
SELECT 'Name' field, 'Siobhan' val FROM dual)
SELECT field, val, TRANSLATE(val, 'aeiou', ',,,,,') updated_val
FROM people;
FIELD VAL UPDATED_VAL
----- ------- -----------
Name Richard R,ch,rd
Name Siobhan S,,bh,n
The translate function simply takes a list of characters and - based on the second list of characters, which defines the translation - translates the input string.
So in the above example, the a (first character in the first list) becomes a , (first character in the second list), the e (second character in the first list) becomes a , (second character in the second list), etc.
N.B. I really, really hope your key-value table is just a made-up example for the situation you're trying to solve, and not an actual production table; in general, key-value tables are a terrible idea in a relational database!

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

Argument '0' is out of range error

I have a query (sql) to pull out a street name from a string. It's looking for the last occurrence of a digit, and then pulling the proceeding text as the street name. I keep getting the oracle
"argument '0' is out of range"
error but I'm struggling to figure out how to fix it.
the part of the query in question is
substr(address,regexp_instr(address,'[[:digit:]]',1,regexp_count(address,'[[:digit:]]'))+2)
any help would be amazing. (using sql developer)
The fourth parameter of regexp_instr is the occurrence:
occurrence is a positive integer indicating which occurrence of
pattern in source_string Oracle should search for. The default is 1,
meaning that Oracle searches for the first occurrence of pattern.
In this case, if an address has no digits within, the regexp_count will return 0, that's not a valid occurrence.
A simpler solution, which does not require separate treatment for addresses without a house number, is this:
with t (address) as (
select '422 Hickory Str.' from dual union all
select 'One US Bank Plaza' from dual
)
select regexp_substr(address, '\s*([^0-9]*)$', 1, 1, null, 1) as street from t;
The output looks like this:
STREET
-------------------------
Hickory Str.
One US Bank Plaza
The third argument to regexp_substr is the first of the three 1's. It means start the search at the first character of address. The second 1 means find the first occurrence of the search pattern. The null means no special match modifiers (such as case insensitive - nothing like that needed here). The last 1 means "return the first SUBEXPRESSION from the match pattern". Subexpressions are parts of the match expression enclosed in parentheses.
The match pattern has a $ at the end - meaning "anchor at the end of the input string" ($ means the end of the string). Then [...] means match any of the characters in square brackets, but the ^ in [^...] changes it to match any character OTHER THAN what is in the square brackets. 0-9 means all characters between 0 and 9; so [^0-9] means match any character(s) OTHER THAN digits, and the * after that means "any number of such characters" (between 0 and everything in the input string). \s is "blank space" - if there are any blank spaces following a possible number in the address, you don't want them included right at the beginning of the street name. The subexpression is just [^0-9]* meaning the non-digits, not including any spaces before them (because the \s* is outside the left parenthesis).
My example illustrates a potential problem though - sometimes an address does, in fact, have a "number" in it, but spelled out as a word instead of using digits. What I show is in fact a real-life address in my town.
Good luck!
looking for the last occurrence of a digit, and then pulling the proceeding text as the street name
You could simply do:
SELECT REGEXP_REPLACE( address, '^(.*)\d+\D*$', '\1' )
AS street_name
FROM address_table;

regexp after a word appear

Im using regexp to find the text after a word appear.
Fiddle demo
The problem is some address use different abreviations for big house: Some have space some have dot
Quinta
QTA
Qta.
I want all the text after any of those appear. Ignoring Case.
I try this one but not sure how include multiple start
SELECT
REGEXP_SUBSTR ("Address", '[^QUINTA]+') "REGEXPR_SUBSTR"
FROM Address;
Solution:
I believe this will match the abbreviations you want:
SELECT
REGEXP_REPLACE("Address", '^.*Q(UIN)?TA\.? *|^.*', '', 1, 1, 'i')
"REGEXPR_SUBSTR"
FROM Address;
Demo in SQL fiddle
Explanation:
It tries to match everything from the begging of the string:
until it finds Q + UIN (optional) + TA + . (optional) + any number of spaces.
if it doesn't find it, then it matches the whole string with ^.*.
Since I'm using REGEXP_REPLACE, it replaces the match with an empty string, thus removing all characters until "QTA", any of its alternations, or the whole string.
Notice the last parameter passed to REGEXP_REPLACE: 'i'. That is a flag that sets a case-insensitive match (flags described here).
The part you were interested in making optional uses a ( pattern ) that is a group with the ? quantifier (which makes it optional). Therefore, Q(UIN)?TA matches either "QUINTA" or "QTA".
Alternatively, in the scope of your question, if you wanted different options, you need to use alternation with a |. For example (pattern1|pattern2|etc) matches any one of the 3 options. Also, the regex (QUINTA|QTA) matches exactly the same as Q(UIN)?TA
What was wrong with your pattern:
The construct you were trying ([^QUINTA]+) uses a character class, and it matches any character except Q, U, I, N, T or A, repeated 1 or more times. But it's applied to characters, not words. For example, [^QUINTA]+ matches the string "BCDEFGHJKLMOPRSVWXYZ" completely, and it fails to match "TIA".

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>