Argument '0' is out of range error - sql

I have a query (sql) to pull out a street name from a string. It's looking for the last occurrence of a digit, and then pulling the proceeding text as the street name. I keep getting the oracle
"argument '0' is out of range"
error but I'm struggling to figure out how to fix it.
the part of the query in question is
substr(address,regexp_instr(address,'[[:digit:]]',1,regexp_count(address,'[[:digit:]]'))+2)
any help would be amazing. (using sql developer)

The fourth parameter of regexp_instr is the occurrence:
occurrence is a positive integer indicating which occurrence of
pattern in source_string Oracle should search for. The default is 1,
meaning that Oracle searches for the first occurrence of pattern.
In this case, if an address has no digits within, the regexp_count will return 0, that's not a valid occurrence.

A simpler solution, which does not require separate treatment for addresses without a house number, is this:
with t (address) as (
select '422 Hickory Str.' from dual union all
select 'One US Bank Plaza' from dual
)
select regexp_substr(address, '\s*([^0-9]*)$', 1, 1, null, 1) as street from t;
The output looks like this:
STREET
-------------------------
Hickory Str.
One US Bank Plaza
The third argument to regexp_substr is the first of the three 1's. It means start the search at the first character of address. The second 1 means find the first occurrence of the search pattern. The null means no special match modifiers (such as case insensitive - nothing like that needed here). The last 1 means "return the first SUBEXPRESSION from the match pattern". Subexpressions are parts of the match expression enclosed in parentheses.
The match pattern has a $ at the end - meaning "anchor at the end of the input string" ($ means the end of the string). Then [...] means match any of the characters in square brackets, but the ^ in [^...] changes it to match any character OTHER THAN what is in the square brackets. 0-9 means all characters between 0 and 9; so [^0-9] means match any character(s) OTHER THAN digits, and the * after that means "any number of such characters" (between 0 and everything in the input string). \s is "blank space" - if there are any blank spaces following a possible number in the address, you don't want them included right at the beginning of the street name. The subexpression is just [^0-9]* meaning the non-digits, not including any spaces before them (because the \s* is outside the left parenthesis).
My example illustrates a potential problem though - sometimes an address does, in fact, have a "number" in it, but spelled out as a word instead of using digits. What I show is in fact a real-life address in my town.
Good luck!

looking for the last occurrence of a digit, and then pulling the proceeding text as the street name
You could simply do:
SELECT REGEXP_REPLACE( address, '^(.*)\d+\D*$', '\1' )
AS street_name
FROM address_table;

Related

How ''~'' and ''^'' actually works with practical examples in PostgreSQL?

I'm trying to solve a case that, a lot of users have used the syntax that contains the "~".
As below:
select
business_postal_code as zip,
count(distinct case when left(business_address,1) ~ '^[0-9]' then lower(split_part(business_address, ' ', 2))
else lower(split_part(business_address, ' ', 1)) end ) as n_street
from sf_restaurant_health_violations
where business_postal_code is not null
group by 1
order by 2 desc, 1 asc;
link to acess the case: https://platform.stratascratch.com/coding/10182-number-of-streets-per-zip-code?python=
But I couldn't undernstand how this part of the code actually works: ... ~ '^ ....
Let's simplify the query in your question to the component parts you're asking about. Once we see how they work individually, perhaps the whole query will make more sense.
To start, the ~ (tilde) is the POSIX, case-sensitive regular expression operator. The linked PostgreSQL documentation provides brief descriptions and usage examples of it and its sibling operators:
Operator
Description
Example
~
Matches regular expression, case sensitive
'thomas' ~ '.*thomas.*'
~*
Matches regular expression, case insensitive
'thomas' ~* '.*Thomas.*'
!~
Does not match regular expression, case sensitive
'thomas' !~ '.*Thomas.*'
!~*
Does not match regular expression, case insensitive
'thomas' !~* '.*vadim.*'
We can see that each operator has two operands: a constant string on the left, and a pattern on the right. If the string on the left is a match for the pattern on the right, the statement is true, otherwise it is false.
In the given example for the operator you're asking about, 'thomas' is a match for the pattern '.*thomas.*' by standard regular expression rules. The '.*' pre-and-postfixes mean "match any character (except newline) any number of times (zero or more)". The whole pattern then means, "match any character any number of times, then the literal string 'thomas', then any character any number of times". One such match would be 'john thomas jones' where 'john ' matches the first '.*' and ' jones' matches the second '.*'.
I don't think this is a great example because it is functionally equivalent to 'thomas' LIKE '%thomas%' which is likely to run faster, among other benefits like being a SQL-standard operator.
A better example is the query in your question where the pattern '^[0-9]' is used. Setting aside the ^ for now, this pattern means, "match any character in 0-9 (0, 1, 2, ..., 8, 9)", which would be much more verbose if you were to use the LIKE operator: field LIKE '^0' OR field LIKE '^1' OR field LIKE '^2' ....
The ^ operator is not PostgreSQL-specific. Rather it is a special character in regular expressions with one of two meanings (aside from its use as a literal character; more about that in this answer):
The match should begin at the start of the line/string.
For example, the string "Hello, World!" would contain a match for the pattern 'World' since the word "World" appears in it, but would not contain a match for the pattern '^World' since the word "World" is not at the start of the string.
The string "Hello, World!" would contain a match for both of the following patterns: 'Hello' and '^Hello' since the word "Hello" is at the start of the string.
The given character set should be negated when making a match.
For example, the pattern [^0-9] means, "match any character that is not in the range 0-9". So 'a' would match, '&' would match, and 'G' would match, but '7' would not match since it is in the character set that is being excluded.
The query in your question uses the first of the two meanings. The pattern '^[0-9]' means, "match any character in the range 0-9 starting at the beginning of the string". So '0123' would match since the string starts with "0", but 'a5' would not match since the string starts with "a" which is not the character set that is being matched.
Back to the query in your question, then. The relevant part reads:
1 count(distinct
2 case
3 when left(business_address, 1) ~ '^[0-9]'
4 then lower(split_part(business_address, ' ', 2))
5 else lower(split_part(business_address, ' ', 1))
6 end
7 ) as n_street
Line 3 contains a regular expression match that will determine if we should use this case in the overall CASE statement. If the string matches the pattern, the expression will be true and we will use this case. If the string does not match the pattern, the expression will be false and we will try the next case.
The string we are matching to the pattern is left(business_address, 1). The LEFT function takes the first n characters from the string. Since n is "1" here, this returns the first character of the field business_address.
The pattern we are trying to match this string to is '^[0-9]' which we have already said means, "match any character in the range 0-9 starting at the beginning of the string". Technically we don't need the ^ regex operator here since LEFT(..., 1) will return at most one character (which will always be the first character in the resulting string).
As an example, if business_address is "123 Jones Street, Anytown, USA", then LEFT(business_address, 1) will return "1" which will match the pattern (and therefore the expression will be true and we will use the first case).
If, instead, business_address were "Jones Plaza, Suite 123, Anytown, USA", then LEFT(business_address, 1) would return "J" which would not match the pattern (since the first character is "J" which is not in the range 0-9). Our expression would be false and we would continue to the next case.

How to extract a text between brackets in oracle sql query

I am trying to extract a value between the brackets from a string.
For example, I have this string:
No information was found [AI1234].
And I want to get the result between the brackets, i.e. AI1234.
However the expression is not always the same. It may vary.
I am trying to write a query like this:
REGEXP_SUBSTR(mssg, '\((.+)\)', 1, 1, NULL, 1) AS "description" from book
But it is not returning anything.What am I missing?
Also I already tried something like that, the things is that the bracket length may vary. So this one below will return something, but not what I am looking for:
substr(mssg,instr(mssg,'(')-8,10) as "description"
If you're looking for a group of digits between square brackets, try this:
WITH
indata(msg) AS (
SELECT 'No information was found [1234]'
)
SELECT
REGEXP_SUBSTR(
msg -- the string
, '^[^[]+[[](\d+)[]].*$' -- the pattern (with a captured
-- string "\d+" in round parentheses)
, 1 -- start from position 1
, 1 -- first found occurrence
, '' -- no modifiers
, 1 -- first captured group
) AS extr
FROM indata;
extr
------
1234
You should do googling more about oracle regexp.
Please try with this.(above Oracle 11g)
SELECT REGEXP_SUBSTR(mssg, '\[[^0-9]*(\d+)[^0-9]*\]', 1, 1, NULL, 1) description
FROM book;
** This helped me to answer here.
UPDATE: This will be OK.
SELECT REGEXP_SUBSTR('No information was found [{AI1234}].', '[[({][^0-9]*(\d+)[^0-9]*[]})]', 1, 1, NULL, 1) description
FROM dual;
UPDATE: Final solution
SELECT REGEXP_SUBSTR('No information was found [{AI1234}].', '[[({]+([^][)(}{]*)[])}]+', 1, 1, NULL, 1) description
FROM dual;
Here, you should take care to [^][)(}{].
DO NOT swap the bracket chracters.
I'll quote from Oracle 11g Regexp reference
[ ]
Bracket expression for specifying a matching list that should match any one of the expressions represented in the list. A non-matching list expression begins with a circumflex (^) and specifies a list that matches any character except for the expressions represented in the list.
To specify a right bracket (]) in the bracket expression, place it first in the list (after the initial circumflex (^), if any).
To specify a hyphen in the bracket expression, place it first in the list (after the initial circumflex (^), if any), last in the list, or as an ending range point in a range expression.
This part - [^ ] - was a hard nut to crack and finally I found solution from the reference, that's why I emphasis this.

Oracle SQL regexp_substr number extraction behavior

In a sense I've answered my own question, but I'm trying to understand the answer better:
When using regexp_substr (in oracle) to extract the first occurrence of a number (either single or multi digits), how/why do the modifiers * and + impact the results? Why does + provide the behavior I'm looking for and * does not? * is my default usage in most regular expressions so I was surprised it didn't suit my need.
For example, in the following:
select test,
regexp_substr(TEST,'\d') Pattern1,
regexp_substr(TEST,'\d*') Pattern2,
regexp_substr(TEST,'\d+') Pattern3
from (
select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);
the use of regexp_substr(TEST,'\d*') returns null value for the input "W 123" - since 'zero or more' digits exist in the string, I'm confused by this behavior. I'm also confused why it does work on the string '123 W'
my understanding is that * means zero or more occurrences of the element it follows and + means 1 or more occurrence of the preceding element. In the example provided for pattern2 [\d*] why does it successfully capture "123" from "123 W" but it does not take 123 from "W 123" as zero or more occurrences of a digit do exist, they just don't exist in the beginning of the string. Is there additional [implied] logic attached to using *?
Note: I looked around for a while trying to find similar questions that helped me capture the '123' from 'W 123' but the closest i found was variations of regexp_replace which would not meet my needs.
So the regexp_count indicates there are FOUR substrings that match the \d* pattern.
The third of those is the '123'. The implication is that the first and second are derived from the W and space and what you have is a zero length result that 'consumes' one character of the source string.
select test,
regexp_count(TEST,'\d*') Pattern2_c,
regexp_substr(TEST,'\d*') Pattern2,
regexp_substr(TEST,'\d*',1,1) Pattern2_1,
regexp_substr(TEST,'\d*',1,2) Pattern2_2,
regexp_substr(TEST,'\d*',1,3) Pattern2_3,
regexp_substr(TEST,'\d*',1,4) Pattern2_4
from (select '123 W' TEST from dual
union
select 'W 123' TEST from dual
);
Oracle has a weird thing about zero length strings and null.
The result doesn't "feel" right, but then if you ask a computer deep philosophical questions about how many zero length substrings are contained in a string, I wouldn't bet on any answer.
After thinking through this, it actually makes sense. The pattern \d* is saying to match any number zero or more times. The problem here is that the beginning of the string will always match this pattern, because of the zero or more times.
If the string begins with a number, then it will include those numbers, so given 123 W, the pattern matches 123. However, given the pattern W 123 the pattern also matches at the beginning, but it matches against 0 characters. This is why you get a NULL result.
This is a general regex thing and not an Oracle thing. You have to be careful with the * quantifier.
Here are two regex fiddle examples to illustrate this, using the string W 123:
\d+ shows 1 match on 123
\d* shows 1 match on nothing (i.e. the beginning of the string)

regular expression using oracle REGEXP_INSTR

I want to use REGEXP_INSTR function in a query to search for any match for user input but I don't know how to write the regular expression that for example will match any value that includes the word car followed by unspecified numbers of letters/numbers/spaces and then the word Paterson. can any one please help me with writing this regEx?
Ok, so let's break this down.
"any value that includes the word car"
I surmise from this that the word car doesn't need to be at the start of the string, therefore i would start the format string with...
'^.*'
Here the '^' character means the start of the string, the '.' means any character and '*' means 0 or more of the preceding character. So zero or more of any character after the start of the string.
Then the word 'car', so...
'^.*car'
Next up...
"followed by unspecified numbers of letters/numbers/spaces"
I'm guessing that unspecified means zero or more. This is very similar to what we did to identify any characters that might come before 'car'. Where the '.' means any character
'^.*car.*'
However, if unspecified means one or more, then you can use '+' in place of '*'
"then the word Paterson"
I'm going to assume that as this is the end of the description, there are no more characters after 'Paterson'.
'^.*car.*Paterson$'
The '$' symbol means that the 'n' of 'Paterson' must be at the end of the string.
Code example:
select
REGEXP_INSTR('123456car1234ABCDPaterson', '^.*car.*Paterson$') as rgx
from dual
Output
RGX
----------
1

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.