How ''~'' and ''^'' actually works with practical examples in PostgreSQL? - sql

I'm trying to solve a case that, a lot of users have used the syntax that contains the "~".
As below:
select
business_postal_code as zip,
count(distinct case when left(business_address,1) ~ '^[0-9]' then lower(split_part(business_address, ' ', 2))
else lower(split_part(business_address, ' ', 1)) end ) as n_street
from sf_restaurant_health_violations
where business_postal_code is not null
group by 1
order by 2 desc, 1 asc;
link to acess the case: https://platform.stratascratch.com/coding/10182-number-of-streets-per-zip-code?python=
But I couldn't undernstand how this part of the code actually works: ... ~ '^ ....

Let's simplify the query in your question to the component parts you're asking about. Once we see how they work individually, perhaps the whole query will make more sense.
To start, the ~ (tilde) is the POSIX, case-sensitive regular expression operator. The linked PostgreSQL documentation provides brief descriptions and usage examples of it and its sibling operators:
Operator
Description
Example
~
Matches regular expression, case sensitive
'thomas' ~ '.*thomas.*'
~*
Matches regular expression, case insensitive
'thomas' ~* '.*Thomas.*'
!~
Does not match regular expression, case sensitive
'thomas' !~ '.*Thomas.*'
!~*
Does not match regular expression, case insensitive
'thomas' !~* '.*vadim.*'
We can see that each operator has two operands: a constant string on the left, and a pattern on the right. If the string on the left is a match for the pattern on the right, the statement is true, otherwise it is false.
In the given example for the operator you're asking about, 'thomas' is a match for the pattern '.*thomas.*' by standard regular expression rules. The '.*' pre-and-postfixes mean "match any character (except newline) any number of times (zero or more)". The whole pattern then means, "match any character any number of times, then the literal string 'thomas', then any character any number of times". One such match would be 'john thomas jones' where 'john ' matches the first '.*' and ' jones' matches the second '.*'.
I don't think this is a great example because it is functionally equivalent to 'thomas' LIKE '%thomas%' which is likely to run faster, among other benefits like being a SQL-standard operator.
A better example is the query in your question where the pattern '^[0-9]' is used. Setting aside the ^ for now, this pattern means, "match any character in 0-9 (0, 1, 2, ..., 8, 9)", which would be much more verbose if you were to use the LIKE operator: field LIKE '^0' OR field LIKE '^1' OR field LIKE '^2' ....
The ^ operator is not PostgreSQL-specific. Rather it is a special character in regular expressions with one of two meanings (aside from its use as a literal character; more about that in this answer):
The match should begin at the start of the line/string.
For example, the string "Hello, World!" would contain a match for the pattern 'World' since the word "World" appears in it, but would not contain a match for the pattern '^World' since the word "World" is not at the start of the string.
The string "Hello, World!" would contain a match for both of the following patterns: 'Hello' and '^Hello' since the word "Hello" is at the start of the string.
The given character set should be negated when making a match.
For example, the pattern [^0-9] means, "match any character that is not in the range 0-9". So 'a' would match, '&' would match, and 'G' would match, but '7' would not match since it is in the character set that is being excluded.
The query in your question uses the first of the two meanings. The pattern '^[0-9]' means, "match any character in the range 0-9 starting at the beginning of the string". So '0123' would match since the string starts with "0", but 'a5' would not match since the string starts with "a" which is not the character set that is being matched.
Back to the query in your question, then. The relevant part reads:
1 count(distinct
2 case
3 when left(business_address, 1) ~ '^[0-9]'
4 then lower(split_part(business_address, ' ', 2))
5 else lower(split_part(business_address, ' ', 1))
6 end
7 ) as n_street
Line 3 contains a regular expression match that will determine if we should use this case in the overall CASE statement. If the string matches the pattern, the expression will be true and we will use this case. If the string does not match the pattern, the expression will be false and we will try the next case.
The string we are matching to the pattern is left(business_address, 1). The LEFT function takes the first n characters from the string. Since n is "1" here, this returns the first character of the field business_address.
The pattern we are trying to match this string to is '^[0-9]' which we have already said means, "match any character in the range 0-9 starting at the beginning of the string". Technically we don't need the ^ regex operator here since LEFT(..., 1) will return at most one character (which will always be the first character in the resulting string).
As an example, if business_address is "123 Jones Street, Anytown, USA", then LEFT(business_address, 1) will return "1" which will match the pattern (and therefore the expression will be true and we will use the first case).
If, instead, business_address were "Jones Plaza, Suite 123, Anytown, USA", then LEFT(business_address, 1) would return "J" which would not match the pattern (since the first character is "J" which is not in the range 0-9). Our expression would be false and we would continue to the next case.

Related

Search first character in a column PostgreSQL

I want to search the first character from a column by charlist (bracket expression) but it brings all the column characters although there are customers their names starting with non-letters.
I use PostgreSQL.
SELECT name
FROM customs
WHERE name ~* '[a-z]'
https://www.postgresql.org/docs/current/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP:
Unlike LIKE patterns, a regular expression is allowed to match anywhere within a string, unless the regular expression is explicitly anchored to the beginning or end of the string.
Some examples:
'abc' ~ 'abc' true
'abc' ~ '^a' true
'abc' ~ '(b|d)' true
'abc' ~ '^(b|c)' false
So your condition should be
WHERE name ~* '^[a-z]'
if you want to match only at the beginning of name.
You could also dispense with regular expression matching and simply do:
where name < 'a' or name >= '{'
{ is the mysterious character that follows z in the ASCII chart. Note: For this or any solution, you may need to check whether or not the collation is case-sensitive.

regular expression using oracle REGEXP_INSTR

I want to use REGEXP_INSTR function in a query to search for any match for user input but I don't know how to write the regular expression that for example will match any value that includes the word car followed by unspecified numbers of letters/numbers/spaces and then the word Paterson. can any one please help me with writing this regEx?
Ok, so let's break this down.
"any value that includes the word car"
I surmise from this that the word car doesn't need to be at the start of the string, therefore i would start the format string with...
'^.*'
Here the '^' character means the start of the string, the '.' means any character and '*' means 0 or more of the preceding character. So zero or more of any character after the start of the string.
Then the word 'car', so...
'^.*car'
Next up...
"followed by unspecified numbers of letters/numbers/spaces"
I'm guessing that unspecified means zero or more. This is very similar to what we did to identify any characters that might come before 'car'. Where the '.' means any character
'^.*car.*'
However, if unspecified means one or more, then you can use '+' in place of '*'
"then the word Paterson"
I'm going to assume that as this is the end of the description, there are no more characters after 'Paterson'.
'^.*car.*Paterson$'
The '$' symbol means that the 'n' of 'Paterson' must be at the end of the string.
Code example:
select
REGEXP_INSTR('123456car1234ABCDPaterson', '^.*car.*Paterson$') as rgx
from dual
Output
RGX
----------
1

Argument '0' is out of range error

I have a query (sql) to pull out a street name from a string. It's looking for the last occurrence of a digit, and then pulling the proceeding text as the street name. I keep getting the oracle
"argument '0' is out of range"
error but I'm struggling to figure out how to fix it.
the part of the query in question is
substr(address,regexp_instr(address,'[[:digit:]]',1,regexp_count(address,'[[:digit:]]'))+2)
any help would be amazing. (using sql developer)
The fourth parameter of regexp_instr is the occurrence:
occurrence is a positive integer indicating which occurrence of
pattern in source_string Oracle should search for. The default is 1,
meaning that Oracle searches for the first occurrence of pattern.
In this case, if an address has no digits within, the regexp_count will return 0, that's not a valid occurrence.
A simpler solution, which does not require separate treatment for addresses without a house number, is this:
with t (address) as (
select '422 Hickory Str.' from dual union all
select 'One US Bank Plaza' from dual
)
select regexp_substr(address, '\s*([^0-9]*)$', 1, 1, null, 1) as street from t;
The output looks like this:
STREET
-------------------------
Hickory Str.
One US Bank Plaza
The third argument to regexp_substr is the first of the three 1's. It means start the search at the first character of address. The second 1 means find the first occurrence of the search pattern. The null means no special match modifiers (such as case insensitive - nothing like that needed here). The last 1 means "return the first SUBEXPRESSION from the match pattern". Subexpressions are parts of the match expression enclosed in parentheses.
The match pattern has a $ at the end - meaning "anchor at the end of the input string" ($ means the end of the string). Then [...] means match any of the characters in square brackets, but the ^ in [^...] changes it to match any character OTHER THAN what is in the square brackets. 0-9 means all characters between 0 and 9; so [^0-9] means match any character(s) OTHER THAN digits, and the * after that means "any number of such characters" (between 0 and everything in the input string). \s is "blank space" - if there are any blank spaces following a possible number in the address, you don't want them included right at the beginning of the street name. The subexpression is just [^0-9]* meaning the non-digits, not including any spaces before them (because the \s* is outside the left parenthesis).
My example illustrates a potential problem though - sometimes an address does, in fact, have a "number" in it, but spelled out as a word instead of using digits. What I show is in fact a real-life address in my town.
Good luck!
looking for the last occurrence of a digit, and then pulling the proceeding text as the street name
You could simply do:
SELECT REGEXP_REPLACE( address, '^(.*)\d+\D*$', '\1' )
AS street_name
FROM address_table;

regexp after a word appear

Im using regexp to find the text after a word appear.
Fiddle demo
The problem is some address use different abreviations for big house: Some have space some have dot
Quinta
QTA
Qta.
I want all the text after any of those appear. Ignoring Case.
I try this one but not sure how include multiple start
SELECT
REGEXP_SUBSTR ("Address", '[^QUINTA]+') "REGEXPR_SUBSTR"
FROM Address;
Solution:
I believe this will match the abbreviations you want:
SELECT
REGEXP_REPLACE("Address", '^.*Q(UIN)?TA\.? *|^.*', '', 1, 1, 'i')
"REGEXPR_SUBSTR"
FROM Address;
Demo in SQL fiddle
Explanation:
It tries to match everything from the begging of the string:
until it finds Q + UIN (optional) + TA + . (optional) + any number of spaces.
if it doesn't find it, then it matches the whole string with ^.*.
Since I'm using REGEXP_REPLACE, it replaces the match with an empty string, thus removing all characters until "QTA", any of its alternations, or the whole string.
Notice the last parameter passed to REGEXP_REPLACE: 'i'. That is a flag that sets a case-insensitive match (flags described here).
The part you were interested in making optional uses a ( pattern ) that is a group with the ? quantifier (which makes it optional). Therefore, Q(UIN)?TA matches either "QUINTA" or "QTA".
Alternatively, in the scope of your question, if you wanted different options, you need to use alternation with a |. For example (pattern1|pattern2|etc) matches any one of the 3 options. Also, the regex (QUINTA|QTA) matches exactly the same as Q(UIN)?TA
What was wrong with your pattern:
The construct you were trying ([^QUINTA]+) uses a character class, and it matches any character except Q, U, I, N, T or A, repeated 1 or more times. But it's applied to characters, not words. For example, [^QUINTA]+ matches the string "BCDEFGHJKLMOPRSVWXYZ" completely, and it fails to match "TIA".

Impossible to match a digit with a REGEXP_REPLACE

I try to extract the '930' from 'EM 930' with following Regexp
REGEXP_REPLACE(info,'^[:space:]*[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*[:space:]*([0-9]+)[:space:]*$','\1')
But it returns me the original string.
An idea why ?
Subsidiary question:
Why does the "\1" returned the original string when the pattern is not matched ? I expected it to return NULL, as in my other regexp experiences (eg Perl).
Who I can re-write this in a performant way so that I get of wel the matched string of well NULL ?
Your space character class was not exactly correct. If we change [:space:] to [[:space:]], your regexp_replace works as you expect:
REGEXP_REPLACE(info, '^[[:space:]]*[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*[[:space:]]*([0-9]+)[[:space:]]*$','\1')
For the sake of succinctness, we could use the upper character class, [[:upper:]], for [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. This changes the function invocation to:
regexp_replace(info, '^[[:space:]]*[[:upper:]]*[[:space:]]*([0-9]+)[[:space:]]*$','\1')
Or escape characters could be used in lieu of character classes:
\s space
\w word character
\d digit character
regexp_replace(info, '^\s*\w*\s*(\d+)\s*$','\1')
Explanation:
Since your malformed character class, [:space:], does not match the space that exists between 'EM' and '930', your search by parameter does not match any characters in the source parameter.
Your search by parameter, '^[[:space:]]*[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*[[:space:]]*([0-9]+)[[:space:]]*$', is anchored to the beginning and end of the column, info, thus it can only match the column, info, one time at most.
In your case, there is no match and the character group, '\1' which is associated with '([0-9]*)', has no value.
Consequently, no characters are replaced and you are left with original value of the column, info, 'EM 930'.
interesting variations to better understand this function:
-If your corrected function invocation had no pattern_to_replace_by parameter, '\1', then a NULL would be returned:
regexp_replace(info, '^\s*\w*\s*(\d+)\s*$' ) FROM dual;
-Since you have a pattern_to_replace_by parameter, '\1', and now it has the matching character group, the repeating digit group is returned:
930