Substring in Postgresql - sql

I have a quick question.
I'm studying some SQL exercises, and one of them, I need to remove the number from a street name.
For example:
5026 S CRENSHAW BLVD - Should be CRENSHAW
2635 WHITTIER BLVD - Should be WHITTIER
308 WESTWOOD PLZ # 1390L - Should Be WESTWOOD
1111 WILSHIRE BLVD - Should be WILSHIRE
Then, the answer to treat the address above was like this:
substring(facility_address FROM '[\d]+\s?\w?\s([\w]+)\s?')
I would like to understand how the substring works, what does it means de [\D] etc.
Could someone explain?
Thank you very much indeed! :)

You should read up on regular expressions.
'[\d]+\s?\w?\s([\w]+)\s?' is a regular expression.
I'll try to break it down:
First of all, these are quantifiers:
+ means one or more
? means one or none
And now for the regular expression:
[\d]+ matches one or more digits. I think the square brackets are actually not necessary here.
\s? optionally matches a whitespace character (optionally meaning it may or may not be there)
\w? optionally matches a word character
\s matches a whitespace character, but this time it's not optional since there is no ? at the end
([\w]+) matches one or more word characters. Notice the parenthesis which denote a so called 'capture group`. Everything within the parenthesis is actually returned by the substring function.

Related

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.
To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.
The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.
You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

REGEXP_REPLACE explanation

Hi may i know what does the below query means?
REGEXP_REPLACE(number,'[^'' ''-/0-9:-#A-Z''[''-`a-z{-~]', 'xy') ext_number
part 1
In terms of explaining what the function function call is doing:
It is a function call to analyse an input string 'number' with a regex (2nd argument) and replace any parts of the string which match a specific string. As for the name after the parenthesis I am not sure, but the documentation for the function is here
part 2
Sorry to be writing a question within an answer here but I cannot respond in comments yet (not enough rep)
Does this regex work? Unless sql uses different syntax this would appear to be a non-functional regex. There are some red flags, e.g:
The entire regex is wrapped in square parenthesis, indicating a set of characters but seems to predominantly hold an expression
There is a range indicator between a single quote and a character (invalid range: if a dash was required in the match it should be escaped with a '\' (backslash))
One set of square brackets is never closed
After some minor tweaks this regex is valid syntax:
^'' ''\-\/0-9:-#A-Z''[''-a-z{-~]`, but does not match anything I can think of, it is important to know what string is being examined/what the context is for the program in order to identify what the regex might be attempting to do
It seems like it is meant to replaces all ASCII control characters in the column or variable number with xy.
[] encloses a class of characters. Any character in that class matches. [^] negates that, hence all characters match, that are not in the class.
- is a range operator, e.g. a-z means all characters from a to z, like abc...xyz.
It seams like characters enclosed in ' should be escaped (The second ' is to escape the ' in the string itself.) At least this would make some sense. (But for none of the DBMS I found having a regexp_replace() function (Postgres, Oracle, DB2, MariaDB, MySQL), I found something in the docs, that would indicate this escape mechanism. They all use \, but maybe I missed something? Unfortunately you didn't tag which DBMS you're actually using!)
Now if you take an ASCII table you'll see, that the ranges in the expression make up all printable characters (counting space as printable) in groups from space to /, 0 to 9, : to #, etc.. Actually it might have been shorter to express it as '' ''-~, space to ~.
Given the negation, all these don't match. The ones left are from NUL to US and DEL. These match and get replaced by xy one by one.

Objective C - RegEx - Invalid Range when trying to match spaces [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

Regex for letters, digits, no spaces

I'm trying to create a Regex to check for 6-12 characters, one being a digit, the rest being any characters, no spaces. Can Regex do this? I'm trying to do this in objective-c and I'm not familiar with Regex at all. I've been reading a couple tutorials, but most are for matching simple cases of a number, or a set of numbers, but not exactly what i'm looking for. I can do it with methods, but I was wondering if it that would be too slow and I figured I could try learning something new.
asdfg1 == ok
asdfg 1 != ok
asdfgh != ok
123456 != ok
asdfasgdasgdasdfasdf != ok
use this regex ^(?=.*\d)(?=.*[a-zA-Z])[^ ]{6,12}$
It seems that you mean "letter" when you say "character", right? And (thanks to burning_LEGION for pointing that out) there may be only one digit?
In that case, use
^(?=\D*\d\D*$)[^\W_]{6,12}$
Explanation:
^ # Start of string
(?=\D*\d\D*$) # Assert that there is exactly one digit in the string
[^\W_] # Match a letter or digit (explanation below)
{6,12} # 6-12 times
$ # End of string
[^\W_] might look a little odd. How does it work? Well, \w matches any letter, digit or underscore. \W matches anything that \w doesn't match. So [^\W] (meaning "match any character that is not not alphanumeric/underscore") is essentially the same as \w, but by adding _ to this character class, we can remove the underscore from the list of allowed characters.
i didn't try though, but i think here is the answer
(^[^\d\x20]*\d[^\d\x20]*$){6,12}
This is for one digit: ^[^\d\x20]{0,11}\d{1}[^\d\x20]{0,11}$ but I can`t get limited to 6-12 length, you can use other function to check length first and if it from 6 to 12 check with this regex witch I wrote.

Regex issue using ICU regex/regexkitlite

Starting a new question as my other question solved a different issue with the regex.
Here's my regex:
(?i)\\d{1,4}(?<!v(?:ol)?\\.?\\s?)(?![^\\(]*\\))
Regex split up for clarity:
(?i) - case insensitive
\\d{1,4} - a number with 1-4 digits
(?<!v(?:ol)?\\.?\\s?) the number cannot be preceded by 'v', 'v.', 'vol', 'vol.', with or without a space on the end.
(?![^\\(]*\\)) - Number cannot be inside parentheses.
It all works except for the 'vol.' bit.:
#"Words words 342 words (2342) (words 2 words) (words).ext" result 342 - correct.
#"Words - words words (2010) (words 2 words) (words).ext" result nil - correct.
#"words words v34 35.ext" result 34 - incorrect.
#"Words vol.342 343 (1234) (3 words) (desc).ext" result 342 - incorrect.
What am I doing wrong with my 'vol.' section?
You need to put the lookbehind before the number. Also, you need to add digits as illegal characters inside the lookbehind, or the 4 in v.34 will match. Try
(?i)(?<!v(?:ol)?\\.?\\s*\\d*)\\d{1,4}(?![^(]*\\))
This is expecting (edit: wrongly, as it turns out) that regexkitlite supports infinite repetition inside lookbehind which not many regex flavors do.
A look into the docs shows that it does support finite (but variable) repetition inside lookbehind, and if you are aware that the following will only work if there is at most one space between vol. and the number, then you could try
(?i)(?<!v(?:ol)?\\.?\\s?)(?<!\\d)\\d{1,4}(?![^(]*\\))