Validate UK postcode using regular expression in oracle - sql

Below is the list of valid postcodes:
A1 1AA
A11 1AA
AA1 1AA
AA11 1AA
A1A 1AA
BFPO 1
BFPO 11
BFPO 111
I tried with (([A-Z]{1,2}[0-9]{1,2})\ ([0-9][A-Z]{2}))|(GIR\ 0AA)$ but it is not working. Could you please help me with proper query to validate all the postcode formats.

First, rather than guessing based on the set of data at hand, let's look at what UK postcodes are.
EC1V 9HQ
The first one or two letters is the postcode area and it identifies the main Royal Mail sorting office which will process the mail. In this case EC would go to the Mount Pleasant sorting office in London.
The second part is usually just one or two numbers but for some parts of London it can be a number and a letter. This is the postcode district and tells the sorting office which delivery office the mail should go to.
This third part is the sector and is usually just one number. This tells the delivery office which local area or neighbourhood the mail should go to.
The final part of the postcode is the unit code which is always two letters. This identifies a group of up to 80 addresses and tells the delivery office which postal route (or walk) will deliver the item.
Digesting that...
1 or 2 letters.
A number and maybe an alphanumeric.
A space.
"Usually" a number, but I can't find any instances otherwise.
2 letters.
\A[[:alpha:]]{1,2}\d[[:alnum:]]? \d[[:alpha:]]{2}\z
We can't use \w because that contains an underscore.
I used the more exact \A and \z over ^ and $ because \A and \z match the exact beginning and end of the string, whereas ^ and $ match the beginning and end of a line. $ in particular is tolerant of a trailing newline.
Of course, there are special cases. XXXX 1ZZ for various overseas territories, XXXX is enumerated.
\A(ASCN|STHL|TDCU|BBND|BIQQ|FIQQ|PCRN|SIQQ|TKCA) 1ZZ\z
Then a couple of really special cases.
GIR 0AA for Girobank.
AI-2640 for Anguilla.
\A(AI-2640|GIR 0AA)\z
Put them all together into one big (...|...|...) mess. It's good to build the query in three pieces and put it together with the x modifier to ignore whitespace.
REGEXP_LIKE(
postcode,
'\A
(
[[:alpha:]]{1,2}\d[[:alnum:]]?\ \d[[:alpha:]]{2}\z |
(ASCN|STHL|TDCU|BBND|BIQQ|FIQQ|PCRN|SIQQ|TKCA)\ 1ZZ |
(AI-2640|GIR\ 0AA)
)
\z',
'x'
)
Or you can make the basic regex less strict and accept 2-4 alphanumerics for the first part. Then there's only the special case for Anguilla to worry about.
\A([[:alnum:]]{2,4} \d[[:alpha:]]{2}|AI-2640)\z
On the downside, this will let in post codes that don't exist. On the up side, you don't have to keep tweaking for additional special cases. That's probably fine for this level of filtering.

Related

Regex to match BIN ranges

I'm trying to write a regex that matches the numbers 456725 to 456744 (Last 2 digits, 25-44), but can't seem to figure out a correct regex format. I've tried ^(4567[2-4][0-9]) but using this also matches 456745 which it shouldn't.
If you do it like ^(4567[2-4][0-9]), you are allowing any number in the range between [2-4] together with any number in the range between [0-9], which is obviously not what you wanted.
So you need to change for something like:
^4567(?:2[5-9]|3[0-9]|4[0-4])
Explanation
^ asserts position at start of the string
4567 matches the characters 4567 literally
Non-capturing group (?:2[5-9]|3[0-9]|4[0-4])
1st Alternative 2[5-9]
2 matches the character 2 literally
Match a single character present in the list [5-9]
2nd Alternative 3[0-9]
3 matches the character 3 literally
Match a single character present in the list [0-9]
3rd Alternative 4[0-4]
4 matches the character 4 literally
Match a single character present in the list [0-4]
You could use the page regex101 to learn more and read good explanations on the subject. Hope it helps.
If your variable is just an integer it is best to just compare it as such...
For the regex though..the ^(4567 is correct your issue is the [2-4] and [0-9] those are independent of each other. You need to put the pieces together so only 25-29 and 40-44 are allowed.
This should get you on the right track:
^(4567(?:2[5-9]|3[0-9]|4[0-4]))$

regexp_like to selects rows where an attribute string contains several different words

A bit new to regexp and looking for some help understanding some of the capabilities. I'am currently trying to select some sets of data that start with a word followed by a space and then several possible words.
Example 1:
I am basically looking to select data such as Product1 green, Product1 red, Product1 blue (green, red or blue basically) but not:
xyz Product1, Product1 black, Product1 white, Product1 garbage red.
I have tried to the following queries with not much success:
Where regexp_like(item, 'Product1 [green | red | blue]');
Where regexp_like(item, 'Product1 [green, red, blue]');
Where regexp_like(item, '^Product1 [green, red, blue]');
Hypothetically, does anybody know of a way I could also implement an 'AND', for example selecting items which contain the words green and red in the same attribute.
Example 2:
Similar situation, but trying to match a word after a punctuation
Where regexp_like (job, 'Commerce [[:punct:]] .*');
With this query I am looking to select jobs which have
Commerce - test
Commerce : abcdefg
These queries are not working as I would expect them to and I'm not able to quite figure out why. I am assuming I have misunderstood the construct of these regular expressions.
Any help / explanations would be greatly appreciated!
For the first, try the following
WHERE REGEXP_LIKE(ITEM, '^Product1.*(green|red|blue)')
or
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue)')
or
WHERE REGEXP_LIKE(ITEM, '^Product1 +(green|red|blue)')
depending on what you expect after the Product1 - the first case allows zero or more characters of any kind, the second requires that there be a single space after Product1, and the third requires one or more blanks after Product1.
Not sure where you're going exactly on the second one. If you really want strings that begin with 'Commerce', followed by a space, followed by a punctuation character, another space, and then anything, try
WHERE REGEXP_LIKE(JOB, '^Commerce [:punct:] .*');
If instead of a punctuation character you're looking for either ':' or '-', try
WHERE REGEXP_LIKE(JOB, '^Commerce [:-] .*');
I'm no great expert on regular expressions but I'll try to offer some explanations:
^ requires that the following element be at the beginning of the string. Thus, in the first case ^Product1 means "'Product1' must be at the the start of the string".
In regular expressions parentheses are used to group expressions, so in the first case (green|red|blue) are grouped together.
| is a logical OR, so (green|red|blue) means "must be one of 'green' or 'red' or 'blue'".
Square brackets are used for character classes. You can use either predefined classes, such as :punct: or :space:, or you can make up your own as in [:-]. During regular expression interpretation a square bracket character class, no matter how long, represents a single character in the string being matched. So in the regular expression ^Commerce [:-] .* the character class [:-] means "look for either a colon or a dash". If you want to indicate that you expect multiple occurrences of characters in the class, one after another, use one of the repetition operators (* or +) after the class - so [abc]* would match all of abcabcabc.
Also keep in mind that in a regular expression every character means something, so you can't use whitespace to make regular expressions more legible because the whitespace becomes something that will be looked for when the expression is interpreted.
Share and enjoy.
Edit
Didn't notice your question about AND earlier. A simple way to AND together multiple expressions is to just put them one after another. To look for (green|red|blue), followed by a space, followed by (green|red|blue) a simple expression would be
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue) (green|red|blue)')
If potentially multiple spaces were to be allowed between the colors
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue) +(green|red|blue)')
could be used.
Resistance is useless.

Different ways to display mobile phone numbers?

I want to give my users the ability to choose how their phone numbers are formatted within my software...
Here in Australia I normally display landline phones like:
(07) 5588 2299
+61 7 5588 2299 (international)
And mobile (cell) phones like:
0422 444 666
+61 4 22 444 666 (international)
Is there any standard format for phone numbers in different countries (and even here in Australia if I have it wrong) that I could use as a template?
Thanks in advance!
My original thought was to allow them to specify a mask and a number, sort of like a printf format string method.
That way the mask +99 9 999 9999 with the number 61755882299 would format as +61 7 558 82299.
But then I reconsidered: if you want to give users formatting ability over their mobile phone numbers, just provide a free text entry field.
Let them enter whatever they want. They may want to enter 1-800-BITE-ME or any number of variations you won't have thought of :-)
That will only ever become a problem if you need to use the phone numbers from within your code and, even then, you could just strip out the non-digit characters first (after possibly converting letters to digits first).
However, allowing phone numbers in multiple countries can be ambiguous sometimes.
For example, 61755882299 can be read as +61 7 5588 2299 in Australia or 617-558-8229 in the United States (where the extra 9 is disregarded; 617 is a valid U.S. area code).
(It is allowable to have extra numbers in a phone number; this is a common practice if letters are used instead of numbers; for example: 1-800-MATTRESS.)
In the U.S., the formatting of mobile and landline phone numbers does not differ. Usually numbers are separated by hyphens, but they can also be separated by periods and spaces, and the area code can be enclosed in parentheses:
Examples:
888-555-1234
888.555.1234
888 555-1234
(888) 555-1234

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.

ASP Regular Expression for UK Telephone format in VB.net

I want regular expression validator for my telephone field in VB.net. Please see the requirement below:
Telephone format should be (+)xx-(0)xxxx-xxxxxx ext xxxx (Optional) example my number would appear as 44-7966-591739 Screen would be formatted to show +44-(0)7966-591739 ext
Please suggest.
Best Regards,
Yuv
+44-(0)7966-591739
The (0) is not valid in phone number display. Remove it.
It's +44 7966 591739 or 07966 591739.
The RegEx pattern is inefficient in multiple ways:
(\d{4}|\d{3})
The above simplifies to:
\d{3,4}
There are bigger problems:
^(((+44\s?\d{4}|(?0\d{4})?)\s?\d{3}\s?\d{3})|((+44\s?\d{3}|(?0\d{3})?)\s?\d{3}\s?\d{4})|((+44\s?\d{2}|(?0\d{2})?)\s?\d{4}\s?\d{4}))(\s?#(\d{4}|\d{3}))?$
Having found the leading +44 or leading 0 once, why keep on searching for it again and again?
^((+44\s?..|0..).....|(+44\s?..|0..).....|(+44\s?..|0..).....)
simplifies to
^(+44\s?|0)(.. .....|.. .....|.. .....)
However, the above pattern caters only for UK 4+6, 3+7 and 2+8 format numbers and not for 3+6, 4+5, 5+5 and 5+4 format numbers.
The pattern is inadequate.
Phone number validation and formatting needs to be broken down into separate steps. Allow a wide range of input formats, extract the vital digits and throw away the various dial prefixes, then strictly format the remaining number in international or national format.
For London numbers, the correct format with spaces is:
+44 20 3555 7890 or 020 3555 7890 or (020) 3555 7890
and without spaces:
+442035557890 or 02035557890.
(0) in parentheses is NEVER valid. Do not use it.
UK phone numbers use a variety of formats: 2+8, 3+7, 3+6, 4+6, 4+5, 5+5, 5+4. Some users don't know which format goes with which number range and might use the wrong one on input. Let them do that; you're interested in the DIGITS.
Step 1: Check the input format looks valid
Make sure that the input looks like a UK phone number. Accept various dial prefixes, +44, 011 44, 00 44 with or without parentheses, hyphens or spaces; or national format with a leading 0. Let the user use any format they want for the remainder of the number: (020) 3555 7788 or 00 (44) 203 555 7788 or 02035-557-788 even if it is the wrong format for that particular number. Don't worry about unbalanced parentheses. The important part of the input is making sure it's the correct number of digits. Punctuation and spaces don't matter.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{5}\)?[\s-]?\d{4,5}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$
The above pattern matches optional opening parentheses, followed by 00 or 011 and optional closing parentheses, followed by an optional space or hyphen, followed by optional opening parentheses. Alternatively, the initial opening parentheses are followed by a literal + without a following space or hyphen. Any of the previous two options are then followed by 44 with optional closing parentheses, followed by optional space or hyphen, followed by optional 0 in optional parentheses, followed by optional space or hyphen, followed by optional opening parentheses (international format). Alternatively, the pattern matches optional initial opening parentheses followed by the 0 trunk code (national format).
The previous part is then followed by the NDC (area code) and the subscriber phone number in 2+8, 3+7, 3+6, 4+6, 4+5, 5+5 or 5+4 format with or without spaces and/or hyphens. This also includes provision for optional closing parentheses and/or optional space or hyphen after where the user thinks the area code ends and the local subscriber number begins. The pattern allows any format to be used with any GB number. The display format must be corrected by later logic if the wrong format for this number has been used by the user on input.
The pattern ends with an optional extension number arranged as an optional space or hyphen followed by x, ext and optional period, or #, followed by the extension number digits. The entire pattern does not bother to check for balanced parentheses as these will be removed from the number in the next step.
At this point you don't care whether the number begins 01 or 07 or something else. You don't care whether it's a valid area code. Later steps will deal with those issues.
Step 2: Extract the NSN so it can be checked in more detail for length and range
After checking the input looks like a GB telephone number using the pattern above, the next step is to extract the NSN part so that it can be checked in greater detail for validity and then formatted in the right way for the applicable number range.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)([1-9]\d{1,4}\)?[\s\d-]+)(?:((?:x|ext\.?\s?|\#)\d+)?)$
Use the above pattern to extract the '44' from $1 to know that international format was used, otherwise assume national format if $1 is null.
Extract the optional extension number details from $3 and store them for later use.
Extract the NSN (including spaces, hyphens and parentheses) from $2.
Step 3: Validate the NSN
Remove the spaces, hyphens and parentheses from $2 and use further RegEx patterns to check the length and range and identify the number type.
These patterns will be much simpler, since they will not have to deal with various dial prefixes or country codes.
The pattern to match valid mobile numbers is therefore as simple as
^7([45789]\d{2}|624)\d{6}$
Premium rate is
^9[018]\d{8}$
There will be a number of other patterns for each number type: landlines, business rate, non-geographic, VoIP, etc.
By breaking the problem into several steps, a very wide range of input formats can be allowed, and the number range and length for the NSN checked in very great detail.
Step 4: Store the number
Once the NSN has been extracted and validated, store the number with country code and all the other digits with no spaces or punctuation, e.g. 442035557788.
Step 5: Format the number for display
Another set of simple rules can be used to format the number with the requisite +44 or 0 added at the beginning.
The rule for numbers beginning 03 is
^44(3\d{2})(\d{3])(\d{4})$
formatted as
0$1 $2 $3 or as +44 $1 $2 $3
and for numbers beginning 02 is
^44(2\d)(\d{4})(\d{4})$
formatted as
(0$1) $2 $3 or as +44 $1 $2 $3
The full list is quite long. I could copy and paste it all into this thread, but it would be hard to maintain that information in multiple places over time. For the present the complete list can be found at: http://aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_GB_Telephone_Numbers
For validation:
As bobince points out, you should be flexible with phone numbers because there are so many ways to enter them.
One simple yet effective way to validate the value is first strip all non-numeric values, then make sure it is at least 11 digits long, and - if you're limiting to UK numbers - then check it starts with either 0 or 44.
I can't be bothered looking up vb.net syntax, but something along the lines of this:
if Phone.replaceAll('\D','').length < 11
// Invalid Number
endif;
(The \D is regex for anything not 0-9.)
To format a number as requested, assuming you've got a relatively fixed input that you want to display to a page, something like this might work:
replace:
(\d{2,3})\D*0?\D*(\d{4})\D*(\d{5})\D*(\d*)
with:
+$1-(0)$2-$3 ext $4
That's fairly flexible but wont accept any old phone number. It currently required an international code at the start, and I'm not quite sure on the rules of them to know if it's going to work perfectly, but it might be good enough for what you need.
An explanation of that regex, in regex comment mode (so it can be used directly as a regex if necessary):
(?x) # enable regex comment mode (whitespace ignored, hashes start comments)
# international code:
(\d{2,3}) # matches 3 or 2 digits; captured to group 1.
# optional 0 with potental spaces dashes or parens:
\D* # matches as many non-digits as possible, none required.
0? # optionally match a zero
\D* # matches as many non-digits as possible, none required.
# main part of number:
(\d{4}) # match 4 digits; captured to group 2
\D* # matches as many non-digits as possible, none required.
(\d{5}) # match 5 digits; captured to group 3.
# optional prefix:
\D* # matches as many non-digits as possible, none required.
(\d*) # match as many digits as possible, none required; captured to group 4.
Never include a (0) in parentheses in the international format.
ITU E.123 warns against it: http://www.itu.int/rec/T-REC-E.123-200102-I/en
as does: http://revk.www.me.uk/2009/09/it-is-not-44-0207-123-4567.html