printing mobile numbers with sed

printing mobile numbers with sed - awk

I have input records like this :
Addison Clark asdj asjdasjd asjdasndasd 9098890099 BE ME BA
Debby Adam asjhdj23 j23 j123jn123 123jnwjb12hg3 123jh123 jhj23 123 9283774849 MBA MIB PHD BE BA
where first two columns contain the first and the last name and among rest of the text in each line can contain anything including mobile number. My target is to extract first and last name and the mobile number.
I have tried
sed -re 's/^(\b\w+\b) (\b\w+\b).*([0-9]{10}).*/\1 \2 \3/'
which works totally fine , but when I change it to
sed -re 's/^(\b\w+\b) (\b\w+\b).*([0-9]+).*/\1 \2 \3/'
It prints only the first digit in mobile no but not the entire mobile numbers . Any idea what may be wrong with the second command ?

Just use Awk with the default field separator,
awk '{for(i=3;i<=NF;i++){if ($i ~ /^[[:digit:]]{10}$/) { number=$i; break } } printf "%s %s %s\n",$1,$2,number }' file
Addison Clark 9098890099
Debby Adam 9283774849
The idea is to loop from the 3rd field to end of the file to match the mobile number pattern and once found break out of the loop and print rest of the fields.
Look at this, regEx tester page, which said for your [0-9]+ match,
3rd Capturing Group ([0-9]+)
Match a single character present in the list below [0-9]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
0-9 a single character in the range between 0 (ASCII 48) and 9 (ASCII 57)
Look at what the + quantifier means taken from this page,
Limiting Repetition
There's an additional quantifier that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is zero or a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,1} is the same as ?, {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.
So [0-9]+ literally means match single or more characters, with minimum being one.

Related

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.

To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.

The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.

You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

Regex to match BIN ranges

I'm trying to write a regex that matches the numbers 456725 to 456744 (Last 2 digits, 25-44), but can't seem to figure out a correct regex format. I've tried ^(4567[2-4][0-9]) but using this also matches 456745 which it shouldn't.

If you do it like ^(4567[2-4][0-9]), you are allowing any number in the range between [2-4] together with any number in the range between [0-9], which is obviously not what you wanted.
So you need to change for something like:
^4567(?:2[5-9]|3[0-9]|4[0-4])
Explanation
^ asserts position at start of the string
4567 matches the characters 4567 literally
Non-capturing group (?:2[5-9]|3[0-9]|4[0-4])
1st Alternative 2[5-9]
2 matches the character 2 literally
Match a single character present in the list [5-9]
2nd Alternative 3[0-9]
3 matches the character 3 literally
Match a single character present in the list [0-9]
3rd Alternative 4[0-4]
4 matches the character 4 literally
Match a single character present in the list [0-4]
You could use the page regex101 to learn more and read good explanations on the subject. Hope it helps.

If your variable is just an integer it is best to just compare it as such...
For the regex though..the ^(4567 is correct your issue is the [2-4] and [0-9] those are independent of each other. You need to put the pieces together so only 25-29 and 40-44 are allowed.
This should get you on the right track:
^(4567(?:2[5-9]|3[0-9]|4[0-4]))$

how to limit number of characters with complete word in regular expression

I have a requirement to Print Long String into different strings with character limit as 20 with complete word and spaces, symbols, commas, dots has to be allowed.
Lets say the String is:
I have String Search the whole web or only webpages After doing some
research I think that I want to combine an if/then statement with
lookahead, i.e. go to the character limit and if there is a character
following it that is a space, add an ellipses, if it is a number or
letter, go to the final space within the limit and add an ellipses
It has to print like:
I have String Search ------> 20Characters with Complete Word
the
whole web or ------> 16C because limit is 20 but next word
Complete's at 21C So its limit to 16C
only webpages After ------->
19C because limit is 20 but next word ends at 25C

Use this RegEx Pattern: (.{1,20})(?:\s|$)
Escaped RegEx: (.{1,20})(?:\\s|$)
Explained demo here: http://regex101.com/r/pU4kI8

Regex issue using ICU regex/regexkitlite

Starting a new question as my other question solved a different issue with the regex.
Here's my regex:
(?i)\\d{1,4}(?<!v(?:ol)?\\.?\\s?)(?![^\\(]*\\))
Regex split up for clarity:
(?i) - case insensitive
\\d{1,4} - a number with 1-4 digits
(?<!v(?:ol)?\\.?\\s?) the number cannot be preceded by 'v', 'v.', 'vol', 'vol.', with or without a space on the end.
(?![^\\(]*\\)) - Number cannot be inside parentheses.
It all works except for the 'vol.' bit.:
#"Words words 342 words (2342) (words 2 words) (words).ext" result 342 - correct.
#"Words - words words (2010) (words 2 words) (words).ext" result nil - correct.
#"words words v34 35.ext" result 34 - incorrect.
#"Words vol.342 343 (1234) (3 words) (desc).ext" result 342 - incorrect.
What am I doing wrong with my 'vol.' section?

You need to put the lookbehind before the number. Also, you need to add digits as illegal characters inside the lookbehind, or the 4 in v.34 will match. Try
(?i)(?<!v(?:ol)?\\.?\\s*\\d*)\\d{1,4}(?![^(]*\\))
This is expecting (edit: wrongly, as it turns out) that regexkitlite supports infinite repetition inside lookbehind which not many regex flavors do.
A look into the docs shows that it does support finite (but variable) repetition inside lookbehind, and if you are aware that the following will only work if there is at most one space between vol. and the number, then you could try
(?i)(?<!v(?:ol)?\\.?\\s?)(?<!\\d)\\d{1,4}(?![^(]*\\))

ASP Regular Expression for UK Telephone format in VB.net

I want regular expression validator for my telephone field in VB.net. Please see the requirement below:
Telephone format should be (+)xx-(0)xxxx-xxxxxx ext xxxx (Optional) example my number would appear as 44-7966-591739 Screen would be formatted to show +44-(0)7966-591739 ext
Please suggest.
Best Regards,
Yuv

+44-(0)7966-591739
The (0) is not valid in phone number display. Remove it.
It's +44 7966 591739 or 07966 591739.
The RegEx pattern is inefficient in multiple ways:
(\d{4}|\d{3})
The above simplifies to:
\d{3,4}
There are bigger problems:
^(((+44\s?\d{4}|(?0\d{4})?)\s?\d{3}\s?\d{3})|((+44\s?\d{3}|(?0\d{3})?)\s?\d{3}\s?\d{4})|((+44\s?\d{2}|(?0\d{2})?)\s?\d{4}\s?\d{4}))(\s?#(\d{4}|\d{3}))?$
Having found the leading +44 or leading 0 once, why keep on searching for it again and again?
^((+44\s?..|0..).....|(+44\s?..|0..).....|(+44\s?..|0..).....)
simplifies to
^(+44\s?|0)(.. .....|.. .....|.. .....)
However, the above pattern caters only for UK 4+6, 3+7 and 2+8 format numbers and not for 3+6, 4+5, 5+5 and 5+4 format numbers.
The pattern is inadequate.
Phone number validation and formatting needs to be broken down into separate steps. Allow a wide range of input formats, extract the vital digits and throw away the various dial prefixes, then strictly format the remaining number in international or national format.
For London numbers, the correct format with spaces is:
+44 20 3555 7890 or 020 3555 7890 or (020) 3555 7890
and without spaces:
+442035557890 or 02035557890.
(0) in parentheses is NEVER valid. Do not use it.
UK phone numbers use a variety of formats: 2+8, 3+7, 3+6, 4+6, 4+5, 5+5, 5+4. Some users don't know which format goes with which number range and might use the wrong one on input. Let them do that; you're interested in the DIGITS.
Step 1: Check the input format looks valid
Make sure that the input looks like a UK phone number. Accept various dial prefixes, +44, 011 44, 00 44 with or without parentheses, hyphens or spaces; or national format with a leading 0. Let the user use any format they want for the remainder of the number: (020) 3555 7788 or 00 (44) 203 555 7788 or 02035-557-788 even if it is the wrong format for that particular number. Don't worry about unbalanced parentheses. The important part of the input is making sure it's the correct number of digits. Punctuation and spaces don't matter.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{5}\)?[\s-]?\d{4,5}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$
The above pattern matches optional opening parentheses, followed by 00 or 011 and optional closing parentheses, followed by an optional space or hyphen, followed by optional opening parentheses. Alternatively, the initial opening parentheses are followed by a literal + without a following space or hyphen. Any of the previous two options are then followed by 44 with optional closing parentheses, followed by optional space or hyphen, followed by optional 0 in optional parentheses, followed by optional space or hyphen, followed by optional opening parentheses (international format). Alternatively, the pattern matches optional initial opening parentheses followed by the 0 trunk code (national format).
The previous part is then followed by the NDC (area code) and the subscriber phone number in 2+8, 3+7, 3+6, 4+6, 4+5, 5+5 or 5+4 format with or without spaces and/or hyphens. This also includes provision for optional closing parentheses and/or optional space or hyphen after where the user thinks the area code ends and the local subscriber number begins. The pattern allows any format to be used with any GB number. The display format must be corrected by later logic if the wrong format for this number has been used by the user on input.
The pattern ends with an optional extension number arranged as an optional space or hyphen followed by x, ext and optional period, or #, followed by the extension number digits. The entire pattern does not bother to check for balanced parentheses as these will be removed from the number in the next step.
At this point you don't care whether the number begins 01 or 07 or something else. You don't care whether it's a valid area code. Later steps will deal with those issues.
Step 2: Extract the NSN so it can be checked in more detail for length and range
After checking the input looks like a GB telephone number using the pattern above, the next step is to extract the NSN part so that it can be checked in greater detail for validity and then formatted in the right way for the applicable number range.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)([1-9]\d{1,4}\)?[\s\d-]+)(?:((?:x|ext\.?\s?|\#)\d+)?)$
Use the above pattern to extract the '44' from $1 to know that international format was used, otherwise assume national format if $1 is null.
Extract the optional extension number details from $3 and store them for later use.
Extract the NSN (including spaces, hyphens and parentheses) from $2.
Step 3: Validate the NSN
Remove the spaces, hyphens and parentheses from $2 and use further RegEx patterns to check the length and range and identify the number type.
These patterns will be much simpler, since they will not have to deal with various dial prefixes or country codes.
The pattern to match valid mobile numbers is therefore as simple as
^7([45789]\d{2}|624)\d{6}$
Premium rate is
^9[018]\d{8}$
There will be a number of other patterns for each number type: landlines, business rate, non-geographic, VoIP, etc.
By breaking the problem into several steps, a very wide range of input formats can be allowed, and the number range and length for the NSN checked in very great detail.
Step 4: Store the number
Once the NSN has been extracted and validated, store the number with country code and all the other digits with no spaces or punctuation, e.g. 442035557788.
Step 5: Format the number for display
Another set of simple rules can be used to format the number with the requisite +44 or 0 added at the beginning.
The rule for numbers beginning 03 is
^44(3\d{2})(\d{3])(\d{4})$
formatted as
0$1 $2 $3 or as +44 $1 $2 $3
and for numbers beginning 02 is
^44(2\d)(\d{4})(\d{4})$
formatted as
(0$1) $2 $3 or as +44 $1 $2 $3
The full list is quite long. I could copy and paste it all into this thread, but it would be hard to maintain that information in multiple places over time. For the present the complete list can be found at: http://aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_GB_Telephone_Numbers

For validation:
As bobince points out, you should be flexible with phone numbers because there are so many ways to enter them.
One simple yet effective way to validate the value is first strip all non-numeric values, then make sure it is at least 11 digits long, and - if you're limiting to UK numbers - then check it starts with either 0 or 44.
I can't be bothered looking up vb.net syntax, but something along the lines of this:
if Phone.replaceAll('\D','').length < 11
// Invalid Number
endif;
(The \D is regex for anything not 0-9.)
To format a number as requested, assuming you've got a relatively fixed input that you want to display to a page, something like this might work:
replace:
(\d{2,3})\D*0?\D*(\d{4})\D*(\d{5})\D*(\d*)
with:
+$1-(0)$2-$3 ext $4
That's fairly flexible but wont accept any old phone number. It currently required an international code at the start, and I'm not quite sure on the rules of them to know if it's going to work perfectly, but it might be good enough for what you need.
An explanation of that regex, in regex comment mode (so it can be used directly as a regex if necessary):
(?x) # enable regex comment mode (whitespace ignored, hashes start comments)
# international code:
(\d{2,3}) # matches 3 or 2 digits; captured to group 1.
# optional 0 with potental spaces dashes or parens:
\D* # matches as many non-digits as possible, none required.
0? # optionally match a zero
\D* # matches as many non-digits as possible, none required.
# main part of number:
(\d{4}) # match 4 digits; captured to group 2
\D* # matches as many non-digits as possible, none required.
(\d{5}) # match 5 digits; captured to group 3.
# optional prefix:
\D* # matches as many non-digits as possible, none required.
(\d*) # match as many digits as possible, none required; captured to group 4.

Never include a (0) in parentheses in the international format.
ITU E.123 warns against it: http://www.itu.int/rec/T-REC-E.123-200102-I/en
as does: http://revk.www.me.uk/2009/09/it-is-not-44-0207-123-4567.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

printing mobile numbers with sed - awk

Related

Regex like telephone number on Hive without prefix (+01)

Regex to match BIN ranges

how to limit number of characters with complete word in regular expression

Regex issue using ICU regex/regexkitlite

ASP Regular Expression for UK Telephone format in VB.net

Categories

Resources