Inverse of this regex expression - regex-negation

I have a list:
50 - David Herd (1961-1968)
49 - Teddy Sheringham (1997-2001)
48 - George Wall (1906-1915)
47 - Stan Pearson (1935-1954)
46 - Harry Gregg (1957-1966)
45 - Paddy Crerand (1963-1971)
44 - Jaap Stam (1998-2001)
43 - Paul Ince (1989-1995)
42 - Dwight Yorke (1998-2002)
I want to select all characters EXCEPT the first and last name with the space in between in order to delete them and leave just the first name, space and last name.
So far I can select the first name, space and last name with:
([[a-zA-Z]+\s[a-zA-Z]+)
But I am unsure of how to 'invert' this expression. Any pointers would be much appreciated.

If regex replacement be an option for you, you could try the following in regex mode:
Find: \d+ - (\w+(?: \w+)+) \(\d{4}-\d{4}\)
Replace: $1
Demo

One option is to match the surrounded data, and capture the firstname space lastname.
In the replacement use the capture group.
^.*?\b([a-zA-Z]+\s[a-zA-Z]+)\b.*
Regex demo

Related

Separate area code from phone with HIVEQL

I have a table with DDD and Phone fields. Some were registered correctly, others the ddd is next to the phone and I need to separate.
my table:
Modified table:
I am starting my studies in HIVEQL, how can I make this change?
Use regexp_extract(str, regex, group_number) to extract ddd and telefone. Demo:
with mytable as (--test data
select stack(3,'5566997000000','5521997000001','24997000011') as str
)
select regexp_extract(str,'^(?:55)?(\\d{2})(\\d+)',1) as ddd,
regexp_extract(str,'^(?:55)?(\\d{2})(\\d+)',2) as telefone
from mytable
Result:
ddd telefone
66 997000000
21 997000001
24 997000011
Regexp '^(?:55)?(\\d{2})(\\d+)' meaning:
^ - beginning of the string anchor
(?:55)? - non-capturing group with 55 country code zero or one time (optional)
(\\d{2}) - capturing group with two digits - ddd
(\\d+) - capturing group with 1+ digits - telefone

How to extract alphanumeric phrase from a string - if it doesn't exist I wish to flag it

I have a column that can have the following possible values -
ITO26218361281- JANE
SBC28791827135 VATS
SOT21092832917 JOHN DOE
TIM INQ12109283291
JANE DOE 12/15
I only want to extract the 14 characters
alphanumeric phrase from the strings that can look like above. If the record is like (5), I still want that record to exist to be able to call it out as an error. I don't need the exact text to be the same, I just need it to be flagged for error.
Result expected -
ITO26218361281
SBC28791827135
SOT21092832917
INQ12109283291
JANE DOE 12/15 (or flagged as error)
You can use a regular expression to match the pattern you need, 3 letters and 11 numbers.
Using this in WHERE clause, you can match all "valid" values:
SELECT *
FROM TableName
WHERE ColumnName like '%[A-Z][A-Z][A-Z][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
I'm not a master of patterns nor regular expression, so used a "simple to understand" pattern here.
With this query, you extract the data to other table, or UPDATE a table column with the flag you want.
You can see this query working here in sqlfiddle.com

query to return all records that have a reoccuring string/number

I am looking for assistance in finding records that have a reoccurring string/number in an attibute due to input mismanagement. For example, the table will look similar to the following:
ID|stuff
1 | 23 jackson jackson st
2 | 89 jackson st
3 | 1 1 jackson st
4 | 66 jackson st
I'd like the return to look like the following:
ID|stuff
1 | 23 jackson jackson st
3 | 1 1 jackson st
please note, in the above example, 's' doesnt cause it to return in id 2, even though its in both jackSon and St.
Any help would be greatly appreciated, thank you.
You can use back-references in Oracle regular expressions. I think this does what you want:
select *
from t
where regexp_like(' ' || stuff, ' ([^ ]+) .*\1');
Here is a db<>fiddle.
Use this WHERE predicate
where regexp_like(stuff, '(^|\W)(\w+)($|\W).*\2')
Note that initial and traling group (^|\W) and ($|\W) means start/ end of the string or a non-word charater will delimit the second group - the first instance of the duplicated word.
The second group is defined as a (\w+) one or more word charters.
You may want alternatively use \s (white space) instead of \W - see here for further details.
Here sample data returned by this regexp addressing also the non-word delimiters.
You should also not underestimate tabs and other white stuff, that the simple solution ignore.
23 jackson jackson st
1 1 jackson st
68 jackson.st.jackson
See also this answer with a similar topic.

How to select a particular word between non-alphabet characters?

I have below values in a column.
TOM-TOM
TOMMY
TOM 12
123_TOM
SITTOM
TOM TIM
TOM,TIN
TOP TOM TON
TOMA
ATOM
How to select only these rows:
TOM-TOM
TOM 12
123_TOM
TOM TIM
TOM,TIN
TOP TOM TON
but not the below rows
SITTOM
TOMMY
TOMA
ATOM
If one character before or after TOM is a non-alphabet, those rows should be shown.
You can achieve this with a straightforward query using REGEXP_LIKE:
select t.col
from table t
where regexp_like(t.col, '(^|[^a-zA-Z])TOM([^a-zA-Z]|$)')
Here's a breakdown of a regular expression used:
^|[^a-zA-Z] - start of a line or non-alphabetic character;
TOM - TOM;
([^a-zA-Z]|$) - end of a line or non-alphabetic character.
If you want to take non-english letters into account, you can use :alpha: instead of a-zA-Z:
select t.col
from t
where regexp_like(t.col, '(^|[^[:alpha:]])TOM([^[:alpha:]]|$)')
Try this on SQLFiddle: http://sqlfiddle.com/#!4/439b6/1

Compare two addresses which are not in standard format

I have to compare addresses from two tables and get the Id if the address matches.
Each table has three columns Houseno, street, state
The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through
At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE.
Any combination with a dot or comma or spaces ,hypen.
I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example
table1
HNO STR State
----- ----- -----
12 6th Ave NY
10 3rd Aven SD
12-11 Fouth St NJ
11 sixth Lane NY
A23 Main Parkway NY
A-21 124 th Str. VA
table2
id HNO STR state
-- ----- ----- -----
1 12 6 Ave. NY
13 10 3 Avenue SD
15 1121 Fouth Street NJ
33 23 9th Lane NY
24 X23 Main Cir. NY
34 A1 124th Street VA
There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.
What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).
For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.
Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.
Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.
This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.
The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match
you can make a cursor where you do first a group by where house number and city =.
in a loop
you can separate a row with instr e substr considering chr(32).
After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.
good luck!