Extracting numbers from a string without the following characters in SQL - sql

So I have street addresses like the following:
123 Street Ave
1234 Road St Apt B
12345 Passage Way
Now, I'm having a hard time extracting just the street numbers without any of the street names.
I just want:
123
1234
12345

The way you put it, two simple options return the desired result. One uses regular expressions (selects the first number in a string), while another one returns the first substring (which is delimited by a space).
SQL> with test (address) as
2 (select '123 Street Ave' from dual union all
3 select '1234 Road St Apt B' from dual union all
4 select '12345 Passage Way' from dual
5 )
6 select
7 address,
8 regexp_substr(address, '^\d+') result_1,
9 substr(address, 1, instr(address, ' ') - 1) result_2
10 from test;
ADDRESS RESULT_1 RESULT_2
------------------ ------------------ ------------------
123 Street Ave 123 123
1234 Road St Apt B 1234 1234
12345 Passage Way 12345 12345
SQL>

Related

Google Sheets "=QUERY()" JOIN ON or equitant

I have one large spreadsheet with names, addresses, phone numbers, emails, Etc. Some records have a second address for which I have a column named "Address 2" I was hopping to write a query that would give me an output with duplicate rows of which the only difference was the "Address 2" column would be in the main address Column.
Data:
A
B
C
D
E
F
G
1
Status
Name
Address
Phone
Email
Address2
Hire Date
2
Joe Smith
123 Smith St
201 555 3099
Joe#stackoverflow.com
7th Avenue Sq
4
Q
Jane Smith
321 Not Smith St
12/15/1980
5
Robert Smith
818 555 4321
Robert#googlesheets.com
12/13/1981
Looking for an Query output to look like:
A
B
C
D
E
F
1
Status
Name
Address
Phone
Email
Hire Date
2
Joe Smith
123 Smith St
201 555 3099
Joe#stackoverflow.com
3
Joe Smith
7th Avenue Sq
201 555 3099
Joe#stackoverflow.com
4
Q
Jane Smith
321 Not Smith St
12/15/1980
5
Robert Smith
818 555 4321
Robert#googlesheets.com
12/13/1981
I was trying something like:
=QUERY({Sheet1!$A2:$G,Sheet1!$B2:$B,Sheet1!$F2:$J },"SELECT Col1, Col2, Col3, Col4, Col5, Col7 JOIN Col6 ON Col2 = Col2")
Which I think is more or less how it would be in SQL, but Google sheets doesn't have a join function.
Is there any way to get this done?
most simple you can do is:
=QUERY({A1:E, G1:G; A2:B, F2:F, D2:E, G2:G}, "where Col3 is not null", )
Something like this?
You can stack data of the same length with {},
this sample create 2 query function and stack them together.
=ArrayFormula(
LAMBDA(DATA_1,DATA_2,
QUERY({DATA_1;DATA_2},"WHERE Col2 IS NOT NULL ORDER BY Col2",1)
)(
QUERY({A1:G4},"SELECT "&JOIN(",","Col"&{1,2,3,4,5,7}),1),
QUERY({A1:G4},"SELECT "&JOIN(",","Col"&{1,2,6,4,5,7})&" WHERE Col6 IS NOT NULL LABEL "&JOIN(",","Col"&{1,2,6,4,5,7}&"''"),1)
)
)

Comparing string values within a table

Is there any way to compare two columns with strings to each other, and getting the matches?
I have two columns containing Names, once with the Full Name the other with (mostly) just the Surname.
I just tried it with soundex, but it will just return if the values are almost similar in both columns.
SELECT * FROM TABLE
WHERE soundex(FullName) = soundex(Surname)
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
with soundex it will only match the 3rd line.
A simple option is to use instr, which shows whether surname exists in fullname:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select *
7 from test
8 where instr(fullname, surname) > 0;
ID FULLNAME SURNAME
---------- ------------- -------------
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
Another option is to use one of UTL_MATCH functions, e.g. Jaro-Winkler similarity which shows how well those strings match:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select id, fullname, surname,
7 utl_match.jaro_winkler_similarity(fullname, surname) jws
8 from test
9 order by id;
ID FULLNAME SURNAME JWS
---------- ------------- ------------- ----------
1 John Doe Doe 48
2 Peter Parker Parker 62
3 Brian Griffin Brian Griffin 100
SQL>
Feel free to explore other function that package offers.
Also, note that I didn't pay attention to possible letter case differences (e.g. "DOE" vs. "Doe"). If you need that as well, compare e.g. upper(surname) to upper(fullname).
Please use instring function,
SELECT * FROM TABLE
WHERE instr(Surname, FullName) > 0;
SELECT * FROM TABLE
WHERE instr(upper(Surname), upper(FullName)) > 0;
SELECT * FROM TABLE
WHERE upper(FullName) > upper(Surname);
As far as I know there is nothing out of the box when matching becomes complicated. For the cases shown, however, the following expression would suffice:
where fullname like '%' || surname
Update
The main problem may be false positives:
The last name 'Park' appears in 'Peter Parker'. Above query solves this by looking at the full name's end.
Another problem may be upper / lower case as mentioned in the other answers (not shown in your sample data).
You want the last name 'PARKER' match 'Peter Parker'.
But when looking at the strings case insensitively, another problem arises:
The last name 'Strong' will suddenly match 'Louis Armstrong'.
A solution for this is to add a blank to make the difference:
where ' ' || upper(fullname) like '% ' || upper(surname)
' LOUIS ARMSTRONG' like '% STRONG' -> false
' LOUIS ARMSTRONG' like '% ARMSTRONG' -> true
' LOUIS ARMSTRONG' like '% LOUIS ARMSTRONG' -> true
Demo: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=0ac5c80061b4aeac1153a8c5976e6e54

Oracle SQL Repeated words in the String

I need your suggestions/inputs on of the following task. I have the following table:
ID ID_NAME
------ ---------------------------------
1 TOM HANKS TOM JR
2 PETER PATROL PETER JOHN PETER
3 SAM LIVING
4 JOHNSON & JOHNSON INC
5 DUHGT LLC
6 THE POST OF THE OFFICE
7 TURNING REP WEST
8 GEORGE JOHN
I Need a SQL query to find a repetitive word for every ID. if it exists, i need to get the count of the repeated word.
For instance in ID 2, the word PETER was repeated 3 times and in ID 1 the word TOM was repeated twice. so I need the output something like this:
ID ID_NAME COUNT
------ --------------------------------- --------
1 TOM HANKS TOM JR 2
2 PETER PATROL PETER JOHN PETER 3
3 SAM LIVING 0
4 JOHNSON & JOHNSON INC 2
5 DUHGT LLC 0
6 THE POST OF THE OFFICE 2
7 TURNING REP WEST 0
8 GEORGE JOHN 0
Just an FYI, The table has 560K rows
I tried the below and it didn't work and it is literally looking for every single word.
SELECT RESULT, COUNT(*)
FROM (SELECT
REGEXP_SUBSTR(COL_NAME, '[^ ]+', 1, COLUMN_VALUE) RESULT
FROM TABLE_NAME T ,
TABLE(CAST(MULTISET(SELECT DISTINCT LEVEL
FROM TABLE_NAME X
CONNECT BY LEVEL <= LENGTH(X.COL_NAME) - LENGTH(REPLACE(X.COL_NAME, ' ', '')) + 1
) AS SYS.ODCINUMBERLIST)) T1
)
WHERE RESULT IS NOT NULL
GROUP BY RESULT
ORDER BY 1;
Please let me know your inputs.
The query below counts repeated words and returns the highest count (if a word appears three times and another appears twice, the result will be the number 3). It treats JOHN as different from John (if capitalization shouldn't count as "different" then wrap the input strings within UPPER(...)). It only considers space as a word delimiter; if something else, like dash, is also considered as a delimiter, add to the REGEXP search pattern. Make sure you put a dash right at the end of a square-bracketed matching character list, etc. - the usual "tricks" for matching character lists. More generally, adapt as needed.
The query first breaks each input string into individual words, and counts how many times each word appears. For the count, I only need the words ("tokens") in the GROUP BY clause, I don't need to actually SELECT them, this is why the innermost query may look odd if you aren't forewarned. (Now you are!)
It also seems you want to show null rather than 1 if there are no repeated words, so I wrote the query to accommodate that. (Not sure why 1 wasn't OK.)
with
test_data ( id, id_name ) as (
select 1, 'TOM HANKS TOM JR' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE' from dual union all
select 7, 'TURNING REP WEST' from dual union all
select 8, 'GEORGE JOHN' from dual
)
-- end of test data; SQL query begins below this line
select id, id_name, case when max(cnt) >= 2 then max(cnt) end as max_count
from (
select id, id_name, count(*) as cnt
from test_data
connect by level <= 1 + regexp_count(id_name, ' ')
and prior id = id
and prior sys_guid() is not null
group by id, id_name, regexp_substr(id_name, '[^ ]+', 1, level)
)
group by id, id_name
order by id -- if needed
;
Output:
ID ID_NAME MAX_COUNT
-- ----------------------------- ----------
1 TOM HANKS TOM JR 2
2 PETER PATROL PETER JOHN PETER 3
3 SAM LIVING
4 JOHNSON & JOHNSON INC 2
5 DUHGT LLC
6 THE POST OF THE OFFICE 2
7 TURNING REP WEST
8 GEORGE JOHN
8 rows selected.
EDIT:
If you only need to find the returns where the string column has at least one repeated word, and you don't care what the highest "repeated word count" is or how many words are repeated, the solution is simpler and more efficient; you don't need to split the input string into component words and count them.
(The OP indicated in the comments, after long dialogue, that this would suffice.)
In the solution the "match pattern" in regexp_like searches for a string of letters, preceded by either the beginning of the string or a space or a dash and ended by space, comma, period, question mark, exclamation point or dash. Both "markers", for beginning and end of a word, can be modified as needed. Make sure the dash is either the first or last character in [...], anywhere else it has a special meaning.
Then it looks for a repetition of the word. That's what \2 does in the match pattern. It's 2 and not 1 because the "word" is in the second pair of parentheses; I need the first pair for the alternation, EITHER start-of-string OR (space or dash).
Look at the first and the last string for special situations that this query covers correctly. Think of any other possible situations that the query may or may not cover.
with
test_data ( id, id_name ) as (
select 1, 'TOM HANKS TOM-ALAN' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE' from dual union all
select 7, 'TURNING REP WEST' from dual union all
select 8, 'GEORGE JOHN-JOHN' from dual
)
-- end of test data; SQL query begins below this line
select id, id_name
from test_data
where regexp_like(id_name, '(^|[ -])([[:alpha:]]+)[ ,.?!-].*\2')
order by id -- if needed
;
ID ID_NAME
-- -----------------------------
1 TOM HANKS TOM-ALAN
2 PETER PATROL PETER JOHN PETER
4 JOHNSON & JOHNSON INC
6 THE POST OF THE OFFICE
8 GEORGE JOHN-JOHN
The next solution find first repeted word and in next step find count of repeating. Edit just now to fix extra subword findings
with s (ID, ID_NAME) as (
select 1, 'TOM HANKS TOM JR' from dual union all
select 10, 'TO TOM TOM TOM TOM TO TO TO STOM HANKS TOM TOMMY' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'qwe JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE ' from dual union all
select 7, 'TURNING REP WEST ' from dual union all
select 8, 'GEORGE JOHN ' from dual)
select id,
case when r1 = 0 then 0
else regexp_count(id_name, r3)
- regexp_count(id_name, r3||'\w+') -- exlude word with tail
- regexp_count(id_name, '\w+'||r3) -- exclude words with head
+ regexp_count(id_name, '\w+'||r3||'\w+') -- double calc with head and tail
end as rep_count
from (
select
s.*,
regexp_instr(s.id_name, '(^|\s)(\w+)(\s|$)(.*(\2))+') as r1 ,
regexp_replace(s.id_name, '.*?(^|\s)(\w+)(\s)(.*(\s)\2(\s|$))+.*$', '\2') as r3
from s);
the result is
ID REP_COUNT
---------- ----------
1 2
10 4
2 3
3 0
4 2
5 0
6 2
7 0
8 0

Remove duplicate address values where length of second column is less than the length of the greatest matching address

I'm not sure if I worded the title properly so I apologize. I feel this is best explained by showing my data.
Address 1 Address 2 City State AddressInfo#
-------------------------------- ------------------ ------------ ----- --------------
1 Main St #100 Burbville, CA, 99999 1 Main St #100 Burbville CA 1001
1 Main St #100 Burbville, CA, 99999 1 Main St Burbville CA 1001
1 Main St #100 Burbville, CA, 99999 1 Main st Burbville CA 1001
...
4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004
4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004
...
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd #800 NewCity MT 1008
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd NewCity MT 1008
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd NewCity MT 1008
I would like to find a way to remove all records where Address 2 is missing the full street address or simply contains an exact duplicate like AddressInfo# 1004.
Expected Output:
Address 1 Address 2 City State AddressInfo#
-------------------------------- ------------------ ------------ ----- --------------
1 Main St #100 Burbville, CA, 99999 1 Main St #100 Burbville CA 1001
...
4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004
...
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd #800 NewCity MT 1008
You could rebuild your data into a new table using
select
address_1,max(address_2) as address_2, addressinfo
from
table1
group by address_1,addressinfo
http://sqlfiddle.com/#!6/3d22c/2
Edit 1:
To select city and state as well you need to include it as a group by expression:
select
address_1,max(address_2) as address_2, addressinfo,
city, state
from
table1
group by address_1,addressinfo, city, state
http://sqlfiddle.com/#!6/4527c/1
Edit 2:
The max function does deliver the longest value here as needed. This works if the shorter values are true starts of the longer values.
Here is an example of this: http://sqlfiddle.com/#!6/3fba8/1
This may have syntax errors but this is a valid approach
with cte as
(
select address1, address2, city, state, ROW_NUMBER() OVER(partition by AddressInfo# order by len(address2) desc) as 'alen'
)
select * from cte
where alen = 1
SELECT DISTINCT
Address1
, Address2
, [AddressInfo#]
, City
, State
-- + any other fields
FROM dbo.Table1 AS t
WHERE NOT EXISTS (
SELECT *
FROM dbo.Table1 AS x
WHERE x.Address1 = t.Address1
-- + any other criteria for "uniqueness"
AND LEFT( x.Address2, LEN( t.Address2 ) ) = t.Address2
AND LEN( x.Address2 ) > LEN( t.Address2 )
);
This query will first get all the rows where there is not another row with the same Address1 and an Address2 matching the current value up to the length of the field, but at least one character longer. The DISTINCT is then applied to eliminate exact duplicates. (This assumes no NULL values.)
A similar query could use the LIKE operator, but this would need to account for special characters in the data, such as "%", "_", or brackets.
Some form of:
UPDATE A
SET Address2 = CASE WHEN Address1 = Address2 THEN NULL ELSE
CASE WHEN CHARINDEX(',',Address2,CHARINDEX(',',Address2)) = 0 THEN NULL ELSE Address2 END
END
FROM Address AS A

splitting a string by multiple delimitters

I have a set of addresses:
34 Main St Suite 23
435 Center Road Ste 3
34 Jack Corner Bldg 4
2 Some Street Building 345
the delimitters would be:
Suite, Ste, Bldg, Building
I would like to separate these addresses into address1 and address2 like this:
+---------------------+--------------+
| Address1 | Address2 |
+---------------------+--------------+
| 34 Main St | Suite 23 |
| 435 Center Road | Ste 3 |
| 34 Jack Corner | Bldg 4 |
| 2 Some Street | Building 345 |
+---------------------+--------------+
How can I define a set of delimitters and delimit in this fashion?
SELECT
T.Address,
Left(T.Address, IsNull(X.Pos - 1, 2147483647)) Address1,
Substring(T.Address, X.Pos + 1, 2147483647) Address2 -- Null if no second
FROM
(
VALUES
('34 Main St Suite 23'),
('435 Center Road Ste 3'),
('34 Jack Corner Bldg 4'),
('2 Some Street Building 345'),
('123 Sterling Rd'),
('405 29th St Bldg 4 Ste 217')
) T (Address)
OUTER APPLY (
SELECT TOP 1 NullIf(PatIndex(Delimiter, T.Address), 0) Pos
FROM (
VALUES ('% Suite %'), ('% Ste %'), ('% Bldg %'), ('% Building %')
) X (Delimiter)
WHERE T.Address LIKE X.Delimiter
ORDER BY Pos
) X
I used PatIndex() so an address like "Sterling Rd" won't give you a false match on "Ste"
Result set:
Address1 Address2
--------------- --------
34 Main St Suite 23
435 Center Road Ste 3
34 Jack Corner Bldg 4
2 Some Street Building 345
123 Sterling Rd NULL
405 29th St Bldg 4 Ste 217
You can use a table of delimiters on which to perform your split. In this example I am using XML to do the parsing, but after you've swapped in a reliable delimiter in place of your set (Ste, Suite, etc.) then you can perform the splitting using any of many t-sql based methods.
declare #tab table (s varchar(100))
insert into #tab
select '34 Main St Suite 23' union all
select '435 Center Road Ste 3' union all
select '34 Jack Corner Bldg 4' union all
select '2 Some Street Building 345' union all
select '20950 N. Tatum Blvd., Ste 300' union all
select '1524 McHenry Ave Ste 470';
declare #delimiters table (d varchar(100));
insert into #delimiters
select 'Suite' union all
select 'Ste' union all
select 'Bldg' union all
select 'Building';
select s,
cast('<r>'+ replace(s, d, '</r><r>'+d) + '</r>' as xml),
[Street1] = cast('<r>'+ replace(s, d, '</r><r>'+d) + '</r>' as xml).value('r[1]', 'varchar(100)'),
[Street2] = cast('<r>'+ replace(s, d, '</r><r>'+d) + '</r>' as xml).value('r[2]', 'varchar(100)')
from #tab t
cross
apply #delimiters d
where charindex(' '+d+' ', s) > 0;
select Addr,CASE WHEN CHARINDEX('suite',addr,1)>0 then LEFT(addr,CHARINDEX('suite',addr,1)-1)
WHEN CHARINDEX('Ste',addr,1)>0 then LEFT(addr,CHARINDEX('Ste',addr,1)-1)
WHEN CHARINDEX('Bldg',addr,1)>0 then LEFT(addr,CHARINDEX('Bldg',addr,1)-1)
WHEN CHARINDEX('Building',addr,1)>0 then LEFT(addr,CHARINDEX('Building',addr,1)-1)
END as [Address],
CASE WHEN CHARINDEX('suite',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('suite',addr,1)-1))
WHEN CHARINDEX('Ste',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('Ste',addr,1)-1))
WHEN CHARINDEX('Bldg',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('Bldg',addr,1)-1))
WHEN CHARINDEX('Building',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('Building',addr,1)-1))
END as [Address1]
from Addr
If you're going to try to parse this data, and it's NOT going to be delimited by something (ie comma), it's going to be much harder and you will have to make some assumptions. Having a larger data set can help you make stronger assumptions, but it will still be very brittle.
Looking at your data, I think you can make the following assumptions:
1) Address 2 is always the last 2 words (when split with spaces), so you could split the address based on spaces, and use the last 2 as Address 2, and the rest as Address 1.
2) You can assume Address 1 is the first 3 words, and the rest is Address 2.
To split up this data, I would either use T-SQL equivalent of split(' ', $data) to get an array of the words. Or, use a T-SQL equivalent of strpos and strrpos to find the 2nd to last space, or the position of the 3rd space, and substr everything before and after that into the appropriate variables.
It's up to you to make the decision based on the data available to pick the more robust assumptions and work with them.