splitting a string by multiple delimitters - sql

I have a set of addresses:
34 Main St Suite 23
435 Center Road Ste 3
34 Jack Corner Bldg 4
2 Some Street Building 345
the delimitters would be:
Suite, Ste, Bldg, Building
I would like to separate these addresses into address1 and address2 like this:
+---------------------+--------------+
| Address1 | Address2 |
+---------------------+--------------+
| 34 Main St | Suite 23 |
| 435 Center Road | Ste 3 |
| 34 Jack Corner | Bldg 4 |
| 2 Some Street | Building 345 |
+---------------------+--------------+
How can I define a set of delimitters and delimit in this fashion?

SELECT
T.Address,
Left(T.Address, IsNull(X.Pos - 1, 2147483647)) Address1,
Substring(T.Address, X.Pos + 1, 2147483647) Address2 -- Null if no second
FROM
(
VALUES
('34 Main St Suite 23'),
('435 Center Road Ste 3'),
('34 Jack Corner Bldg 4'),
('2 Some Street Building 345'),
('123 Sterling Rd'),
('405 29th St Bldg 4 Ste 217')
) T (Address)
OUTER APPLY (
SELECT TOP 1 NullIf(PatIndex(Delimiter, T.Address), 0) Pos
FROM (
VALUES ('% Suite %'), ('% Ste %'), ('% Bldg %'), ('% Building %')
) X (Delimiter)
WHERE T.Address LIKE X.Delimiter
ORDER BY Pos
) X
I used PatIndex() so an address like "Sterling Rd" won't give you a false match on "Ste"
Result set:
Address1 Address2
--------------- --------
34 Main St Suite 23
435 Center Road Ste 3
34 Jack Corner Bldg 4
2 Some Street Building 345
123 Sterling Rd NULL
405 29th St Bldg 4 Ste 217

You can use a table of delimiters on which to perform your split. In this example I am using XML to do the parsing, but after you've swapped in a reliable delimiter in place of your set (Ste, Suite, etc.) then you can perform the splitting using any of many t-sql based methods.
declare #tab table (s varchar(100))
insert into #tab
select '34 Main St Suite 23' union all
select '435 Center Road Ste 3' union all
select '34 Jack Corner Bldg 4' union all
select '2 Some Street Building 345' union all
select '20950 N. Tatum Blvd., Ste 300' union all
select '1524 McHenry Ave Ste 470';
declare #delimiters table (d varchar(100));
insert into #delimiters
select 'Suite' union all
select 'Ste' union all
select 'Bldg' union all
select 'Building';
select s,
cast('<r>'+ replace(s, d, '</r><r>'+d) + '</r>' as xml),
[Street1] = cast('<r>'+ replace(s, d, '</r><r>'+d) + '</r>' as xml).value('r[1]', 'varchar(100)'),
[Street2] = cast('<r>'+ replace(s, d, '</r><r>'+d) + '</r>' as xml).value('r[2]', 'varchar(100)')
from #tab t
cross
apply #delimiters d
where charindex(' '+d+' ', s) > 0;

select Addr,CASE WHEN CHARINDEX('suite',addr,1)>0 then LEFT(addr,CHARINDEX('suite',addr,1)-1)
WHEN CHARINDEX('Ste',addr,1)>0 then LEFT(addr,CHARINDEX('Ste',addr,1)-1)
WHEN CHARINDEX('Bldg',addr,1)>0 then LEFT(addr,CHARINDEX('Bldg',addr,1)-1)
WHEN CHARINDEX('Building',addr,1)>0 then LEFT(addr,CHARINDEX('Building',addr,1)-1)
END as [Address],
CASE WHEN CHARINDEX('suite',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('suite',addr,1)-1))
WHEN CHARINDEX('Ste',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('Ste',addr,1)-1))
WHEN CHARINDEX('Bldg',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('Bldg',addr,1)-1))
WHEN CHARINDEX('Building',addr,1)>0 then RIGHT(addr,len(addr)-(CHARINDEX('Building',addr,1)-1))
END as [Address1]
from Addr

If you're going to try to parse this data, and it's NOT going to be delimited by something (ie comma), it's going to be much harder and you will have to make some assumptions. Having a larger data set can help you make stronger assumptions, but it will still be very brittle.
Looking at your data, I think you can make the following assumptions:
1) Address 2 is always the last 2 words (when split with spaces), so you could split the address based on spaces, and use the last 2 as Address 2, and the rest as Address 1.
2) You can assume Address 1 is the first 3 words, and the rest is Address 2.
To split up this data, I would either use T-SQL equivalent of split(' ', $data) to get an array of the words. Or, use a T-SQL equivalent of strpos and strrpos to find the 2nd to last space, or the position of the 3rd space, and substr everything before and after that into the appropriate variables.
It's up to you to make the decision based on the data available to pick the more robust assumptions and work with them.

Related

SQL - Select records with duplication in the same column

Is there a way to select records with duplication within 1 column?
So, for instance you have Address_Table:
Address_Line_1
Address_Line_2
123 street
Town
321 street 321 street 321 street
Town
456 street
Town
789 street 789 street
Town
Is there a way to select the all records like 321 & 789 street from the Address_Line_1 column that contain duplicates of themselves?
Thanks
Just a thought, and not fully tested.
Select A.*
From YourTable A
Cross Apply (
select Dupes = Avg(Hits) -- perhaps Max(Hits) instead
From (
Select Value
,Hits = sum(1) over (partition by Value)
From string_split([Address_Line_1],' ')
) B1
) B
Where Dupes>1
Results
Address_Line_1 Address_Line_2
321 street 321 street 321 street Town
789 street 789 street Town
If your databse COMPATIBILITY_LEVEL more than 130
ALTER DATABASE [DatabaseName] SET COMPATIBILITY_LEVEL = 130
You can try this
SELECT ads = STUFF((
SELECT ' ' + value
FROM STRING_SPLIT(Address_Line_1, ' ')
GROUP BY value
FOR XML PATH('')
), 1, 1, '') , Address_Line_1, Address_Line_2 FROM Adress
Fidler Sample
Sample Image

Comparing string values within a table

Is there any way to compare two columns with strings to each other, and getting the matches?
I have two columns containing Names, once with the Full Name the other with (mostly) just the Surname.
I just tried it with soundex, but it will just return if the values are almost similar in both columns.
SELECT * FROM TABLE
WHERE soundex(FullName) = soundex(Surname)
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
with soundex it will only match the 3rd line.
A simple option is to use instr, which shows whether surname exists in fullname:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select *
7 from test
8 where instr(fullname, surname) > 0;
ID FULLNAME SURNAME
---------- ------------- -------------
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
Another option is to use one of UTL_MATCH functions, e.g. Jaro-Winkler similarity which shows how well those strings match:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select id, fullname, surname,
7 utl_match.jaro_winkler_similarity(fullname, surname) jws
8 from test
9 order by id;
ID FULLNAME SURNAME JWS
---------- ------------- ------------- ----------
1 John Doe Doe 48
2 Peter Parker Parker 62
3 Brian Griffin Brian Griffin 100
SQL>
Feel free to explore other function that package offers.
Also, note that I didn't pay attention to possible letter case differences (e.g. "DOE" vs. "Doe"). If you need that as well, compare e.g. upper(surname) to upper(fullname).
Please use instring function,
SELECT * FROM TABLE
WHERE instr(Surname, FullName) > 0;
SELECT * FROM TABLE
WHERE instr(upper(Surname), upper(FullName)) > 0;
SELECT * FROM TABLE
WHERE upper(FullName) > upper(Surname);
As far as I know there is nothing out of the box when matching becomes complicated. For the cases shown, however, the following expression would suffice:
where fullname like '%' || surname
Update
The main problem may be false positives:
The last name 'Park' appears in 'Peter Parker'. Above query solves this by looking at the full name's end.
Another problem may be upper / lower case as mentioned in the other answers (not shown in your sample data).
You want the last name 'PARKER' match 'Peter Parker'.
But when looking at the strings case insensitively, another problem arises:
The last name 'Strong' will suddenly match 'Louis Armstrong'.
A solution for this is to add a blank to make the difference:
where ' ' || upper(fullname) like '% ' || upper(surname)
' LOUIS ARMSTRONG' like '% STRONG' -> false
' LOUIS ARMSTRONG' like '% ARMSTRONG' -> true
' LOUIS ARMSTRONG' like '% LOUIS ARMSTRONG' -> true
Demo: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=0ac5c80061b4aeac1153a8c5976e6e54

Extracting numbers from a string without the following characters in SQL

So I have street addresses like the following:
123 Street Ave
1234 Road St Apt B
12345 Passage Way
Now, I'm having a hard time extracting just the street numbers without any of the street names.
I just want:
123
1234
12345
The way you put it, two simple options return the desired result. One uses regular expressions (selects the first number in a string), while another one returns the first substring (which is delimited by a space).
SQL> with test (address) as
2 (select '123 Street Ave' from dual union all
3 select '1234 Road St Apt B' from dual union all
4 select '12345 Passage Way' from dual
5 )
6 select
7 address,
8 regexp_substr(address, '^\d+') result_1,
9 substr(address, 1, instr(address, ' ') - 1) result_2
10 from test;
ADDRESS RESULT_1 RESULT_2
------------------ ------------------ ------------------
123 Street Ave 123 123
1234 Road St Apt B 1234 1234
12345 Passage Way 12345 12345
SQL>

Converting query with aggregate functions to a group by

What I am trying to do is extract what and how many orders are ordered by customers.
I am able to get all the data but what I want is to group it based on a TrackingID unique to each customer, and thus get only one row per customer, regardless of how many items ordered.
The Code I currently have is
Select OT.TrackingID As FW_ID
,( Select
SUBSTRING(CT.Name, 1, CHARINDEX(' ', CT.Name) - 1)
Where LEN(CT.Name) - LEN(REPLACE(CT.Name, ' ', '')) > 0
) As Forename
,( Select
SUBSTRING(CT.Name, CHARINDEX(' ', CT.Name) + 1, 8000)
Where LEN(CT.Name) - LEN(REPLACE(CT.Name, ' ', '')) > 0
) As Surname
,( Select CAST(1 as VARCHAR) + ' p1 male'
Where OT.ArticleNr = 1
And CT.GroupNr IN (2,5)) As Amount_male_t1
,( Select CAST(1 as VARCHAR) + ' p1 female'
Where OT.ArticleNr = 2
And CT.GroupNr IN (2,5)) As Amount_female_t1
,( Select CAST(1 as VARCHAR) + ' p2 male'
Where OT.ArticleNr = 1
And CT.GroupNr IN (3,6)) As Amount_male_t2
,( Select CAST(1 as VARCHAR) + ' p2 female'
Where OT.ArticleNr = 2
And CT.GroupNr IN (3,6)) As Amount_female_t2
From OrderTable As OT
JOIN CustomerTable As CT
ON OT.CustomerNr = CT.CustomerNr
JOIN CampaignTable As CT
ON OT.TrackingID = CT.TrackingID
Where CT.GroupNr IN (2,3,5,6)
And OT.NewOrder = 1
An example of what I can get from this is
FW_ID Forename Surname Amount_male_t1 Amount_female_t1 Amount_male_t2 Amount_female_t2
101 John Doe 1 p1 male NULL NULL NULL
101 John Doe NULL 1 p1 female NULL NULL
102 Steve Boss NULL NULL 1 p2 male NULL
102 Steve Boss NULL NULL 1 p2 male NULL
And what I want is
FW_ID Forename Surname Amount_male_t1 Amount_female_t1 Amount_male_t2 Amount_female_t2
101 John Doe 1 p1 male 1 p1 female NULL NULL
102 Steve Boss NULL NULL 2 p2 male NULL
Problem is that when I use Group By on OT.TrackingID I get an error when using MAX() on the names due to them being aggregated already and errors when trying to turn the package counters into COUNT() funktions.
Help would be most appreciated.
The joined tables looks something like this
OrderTable:
TrackingID CustomerNr OrderNr ArticleNr NewOrder OrderDate
101 10054 25 1 1 2014-06-09
101 10054 24 2 1 2014-06-09
102 10036 23 1 1 2014-06-08
102 10036 22 1 1 2014-06-07
103 10044 21 2 0 2014-06-06
CustomerTable
CustomerNr Name Adress ZipCode CustomerCreatedDate
10054 John Doe Upstreet 123456 2013-05-18
10036 Steve Boss Downstreet 234567 2014-06-07
10044 Eric Cartman Sidestreet 345678 2014-02-21
CampaignTable
TrackingID GroupNr ProductDescription
101 2 Group 2 & 5 are offered package 1
102 3 Group 3 & 6 are offered package 2
103 5 Group 2 & 5 are offered package 1
NOTE: If someone could give advice as to why my question is downvoted that would be most appreciated. I don't quite know what I've done wrong.
One approach would be to use a view, as mentioned. You can do this in-line in the query, and it doesn't need to be saved in the schema.
I've made a demo to show how this can be done with the sample table data you provided. From here if you wanted to change the representation of the data to a single line, you can just pivot it as shown here.
SELECT SplitNames.Forename, SplitNames.LastName FROM CustomerTable
INNER JOIN
(
SELECT CustomerNr,
SUBSTRING(CT.Name, 1, CHARINDEX(' ', CT.Name) - 1) As Forename,
SUBSTRING(CT.Name, CHARINDEX(' ', CT.Name), LEN(CT.Name) - 1) As LastName
FROM CustomerTable CT
) SplitNames ON CustomerTable.CustomerNr = SplitNames.CustomerNr
In general you should try to use as few subqueries in the select statement as possible, as it makes it impossible to properly aggregate any of your results.
You could use your query as a view and then use that view for a new query using distinct and count clauses.
I'm sorry, I understand this is not the best solution since it has two steps and maybe could be solve in a better way, but I can't find anything better, at moment, without any example data on a my db.

Remove duplicate address values where length of second column is less than the length of the greatest matching address

I'm not sure if I worded the title properly so I apologize. I feel this is best explained by showing my data.
Address 1 Address 2 City State AddressInfo#
-------------------------------- ------------------ ------------ ----- --------------
1 Main St #100 Burbville, CA, 99999 1 Main St #100 Burbville CA 1001
1 Main St #100 Burbville, CA, 99999 1 Main St Burbville CA 1001
1 Main St #100 Burbville, CA, 99999 1 Main st Burbville CA 1001
...
4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004
4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004
...
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd #800 NewCity MT 1008
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd NewCity MT 1008
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd NewCity MT 1008
I would like to find a way to remove all records where Address 2 is missing the full street address or simply contains an exact duplicate like AddressInfo# 1004.
Expected Output:
Address 1 Address 2 City State AddressInfo#
-------------------------------- ------------------ ------------ ----- --------------
1 Main St #100 Burbville, CA, 99999 1 Main St #100 Burbville CA 1001
...
4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004
...
8 New Blvd #800 NewCity, MT, 88888 8 New Blvd #800 NewCity MT 1008
You could rebuild your data into a new table using
select
address_1,max(address_2) as address_2, addressinfo
from
table1
group by address_1,addressinfo
http://sqlfiddle.com/#!6/3d22c/2
Edit 1:
To select city and state as well you need to include it as a group by expression:
select
address_1,max(address_2) as address_2, addressinfo,
city, state
from
table1
group by address_1,addressinfo, city, state
http://sqlfiddle.com/#!6/4527c/1
Edit 2:
The max function does deliver the longest value here as needed. This works if the shorter values are true starts of the longer values.
Here is an example of this: http://sqlfiddle.com/#!6/3fba8/1
This may have syntax errors but this is a valid approach
with cte as
(
select address1, address2, city, state, ROW_NUMBER() OVER(partition by AddressInfo# order by len(address2) desc) as 'alen'
)
select * from cte
where alen = 1
SELECT DISTINCT
Address1
, Address2
, [AddressInfo#]
, City
, State
-- + any other fields
FROM dbo.Table1 AS t
WHERE NOT EXISTS (
SELECT *
FROM dbo.Table1 AS x
WHERE x.Address1 = t.Address1
-- + any other criteria for "uniqueness"
AND LEFT( x.Address2, LEN( t.Address2 ) ) = t.Address2
AND LEN( x.Address2 ) > LEN( t.Address2 )
);
This query will first get all the rows where there is not another row with the same Address1 and an Address2 matching the current value up to the length of the field, but at least one character longer. The DISTINCT is then applied to eliminate exact duplicates. (This assumes no NULL values.)
A similar query could use the LIKE operator, but this would need to account for special characters in the data, such as "%", "_", or brackets.
Some form of:
UPDATE A
SET Address2 = CASE WHEN Address1 = Address2 THEN NULL ELSE
CASE WHEN CHARINDEX(',',Address2,CHARINDEX(',',Address2)) = 0 THEN NULL ELSE Address2 END
END
FROM Address AS A