T-SQL: Parsing names to ignore spaces and middle initials

T-SQL: Parsing names to ignore spaces and middle initials - sql

I have a poorly maintained database that includes employee information. Human Resources requested a report that lists instances where the employee name associated with an insurance coverage does not match the name on the insurance policy.
There are inconsistencies in the formatting of the names in both tables. It's always last name then first name, but you might see any of the following in either table for a fictional employee named Steven J. Smith:
Smith, Steven
Smith,Steven
Smith, Steven J.
Smith,Steven J.
I need to run a query looking for instances where EMPLOYEE.EMP_NAME <> INSURANCE.SUBSCRIBER_NAME while allowing for differences in name formatting as shown above (i.e. picking up that "Smith,Steven J." and "Smith, Steven" are (probably) the same person and igonring them).
SELECT
EMPLOYEE.EMP_NO
, EMPLOYEE.EMP_NAME
, INSURANCE.SUBSCRIBER_NAME
, INSURANCE.PAYOR_NAME
FROM EMPLOYEE
INNER JOIN INSURANCE ON EMPLOYEE.EMP_NO = INSURANCE.EMP_NO
WHERE EMPLOYEE.EMP_NAME <> INSURANCE.SUBSCRIBER_NAME
I know I want to do a substring to ignore the middle initial, but how do I account for ignoring whether or not there is a space after the comma?

Why not just remove all commas and spaces with REPLACE?
WHERE REPLACE(REPLACE(EMPLOYEE.EMP_NAME,' ',''),',','') <> REPLACE(REPLACE(INSURANCE.SUBSCRIBER_NAME,' ',''),',','')

You could simply replace out the comma
WHERE replace (EMPLOYEE.EMP_NAME,',','') <> replace (INSURANCE.SUBSCRIBER_NAME,',','')
To find most mismatches...
;with cE as
(select
EMP_NO,
REPLACE(REPLACE(REPLACE(EMP_NAME,',',''),' ',''),'.','') as namekey
from EMPLOYEE),
ci as
(select
EMP_NO,
REPLACE(REPLACE(REPLACE(SUBSCRIBER_NAME,',',''),' ',''),'.','') as namekey
from INSURANCE)
select *
from ce
inner join ci on ce.EMP_NO = ci.EMP_NO
where
not
(
(LEN(ce.namekey)< LEN(ci.namekey) and ci.namekey like ce.namekey+'%')
or
(LEN(ce.namekey)>= LEN(ci.namekey) and ce.namekey like ci.namekey+'%')
)

you can remove space after comma and then remove initials
declare #Temp table (Name nvarchar(128))
insert into #Temp
select 'Smith, Steven' union all
select 'Smith,Steven' union all
select 'Smith, Steven J.' union all
select 'Smith,Steven J.'
select
case
when N1.Name like '% %' then left(N1.Name, charindex(' ', N1.Name))
else N1.Name
end as Name_New,
T.Name
from #Temp as T
outer apply (select replace(T.Name, ', ', ',') as Name) as N1

Thanks, your answers helped a lot. I ended up cutting the name into [lastname][firstname] with no spaces and cutting off the middle initial if it was there. Here's what eventually worked in eliminating the vast majority of the same-name matches:
((CASE
WHEN CHARINDEX(' ',REPLACE(REPLACE(EMPLOYEE.EMP_NAME,', ',''),',','')) = 0
THEN UPPER(REPLACE(REPLACE(EMPLOYEE.EMP_NAME,', ',''),',',''))
ELSE UPPER(LEFT(REPLACE(REPLACE(EMPLOYEE.EMP_NAME,', ',''),',',''),CHARINDEX(' ',REPLACE(REPLACE(EMPLOYEE.EMP_NAME,', ',''),',',''))))
END) <>
(CASE
WHEN CHARINDEX(' ',REPLACE(REPLACE(INSURANCE.SUBSCRIBER_NAME
,', ',''),',','')) = 0
THEN UPPER(REPLACE(REPLACE(INSURANCE.SUBSCRIBER_NAME
,', ',''),',',''))
ELSE UPPER(LEFT(REPLACE(REPLACE(INSURANCE.SUBSCRIBER_NAME
,', ',''),',',''),CHARINDEX(' ',REPLACE(REPLACE(INSURANCE.SUBSCRIBER_NAME
,', ',''),',',''))))
END))

Related

Get every other row in SQL Server without using ROW_NUMBER(): Or... How is my row_number() wrong?

My company uses GoldMine CRM (on SQL Server), and I'm writing a query to find duplicate records. The query is great, but it returns two rows for each duplicate, and I only need one. However, I can't seem to use Row_Number() at all - it always returns a blank column. Here's my query:
SELECT *
FROM (
SELECT
c11.company AS Company1,
c12.company AS Company2,
c11.phone1 AS DuplicatePhone,
c11.address1 AS C1Address1,
c12.address1 AS C2Address1,
c11.zip AS Zip1,
c12.zip AS Zip2,
c11.contact AS Contact_1,
c12.contact AS Contact_2,
SUBSTRING(c11.company, 1, CHARINDEX(' ', c11.company)) AS C1_Firstword,
SUBSTRING(c12.company, 1, CHARINDEX(' ', c12.company)) AS C2_Firstword,
c11.accountno AS Acctno1,
c12.accountno AS Acctno2
FROM db.contact1 AS c11
INNER JOIN db.contact1 AS c12
ON c11.phone1 = c12.phone1
WHERE c11.state = 'MA'
AND c12.state = 'MA'
AND c11.phone1 IS NOT NULL
AND c11.phone1 <> ''
AND c11.accountno <> c12.accountno) AS foo
WHERE (PATINDEX('%' + foo.C1_Firstword + '%', foo.company2) > 0
OR PATINDEX('%' + foo.C2_Firstword + '%', foo.company1) > 0)
ORDER BY foo.DuplicatePhone
The query first looks for records with the same phone number, and then looks for similarities in the company name (sometimes our contacts share a phone number without being duplicates, but it's common to find duplicates where one name is 'John Smith Enterprises' and the other 'Smith Enterprises')
I've tried every iteration of ROW_NUMBER() in this query and in a far simpler one, eg:
SELECT c1.accountno, c1.company, ROW_NUMBER() OVER(ORDER BY c1.Accountno ASC) row_num
FROM bpmain1.dbo.contact1 c1
WHERE c1.state = 'MA'
... and I always get a blank column. My theory is that the SQL panel in GoldMine is stopping me from using it, since the results that I get back from GoldMine always include a column 'Row' that's numbered (As though GoldMine "conveniently" wraps every query with an empty ROW_NUMBER() clause.)
So, I end up with two rows for each duplicate instance, and I only need one - it doesn't matter which one. The point of using ROW_NUMBER() was to get me every other result. Any other ideas?

Just change your where conditions:
FROM db.contact1 c11 INNER JOIN
db.contact1 c12
ON c11.phone1 = c12.phone1 AND c11.state = c12.state
WHERE c11.state = 'MA' AND
c11.phone1 <> '' AND
c11.accountno < c12.accountno
--------------------^ this is the key change
That is, you don't need to remove duplicates. Just don't generate them in the first place by returning the contacts in account order.
Note that the condition c11.phone1 IS NOT NULL is redundant. Both the JOIN conditions and the <> '' filter out NULL values.

Count the number of not null columns using a case statement

I need some help with my query...I am trying to get a count of names in each house, all the col#'s are names.
Query:
SELECT House#,
COUNT(CASE WHEN col#1 IS NOT NULL THEN 1 ELSE 0 END) +
COUNT(CASE WHEN col#2 IS NOT NULL THEN 1 ELSE 0 END) +
COUNT(CASE WHEN col#3 IS NOT NULL THEN 1 ELSE 0 END) as count
FROM myDB
WHERE House# in (house#1,house#2,house#3)
GROUP BY House#
Desired results:
house 1 - the count is 3 /
house 2 - the count is 2 /
house 3 - the count is 1
...with my current query the results for count would be just 3's

In this case, it seems that counting names is the same as counting the commas (,) plus one:
SELECT House_Name,
LEN(Names) - LEN(REPLACE(Names,',','')) + 1 as Names
FROM dbo.YourTable;

Another option since Lamak stole my thunder, would be to split it and normalize your data, and then aggregate. This uses a common split function but you could use anything, including STRING_SPLIT for SQL Server 2016+ or your own...
declare #table table (house varchar(16), names varchar(256))
insert into #table
values
('house 1','peter, paul, mary'),
('house 2','sarah, sally'),
('house 3','joe')
select
t.house
,NumberOfNames = count(s.Item)
from
#table t
cross apply dbo.DelimitedSplit8K(names,',') s
group by
t.house

Notice how the answers you are getting are quite complex for what they're doing? That's because relational databases are not designed to store data that way.
On the other hand, if you change your data structure to something like this:
house name
1 peter
1 paul
1 mary
2 sarah
2 sally
3 joe
The query now is:
select house, count(name)
from housenames
group by house
So my recommendation is to do that: use a design that's more suitable for SQL Server to work with, and your queries become simpler and more efficient.

One dirty trick is to replace commas with empty strings and compare the lengths:
SELECT house +
' has ' +
CAST((LEN(names) - LEN(REPLACE(names, ',', '')) + 1) AS VARCHAR) +
' names'
FROM mytable

You can parse using xml and find count as below:
Select *, a.xm.value('count(/x)','int') from (
Select *, xm = CAST('<x>' + REPLACE((SELECT REPLACE(names,', ','$$$SSText$$$') AS [*] FOR XML PATH('')),'$$$SSText$$$','</x><x>')+ '</x>' AS XML) from #housedata
) a

select House, 'has '+cast((LEN(Names)-LEN(REPLACE(Names, ',', ''))+1) as varchar)+' names'
from TempTable

Using Pivot with non Numerical Data

This is the first time I have ever tried to use PIVOT.
I am using Microsoft SQL Server.
So here is my issue, I have been reading up on Pivot and have decided that it would work great for a project that exports Patient data to a formatted file i.e. Report, that can be printed out etc.. etc..
VPatientPlusAllergyData is a VIEW, that displays this as a sample result with some of the data cut out for ease of reading
strPatientFullName strAllergy strAllergyMedication
------------------------------------------------------------
Smith, John Henry Dogs Pounces
Smith, John Henry Dogs Orange Juice
Smith, John Henry Mustard Ketchup
Smith, John Henry Mustard Sugar
This is the result I want
strPatientFullName strAllergy1 strAllergy1Medications strAllergy2 strAllergy2Medications
------------------------------------------------------------------------------------------------------
Smith, John Henry Dogs Pounces, OrangeJuice Mustard Ketchup, Sugar
After readin on W3Schools, watching a Youtube video and even reading some articles on this site I'm wondering if what I am trying to do is possible
below is a code snippet but I got stuck on what I should put in the IN statement, and when I started to question the viability of PIVOT being the answer to my particular problem.
GO
SELECT
strPatientFullName
,strStreetAddress
,strCity
,strState
,strZipcode
,strPrimaryPhoneNumber
,strSecondaryPhoneNumber
,blnSmoker
,decPackYears
,blnHeadOfHousehold
,dtmDateOfBirth
,strSex
,strAllergy
,strAllergyMedication
,strEmailAddress
,strRecordCreator
FROM ( SELECT * FROM VPatientPlusAllergyData ) PatientAllergyData
PIVOT
(
MAX(strAllergyMedication)
FOR strAllergy
IN ()
)
GO
Hoping someone more familiar with Pivot will show me what I am missing or enlighten me to a much more efficient solution.
Thanks for the help
****** EDIT: I Have Decided that while I would love to put this sort of operation on the server side, for my particular application, it was just simpler to create a ton of views then perform SELECT queries on the client side and concatenate them that way, then implementing a "EXPORT PROCESSING" Screen.
I appreciate all the help, maybe on day I will write a script and have it execute server side, but for the moment this work good enough ******

Here's an example of how you could do something like this with a STUFF statement, conditional aggregation and dynamic SQL.
DECLARE #SQL NVARCHAR(MAX) = '';
SELECT #SQL += '
, MAX(CASE WHEN RN = ' + RN + ' THEN strAllergy END) strAllergy' + RN + '
, MAX(CASE WHEN RN = ' + RN + ' THEN strAllergyMedications END) strAllergyMedications' + RN
FROM (
SELECT CAST(ROW_NUMBER() OVER (PARTITION BY strPatientFullName, strAllergy ORDER BY (SELECT NULL)) AS VARCHAR(5)) RN
FROM VPatientPlusAllergyData) T
GROUP BY RN;
SELECT #SQL = 'SELECT strPatientFullName' + #SQL + '
FROM (
SELECT strPatientFullname
, strAllergy
, STUFF((SELECT '', '' + strAllergyMedication FROM VPatientPlusAllergyData WHERE strPatientFullName = T.strPatientFullName AND strAllergy = T.strAllergy FOR XML PATH ('''')), 1, 2, '''') strAllergyMedications
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM VPatientPlusAllergyData T
GROUP BY strPatientFullname, strAllergy) T
GROUP BY strPatientFullname;';
PRINT #SQL;
EXEC(#SQL);
As scsimon mentions in the comments, dynamic SQL may be necessary if there can be any number of allergies. A stuff statement is one way of getting the comma separated values into a single column. And the conditional aggregation works in the same way that a PIVOT would normally work, but is far easier (IMO) to write and understand than a normal PIVOT statement.

So to get to what you want you are actually looking at needing the following techniques:
For the case of strAllergyMedications you are needing to Concatenate Rows to a Delimited String
Then to make your rows into columns you need to PIVOT, but because you are pivoting 2 columns you would have to PIVOT twice or use Conditional Aggregation
The main trick to pulling it off is to prepare your table by doing the concatenation and coming up with a Row Number for the Allergy. Here is an example using a Common Table Expression [CTE] and STUFF() with a sub select XML to create the delimited string and create the Row Number.
DECLARE #VPatientPlusAllergyData AS TABLE (strPatientFullName VARCHAR(100), strAllergy VARCHAR(50), strAllergyMedication VARCHAR(100))
INSERT INTO #VPatientPlusAllergyData VALUES
('Smith, John Henry','Dogs','Pounces')
,('Smith, John Henry','Dogs','Orange Juice')
,('Smith, John Henry','Mustard','Ketchup')
,('Smith, John Henry','Mustard','Sugar')
;WITH cte AS (
SELECT DISTINCT
v1.strPatientFullName
,v1.strAllergy
,strAllergyMedications = STUFF(
(SELECT ', ' + v2.strAllergyMedication
FROM
#VPatientPlusAllergyData v2
WHERE
v1.strPatientFullName = v2.strPatientFullName
AND v1.strAllergy = v2.strAllergy
FOR XML PATH(''))
,1,2,'')
,AllergyRowNum = DENSE_RANK() OVER (PARTITION BY v1.strPatientFullName ORDER BY v1.strAllergy)
FROM
#VPatientPlusAllergyData v1
)
SELECT
strPatientFullName
,strAllergy1 = MAX(CASE WHEN AllergyRowNum = 1 THEN strAllergy END)
,strAllergy1Medications = MAX(CASE WHEN AllergyRowNum = 1 THEN strAllergyMedications END)
,strAllergy2 = MAX(CASE WHEN AllergyRowNum = 2 THEN strAllergy END)
,strAllergy2Medications = MAX(CASE WHEN AllergyRowNum = 2 THEN strAllergyMedications END)
FROM
cte
GROUP BY
strPatientFullName
AND while I was preparing and posting this #ZLK wrote a nice method to do it dynamically.

sort by second string in database field

I have the below sql statement which sorts an address field (address1) using the street name not the number. This seems to work fine but I want the street names to appear alphabetically. The ASC at the end of order by doesnt help
e.g Address1 field might contain
"5 Elm Close" - a normal sort and order will sort by the number the below will sort by looking at the 2nd string "Elm"
(Using SQL Server)
SELECT tblcontact.ContactID, tblcontact.Forename, tblcontact.Surname,
tbladdress.AddressLine1, tbladdress.AddressLine2
FROM tblcontact
INNER JOIN tbladdress
ON tblcontact.AddressID = tbladdress.AddressID
LEFT JOIN tblDonate
ON tblcontact.ContactID = tblDonate.ContactID
WHERE (tbladdress.CollectionArea = 'Queens Park')
GROUP BY tblcontact.ContactID, tblcontact.Forename, tblcontact.Surname,
tbladdress.AddressLine1, tbladdress.AddressLine2
ORDER BY REVERSE(LEFT(REVERSE(tbladdress.AddressLine1),
charindex(' ', REVERSE(tbladdress.AddressLine1)+' ')-1)) asc
Gordon's statement sorts as below
1 Kings Road
10 Olivier Way
11 Albert Street
11 Kings Road
11 Princes Road
120 High Street

Try this: I based it off of Gordon's code, but altered it to remove the LEFT(AddressLine1, 1) portion - a single-character string could never be match the pattern "n + space + %".
This works on my SQL-Server 2012 environment:
WITH tbladdress AS
(
SELECT AddressLine1 FROM (VALUES ('1 Kings Road'),('10 Olivier Way'), ('11 Albert Street')) AS V(AddressLine1)
)
SELECT
AddressLine1
FROM tbladdress
order by (case when tbladdress.AddressLine1 like '[0-9]% %'
then substrING(tbladdress.AddressLine1, charindex(' ', tbladdress.AddressLine1) + 1, len(tbladdress.AddressLine1))
else tbladdress.AddressLine1
end)
This is edited to be more similar to Gordon's code (position of closing parentheses, substr instead of substring):
order by (case when tbladdress.AddressLine1 like '[0-9]% %'
then substr(tbladdress.AddressLine1, charindex(' ', tbladdress.AddressLine1) + 1), len(tbladdress.AddressLine1)
else tbladdress.AddressLine1
end)

If you assume that the street name is the first or second value in a space separated string, you could try:
order by (case when left(tbladdress.AddressLine1, 1) like '[0-9]% %'
then substr(tbladdress.AddressLine1, charindex(' ', tbladdress.AddressLine1) + 1), len(tbladdress.AddressLine1) )
else tbladdress.AddressLine1
end)

I don't think you need to use REVERSE() at all. That seems like a trap.
ORDER BY
CASE
WHEN ISNUMERIC(LEFT(tbladdress.AddressLine1,CHARINDEX(' ',tbladdress.AddressLine1) - 1))
THEN RIGHT(tbladdress.AddressLine1,LEN(tbladdress.AddressLine1) - CHARINDEX(' ',tbladdress.AddressLine1))
ELSE tbladdress.AddressLine1
END,
CASE
WHEN ISNUMERIC(LEFT(tbladdress.AddressLine1,CHARINDEX(' ',tbladdress.AddressLine1) - 1))
THEN CAST(LEFT(tbladdress.AddressLine1,CHARINDEX(' ',tbladdress.AddressLine1) - 1) AS INT)
ELSE NULL
END
Also, you have a GROUP BY with no aggregate function. While that's not wrong, per se, it is weird. Just use DISTINCT if you're getting duplicate records.

This is the bit of code that works in sql server
order by (case when tbladdress.AddressLine1 like '[0-9]% %'
then substrING(tbladdress.AddressLine1, charindex(' ', tbladdress.AddressLine1) + 1, len(tbladdress.AddressLine1))
else tbladdress.AddressLine1
end)

No column name showing up in SQL

In SQL Server 2008 but column name does not appear. I need to put in an empty string as the rows are populated manually in the report.
(SELECT '' As 'Total No of people')
It seems to show up as (No column name)

You can have
SELECT ID as 'ID',
(SELECT <....> FROM table WHERE <...> ) AS 'Total No of people'
FROM somewhere
You have to put the column name after the ) for the inner select

I will say it works correctly! http://sqlfiddle.com/#!3/d41d8/18149
But perhaps your problem is that you do (technically using a subquery)
SELECT ID, (SELECT '' As 'Total No of people') FROM SomeWhere
and that is wrong...
SELECT ID, '' As 'Total No of people' FROM SomeWhere
or
SELECT ID, (SELECT '') As 'Total No of people' FROM SomeWhere
but there is no reason for the inner SELECT

make sure you put in tight order : SELECT '' As 'Total No of people' from PEOPLE

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

T-SQL: Parsing names to ignore spaces and middle initials - sql

Why not just remove all commas and spaces with REPLACE? WHERE REPLACE(REPLACE(EMPLOYEE.EMP_NAME,' ',''),',','') <> REPLACE(REPLACE(INSURANCE.SUBSCRIBER_NAME,' ',''),',','')

Related

Get every other row in SQL Server without using ROW_NUMBER(): Or... How is my row_number() wrong?

Count the number of not null columns using a case statement

Using Pivot with non Numerical Data

sort by second string in database field

No column name showing up in SQL

Categories

Resources