How to match people between separate systems using SQL? - sql

I would like to know if there is a way to match people between two separate systems, using (mostly) SQL.
We have two separate Oracle databases where people are stored. There is no link between the two (i.e. cannot join on person_id); this is intentional. I would like to create a query that checks to see if a given group of people from system A exists in system B.
I am able to create tables if that makes it easier. I can also run queries and do some data manipulation in Excel when creating my final report. I am not very familiar with PL/SQL.
In system A, we have information about people (name, DOB, soc, sex, etc.). In system B we have the same types of information about people. There could be data entry errors (person enters an incorrect spelling), but I am not going to worry about this too much, other than maybe just comparing the first 4 letters. This question deals with that problem more specifically.
They way I thought about doing this is through correlated subqueries. So, roughly,
select a.lastname, a.firstname, a.soc, a.dob, a.gender
case
when exists (select 1 from b where b.lastname = a.lastname) then 'Y' else 'N'
end last_name,
case
when exists (select 1 from b where b.firstname = a.firstname) then 'Y' else 'N'
end first_name,
case [etc.]
from a
This gives me what I want, I think...I can export the results to Excel and then find records that have 3 or more matches. I believe that this shows that a given field from A was found in B. However, I ran this query with just three of these fields and it took over 3 hours to run (I'm looking in 2 years of data). I would like to be able to match on up to 5 criteria (lastname, firstname, gender, date of birth, soc). Additionally, while soc number is the best choice for matching, it is also the piece of data that tends to be missing the most often. What is the best way to do this? Thanks.

You definitely want to weigh the different matches. If an SSN matches, that's a pretty good indication. If a firstName matches, that's basically worthless.
You could try a scoring method based on weights for the matches, combined with the phonetic string matching algorithms you linked to. Here's an example I whipped up in T-SQL. It would have to be ported to Oracle for your issue.
--Score Threshold to be returned
DECLARE #Threshold DECIMAL(5,5) = 0.60
--Weights to apply to each column match (0.00 - 1.00)
DECLARE #Weight_FirstName DECIMAL(5,5) = 0.10
DECLARE #Weight_LastName DECIMAL(5,5) = 0.40
DECLARE #Weight_SSN DECIMAL(5,5) = 0.40
DECLARE #Weight_Gender DECIMAL(5,5) = 0.10
DECLARE #NewStuff TABLE (ID INT IDENTITY PRIMARY KEY, FirstName VARCHAR(MAX), LastName VARCHAR(MAX), SSN VARCHAR(11), Gender VARCHAR(1))
INSERT INTO #NewStuff
( FirstName, LastName, SSN, Gender )
VALUES
( 'Ben','Sanders','234-62-3442','M' )
DECLARE #OldStuff TABLE (ID INT IDENTITY PRIMARY KEY, FirstName VARCHAR(MAX), LastName VARCHAR(MAX), SSN VARCHAR(11), Gender VARCHAR(1))
INSERT INTO #OldStuff
( FirstName, LastName, SSN, Gender )
VALUES
( 'Ben','Stickler','234-62-3442','M' ), --3/4 Match
( 'Albert','Sanders','523-42-3441','M' ), --2/4 Match
( 'Benne','Sanders','234-53-2334','F' ), --2/4 Match
( 'Ben','Sanders','234623442','M' ), --SSN has no dashes
( 'Ben','Sanders','234-62-3442','M' ) --perfect match
SELECT
'NewID' = ns.ID,
'OldID' = os.ID,
'Weighted Score' =
(CASE WHEN ns.FirstName = os.FirstName THEN #Weight_FirstName ELSE 0 END)
+
(CASE WHEN ns.LastName = os.LastName THEN #Weight_LastName ELSE 0 END)
+
(CASE WHEN ns.SSN = os.SSN THEN #Weight_SSN ELSE 0 END)
+
(CASE WHEN ns.Gender = os.Gender THEN #Weight_Gender ELSE 0 END)
,
'RAW Score' = CAST(
((CASE WHEN ns.FirstName = os.FirstName THEN 1 ELSE 0 END)
+
(CASE WHEN ns.LastName = os.LastName THEN 1 ELSE 0 END)
+
(CASE WHEN ns.SSN = os.SSN THEN 1 ELSE 0 END)
+
(CASE WHEN ns.Gender = os.Gender THEN 1 ELSE 0 END) ) AS varchar(MAX))
+
' / 4',
os.FirstName ,
os.LastName ,
os.SSN ,
os.Gender
FROM #NewStuff ns
--make sure that at least one item matches exactly
INNER JOIN #OldStuff os ON
os.FirstName = ns.FirstName OR
os.LastName = ns.LastName OR
os.SSN = ns.SSN OR
os.Gender = ns.Gender
where
(CASE WHEN ns.FirstName = os.FirstName THEN #Weight_FirstName ELSE 0 END)
+
(CASE WHEN ns.LastName = os.LastName THEN #Weight_LastName ELSE 0 END)
+
(CASE WHEN ns.SSN = os.SSN THEN #Weight_SSN ELSE 0 END)
+
(CASE WHEN ns.Gender = os.Gender THEN #Weight_Gender ELSE 0 END)
>= #Threshold
ORDER BY ns.ID, 'Weighted Score' DESC
And then, here's the output.
NewID OldID Weighted Raw First Last SSN Gender
1 5 1.00000 4 / 4 Ben Sanders 234-62-3442 M
1 1 0.60000 3 / 4 Ben Stickler 234-62-3442 M
1 4 0.60000 3 / 4 Ben Sanders 234623442 M
Then, you would have to do some post processing to evaluate the validity of each possible match. If you ever get a 1.00 for weighted score, you can assume that it's the right match, unless you get two of them. If you get a last name and SSN (a combined weight of 0.8 in my example), you can be reasonably certain that it's correct.

Example of HLGEM's JOIN suggestion:
SELECT a.lastname,
a.firstname,
a.soc,
a.dob,
a.gender
FROM TABLE a
JOIN TABLE b ON SOUNDEX(b.lastname) = SOUNDEX(a.lastname)
AND SOUNDEX(b.firstname) = SOUNDEX(a.firstname)
AND b.soc = a.soc
AND b.dob = a.dob
AND b.gender = a.gender
Reference: SOUNDEX

I would probably use joins instead of correlated subqueries but you will have to join on all the fields, so not sure how much that might improve things. But since correlated subqueries often have to evaluate row-by-row and joins don't it could improve things a good bit if you have good indexing. But as with all performance tuning only trying the techinque will let you knw ofor sure.
I did a similar task looking for duplicates in our SQL Server system and I broke it out into steps. So first I found everyone where the names and city/state were an exact match. Then I looked for additional possible matches (phone number, ssn, inexact name match etc. AS I found a possible match between two profiles, I added it to a staging table with a code for what type of match found it. Then I assigned a confidence amount to each type of match and added up the confidence for each potential match. So if the SOC matches, you might want a high confidence, same if the name is eact and the gender is exact and the dob is exact. Less so if the last name is exact and the first name is not exact, etc. By adding a confidence, I was much better able to see which possible mathes were more likely to be the same person. SQl Server also has a soundex function which can help with names that are slightly different. I'll bet Oracle has something similar.
After I did this, I learned how to do fuzzy grouping in SSIS and was able to generate more matches with a higher confidence level. I don't know if Oracle's ETL tools havea a way to do fuzzy logic, but if they do it can really help with this type of task. If you happen to also have SQL Server, SSIS can be run connecting to Oracle, so you could use fuzzy grouping yourself. It can take a long time to run though.
I will warn you that name, dob and gender are not likely to ensure they are the same person especially for common names.

Are there indexes on all of the columns in table b in the WHERE clause? If not, that will force a full scan of the table for each row in table a.

You can use soundex but you can also use utl_match for fuzzy comparing of string, utl_match makes it possible to define a treshold: http://www.psoug.org/reference/utl_match.html

Related

Dynamic SQL: CASE expression in HAVING clause for SSRS dataset query

One of my tables contains 6 bit flags:
tblDocumentFact.useCase1
tblDocumentFact.useCase2
tblDocumentFact.useCase3
tblDocumentFact.useCase4
tblDocumentFact.useCase5
tblDocumentFact.useCase6
The bit flags are used to restrict the returned data via a HAVING clause, for example:
HAVING tblDocumentFact.useCase4 = 1 /* '1' means 'True' */
That works in a static query. The query is for a dataset for a SQL Server Reporting Services report. Rather than have 6 reports, one per bit flag, I'd like to have 1 report with an #UserChoice input parameter. I'm trying to write a dynamic query to structure the HAVING clause in accordance with the #UserChoice parameter. I'm thinking that #UserChoice could be set to an integer value (1, 2, 3, 4, 5 or 6) when the user clicks a 1-of-6 option button. I've tried to do this via CASE expressions as shown below, but it doesn't work--the query returns no rows. What's the correct approach here?
HAVING (
(CASE WHEN #UserChoice =1 THEN 'dbo.tblDocumentFact.useCase1' END) = '1'
OR (CASE WHEN #UserChoice =2 THEN 'dbo.tblDocumentFact.useCase2' END) = '1'
OR (CASE WHEN #UserChoice =3 THEN 'dbo.tblDocumentFact.useCase3' END) = '1'
OR (CASE WHEN #UserChoice =4 THEN 'dbo.tblDocumentFact.useCase4' END) = '1'
OR (CASE WHEN #UserChoice =5 THEN 'dbo.tblDocumentFact.useCase5' END) = '1'
OR (CASE WHEN #UserChoice =6 THEN 'dbo.tblDocumentFact.useCase6' END) = '1'
)
You need to rephrase your logic slightly:
HAVING
(#UserChoice = 1 AND 'dbo.tblDocumentFact.useCase1' = '1') OR
(#UserChoice = 2 AND 'dbo.tblDocumentFact.useCase2' = '2') OR
(#UserChoice = 3 AND 'dbo.tblDocumentFact.useCase3' = '3') OR
(#UserChoice = 4 AND 'dbo.tblDocumentFact.useCase4' = '4') OR
(#UserChoice = 5 AND 'dbo.tblDocumentFact.useCase5' = '5') OR
(#UserChoice = 6 AND 'dbo.tblDocumentFact.useCase6' = '6');
A CASE expression can't be used in the way you were using it, because what follows THEN or ELSE has to be a literal value, not a logical condition.
To expand a bit on the comment under Tim's post, I think the reason it doesn't work out is because your cases are emitting strings containing column names not the values of columns
HAVING
CASE WHEN #UserChoice = 1 THEN dbo.tblDocumentFact.useCase1 END = 1
OR CASE WHEN #UserChoice = 2 THEN dbo.tblDocumentFact.useCase2 END = 1
...
It might even clean up to this:
HAVING
CASE #UserChoice
WHEN 1 THEN dbo.tblDocumentFact.useCase1
WHEN 2 THEN dbo.tblDocumentFact.useCase2
...
END = 1
The problem (I believe; in sql server at least, not totally sure about SSRS) is that when you say:
CASE WHEN #UserChoice = 1 THEN 'dbo.tblDocumentFact.useCase1' END = '1'
Your case when is emitting the literal string dbo.tblDocumentFact.useCase1 not the value of that column on that row. And of course this literal string is never equal to a literal string of 1
Overall I prefer Tim's solution; I think the query optimizer will more likely be able to use an index on the bit columns in that form, but be aware that use of ORs can cause sql server to ignore indexes; the DBAs at my old place frequently rewrote queries like:
SELECT * FROM Person WHERE FirstName = 'john' OR LastName = 'Smith'
Into this:
SELECT * FROM Person WHERE FirstName = 'john'
UNION
SELECT * FROM Person WHERE LastName = 'Smith'
Because the server wouldn't combine the index on FirstName and the other index on LastName when we used OR, but it would parallel execute using both indexes in the UNION form
Consider as an alternative, combining those bit flags into a single integer, either as a binary 2's complement (if you want to be able to say user choice 1 and 2 by searching for 3 or choice 2 and 4 and 6 by searching 42 [2^(2 -1) + 2^(4-1) + 2^(6-1)]) or just a straight int you can compare to #userChoice, and indexing it

How to to get two columns of data unrelated to each other in one sql query statement?

I need to get a state level count on number of services. For the purposes of this I only have two services. The first column is the states, the second column is the first services and the third column is the second service. What I am struggling with is to have the second and third column show up on the results in one query. Here is my code:
SELECT Distinct allstates.Name, count (data.StateName) as CareCase_Management_Services, count(data.StateName) Caregiver_Support_Services
From
(select distinct Name from USstate) allstates
Left Join
Client2017 data
on
allstates.Name = data.StateName and
data.FiscalYear = 2017 and
data.SrvstartCareCaseMgmtCode NOT IN('999','', '998') and
data.SrvstartCaregiverSuppCode NOT IN('999','', '998')
GROUP BY allstates.Name
ORDER BY allstates.Name ASC
I understand that you are looking to compute, for each state, the count of services that match certain criteria. There are two types of services, stored in two different columns.
If so, your query could be simplified using conditional aggregation :
SELECT
allstates.Name,
SUM(CASE WHEN c.SrvstartCareCaseMgmtCode NOT IN ('999', '', '998') THEN 1 ELSE 0 END) CareCase_Management_Services,
SUM(CASE WHEN c.SrvstartCaregiverSuppCode NOT IN ('999', '', '998') THEN 1 ELSE 0 END) Caregiver_Support_Services
FROM
(SELECT DISTINCT Name FROM USstate) s
LEFT JOIN Client2017 c ON s.Name = c.StateName AND c.FiscalYear = 2017
GROUP BY allstates.Name
With this technique, each service is counted according to its own logic ; when conditions are met, the record is counted in (1 is added to the SUM()), else it is ignored (+ 0).
NB : do you really have duplicated state names in USstate ? if no, you can replace subquery (SELECT DISTINCT Name FROM USstate) s with just USstate

SQL sub query logic

I am trying to calculate values in a column called Peak, but I need to apply different calculations dependant on the 'ChargeCode'.
Below is kind of what I am trying to do, but it results in 3 columns called Peak - Which I know is what I asked for :)
Can anyone help with the correct syntax, so that I end up with one column called Peak?
Use Test
Select Chargecode,
(SELECT 1 Where Chargecode='1') AS [Peak],
(SELECT 1 Where Chargecode='1242') AS [Peak],
Peak*2 AS [Peak],
CallType
from Daisy_March2014
Thanks
You want a case statement. I think this is what you are looking for:
Select Chargecode,
(case when chargecode = '1'
when chargecode = '1242' then 2
else 2 * Peak
end) as Peak,
CallType
from Daisy_March2014;
Thanks Gordon, I have marked you response as Answered. Here is the final working code:
(case when chargecode in ('1') then 1 when chargecode in ('1264') then 2 else Peak*2 end) as Peak,
Since it depends on your charge code, I'm going to make a wild assumption that this might be an ongoing thing where new charge codes / rules could be added. Why not store this as metadata either in the charge code table or in a new table? You could generate the initial data with this:
SELECT ChargeCode,
Multiplier
INTO ChargeMeta
FROM (
Select 1 AS ChargeCode,
1 AS Multiplier
UNION ALL
SELECT 1242 AS ChargeCode,
1 AS Multiplier
UNION ALL
SELECT ChargeCode,
2 AS Multiplier
FROM Daisy_March2014
WHERE ChargeCode NOT IN (1,1242)
) SQ
Then just join to your original data.
SELECT a.ChargeCode,
a.Peak*b.Multiplier AS Peak
FROM Daisy_March2014 a
JOIN ChargeMeta b
ON a.ChargeCode = b.ChargeCode
If you do not want to maintain all charge code multipliers, you could maintain your non-standard ones, and store the standard one in the SQL. This would be about the same as a case statement, but it may still add benefit to store the overrides in a table. At the very least, it makes it easier to re-use elsewhere. No need to check all the queries that deal with Peak values and make them consistent, if ChargeCode 42 needs to have a new multiplier set.
If you want to store the default in the table, you could use two joins instead of one, storing the default charge code under a value that will never be used. (-1?)
SELECT a.ChargeCode,
a.Peak*COALESCE(b.Multiplier,c.Multiplier) AS Peak
FROM Daisy_March2014 a
LEFT JOIN ChargeMeta b ON a.ChargeCode = b.ChargeCode
LEFT JOIN ChargeMeta c ON c.ChargeCode = -1

multiple count(distinct)

I get an error unless I remove one of the count(distinct ...). Can someone tell me why and how to fix it?
I'm in vfp. iif([condition],[if true],[else]) is equivalent to case when
SELECT * FROM dpgift where !nocalc AND rectype = "G" AND sol = "EM112" INTO CURSOR cGift
SELECT
list_code,
count(distinct iif(language != 'F' AND renew = '0' AND type = 'IN',donor,0)) as d_Count_E_New_Indiv,
count(distinct iif(language = 'F' AND renew = '0' AND type = 'IN',donor,0)) as d_Count_F_New_Indiv /*it works if i remove this*/
FROM cGift gift
LEFT JOIN
(select didnumb, language, type from dp) d
on cast(gift.donor as i) = cast(d.didnumb as i)
GROUP BY list_code
ORDER by list_code
edit:
apparently, you can't use multiple distinct commands on the same level. Any way around this?
VFP does NOT support two "DISTINCT" clauses in the same query... PERIOD... I've even tested on a simple table of my own, DIRECTLY from within VFP such as
select count( distinct Col1 ) as Cnt1, count( distinct col2 ) as Cnt2 from MyTable
causes a crash. I don't know why you are trying to do DISTINCT as you are just testing a condition... I more accurately appears you just want a COUNT of entries per each category of criteria instead of actually DISTINCT
Because you are not "alias.field" referencing your columns in your query, I don't know which column is the basis of what. However, to help handle your DISTINCT, and it appears you are running from WITHIN a VFP app as you are using the "INTO CURSOR" clause (which would not be associated with any OleDB .net development), I would pre-query and group those criteria, something like...
select list_code,
donor,
max( iif( language != 'F' and renew = '0' and type = 'IN', 1, 0 )) as EQualified,
max( iif( language = 'F' and renew = '0' and type = 'IN', 1, 0 )) as FQualified
from
list_code
group by
list_code,
donor
into
cursor cGroupedByDonor
so the above will ONLY get a count of 1 per donor per list code, no matter how many records that qualify. In addition, if one record as an "F" and another does NOT, then you'll have a value of 1 in EACH of the columns... Then you can do something like...
select
list_code,
sum( EQualified ) as DistEQualified,
sum( FQualified ) as DistFQualified
from
cGroupedByDonor
group by
list_code
into
cursor cDistinctByListCode
then run from that...
You can try using either another derived table or two to do the calculations you need, or using projections (queries in the field list). Without seeing the schema, it's hard to know which one will work for you.

How would I write this SQL query?

I have the following tables:
PERSON_T DISEASE_T DRUG_T
========= ========== ========
PERSON_ID DISEASE_ID DRUG_ID
GENDER PERSON_ID PERSON_ID
NAME DISEASE_START_DATE DRUG_START_DATE
DISEASE_END_DATE DRUG_END_DATE
I want to write a query that takes an input of a disease id and returns one row for each person in the database with a column for the gender, a column for whether or not they have ever had the disease, and a column for each drug which specifies if they took the drug before contracting the disease. I.E. true would mean drug_start_date < disease_start_date. False would mean drug_start_date>disease_start_date or the person never took that particular drug.
We currently pull all of the data from the database and use Java to create a 2D array with all of these values. We are investigating moving this logic into the database. Is it possible to create a query that will return the result set as I want it or would I have to create a stored procedure? We are using Postgres, but I assume an SQL answer for another database will easily translate to Postgres.
Based on the info provided:
SELECT p.name,
p.gender,
CASE WHEN d.disease_id IS NULL THEN 'N' ELSE 'Y' END AS had_disease,
dt.drug_id
FROM PERSON p
LEFT JOIN DISEASE d ON d.person_id = p.person_id
AND d.disease_id = ?
LEFT JOIN DRUG_T dt ON dt.person_id = p.person_id
AND dt.drug_start_date < d.disease_start_date
..but there's going to be a lot of rows that will look duplicate except for the drug_id column.
You're essentially looking to create a cross-tab query with the drugs. While there are plenty of OLAP tools out there that can do this sort of thing (among all sorts of other slicing and dicing of the data), doing something like this in traditional SQL is not easy (and, in general, impossible to do without some sort of procedural syntax in all but the simplest scenarios).
You essentially have two options when doing this with SQL (well, more accurately, you have one option, and another more complicated but flexible option that derives from it):
Use a series of CASE statements in your query to produce columns that are representative of each individual drug. This requires knowing the list of variable values (i.e. drugs) ahead of time
Use a procedural SQL language, such as T-SQL, to dynamically construct a query that uses case statements as described above, but along with obtaining that list of values from the data itself.
The two options essentially do the same thing, you're just trading simplicity and ease of maintenance for flexibility in the second option.
For example, using option 1:
select
p.NAME,
p.GENDER,
(case when d.DISEASE_ID is null then 0 else 1 end) as HAD_DISEASE,
(case when sum(case when dr.DRUG_ID = 1 then 1 else 0 end) > 0 then 1 else 0 end) as TOOK_DRUG_1,
(case when sum(case when dr.DRUG_ID = 2 then 1 else 0 end) > 0 then 1 else 0 end) as TOOK_DRUG_2,
(case when sum(case when dr.DRUG_ID = 3 then 1 else 0 end) > 0 then 1 else 0 end) as TOOK_DRUG_3
from PERSON_T p
left join DISEASE_T d on d.PERSON_ID = p.PERSON_ID and d.DISEASE_ID = #DiseaseId
left join DRUG_T dr on dr.PERSON_ID = p.PERSON_ID and dr.DRUG_START_DATE < d.DISEASE_START_DATE
group by p.PERSON_ID, p.NAME, p.GENDER, d.DISEASE_ID
As you can tell, this gets a little laborious as you get outside of just a few potential values.
The other option is to construct this query dynamically. I don't know PostgreSQL and what, if any, procedural capabilities it has, but the overall procedure would be this:
Gather list of potential DRUG_ID values along with names for the columns
Prepare three string values: the SQL prefix (everything before the first drug-related CASE statement, the SQL stuffix (everything after the last drug-related CASE statement), and the dynamic portion
Construct the dynamic portion by combining drug CASE statements based upon the previously retrieved list
Combine them into a single (hopefully valid) SQL statement and execute