SQL, Matching Rows Across Multiple Tables - sql

If I have two tables, A and B which have identical layout of:
Forename
Middlename
Surname
Date of Birth
Table A contains my data, table B contains data I wish to compare to table A.
I'd like to return all matches that are full matches (Forename, Middlename and Surname) as well as partial matches (First initial, surname, dob).
What would be the most efficient way of doing this and being able to distinguish between the two?
My initial thoughts are that I could do this with two passes however there must be a more efficient way as over a large number of records this could be quite inefficient.

You can do this:
select T1.*, T2.*, 'exact-match' as mode
from T1 inner join T2
on T1.fname = T2.fname
and T1.mname = T2.mname
and T1.lname = T2.lname
and t1.dob = T2.dob
UNION
select t1.*, t2.*, 'partial-match' as mode
from T1 inner join T2
on left(T1.fname,1) = LEFT(T2.fname,1)
and T1.lname = T2.lname
and T1.dob = T2.dob
where T1.fname <> T2.fname
The last line is there because otherwise exact matches would also satisfy the partial match test. You can get rid of that where-clause if you like. The second part of the query ignores middle name, and treats "Tim Q Jones" and "Tom X Jones" as a partial-match if they're born on the same day. That's what you asked for, right?

If you really want to avoid two queries, you could do something like this:
SELECT A.*,
CASE WHEN A.Middlename <> B.Middlename) THEN 'Partial'
ELSE 'Full'
END AS MatchType
FROM A
JOIN B ON (A.Forename = B.Forename AND
A.Middlename = B.Middlename AND
A.Surname = B.Surname)
OR
(LEFT(A.Forename,1) = LEFT(B.Forename,1) AND
A.Surname = B.Surname AND
A.DoB = B.DoB)
A JOIN with two different sets of JOIN criteria, and a case in the select that identifies which of the sets must have resulted in the joined records (If Middlename doesn't match, it must not have been a "full" match that resulted in the join).

This will do it in a single pass.
The condition for recognizing a full match has to be on both forename and middlename, otherwise it will classify some matches incorrectly.
select Forename, Middlename, Surname, DateOfBirth,
Case
when A.ForeName=B.ForeName and A.Middlename = B.middlename then 'full'
Else 'partial'
end as MatchType
from A
inner join B on
-- (Forename, Middlename and Surname)
(A.ForeName=B.ForeName
and A.Middlename = B.middlename
and A.Surname = B.surname)
or
-- (First initial, surname, dob)
(A.ForeName LIKE LEFT(B.ForeName,1)+'%'
and A.Surname = B.surname
and A.DateOfBirth = B.DateOfBirth)

Select
T1.Forename
, T1.Middlename
, T1.Surname
, T1.[Date of Birth]
, Case When T1.[Forename] = T2.[Forename] and T1.Middlename = T2.Middlename
Then 'Full'
else 'Partial'
end as Match_Type
From Table1 as T1
Inner Join Table2
on Left(Table1.[Forename], 1) = Left(Table2.[Forename], 1)
and Table1.[Date Of Birth] = Table2.[Date Of Birth]
and Table1.Surname = Table2.Surname

Related

Need to optimise select query

I have a query that does a select with joins from multiple tables that contains in total about 90 million rows. I only need data from the last 30 days. The problem is that when I run the select query the sql server throws a timeout while the query is running and new records are not created during this time frame. This query takes about 5 seconds to complete.
I would like to optimise this query so that it wont go through the entire tables looking at the datetime and would only search from the latest entries.
Right now it seems that I would need to index datetime column. Please advise if I need to create indexes or if there is another way to optimise this query.
SELECT [table1].Column1 AS InvoiceNo,
'ND' AS VATRegistrationNumber,
'ND' AS RegistrationNumber,
Column2 AS Country,
[table2].Column3 + ' ' + [table2].Column4 AS Name,
CAST([table1].Column5 AS date) AS InvoiceDate,
'SF' AS InvoiceType,
'' AS SpecialTaxation,
'' AS VATPointDate,
ROUND([table1Line].Column6, 2) AS TaxableValue,
CASE
WHEN [table1Line].Column7 = 9 THEN 'PVM2'
WHEN [table1Line].Column7 = 21 THEN 'PVM1'
WHEN [table1Line].Column7 = 0 THEN 'PVM14'
END AS TaxCode,
CAST([table1Line].Column7 AS int) AS TaxPercentage,
table1Line.Column8 - ROUND([table1Line].Column6, 2) AS Amount,
'' AS VATPointDate2,
[table1].Column1 AS InvoiceNo,
'' AS ReferenceNo,
'' AS ReferenceDate,
[table1].CustomerPersonID AS CustomerID
FROM [table1]
INNER JOIN [table2] ON [table1].CustomerPersonID = [table2].ID
INNER JOIN [table3] ON [table2].Column9 = [table3].ID
INNER JOIN [table1Line] ON [table1].ID = [table1Line].table1ID
INNER JOIN [table4] ON table1Line.TaxID = Tax.ID
INNER JOIN [table5] ON [table1].CompanyID = Company.ID
INNER JOIN table6 ON [table1].SalesChannelID = table6.ID
WHERE Column5 LIKE '%date%'
AND table6.id = 5
OR table6.id = 2
AND Column5 LIKE '%date%'
ORDER BY Column5 DESC;
First things first, each database runs a little differently because the optimizer has been running and figuring out how the unique circumstances can be improved and continuously tries to make common things run better.
There's also versioning differences that also play a part is the performance of the server.
Besides that stuff, Here's a few things to do to optimize this query.
When working with Joins, Your Joined table comes first then compare against the already specified table.
For example t2 checks against t1:
select t1.name, t2.car
from customers as t1
left join purchases as t2
on t2.customerid = t1.customerid
The next thing I see is the Like condition in the Where part of the code.
The stored date that it's finding is stored as text in your example.
I would recommend processing the date as a datetime instead of a string type of datatype.
I would include that in the code below, but I'm not sure what the format looks like for your string of text.
%date% is the same thing as saying "Contains date".
This takes the date string, and tries to see if it matches in every position of characters from left to right.
So if your date text is 20200130, it will check to see if it matches 2date0200130, then tries 20date200130, then tries 202date00130, etc.
It will significantly increase the time it takes to process.
I also see that the date is being searched accidently two times instead of one.
I would recommend doing:
WHERE LTRIM(RTRIM(Column5)) LIKE 'date'
As for the Inner Joins, I would not use them.
Use the Left join, and then in the Where, I would make sure it had no Null values for that joined data.
This makes the Left Join work the same as the Inner Join and runs more optimally when you are running the query.
For Instance, the first Join would look like this:
FROM [table1]
LEFT JOIN [table2] ON [table2].ID = [table1].CustomerPersonID
WHERE table2.id IS NOT NULL
I see an error in the code in the Where statement:
AND table6.id = 5
OR tables6.id = 2
This should be:
AND (tables6.id = 5 OR tables6.id = 2)
So here should be an optimized version of your code:
SELECT [table1].Column1 AS InvoiceNo,
'ND' AS VATRegistrationNumber,
'ND' AS RegistrationNumber,
Column2 AS Country,
[table2].Column3 + ' ' + [table2].Column4 AS Name,
CAST([table1].Column5 AS date) AS InvoiceDate,
'SF' AS InvoiceType,
'' AS SpecialTaxation,
'' AS VATPointDate,
ROUND([table1Line].Column6, 2) AS TaxableValue,
(CASE WHEN [table1Line].Column7 = 9 THEN 'PVM2'
WHEN [table1Line].Column7 = 21 THEN 'PVM1'
WHEN [table1Line].Column7 = 0 THEN 'PVM14'
ELSE '' END ) AS TaxCode,
CAST([table1Line].Column7 AS int) AS TaxPercentage,
table1Line.Column8 - ROUND([table1Line].Column6, 2) AS Amount,
'' AS VATPointDate2,
[table1].Column1 AS InvoiceNo,
'' AS ReferenceNo,
'' AS ReferenceDate,
[table1].CustomerPersonID AS CustomerID
FROM [table1]
LEFT JOIN [table2] ON [table2].ID = [table1].CustomerPersonID
LEFT JOIN [table3] ON [table3].ID = [table2].Column9
LEFT JOIN [table1Line] ON [table1Line].table1ID = [table1].ID
LEFT JOIN [table4] ON [table4].ID = table1Line.TaxID
LEFT JOIN [table5] ON [table5].ID = [table1].CompanyID
LEFT JOIN [table6] ON table6.ID = [table1].SalesChannelID
WHERE table2.ID IS NOT null
AND table3.ID IS NOT null
AND table1Line.ID IS NOT null
AND table4.ID IS NOT null
AND table5.ID IS NOT null
AND table6.ID IS NOT null
AND LTRIM(RTRIM(Column5)) LIKE 'date'
AND (table6.id = 5 OR table6.id = 2)
ORDER BY Column5 DESC;

SQL Nested select grouped with several rows of results

Hope this makes sense..
I have the following database tabels.
I am trying to group a resultset together in an SQL statement.
This is my current SQL statements:
SELECT
Patient.ID,
Patient.Name,
AnimalType.Value as AnimalType,
Patient.Age,
Customer.Firstname,
Customer.Lastname
FROM Patient
INNER JOIN Customer ON Patient.Owner_FK = Customer.ID
INNER JOIN AnimalType ON Patient.Type_FK = AnimalType.ID
SELECT
Treatment.Treatment_Date,
TreatmentType.Type
FROM Treatment
INNER JOIN TreatmentItem ON Treatment.ID = TreatmentItem.Treatment_FK
INNER JOIN TreatmentType ON TreatmentItem.TreatmentType_FK = TreatmentType.ID
INNER JOIN Patient ON TreatmentItem.Patient_FK = Patient.ID
WHERE Patient.ID = 132
There are two issues with this,
I have a static ID, and the results are split.
This is result of the above SQL's
My Issue is that the last resultset, should be together with the corresponding "Animal (patient)".
But without duplicate data. I could get the data all in one go, but then i would have a lot of duplicate rows of data with only the TreatmentType being different..
So how do i make this work ?
I have searched to no avail, and have not been able to make a correct Group by, that would make it work.
Does it make any sense ?
Is it even possible ?
example of desired result:
I believe you can achieve what you want with a single query, CASE statements, and the ROW_NUMBER() function, but it would require conversions of all non-text columns.
Here is a rough stab at a potential solution (I did not build your DB, so I haven't verified that this exact SQL runs, but the overall concept works).
WITH CTE_PatientTreatments AS (
SELECT
-- Get the row number for each treatment for a given patient
ROW_NUMBER() OVER (PARTITION BY Patient.ID ORDER BY Treatment.ID) AS RowNum,
Patient.ID,
Patient.Name,
AnimalType.Value as AnimalType,
Patient.Age,
Customer.Firstname,
Customer.Lastname,
Treatment.Treatment_Date,
TreatmentType.Type
FROM Patient
INNER JOIN Customer ON Patient.Owner_FK = Customer.ID
INNER JOIN AnimalType ON Patient.Type_FK = AnimalType.ID
INNER JOIN TreatmentItem ON TreatmentItem.Patient_FK = Patient.ID
INNER JOIN Treatment ON Treatment.ID = TreatmentItem.Treatment_FK
INNER JOIN TreatmentType ON TreatmentItem.TreatmentType_FK = TreatmentType.ID
WHERE Patient.ID = 132
-- Ensure rows are sorted so that rows for the same patient are always together
ORDER BY Patient.ID, Treatment.ID
)
-- Only display patient information for the first row
SELECT -- Convert numeric columns to text so that the "ELSE ''" doesn't get coerced into a number (0)
CASE WHEN (RowNum > 1) THEN '' ELSE CAST(ID AS VARCHAR) END AS ID,
CASE WHEN (RowNum > 1) THEN '' ELSE Name END AS Name,
CASE WHEN (RowNum > 1) THEN '' ELSE AnimalType END AS AnimalType,
CASE WHEN (RowNum > 1) THEN '' ELSE CAST(Age AS VARCHAR) END AS Age,
CASE WHEN (RowNum > 1) THEN '' ELSE Firstname END AS Firstname,
CASE WHEN (RowNum > 1) THEN '' ELSE Lastname END AS Lastname,
Treatment_Date,
Type
FROM CTE_PatientTreatments

If else condition in MSSQL

Suppose I have serial number, test name and few other columns, i want to write a condition if TESTNAME is null for a particular serial number then set the TESTNAME to blank else perform inner join
SELECT
(A.PTNUMBER + '-' +A.SL_NO) AS ENUMBER,
D.ENGINEER AS REQ, D.DATETIME as "DATE",
(select Value
from DROPDOWN
where B.TEST_NAME=CONVERT(VARCHAR,DropdownID)) TESTNAME,
TABLE_NAME AS TABLETD
FROM INSPECTION D
INNER JOIN TABLEA A ON D.ENGID = CONVERT(VARCHAR,A.EN_ID)
INNER JOIN TABLEB B ON B.ENGID = CONVERT(VARCHAR,A.EN_ID)
INNER JOIN TABLEC C ON C.ENGID = CONVERT(VARCHAR,A.EN_ID)
not sure what you mean by set testname to blank but if you meant to be using a SELECT query then you can do like
select *,
case when TESTNAME is null and serial_number = some_value then '' end as TESTNAME
from mytable
You could combine a case expression and coalesce() along with your join to choose the value you want to return.
select serial_number, ...
,case when coalesce(testname,'') <> ''
then t2.testname
else coalesce(testname,'') end
from t
inner join t2
on ...
You can use isnull() or coalesce() in sql server to return a different value to replace null.
select isnull(testname,'')
or
select coalesce(testname,'')
The main difference between the two is that coalesce() can support more than 2 parameters, and it selects the first one that is not null. More differences between the two are answered here.
select coalesce(testname,testname2,'')
coalesce() is also standard ANSI sql, so you will find it in most RDBMS. isnull() is specific to sql server.
Reference:
isnull() - msdn
coalesce() - msdn
SELECT (A.PTNUMBER + '-' + A.SL_NO) AS ENUMBER,
D.ENGINEER AS REQ,
D.DATETIME as "DATE",
case
when SerialNo = xxx and TESTNAME is null then ''
else (select Value from DROPDOWN where B.TEST_NAME = CONVERT(VARCHAR, DropdownID))
end AS TESTNAME,
TABLE_NAME AS TABLETD
FROM INSPECTION D
INNER JOIN TABLEA A ON D.ENGID = CONVERT(VARCHAR, A.EN_ID)
INNER JOIN TABLEB B ON B.ENGID = CONVERT(VARCHAR, A.EN_ID)
INNER JOIN TABLEC C ON C.ENGID = CONVERT(VARCHAR, A.EN_ID);

In T-SQL, how can I guarantee that a column value is updated only after all other column values in the same UPDATE statement have been updated?

I'd like to do a sort of "bulk update" of a table using an inner join. Here's how it works.
There are two tables in question, one of which is being updated. I'll call the one being updated OriginalTable and the other one UpdateData.
Both OriginalTable and UpdateData contain a PK column called Id, which we join on later. OriginalTable contains a number of other columns, all of which are nvarchars, I'll call these the data columns. Finally, OriginalTable also contains a Checksum column, which I'd like to have contain a SHA1 hash of the string concatenated data in a given row's data columns. UpdateData contains a subset of the data columns in the OriginalTable, and that's what's used to specify what to update the data in OriginalTable to.
If a value in UpdateData is a non-empty string or NULL, then I'd like to update the corresponding row in OriginalTable with that value. If the value is an empty string, then I don't want to modify the value of OriginalTable row.
Let's say that the data columns in OriginalTable are: FirstName, LastName, MI, Age. Let's say the data columns in UpdateData are: FirstName, LastName, Age. This basically means we're updating information for everything but MI.
I can accomplish this update with this SQL:
UPDATE T
SET FirstName = CASE UD.FirstName WHEN '' THEN T.FirstName ELSE UD.FirstName END,
LastName = CASE UD.LastName WHEN '' THEN T.LastName ELSE UD.LastName END,
Age = CASE UD.Age WHEN '' THEN T.Age ELSE UD.Age END
FROM #OriginalTable T
INNER JOIN #UpdateData UD
ON T.Id = UD.Id;
This is well and good. Now, the challenge I'm facing is that when it comes to updating the Checksum value for a row, I don't know how to guarantee that the hash calculation will occur only after the other rows have been updated first. I should mention that it's essential that the checksum calculation occurs in the same statement. I'd like to do something like this:
UPDATE T
SET FirstName = CASE UD.FirstName WHEN '' THEN T.FirstName ELSE UD.FirstName END,
LastName = CASE UD.LastName WHEN '' THEN T.LastName ELSE UD.LastName END,
Age = CASE UD.Age WHEN '' THEN T.Age ELSE UD.Age END,
Checksum = CONVERT(varchar(40), HASHBYTES('SHA1', T.FirstName + T.LastName + T.MI + T.Age), 2)
FROM #OriginalTable T
INNER JOIN #UpdateData UD
ON T.Id = UD.Id;
I need to calculate the value of the Checksum column with the values in OriginalTable after the update. How can I accomplish this hash calculation in the same UPDATE statement, guaranteeing it's calculated for data after the other columns have been updated?
You can try something like this (untested):
UPDATE T
SET FirstName = UD.FirstName,
LastName = UD.LastName,
Age = UD.Age,
Checksum = CONVERT(varchar(40), HASHBYTES('SHA1', UD.FirstName + UD.LastName + T.MI + UD.Age), 2)
FROM #OriginalTable T
INNER JOIN (
SELECT T1.Id,
CASE UD1.FirstName WHEN '' THEN T1.FirstName ELSE UD1.FirstName END AS FirstName,
CASE UD1.LastName WHEN '' THEN T1.LastName ELSE UD1.LastName END AS LastName,
CASE UD1.Age WHEN '' THEN T1.Age ELSE UD1.Age END AS Age
FROM #OriginalTable T1
INNER JOIN #UpdateData UD1 ON T1.Id = UD1.Id) AS UD
ON T.Id = UD.Id;

SQL - check for duplicate names/nicknames

I have a query that gives me duplicate names in my table. But, I need to add the checking of nicknames. I've tried many variations but am still stumped. The following query takes oave 12 minutes to run so I canceled it.
WITH TEAM2 as
(
SELECT ID, LastName, FirstName, Name,
ROW_NUMBER() OVER (PARTITION BY LastName, FirstName order by LastName, FirstName,ID DESC) RN
FROM dbo.vw_Users_Details
WHERE Lastname <> ''
AND Firstname <> ''
AND Not_Dupe_Flag <> 1
)
SELECT a.ID, a.LastName, a.FirstName
FROM TEAM2 a
where exists (select 1
from TEAM2 b
where (b.FirstName = a.FirstName
and b.LastName = a.LastName
and b.RN > 1)
OR
(b.LastName = a.LastName
AND EXISTS (SELECT 1 FROM pdNicknames AS c WHERE c.NAME = a.firstname AND c.variation = b.firstname)
and b.RN > 1)
)
order by a.LastName, a.FirstName, a.id
You can use the having clause.
For example:
select b.Branches_ShortName
from kplus..Folders f
inner join kplus..Portfolios p on p.Portfolios_Id = f.Portfolios_Id
inner join kplus..Branches b on b.Branches_Id = p.Branches_Id
group by Branches_ShortName
having count(Branches_ShortName) > 1
This will provide only the Branches that have more than 1 Folder :)
Okay, you're attempting to find all users who share the same name/nickname.
I believe the following should work;
SELECT a.ID, a.LastName, a.FirstName
FROM dbo.vw_Users_Details as a
WHERE a.LastName <> ''
AND a.FirstName <> ''
AND EXISTS (SELECT '1'
FROM dbo.vw_Users_Details as b
LEFT JOIN pdNicknames as c
ON (c.name = b.FirstName
AND c.variation = a.FirstName)
OR (c.name = a.FirstName
AND c.variation = b.FirstName)
WHERE b.ID <> a.ID
AND b.LastName = a.LastName
AND (b.FirstName = a.FirstName
OR (c.name IS NOT NULL OR c.variation IS NOT NULL)
)
)
I make no garuantees about the execution performance of this statement, as you haven't provided enough information for us to know. However, it's likely to be better, given you won't need the OLAP; I do recommend indicies on the various names and variation, of course. I left off Not_Dupe_Flag because I'm a little confused by it's use (because you seem to be using '1' as 'false', which is opposite to how most comparisons are setup); at minimum, never include 'Not' as part of a boolean variable name - it makes reasoning about it difficult (use Unique_Name or Duplicated_Name, either of which is immediately understandable).
EDIT:
If you need to restrict your selection, I recommend encapsulating the query in a view (including the ROW_NUMBER() function), and query the view. Alternatively, if your RDBMS supports it, wrap the query in a CTE. Multiple nested FROM clauses are like multiple nested if statements - confusing. Being able to logically seperate parts of the query with a view or CTE goes a long way to retaining sanity.