Allow nulls / de-duplicate within multi-table join? T-SQL - sql

I was wondering if there is a way in either SSIS or T-SQL (SQL Server 2012) to easily return non-duplicate data when doing a multi-table join (per-column, not per row)
I am trying to denormalize / flatten a bunch of data for conversion into a warehouse and I am winding up duplicating a ton of data. I'm hoping there is a sort of rollup/summary function or a design concept I am missing that can help me when merging multiple tables to a single destination.
Example
Let's say for example I have three tables: CUSTOMERS, CUSTOMER_ADDRESSES and CUSTOMER_ACCOUNTS. They and their data look like this:
CUSTOMERS
CUST_ID NAME
1 Burton Guster
CUSTOMER_ADDRESSES
CUST_ID ADDR_SEQ ADDRESS
1 1 123 Awesome St
1 2 456 Fake St
CUSTOMER_ACCOUNTS
CUST_ID ACCT_SEQ ACCT_TYPE ACCOUNT_OPEN_DT
1 1 TAP 1/1/1989
1 2 PHARMA 1/1/2010
I join them using a query like this:
SELECT a.CUST_ID, a.NAME, b.ADDRESS, c.ACCT_TYPE, c.ACCOUNT_OPEN_DT
FROM CUSTOMERS a
JOIN CUSTOMER_ADDRESSES b on a.CUST_ID = b.CUST_ID
JOIN CUSTOMER_ACCOUNTS c on a.CUST_ID = c.CUST_ID
Obviously each row joins to each row and as expected my output looks like this:
ID NAME ADDRESS ACCT_TYPE ACCT_OPEN_DT
1 Burton Guster 123 Awesome St TAP 1/1/1989
1 Burton Guster 123 Awesome St PHARMA 1/1/2010
1 Burton Guster 456 Fake St TAP 1/1/1989
1 Burton Guster 456 Fake St PHARMA 1/1/2010
Is there any way for me to get something like this instead?:
ID NAME ADDRESS ACCT_TYPE ACCT_OPEN_DT
1 Burton Guster 123 Awesome St TAP 1/1/1989
1 NULL 456 Fake St PHARMA 1/1/2010
The goal being to group each column, returning the distinct value per column only once. The larger set would be grouped by the customer ID.
Thank you

Sure, it can be done, although it's kinda awkward to do... :-)
You can use ROW_NUMBER() to get a running row number per costumer from each table independently. Then you can use these row numbers to bring the data together:
;WITH custCTE AS (
SELECT CUST_ID, NAME, 1 AS CUST_ROW_N
FROM CUSTOMERS
),
addrCTE AS (
SELECT CUST_ID, ADDRESS, ROW_NUMBER() OVER(PARTITION BY CUST_ID ORDER BY ADDR_SEQ) CUST_ROW_N
FROM CUSTOMER_ADDRESSES
),
acctCTE AS (
SELECT CUST_ID, ACCT_TYPE, ACCOUNT_OPEN_DT, ROW_NUMBER() OVER(PARTITION BY CUST_ID ORDER BY ACCT_SEQ) CUST_ROW_N
FROM CUSTOMER_ACCOUNTS
)
SELECT COALESCE(a.CUST_ID, b.CUST_ID, c.CUST_ID), a.NAME, b.ADDRESS, c.ACCT_TYPE, c.ACCOUNT_OPEN_DT
FROM custCTE a FULL JOIN addrCTE b ON
a.CUST_ID = b.CUST_ID AND a.CUST_ROW_N = b.CUST_ROW_N FULL JOIN acctCTE c ON
(b.CUST_ID = c.CUST_ID AND b.CUST_ROW_N = c.CUST_ROW_N) OR (a.CUST_ID = c.CUST_ID AND a.CUST_ROW_N = c.CUST_ROW_N)
Here's an SQLFiddle

Related

SQL subquery with latest record

I've read just about every question on here that I can find that is referencing getting the latest record from a subquery, but I just can't work out how to make it work in my situation.
I'm creating an SSRS report for use on SQL Server 2008.
In the database is a table of contacts and DBSdata. I want to pull up a list of contacts and the latest record (many of the fields from that row) from the DBSdata table (expiry date furthest in the future)
Contacts
========
PKContactID ContactName
----------- -----------
1 JONES Chris
2 SMITH Mary
3 GREY Jean
DBSdata
=======
Ordinal FKContactID ExpiryDate IssueDate DBSType
------- ----------- ---------- --------- -------
3 1 2021-09-01 2019-09-01 Internal
2 1 2019-08-31 2017-08-31 External
1 1 2017-07-01 2015-07-01 Internal
2 2 2021-04-15 2019-04-15 Internal
1 2 2019-05-05 2017-05-06 External
1 3 2018-01-03 2016-03-02 External
And the result I'd like is:
Latest DBS
==========
PKContactID ContactName ExpiryDate IssueDate DBSType
-------------------------------------------------------------------
3 GREY Jean 2018-01-03 2016-03-02 External
1 JONES Chris 2021-09-01 2019-09-01 Internal
2 SMITH Mary 2021-04-15 2019-04-15 Internal
[The DBSData table doesn't have it's own Primary Key field - that's not something I have control over, unfortunately... And the ordinal increases per contact, so FKContactID+Ordinal is unique....]
This is the code I've kind of got to, but it isn't working. The system I'm uploading the SSRS to doesn't give me any useful error message at all, so I can't be more specific about what isn't working I'm afraid. I get none of the SSRS report displayed, just an error saying the dataset source isn't working.
SELECT
c.PKContactID, c.ContactName, d.ExpiryDate, d.IssueDate, d.DBSType
FROM
Contacts c
LEFT JOIN (
SELECT TOP 1 FKContactID, ExpiryDate, IssueDate, DBSType
FROM DBSData
WHERE FKContactID = c.PKContactID
ORDER BY ExpiryDate DESC
) d ON c.PKContactID = d.FKContactID
ORDER BY
c.ContactName
I suspect it's something to do with that WHERE in the subquery, but if I don't have that, that whole table is using the WHOLE table and returning 1 row, not the top 1 for that contact.
Your method would work using APPLY, instead of JOIN:
SELECT c.PKContactID, c.ContactName,
d.ExpiryDate, d.IssueDate, d.DBSType
FROM Contacts c OUTER APPLY
(SELECT TOP 1 d.*
FROM DBSData d
WHERE d.FKContactID = c.PKContactID
ORDER BY d.ExpiryDate DESC
) d
ORDER BY c.ContactName;
Technically APPLY implements something called a lateral join. This is like a correlated subquery, but it can return multiple rows and multiple columns. Lateral joins are very powerful, and this is a good example for using them.
For performance, you want indexes on DBSData(FKContactID, ExpiryDate DESC) (perhaps including the other columns you want as well) and Contacts(ContactName).
With the right indexes, I would expect this to have performance at least as good as other methods.
An alternative that also typically has good performance is using a correlated subquery for filtering:
SELECT c.PKContactID, c.ContactName,
d.ExpiryDate, d.IssueDate, d.DBSType
FROM Contacts c LEFT JOIN
DBSData d
ON d.FKContactID = c.PKContactID AND
d.ExpiryDate = (SELECT MAX(d2.ExpiryDate)
FROM DBSData d
WHERE d2.FKContactID = d.FKContactID
);
Note that to match the LEFT JOIN, the correlation condition needs to be in the ON clause, not the WHERE clause.
Finally, if you do use window functions, I would recommend a subquery for getting the first row:
SELECT c.PKContactID, c.ContactName,
d.ExpiryDate, d.IssueDate, d.DBSType
FROM Contacts c LEFT JOIN
(SELECT d.*,
ROW_NUMBER() OVER (PARTITION BY d.FKContactID ORDER BY d.PKContactID DESC) as seqnum
FROM DBSData d
) d
ON d.FKContactID = c.PKContactID AND
d.seqnum = 1;
Doing the subquery before the JOIN gives more opportunities for the optimizer to produce a better execution plan.
Here's one option using row_number():
SELECT *
FROM (
SELECT
c.PKContactID, c.ContactName, d.ExpiryDate, d.IssueDate, d.DBSType,
row_number() over (partition by c.PKContactID order by d.ExpiryDate desc) rn
FROM
Contacts c
LEFT JOIN DBSData d ON d.FKContactID = c.PKContactID
) t
WHERE rn = 1
ORDER BY ContactName
Online Demo
This Solution gives result as you expected and performance is so much higher.
select c.PKContactID,c.ContactName,d.ExpiryDate, d.IssueDate, d.DBSType from Contacts c
inner join DBSdata d
on c.PKContactID=d.FKContactID
where d.Ordinal in (select max(d.Ordinal) from DBSdata d where d.FKContactID=c.PKContactID)
order by c.ContactName

Most popular pairs of shops for workers from each company

I've got 2 tables, one with sales and one with companies:
Sales Table
Transaction_Id Shop_id Sale_date Client_ID
92356 24234 11.09.2018 12356
92345 32121 11.09.2018 32121
94323 24321 11.09.2018 21231
94278 45321 11.09.2018 42123
Company table
Client_ID Company_name
12345 ABC
13322 ABC
32321 BCD
22221 BCD
What I want to achieve is distinct count of Clients from each Company for each pair of shops(Clients who had at least 1 transaction in both of shops) :
Shop_Id_1 Shop_id_2 Company_name Count(distinct Client_id)
12356 12345 ABC 31
12345 14278 ABC 23
14323 12345 BCD 32
14278 12345 BCD 43
I think that I have to use self join, but my queries even with filter for one week is killing DB, any thoughts on that? I'm using Microsoft SQL server 2012.
Thanks
I think this is a self-join and aggregation, with a twist. The twist is that you want to include the company in each sales record, so it can be used in the self-join:
with sc as (
select s.*, c.company_name
from sales s join
companies c
on s.client_id = c.client_id
)
select sc1.shop_id, sc2.shop_id, sc1.company_name, count(distinct sc1.client_id)
from sc sc1 join
sc sc2
on sc1.client_id = sc2.client_id and
sc1.company_name = sc2.company_name
group by sc1.shop_id, sc2.shop_id, sc1.company_name;
I think there are some issues with your question. I interpreted it as such that the company table contains the shop ID's, not the ClienId's.
First you can create a solution to get the shops as rows for each company. Here I chose a maximum of 5 shops per company. Don't forget the semicolon in the previous statement before the cte's.
WITH CTE_Comp AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY CompanyName ORDER BY ShopID) AS RowNumb
FROM Company AS C
)
SELECT C1.ShopID,
C2.ShopID AS ShopID_2,
C3.ShopID AS ShopID_3,
C4.ShopID AS ShopID_4,
C5.ShopID AS ShopID_5,
C1.CompanyName
INTO ShopsByCompany
FROM CTE_Comp AS C1
LEFT JOIN CTE_Comp AS C2 ON C1.CompanyName= C2.CompanyName AND RowNumb = 2
LEFT JOIN CTE_Comp AS C2 ON C1.CompanyName= C3.CompanyName AND RowNumb = 3
LEFT JOIN CTE_Comp AS C2 ON C1.CompanyName= C4.CompanyName AND RowNumb = 4
LEFT JOIN CTE_Comp AS C2 ON C1.CompanyName= C5.CompanyName AND RowNumb = 5
WHERE C1.RowNumb = 1
After that, in a few steps, I think you could get the desired result:
WITH ClientsPerShop AS
(
SELECT ShopID,
COUNT (DISTINCT ClientID) AS TotalClients
FROM Sales
GROUP BY ShopID
)
, ClienstsPerCompany AS
(
SELECT CompanyName,
SUM (TotalClients) AS ClientsPerComp
FROM Company AS C
INNER JOIN ClientsPerShop AS CPS ON C.ShopID = CPS.ShopID
GROUP BY CompanyName
)
SELECT *
FROM ClienstsPerCompany AS CPA
INNER JOIN ShopsByCompany AS SBC ON SBC.CompanyName = CPA.CompanyName
Hopefully this will bring you closer to your solution, best of luck!

T SQL Adress Table with the same Company need latest Contact

i got an Address Table with Primary and Secondary Company locations, example:
ADDRESSES:
ID CompanyName AdressType MainID Location
1 ExampleCompany H 0 Germany
2 ExampleCompany N 1 Sweden
3 ExampleCompany N 1 Germany
and we got another Contacts Table including the latest Contact to each of the Company Locations
Contacts
ID SuperID Datecreate Notes
1 1 10.04.2018 XY
2 3 09.04.2018 YX
3 2 11.04.2018 XX
Now we want to select the latest Contact per Company and sort them so we got a list of all our customers that we did not contact in a long time.
i thought about something like this:
SELECT
ADDRH.ID,
ADDRH.COMPANY1,
TOPCONT.ID,
TOPCONT.DATECREATE,
TOPCONT.NOTES0
FROM dbo.ADDRESSES ADDRH
OUTER APPLY (SELECT TOP 1 ID, SUPERID, DATECREATE, CREATEDBY, NOTES0 FROM DBO.CONTACTS CONT WHERE ADDRH.ID = CONT.SUPERID ORDER BY DATECREATE DESC) TOPCONT
WHERE
TOPCONT.ID IS NOT NULL
ORDER BY TOPCONT.DATECREATE
But this is still missing the fact that we got the same company multiple times in the addresses table. how can i create a list that got each company with the latest contact?
Thanks for your help
Greetings
Well, you have to remove duplicates from address as well. Because of the structure of your data, I think the best approach is to use row_number():
SELECT ac.*
FROM (SELECT a.ID, a.COMPANY1, c.ID, c.DATECREATE, c.NOTES0,
ROW_NUMBER() OVER (PARTITION BY a.COMPANY1 ORDER BY c.DATECREATE DESC) as seqnum
FROM dbo.ADDRESSES a JOIN
DBO.CONTACTS c
ON a.ID = c.SUPERID
WHERE c.ID IS NOT NULL
) ac
WHERE seqnum = 1
ORDER BY c.DATECREATE;

SQL Query: Using AND/OR in the WHERE clause

I'm currently working on a project where I have a list of dental patients, and I'm supposed to be displaying all the patients who have two specific procedure codes attached to their profiles. These patients must have BOTH procedure codes, not one or the other. At first I thought I could accomplish this by using a basic AND statement in my WHERE clause, like so.
SELECT [stuff]
FROM [this table]
WHERE ProcedureCode = 'ABC123' AND ProcedureCode = 'DEF456';
The query of course returns nothing because the codes are entered individually and you can't have both Procedure 1 and Procedure 2 simultaneously.
I tried switching the "AND" to "OR" just out of curiosity. Of course I'm now getting results for patients who only have one code or the other, but the patients who have both are showing up too, and the codes are displayed as separate results, like so:
Patient ID Last Name First Name Code Visit Date
1111111 Doe Jane ABC123 11-21-2015
5555555 Smith John ABC123 12-08-2015
5555555 Smith John DEF456 12-08-2015
My SQL is pretty rusty these days. I'm trying to think of a way to filter out patients like Jane Doe and only include patients like John Smith who have both procedure codes. Ideas?
ADDING INFO BASED ON CHRISTIAN'S ANSWER:
This is what the updated query looks like:
SELECT PatientID, LastName, FirstName, Code, VisitDate
FROM VisitInfo
WHERE PatientID IN
(
SELECT PatientID
FROM VisitInfo
WHERE Code = 'ABC123' OR Code = 'DEF456'
GROUP BY PatientID
HAVING COUNT(*) > 1
)
AND (Code = 'ABC123' OR Code = 'DEF456');
So I'm still getting results like the following where a patient is only showing one procedure code but possibly multiple instances of it:
Patient ID Last Name First Name Code Visit Date
1111111 Doe Jane ABC123 11-02-2015
1111111 Doe Jane ABC123 11-21-2015
5555555 Smith John ABC123 12-08-2015
5555555 Smith John DEF456 12-08-2015
5555555 Smith John ABC123 12-14-2015
9999999 Jones Mike DEF456 11-22-2015
9999999 Jones Mike DEF456 12-06-2015
Even though Jane Doe and Mike Jones have 2 results, they're both the same code, so we don't want to include them. And even though John Smith still has 2 of the same code, his results also include both codes, so we want to keep him and other patients like him.
ANOTHER UPDATE:
I just learned that I now need to include a few basic demographic details for the patients in question, so I've joined my VisitInfo table with a PatientInfo table. The updated query looks like this:
SELECT v.PatientID, v.LastName, v.FirstName, v.Code, v.VisitDate, p.DateOfBirth, p.Gender, p.PrimaryPhone
FROM VisitInfo v JOIN PatientInfo p ON v.PatientID = p.PatientID
WHERE v.PatientID IN
(
SELECT PatientID
FROM VisitInfo
WHERE Code = 'ABC123' OR Code = 'DEF456'
GROUP BY PatientID
HAVING COUNT(*) > 1
)
AND (Code = 'ABC123' OR Code = 'DEF456');
I wasn't sure if the new JOIN would affect anyone's answers...
SELECT *
FROM TABLE
WHERE Patient_ID IN
(
SELECT Patient_ID
FROM TABLE
WHERE Code = 'ABC123' OR Code = 'DEF456'
GROUP BY Patient_ID
HAVING COUNT(*) = 2
)
AND (Code = 'ABC123' OR Code = 'DEF456')
UPDATE 1:
As a patient can have multiple ´procedure codes´, this way will work better:
SELECT *
FROM TABLE T1
WHERE EXISTS (SELECT 1
FROM TABLE T2
WHERE T1.Patient_ID = T2.Patient_ID
AND T2.Code = 'ABC123')
AND EXISTS (SELECT 1
FROM TABLE T2
WHERE T1.Patient_ID = T2.Patient_ID
AND T2.Code = 'DEF456')
AND T1.Code IN ('ABC123','DEF456')
There's a bunch of ways to skin this particular cat - here's another one:
WITH ABC123 AS (SELECT DISTINCT PATIENTID
FROM VISITINFO
WHERE PROCEDURECODE = 'ABC123'),
DEF456 AS (SELECT DISTINCT PATIENTID
FROM VISITINFO
WHERE PROCEDURECODE = 'DEF456')
SELECT v.PATIENTID, v.LASTNAME, v.FIRSTNAME, v.PROCEDURECODE, v.VISITDATE,
p.DateOfBirth, p.Gender, p.PrimaryPhone
FROM VISITINFO v
INNER JOIN ABC123 a
ON a.PATIENTID = v.PATIENTID
INNER JOIN DEF456 d
ON d.PATIENTID = v.PATIENTID
INNER JOIN PatientInfo p
ON v.PatientID = p.PatientID
WHERE v.PROCEDURE_CODE IN ('ABC123', 'DEF456')
ORDER BY v.PATIENTID, v.VISITDATE, v.PROCEDURECODE;
What we're doing here is using a couple of CTE's to get the PATIENTID's for each patient who has the the procedures in question performed. We then start with all records in VISITINFO and inner join those with the two CTE's. Because an INNER JOIN requires that matching information exist in the tables on both sides of the join this has the effect of retaining only the visits which match the information in both of the CTE's, based on the join criteria which in this case is the PATIENTID.
Best of luck.
Select the records with the two codes, but use COUNT OVER to count the distinct codes per patient. Only keep those records with a count of 2 codes for the patient, i.e. patients with both codes.
select patient_id, last_name, first_name, code, visit_date
from
(
select mytable.*, count(distinct code) over (partition by patient_id) as cnt
from mytable
where code in ('ABC123','DEF456')
) data
where cnt = 2;

Finding the MIN value from the SUM of multiple row grouped by some id

I am using SQLplus connecting to oracle database 12c
I have two tables, Customer and Account, where customers can have multiple accounts (one customer id, but multiple account types). I am trying to get the information on the customer with the largest amount of debt. Suppose we have the following information:
Customer table:
BSB# Customer# Name Address
------------------------------
0123 123456 Adam ABC st
0234 234566 Dave CBC rd
0345 345667 Max DSE st
Account table:
BSB# Customer# Type Balance
---------------------------------
0123 123456 Saving -2300
0123 123456 Credit -500
0123 123456 iSaver 200
0234 234566 Saving 5000
0345 345667 Credit -1500
0345 345667 iSaver -200
The desired output:
Customer# Name Address
-------------------------
123456 Adam ABC St
I can print out the sum of everyone's account balance, but I am not sure how to print out just the information on the account with the lowest accumulated balance. I tried using MIN(SUM(A.Balance)) but I keep getting the error saying "not a single-group function". I wouldn't be surprised if I made a mistake somewhere.
I am relatively new to sql and this is what I have so far. Any advice or pointers would be nice...
SELECT C.Customer#, C.Name, Address, SUM(A.Balance)
FROM Customer C
RIGHT OUTER JOIN Account A
ON C.BSB# = A.BSB#
AND
C.Customer# = A.Customer#
GROUP BY C.Customer#, C.Name, Address;
Thanks!
The following will order the customers by the sum of the balances:
SELECT C.Customer#, C.Name, Address, SUM(A.Balance)
FROM Customer C JOIN
Account A
ON C.BSB# = A.BSB# AND
C.Customer# = A.Customer#
GROUP BY C.Customer#, C.Name, Address
ORDER BY SUM(A.Balance);
If you want only one row, then that depends on the database. Here are some methods:
Change the select to:
SELECT TOP 1 . . .
Add a limit clause to the end of the query:
LIMIT 1
Add a fetch clause to the end of the query:
FETCH FIRST 1 ROWS ONLY
And in Oracle, you can do this with a subquery:
SELECT *
FROM (SELECT C.Customer#, C.Name, Address, SUM(A.Balance)
FROM Customer C JOIN
Account A
ON C.BSB# = A.BSB# AND
C.Customer# = A.Customer#
GROUP BY C.Customer#, C.Name, Address
ORDER BY SUM(A.Balance)
) t
WHERE rownum = 1;
U can use the folowing statement
with MinAccounts as
(
SELECT BSB#, Customer#, Balance
FROM (
SELECT BSB#, t.Customer# , sum(t.Balance) Balance, min(sum(t.Balance)) over () minBalance
FROM Account t
GROUP BY BSB#, Customer#
) d
WHERE Balance = minBalance)
SELECT *
FROM MinAccounts a
JOIN Customer c ON a.BSB# = c.BSB#