Distinct Households Comparison Using Join - sql

I am trying to compare two lists of unique Household IDs using the Distinct clause. The problem comes when I try to pull in a third column consisting of timestamps into the results.
When I include only the two Household ID columns in the Select statement, the results seem to make sense. I get back two lists of unique IDs.
Here is that query:
select distinct e.household_id, a.hhid
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
However, when I just add the "e.imp_ts" column to the Select statement, it looks like SQL completely disregards the Distinct part of the query and pulls in all the duplicate households in the files.
select distinct e.household_id, a.hhid, e.imp_ts
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
Can someone please explain why the query doesn't work when I simply add a third column to the Select statement?
Thank you!

It is not that the second query "doesn't work", but rather that it is being asked to provide different results than the first query. As others in the comments have pointed out, because the imp_ts column contains more granular data, the distinct can no longer return a unique list of household IDs. For example, household ID 12345 may contain 5 records, each with unique timestamps on them.
In order to resolve this, you have some choices:
Remove imp_ts from the query.
Return the minimum (most likely first) timestamp
Return the maximum (most likely last) timestamp
For #2 and #3 above, you can use MIN() or MAX() with a GROUP BY to achieve those results. Here is an example of using MIN():
select e.household_id, a.hhid, MIN(e.imp_ts) AS min_imp_ts
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
group by e.household_id, a.hhid
I would suggest looking up GROUP BY examples online to get a better idea of what is happening.

Related

Avoid repeated information when having multiple joins?

I have the following query that uses joins to join multiple tables
select DISTINCT
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblTypes.Article_Type_Name,
tblimages.image_path as "Extra images"
from tblArticles inner join tblWriters
on tblArticles.Writer_ID_Fkey = tblWriters.Writer_ID inner join
tblArticleType on tblArticles.Article_ID = tblArticleType.Article_ID_Fkey inner join
tblTypes on tblArticleType.Article_Type_ID_Fkey = tblTypes.Article_Type_ID left outer join tblExtraImages
on tblArticles.Article_ID = tblExtraImages.Article_ID_Fkey left outer join tblimages
on tblExtraImages.image_id_fkey = tblimages.image_id
order by tblArticles.Article_Sequence, tblArticles.Article_Date_Created;
And I get the following results:
If an article has more than one type_name then I will get repeated columns for the rest of the records. Is there another way of joining these tables that would prevent that from happening?
The simplest method is to just remove column Article_Type_Name from the select clause. This allows SELECT DISTINCT to identify the rows as duplicates, and eliminate them.
Another option is to use an aggregation function on the column. In recent SQL Server versions, STRING_AGG() comes handy (you can also use MIN() or MAX()):
select
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
string_agg(tblTypes.Article_Type_Name, ',')
within group(order by tblTypes.Article_Type_Name) Article_Type_Name_List,
tblimages.image_path as Extra_Images
from ..
group by
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblimages.image_path
What you're seeing here is a Cartesian product; you've joined Tables in such a way that multiple rows from one side match with rows from the other
If you don't care about the article_type, then group the other columns and take the max(article_type), or omit it in a subquery that selects distinct records, not including the article type column, from the table that contains article type). If your SQLS is recent enough and you want to know all the article types you could STRING_AGG them into a csv list
Ultimately what you choose to do depends on what you want them for; filter the rows out, or group them down

How to Count records in query with association table in SQL Server?

I have three tables
PackingLists
ItemsToPackingLists
Items
I would like to have a list of all PackingLists with the Number of items per PackingList and the WeightInGramms for the PackingList.
I wrote the following query, but it gives wrong results. I guess I have to arrange the joins somehow different.
Any help how to refactor the query is appreciated.
SELECT p.ID,
p.NameOfPackingList,
COUNT(ItemsToP.ItemID) AS NumberOfDifferentItems,
SUM(items.WeightInGrams * ItemsToP.Quantity) AS WeightInGramms
FROM PackingLists AS p
LEFT OUTER JOIN ItemsToPackingLists AS ItemsToP
ON (ItemsToP.PackingListID = p.ID)
LEFT OUTER JOIN Items AS items
ON (ItemsToP.ItemID = items.ID)
GROUP BY p.ID,p.NameOfPackingList
Not really clear what you want to get, but two options to check.
Use COUNT(Distinct ItemsToP.ItemID) instead of COUNT(ItemsToP.ItemID), you might including the same item twice in one package (with different quantities), and naming of the col 'NumberOfDifferentItems' suggest using distinct as well.
However, your question is 'Number of items per PackingList'. To my understanding you should sum the quantities, SUM(ItemsToP.Quantity) instead of counting the IDs.

Duplicate results on inner join

I've written the below query but I'm getting multiple duplicate rows in the results, please can anyone see where I'm going wrong?
use Customers
select customer_details.Customer_ID,
customer_details.customer_name,
metering_point_details.MPAN_ID,
Agents.DA_DC_Charge
from Customer_Details
left join Metering_Point_Details
on customer_details.customer_id = Metering_Point_Details.Customer_ID
left join agents
on customer_details.Customer_ID = agents.customer_id
order by customer_id
It doesn't really matter, but you're not using an INNER JOIN. Regardless, your unexpected rows indicate that your JOIN criteria is not specific enough to return your expected output. You can use SELECT DISTINCT if your results are fully duplicative, and if you'd like to see why you're getting those duplicates you can just use SELECT * to see the full detail between the multiple rows that are returned using your JOIN criteria, which should help you either make your criteria more specific or show you that you've got duplicated records in one of the tables you're using in your JOIN.
With sample data we can dissect the problem more, but odds are you won't need it once you see why the rows are duplicated.

Select Statement with Distinct returning multiple rows and need only first result

I having a challenge with my query returning multiple results.
SELECT DISTINCT gpph.id, gpph.cname, gc2a.assetfilename, gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
A record, 5679599, has 2 'Primary Images' and is returning 2 results for that id but I only need the first result back. Is there any way to do this IN the current query? Do I need to write multiple queries?
I need some direction on how to constrain the results to only 1 result on Primary Image. I have looked at a ton of similar questions but most typically are just requiring the guidance of adding 'distinct' to the beginning of their query rather than on the where clause.
Edit: This problem is created by a user inputting 2 Primary Images on one record in the database. My business requirements only state to take the first result.
Any help would be awesome!
Given the choice is arbitary which to return, we can just use an aggregate on the value. This then needs a group by clause, which eliminates the need for the distinct.
SELECT gpph.id, gpph.cname, max(gc2a.assetfilename), gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
GROUP BY gpph.id, gpph.cname, gpph.alternateURL
In this instance, using max(gc2a.assetfilename) is going to give you the alphabetically highest value in the event of there being more than one record. It's not the ideal choice, some kind of timestamp knowing the order of the records might be more helpful, since then the meaning of the word 'first' could make more sense.
Replace distinct to group by :
SELECT MAX(gpph.id), gpph.cname, gc2a.assetfilename, gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
AND gpph.id = MAX(gpph.id)
GROUP BY gpph.cname, gc2a.assetfilename, gpph.alternateURL

Use two DISTINCT statements in SQL

I have combined two different tables together, one side is named DynDom and the other is CATH. I am trying to remove duplicates from that table such as below:
However, if i select distinct Dyndom pdbcode from the table, it returns distinct values of that pdbcode.
and
Based on the pictures above, I commented out the DynDom/CATH columns in the table and ran the query separately for DynDom/CATH and it returned those values accordingly, which is what i need and i was wondering if it's possible for me to use 2 distinct statements to return distinct values of the entire table based on the pdbcode.
Here's my code :
select DISTINCT
cath_dyndom_table_2."DYNDOM_DOMAINID",
cath_dyndom_table_2."DYNDOM_DSTART",
cath_dyndom_table_2."DYNDOM_DEND",
cath_dyndom_table_2."DYNDOM_CONFORMERID",
cath_dyndom_table_2.pdbcode,
cath_dyndom_table_2."DYNDOM_ChainID",
cath_dyndom_table_2.cath_pdbcode,
cath_dyndom_table_2."CATH_BEGIN",
cath_dyndom_table_2."CATH_END"
from
cath_dyndom_table_2
where
pdbcode = '2hun'
order by
cath_dyndom_table_2."DYNDOM_DOMAINID",
cath_dyndom_table_2."DYNDOM_DSTART",
cath_dyndom_table_2."DYNDOM_DEND",
cath_dyndom_table_2.pdbcode,
cath_dyndom_table_2.cath_pdbcode,
cath_dyndom_table_2."CATH_BEGIN",
cath_dyndom_table_2."CATH_END";
In the end, i would like to search domains from DynDom and CATH, based on the pdbcode and return the rows without having duplicate values.
Thank you.
UPDATE :
This is my VIEW table that i have done.
CREATE VIEW cath_dyndom_table AS
SELECT
r.domainid AS "DYNDOM_DOMAINID",
r.DomainStart AS "DYNDOM_DSTART",
r.Domain_End AS "DYNDOM_DEND",
r.ddid AS "DYN_DDID",
r.confid AS "DYNDOM_CONFORMERID",
r.pdbcode,
r.chainid AS "DYNDOM_ChainID",
d.cath_pdbcode,
d.cathbegin AS "CATH_BEGIN",
d.cathend AS "CATH_END"
FROM dyndom_domain_table r
FULL OUTER JOIN cath_domains d ON d.cath_pdbcode::character(4) = r.pdbcode
ORDER BY confid ASC;
What you are getting is the cartesian product of the ´two tables`.
In order to get one line without duplicates you need to have to have a 1-to-1 relation between both tables.
You can see HERE what are cartesian joins and HERE how to avoid them!
It sounds as though you want a UNION of domain name and ranges from each table - this can be achieved like so:
SELECT DYNDOM_DOMAINID, DYNDOM_DSTART, DYNDOM_DEND
FROM DynDom
UNION
SELECT RTRIM(cath_pdbcode), CATH_BEGIN, CATH_END
FROM CATH
This should eliminate exact duplicates (ie. where the domain name, start and end are all identical) but will not eliminate duplicate domain names with different ranges - if these exist you will need to decide how to handle them (retain them as separate entries, combine them with lowest start and highest end, or whatever other option is preferred).
EDIT: Actually, I believe you can get the desired results simply by changing the JOIN ON condition in your view to be:
FULL OUTER JOIN cath_domains d
ON d.cath_pdbcode::character(5) = r.pdbcode || r.chainid AND
r.DomainStart <= d.cathbegin AND
r.Domain_End >= d.cathend