TSQL Distinct does not work - sql

I have the following query and I try to keep a unique value for each GLOBAL_CONTENT_ID using the DISTINCT keyword. Unfortunately I cannot make it work.
SELECT DISTINCT
CD.GLOBAL_CONTENT_ID, CD.DOWNLOAD_ID, PA.PHYSICAL_ASSET_ID
FROM
[CONTENT_DOWNLOAD] CD
INNER JOIN
PHYSICAL_ASSET AS PA ON CD.GLOBAL_CONTENT_ID = PA.GLOBAL_CONTENT_ID
WHERE
CD.UPC = '00600753515501'
ORDER BY
CD.GLOBAL_CONTENT_ID
Any idea?
Thanks

The DISTINCT keyword will ensure that no duplicate records appear in your result set. However, it makes no guarantee that a given column cannot have duplicate values across multiple records, if the combination of values in those records be distinct.
One option to get the distinct GLOBAL_CONTENT_ID values would be to use the following query:
SELECT DISTINCT CD.GLOBAL_CONTENT_ID
FROM [CONTENT_DOWNLOAD] CD
INNER JOIN PHYSICAL_ASSET AS PA ON CD.GLOBAL_CONTENT_ID = PA.GLOBAL_CONTENT_ID
WHERE CD.UPC = '00600753515501'
ORDER BY CD.GLOBAL_CONTENT_ID

DISTINCT works on every column in the SELECT clause, not just a single column. If one of those columns has a different value, then the row is considered different and returned as another row. In your query, you are including 'PHYSICAL_ASSET_ID' which has a different value for each row which is why you are getting multiple rows.

Related

Prevent duplicate record when inner join query in SQL

I used the inner join command to get the data from two tables.
But, when I run the SQL query.
I got the same record duplicated 48 times.
The SQL query I created is below
SELECT
ABS_LIMIT.B1_NAME, ABS_LIMIT.B2_NAME, ABS_LIMIT.B3_NAME, ABS_LIMIT.ELEM_NAME
FROM
ABS_LIMIT
INNER JOIN
RTU_SCAN ON RTU+SCAN.B1_NAME = ABS_LIMIT.B1_NAME
WHERE
ABS_LIMIT.B3_NAME LIKE 'AMP%';
Does anyone have any idea how to remove the duplicate from the query result?
You never SELECT any columns from RTU_SCAN so you can use EXISTS rather than an INNER JOIN:
SELECT a.B1_NAME,
a.B2_NAME,
a.B3_NAME,
a.ELEM_NAME
FROM ABS_LIMIT a
WHERE EXISTS (SELECT 1 FROM RTU_SCAN r WHERE r.B1_NAME = a.B1_NAME)
AND a.B3_NAME LIKE 'AMP%';
Then, if there are duplicates in RTU_SCAN they will not propagate duplicate rows in the output.
Alternatively, you could use DISTINCT to remove duplicates:
SELECT DISTINCT
a.B1_NAME,
a.B2_NAME,
a.B3_NAME,
a.ELEM_NAME
FROM ABS_LIMIT a
INNER JOIN RTU_SCAN r
ON r.B1_NAME = a.B1_NAME
AND a.B3_NAME LIKE 'AMP%';
However, it will probably be less efficient to generate duplicates and then filter them out using DISTINCT compared to using EXISTS and not generating the duplicates in the first place.

Why do I get a duplicate column name error only when I SELECT FROM (SELECT)

I imagine this is a really basic oversight on my part but I have an SQL query which works fine. But I when I SELECT from that result (SELECT FROM (SELECT))
I get a 'duplicate column' error. There are duplicate column names, for sure, in two tables where I compare them but they do not cause a problem in the initial result. For example:
SELECT _dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id
Works fine but when I try to select from it, I get the error:
SELECT DISTINCT tag FROM
(SELECT _dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id) a
Regardless of the DISTINCT. Ok, I can change the column names to be unique but the question really is - why do i get the error when I SELECT FROM (SELECT) and not in the initial query?
Thanks
Solution:
SELECT DISTINCT tag_id, tag FROM (SELECT _dia_tagsrel.tag_id, _dia_tagsrel.article_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id) a
I only needed to SELECT one of the duplicate columns, even though I was comparing the both of them. Provided by answer below.
In you are second query i.e., the sub query, you are selecting tag_id twice. Though it is from two different tables, it works out whey you are selecting the data. But when you select the columns with same name twice, it provides you duplicate error. Below is the way you have selected the column which is incorrect
_dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
While using sub queries, merge, in or exists clause, avoid using the same column names multiple times.
Simple join works out no need of having subquery,
SELECT _dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id
Your first query returns four columns:
tag_id
article_id
tag_id
tag
Duplicate column names are allowed in a result set, but are not allowed in a table -- or derived table, view, CTE, or most subqueries (an exception are EXISTS subqueries).
I hope you can see the duplicate. There is no need to select tag_id twice, because the JOIN requires that the values are the same. So just select three columns:
SELECT tr.tag_id, tr.article_id, t.tag
FROM _dia_tagsrel tr JOIN
_dia_tags t
ON tr.tag_id = t.tag_id
Your subquery has two tag_ids, so how database engine decide which one you want to use.
So, either use one (join requires tag_ids to be same) or re-name it :
If _dia_tag has unique tags then you can use EXISTS instead of INNER JOIN:
SELECT t.tag
FROM _dia_tags t
WHERE EXISTS (SELECT 1 FROM _dia_tagsrel tr WHERE tr.tag_id = t.tag_id);

Avoid repeated information when having multiple joins?

I have the following query that uses joins to join multiple tables
select DISTINCT
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblTypes.Article_Type_Name,
tblimages.image_path as "Extra images"
from tblArticles inner join tblWriters
on tblArticles.Writer_ID_Fkey = tblWriters.Writer_ID inner join
tblArticleType on tblArticles.Article_ID = tblArticleType.Article_ID_Fkey inner join
tblTypes on tblArticleType.Article_Type_ID_Fkey = tblTypes.Article_Type_ID left outer join tblExtraImages
on tblArticles.Article_ID = tblExtraImages.Article_ID_Fkey left outer join tblimages
on tblExtraImages.image_id_fkey = tblimages.image_id
order by tblArticles.Article_Sequence, tblArticles.Article_Date_Created;
And I get the following results:
If an article has more than one type_name then I will get repeated columns for the rest of the records. Is there another way of joining these tables that would prevent that from happening?
The simplest method is to just remove column Article_Type_Name from the select clause. This allows SELECT DISTINCT to identify the rows as duplicates, and eliminate them.
Another option is to use an aggregation function on the column. In recent SQL Server versions, STRING_AGG() comes handy (you can also use MIN() or MAX()):
select
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
string_agg(tblTypes.Article_Type_Name, ',')
within group(order by tblTypes.Article_Type_Name) Article_Type_Name_List,
tblimages.image_path as Extra_Images
from ..
group by
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblimages.image_path
What you're seeing here is a Cartesian product; you've joined Tables in such a way that multiple rows from one side match with rows from the other
If you don't care about the article_type, then group the other columns and take the max(article_type), or omit it in a subquery that selects distinct records, not including the article type column, from the table that contains article type). If your SQLS is recent enough and you want to know all the article types you could STRING_AGG them into a csv list
Ultimately what you choose to do depends on what you want them for; filter the rows out, or group them down

How To Get Count of the records in sql

I have 2 tables in my database one tblNews and another tblNewsComments
I want to select 10 records from tblNewsComments than have must Comments of news
I used this query but it give an error
SELECT tblNews.id,
tblNews.newsTitle,
tblNews.createdate,
tblNews.viewcount,
COUNT(tblNewsComments.id) AS comcounts
FROM tblNews
INNER JOIN tblNewsComments ON tblNews.id = tblNewsComments.newsID
GROUP BY tblNews.id
Try to replace
GROUP BY tblNews.id
With
GROUP BY tblNews.id,
tblNews.newsTitle,
tblNews.createdate,
tblNews.viewcount
All the expressions in the SELECT list should be in the GROUP BY or inside an aggregate function.
I've always found this to be an annoyance in SQL. There's nothing logically wrong with your query; you're grouping by news item and selecting various attributes of the news item, and then selecting the count of comments linked to the news item. That makes sense.
The error arises because the SQL engine isn't smart enough to realize that all the columns in tblNews are at the same data context, and that grouping by tblNews.id effectively guarantees that there will only be one newsTitle, createdate, and viewcount for each group. It should be able to realize that, I think, and carry out the query. But it doesn't do that; the only column it considers to be unique in the group data context is the exact column that you grouped by, id.
One solution, as Multisync just posted, is to group by ALL the columns you want to include in the select clause. I don't think this is the best solution, however, as you shouldn't have to specify all those columns in the group by clause, and that would force you to keep adding to that list whenever you want to add a new TblNews column to the select clause.
The solution I've always used is to wrap the column in an ineffectual aggregate function in the select clause; I always use max():
select
tblNews.id,
max(tblNews.newsTitle),
max(tblNews.createdate),
max(tblNews.viewcount),
count(tblNewsComments.id) comcounts
from
tblNews
inner join tblNewsComments on tblNews.id=tblNewsComments.newsID
group by
tblNews.id
;
Or with subquery:
SELECT n.id,
n.newsTitle,
n.createdate,
n.viewcount,
(SELECT COUNT(*) FROM tblNewsComments c ON n.id = c.newsID) AS comcounts
FROM tblNews n
you have to select one column and group by another...other column will not work as they are not in the aggregate function.
SELECT tblNews.id, COUNT(tblNewsComments.newsID) AS comcounts
FROM tblNews
INNER JOIN tblNewsComments ON tblNews.id = tblNewsComments.newsID
GROUP BY tblNews.id
Read Here

Use two DISTINCT statements in SQL

I have combined two different tables together, one side is named DynDom and the other is CATH. I am trying to remove duplicates from that table such as below:
However, if i select distinct Dyndom pdbcode from the table, it returns distinct values of that pdbcode.
and
Based on the pictures above, I commented out the DynDom/CATH columns in the table and ran the query separately for DynDom/CATH and it returned those values accordingly, which is what i need and i was wondering if it's possible for me to use 2 distinct statements to return distinct values of the entire table based on the pdbcode.
Here's my code :
select DISTINCT
cath_dyndom_table_2."DYNDOM_DOMAINID",
cath_dyndom_table_2."DYNDOM_DSTART",
cath_dyndom_table_2."DYNDOM_DEND",
cath_dyndom_table_2."DYNDOM_CONFORMERID",
cath_dyndom_table_2.pdbcode,
cath_dyndom_table_2."DYNDOM_ChainID",
cath_dyndom_table_2.cath_pdbcode,
cath_dyndom_table_2."CATH_BEGIN",
cath_dyndom_table_2."CATH_END"
from
cath_dyndom_table_2
where
pdbcode = '2hun'
order by
cath_dyndom_table_2."DYNDOM_DOMAINID",
cath_dyndom_table_2."DYNDOM_DSTART",
cath_dyndom_table_2."DYNDOM_DEND",
cath_dyndom_table_2.pdbcode,
cath_dyndom_table_2.cath_pdbcode,
cath_dyndom_table_2."CATH_BEGIN",
cath_dyndom_table_2."CATH_END";
In the end, i would like to search domains from DynDom and CATH, based on the pdbcode and return the rows without having duplicate values.
Thank you.
UPDATE :
This is my VIEW table that i have done.
CREATE VIEW cath_dyndom_table AS
SELECT
r.domainid AS "DYNDOM_DOMAINID",
r.DomainStart AS "DYNDOM_DSTART",
r.Domain_End AS "DYNDOM_DEND",
r.ddid AS "DYN_DDID",
r.confid AS "DYNDOM_CONFORMERID",
r.pdbcode,
r.chainid AS "DYNDOM_ChainID",
d.cath_pdbcode,
d.cathbegin AS "CATH_BEGIN",
d.cathend AS "CATH_END"
FROM dyndom_domain_table r
FULL OUTER JOIN cath_domains d ON d.cath_pdbcode::character(4) = r.pdbcode
ORDER BY confid ASC;
What you are getting is the cartesian product of the ´two tables`.
In order to get one line without duplicates you need to have to have a 1-to-1 relation between both tables.
You can see HERE what are cartesian joins and HERE how to avoid them!
It sounds as though you want a UNION of domain name and ranges from each table - this can be achieved like so:
SELECT DYNDOM_DOMAINID, DYNDOM_DSTART, DYNDOM_DEND
FROM DynDom
UNION
SELECT RTRIM(cath_pdbcode), CATH_BEGIN, CATH_END
FROM CATH
This should eliminate exact duplicates (ie. where the domain name, start and end are all identical) but will not eliminate duplicate domain names with different ranges - if these exist you will need to decide how to handle them (retain them as separate entries, combine them with lowest start and highest end, or whatever other option is preferred).
EDIT: Actually, I believe you can get the desired results simply by changing the JOIN ON condition in your view to be:
FULL OUTER JOIN cath_domains d
ON d.cath_pdbcode::character(5) = r.pdbcode || r.chainid AND
r.DomainStart <= d.cathbegin AND
r.Domain_End >= d.cathend