Duplicate results returned from query when distinct is used

Duplicate results returned from query when distinct is used - sql

On a current project at I am needing to do some pagination of results returned from SQL. I have hit a corner case in which the query can accept identifiers as part of the where clause, normally this isn't an issue but in one case we have a single identifier being passed up that has a one to many relationship with one of the tables that the query joins on and it is returning multiple rows in the results. That issue was fixed by introducing a distinct to the query. The following is the query which returns the correct result of one row (all table/field names have been changed of course):
select distinct [item_table].[item_id]
, row_number() over (order by [item_table].[pub_date] desc, [item_table].[item_id]) as [row_num]
from [item_table]
join [OneToOneRelationShip] on [OneToOneRelationShip].[other_id] = [item_table].[other_id]
left join [OneToNoneOrManyRelationship] on [OneToNoneOrManyRelationship].[item_id] = [item_table].[item_id]
where [item_table].[pub_item_web] = 1
and [item_table].[live_item] = 1
and [item_table].[item_id] in (1404309)
However when I introduce pagination into the query I am finding that it is now returning multiple rows when it should be only be returning one. The method I am using for pagination is as follows:
select [item_id]
from (
select distinct [item_table].[item_id]
, row_number() over (order by [item_table].[pub_date] desc, [item_table].[item_id]) as [row_num]
from [item_table]
join [OneToOneRelationShip] on [OneToOneRelationShip].[other_id] = [item_table].[other_id]
left join [OneToNoneOrManyRelationship] on [OneToNoneOrManyRelationship].[item_id] = [item_table].[item_id]
where [item_table].[pub_item_web] = 1
and [item_table].[live_item] = 1
and [item_table].[item_id] in (1404309)
) as [items]
where [items].[row_num] between 0 and 100
I worry that adding a distinct to the outer query will cause an incorrect number of results to be returned and I am unsure of how else to fix this issue. The database I am querying is MS SQL Server 2008.

About 5 minutes after posting the question a possible solution hit me, if I group by the item_id (and any sort criteria) which should only be one instance of it should solve the issue. After testing this was the query that I was left with:
select [item_id]
from (
select [item_table].[item_id]
, row_number() over (order by [item_table].[pub_date] desc, [item_table].[item_id]) as [row_num]
from [item_table]
join [OneToOneRelationShip] on [OneToOneRelationShip].[other_id] = [item_table].[other_id]
left join [OneToNoneOrManyRelationship] on [OneToNoneOrManyRelationship].[item_id] = [item_table].[item_id]
where [item_table].[pub_item_web] = 1
and [item_table].[live_item] = 1
and [item_table].[item_id] in (1404309)
group by [item_table].[item_id], [item_table].[pub_date]
) as [items]
where [items].[row_num] between 0 and 100

I don't see where the DISTINCT is adding any value in your first query. The results are [item_table].[item_id] and [row_num]. Because the value of [row_num] is already distinct, the combination of [item_table].[item_id] and [row_num] will be distinct. When adding the DISTINCT keyword to the query, no rows are excluded.
In the second query, your results will return [item_id] from the sub query where [row_num] meets the criteria. If there where duplicate [item_id] values in the sub-query, there will be duplicates in the final results, but now you don't display [row_num] to distinguish the duplicates.

Related

LAG() function in sql 2008

I have looked at a few other questions regarding this problem, we are trying to get a stored procedure working that contains the LAG() function, but the machine we are now trying to install an instance on is SQL 2008 and we can't use it
SELECT se.SetID,SetName,ParentSetId,
qu.QuestionID,qu.QuestionText,qu.QuestionTypeID,qu.IsPublished,qu.IsFilter,
qu.IsRequired,qu.QueCode,qu.IsDisplayInTable,
Case when (LAG(se.ParentSetId) OVER(ORDER BY se.ParentSetId) <> ParentSetId) then 2 else 1 end level ,
QuestionType
FROM tblSet se
LEFT join tblQuestion qu on qu.SetID=se.SetID
Inner join tblQuestionType qt on qt.QuestionTypeID=qu.QuestionTypeID and qt.IsAnswer=1
where CollectionId=#colID and se.IsDeleted=0
order by se.SetID
What I've tried so far (edited to reflect Zohar Peled's) suggestoin
SELECT se.SetID,se.SetName,se.ParentSetId,
qu.QuestionID,qu.QuestionText,qu.QuestionTypeID,qu.IsPublished,qu.IsFilter,
qu.IsRequired,qu.QueCode,qu.IsDisplayInTable,
(case when row_number() over (partition by se.parentsetid
order by se.parentsetid
) = 1
then 1 else 2
end) as level,
QuestionType
FROM tblSet se
left join tblSet se2 on se.ParentSetId = se2.ParentSetId -1
LEFT join tblQuestion qu on qu.SetID=se.SetID
Inner join tblQuestionType qt on qt.QuestionTypeID=qu.QuestionTypeID and qt.IsAnswer=1
where se.CollectionId=#colID and se.IsDeleted=0
order by se.SetID
it does not seem to be bringing out all of the same records when I run them side by side and the level value seems to be different also
I have put in some of the outputs into a HTML formatted table from the version containing LAG() (the first results) then the second is the new version, where the levels are not coming out the same
https://jsfiddle.net/gyn8Lv3u/

LAG() can be implemented using a self-join as Jeroen wrote in his comment, or by using a correlated subquery. In this case, it's a simple lag() so the correlated subquery is also simple:
SELECT se.SetID,SetName,ParentSetId,
qu.QuestionID,qu.QuestionText,qu.QuestionTypeID,qu.IsPublished,qu.IsFilter,
qu.IsRequired,qu.QueCode,qu.IsDisplayInTable,
Case when (
(
SELECT TOP 1 ParentSetId
FROM tblSet seInner
WHERE seInner.ParentSetId < se.ParentSetId
ORDER BY seInner.ParentSetId DESC
)
<> ParentSetId) then 2 else 1 end level ,
QuestionType
FROM tblSet se
LEFT join tblQuestion qu on qu.SetID=se.SetID
Inner join tblQuestionType qt on qt.QuestionTypeID=qu.QuestionTypeID and qt.IsAnswer=1
where CollectionId=#colID and se.IsDeleted=0
order by se.SetID
If you had specified an offset it would be harder do implement using a correlated subquery, and a self join would make a much easier solution.

Sample data and desired results would help. This construct:
(case when (LAG(se.ParentSetId) OVER(ORDER BY se.ParentSetId) <> ParentSetId) then 2 else 1
end) as level
is quite strange. You are lagging by the only column used in the order by. That makes sense. But then you are comparing the value to the same column, implying that there are duplicates.
If you have duplicates, then order by se.ParentSetId is unstable. That is, the "previous" row is indeterminate because of the duplicate values being ordered. You can run the query twice and get different results.
I am guessing you want one row with the value 1 for each parent set id. If so, then in either database, you would use:
(case when row_number() over (partition by se.parentsetid
order by se.parentsetid
) = 1
then 1 else 2
end) as level
This also has the problem with an unstable ordering. You can fix this by changing the order by to what you really want.

Convert subselect to a join

I seem to understand that Join is preferred to sub-select.
I'm unable to see how to turn the 3 sub-selects to joins.
My sub-selects fetch the first row only
I'm perfectly willing to leave this alone if it is not offensive SQL.
This is my query, and yes, those really are the table and column names
select x1.*, x2.KTNR, x3.J6NQ
from
(select D0HONB as HONB, D0HHNB as HHNB,
(
select DHHHNB
from ECDHREP
where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
order by DHEJDT desc
FETCH FIRST 1 ROW ONLY
) as STC_HHNB,
(
select FIQ9NB
from DCFIREP
where FIQ7NB = D0Q7NB
AND FIBAEQ = D0ATEQ
and FISQCD = D0KNCD
and FIGZSZ in ('POS', 'ACT', 'MAN', 'HLD')
order by FIYCNB desc
FETCH FIRST 1 ROW ONLY
) as BL_Q9NB,
(
select AAKPNR
from C1AACPP
where AACEEQ = D0ATEQ and AARCCE = D0KNCD and AARDCE = D0KOCD
order by AAHMDT desc, AANENO desc
FETCH FIRST 1 ROW ONLY
) as NULL_KPNR
from ECD0REP
) as x1
left outer join (
select AAKPNR as null_kpnr, max(ABKTNR) as KTNR
from C1AACPP
left outer join C1ABCPP on AAKPNR = ABKPNR
group by AAKPNR
) as X2 on x1.NULL_KPNR = x2.null_KPNR
left outer join (
select ACKPNR as KPNR, count(*) as J6NQ
from C1ACCPP
WHERE ACJNDD = 'Y'
group by ACKPNR
) as X3 on x1.NULL_KPNR = x3.KPNR

You've got a combination of correlated subselects and nested table expressions (NTE).
Personally, I'd call it offensive if I had to maintain it. ;)
Consider common table expressions & joins...without your data and tabvle structure, I can't give you the real statement, but the general form would look like
with
STC_HHNB as (
select DHHHNB, DHAOEQ, DHJRCD, DHEJDT
from ECDHREP )
, BL_Q9NB as ( <....>
where FIGZSZ in ('POS', 'ACT', 'MAN', 'HLD'))
<...>
select <...>
from stc_hhb
join blq9nb on <...>
Two important reasons to favor CTE over NTE...the results of a CTE can be reused Also it's easy to build a statement with CTE's incrementally.
By re-used, I mean you can have
with
cte1 as (<...>)
, cte2 as (select <...> from cte1 join <...>)
, cte3 as (select <...> from cte1 join <...>)
, cte4 as (select <...> from cte2 join cte3 on <...>)
select * from cte4;
The optimizer can choose to build a temporary results set for cte1 and use it multiple times. From a building standpoint, you can see I'm builing on each preceding cte.
Here's a good article
https://www.mcpressonline.com/programming/sql/simplify-sql-qwithq-common-table-expressions
Edit
Let's dig into your first correlated sub-query.
select D0HONB as HONB, D0HHNB as HHNB,
(
select DHHHNB
from ECDHREP
where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
order by DHEJDT desc
FETCH FIRST 1 ROW ONLY
) as STC_HHNB
from ECD0REP
What you asking the DB to do is for every row read in ECD0REP, go out and get a row from ECDHREP. If you're unlucky, the DB will have to read lots of records in ECDHREP to find that one row. Generally, consider that with correlated sub-query the inner query would need to read every row. So if there's M rows in the outer and N rows in the inner...then you're looking at MxN rows being read.
I've seen this before, especially on the IBM i. As that's how an RPG developer would do it
read ECD0REP;
dow not %eof(ECD0REP);
//need to get DHHHNB from ECDHREP
chain (D0ATEQ, D0KNCD) ECDHREP;
STC_HHNB = DHHHNB;
read ECD0REP;
enddo;
But that's not the way to do it in SQL. SQL is (supposed to be) set based.
So what you need to do is think of how to select the set of records out of ECDHREP that will match up to the set of record you want from ECD0REP.
with cte1 as (
select DHHHNB, DHAOEQ, DHJRCD
from ECDHREP
)
select D0HONB as HONB
, D0HHNB as HHNB
, DHHHBN as STC_HHNB
from ECD0REP join cte1
on DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
Now maybe that's not quite correct. Perhaps there's multiple rows in ECDHREP with the same values (DHAOEQ, DHJRCD); thus you needed the FETCH FIRST in your correlated sub-query. Fine you can focus on the CTE and figure out what needs to be done to get that 1 row you want. Perhaps MAX(DHHHNB) or MIN(DHHHNB) would work. If nothing else, you could use ROW_NUMBER() to pick out just one row...
with cte1 as (
select DHHHNB, DHAOEQ, DHJRCD
, row_number() over(partition by DHAOEQ, DHJRCD
order by DHAOEQ, DHJRCD)
as rowNbr
from ECDHREP
), cte2 as (
select DHHHNB, DHAOEQ, DHJRCD
from cte1
where rowNbr = 1
)
select D0HONB as HONB
, D0HHNB as HHNB
, DHHHBN as STC_HHNB
from ECD0REP join cte2
on DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
Now you're dealing with sets of records, joining them together for your final results.
Worse case, the DB has to read M + N records.
It's not really about performance, it's about thinking in sets.
Sure with a simple statement using a correlated sub-query, the optimizer will probably be able to re-write it into a join.
But it's best to write the best code you can, rather then hope the optimizer can correct it.
I've seen and rewritten queries with 100's of correlated & regular sub-queries....in fact I've seen a query that had to be broken into 2 because there were two many sub-queries. The DB has a limit of 256 per statement.

I'm going to have to differ with Charles here if the FETCH FIRST 1 ROW ONLY clauses are necessary. In this case you likely can't pull those sub-selects out into a CTE because that CTE would only have a single row in it. I suspect you could pull the outer sub-select into a CTE, but you would still need the sub-selects in the CTE. Since there appears to be no sharing, I would call this personal preference. BTW, I don't think pulling the sub-selects into a join will work for you either, in this case, for the same reason.
What is the difference between a sub-select and a CTE?
with mycte as (
select field1, field2
from mytable
where somecondition = true)
select *
from mycte
vs.
select *
from (select field1, field2
from mytable
where somecondition = true) a
It's really just a personal preference, though depending on the specific requirements, a CTE can be used multiple times within the SQL statement, but a sub-select will be more correct in other cases like the FETCT FIRST clause in your question.
EDIT
Let's look at the first sub-query. With the appropriate index:
(
select DHHHNB
from ECDHREP
where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD
order by DHEJDT desc
FETCH FIRST 1 ROW ONLY
) as STC_HHNB,
only has to read one record per row in the output set. I don't think that is terribly onerous. This is the same for the third correlated sub-query as well.
That index on the first correlated sub-query would be:
create index ECDHREP_X1
on ECDHREP (DHAOEQ, DHJRCD, DHEJDT);
The second correlated sub-query might need more than one read per row, just because of the IN predicate, but it is far from needing a full table scan.

Count query giving wrong column name error

select COUNT(analysed) from Results where analysed="True"
I want to display count of rows in which analysed value is true.
However, my query gives the error: "The multi-part identifier "Results.runId" could not be bound.".
This is the actual query:
select ((SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True')/failCount) as PercentAnalysed
from Runs
where Runs.runId=Analysed.runId
My table schema is:
The value I want for a particular runId is: (the number of entries where analysed=true)/failCount
EDIT : How to merge these two queries?
i) select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner from Runs,Product where Runs.prodId=Product.prodId
ii) select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I tried this :
select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner,counts.runId,(cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs,Product
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
where Runs.prodId=Product.prodId
but it gives error.

Your problems are arising from improper joining of tables. You need information from both Runs and Results, but they aren't combined properly in your query. You have the right idea with a nested subquery, but it's in the wrong spot. You're also referencing the Analysed table in the outer where clause, but it hasn't been included in the from clause.
Try this instead:
select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I've set this up as an inner join to eliminate any runs which don't have analysed results; you can change it to a left join if you want those rows, but will need to add code to handle the null case. I've also added casts to the two numbers, because otherwise the query will perform integer division and truncate any fractional amounts.

I'd try the following query:
SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True'
This will count all of your rows where Analysed is 'True'. This should work if the datatype of your Analysed column is either BIT (Boolean) or STRING(VARCHAR, NVARCHAR).

Use CASE in Count
SELECT COUNT(CASE WHEN analysed='True' THEN analysed END) [COUNT]
FROM Results
Click here to view result

select COUNT(*) from Results where analysed="True"

Unexpected SQL Results - Need advice

The top query is just showing the results I am expecting before I join another table to perform some calculations in Crystal Reports.
select
DriveID, FromDateTime, accountid, LocationID, StatusID
from
DriveMaster
where
AccountID = '3813'
order by
FromDateTime desc;
Results:
Query Results
This second query is where I apply a couple of filters (specifying accountid, locationid, statusid, and which occur before a specific date).
select
dm.driveid, dm.fromdatetime, dm.accountid, dpact.ProcedureProjection,
dpact.ProceduresPerformed, dpact.ProductProjection, dpact.ProductsCollected,
dm.locationid
from
rpt_drivemaster dm
inner join
driveprojectionandcollectedtotals dpact on dm.driveid = dpact.driveid
where
dm.statusid = 2
and dm.accountid = '3813'
and dm.locationid = '4018'
and dm.fromdatetime < '20140602'
and dm.fromdatetime in (select top 3 dm2.fromdatetime
from rpt_drivemaster dm2
where dm2.accountid = '3813'
and dm2.statusid = 2
order by dm2.fromdatetime desc);
The only results I am getting however are:
Query Results 2
Based on the earlier query, I was expecting results for DriveIDs of:
1. 314933
2. 205250
3. 184779
Any suggestions on what I am missing here?

The problem is in your subquery that is passed in the IN – it is returning three rows with DriverID 548002,314933 and 205250. The first row has LocationID = 31036 so it doesn't go in the result set because in your main query there is a condition dm.locationid='4018'.You should pass this condition in the subquery too, to get the desired result :
select top 3 dm2.fromdatetime
from rpt_drivemaster dm2
where dm2.accountid='3813'
and dm2.statusid=2
and dm2.LocationID = '4018'
order by dm2.fromdatetime desc

Sql Server Query Selecting Top and grouping by

SpousesTable
SpouseID
SpousePreviousAddressesTable
PreviousAddressID, SpouseID, FromDate, AddressTypeID
What I have now is updating the most recent for the whole table and assigning the most recent regardless of SpouseID the AddressTypeID = 1
I want to assign the most recent SpousePreviousAddress.AddressTypeID = 1
for each unique SpouseID in the SpousePreviousAddresses table.
UPDATE spa
SET spa.AddressTypeID = 1
FROM SpousePreviousAddresses AS spa INNER JOIN Spouses ON spa.SpouseID = Spouses.SpouseID,
(SELECT TOP 1 SpousePreviousAddresses.* FROM SpousePreviousAddresses
INNER JOIN Spouses AS s ON SpousePreviousAddresses.SpouseID = s.SpouseID
WHERE SpousePreviousAddresses.CountryID = 181 ORDER BY SpousePreviousAddresses.FromDate DESC) as us
WHERE spa.PreviousAddressID = us.PreviousAddressID
I think I need a group by but my sql isn't all that hot. Thanks.
Update that is Working
I was wrong about having found a solution to this earlier. Below is the solution I am going with
WITH result AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY SpouseID ORDER BY FromDate DESC) AS rowNumber, *
FROM SpousePreviousAddresses
WHERE CountryID = 181
)
UPDATE result
SET AddressTypeID = 1
FROM result WHERE rowNumber = 1

Presuming you are using SQLServer 2005 (based on the error message you got from the previous attempt) probably the most straightforward way to do this would be to use the ROW_NUMBER() Function couple with a Common Table Expression, I think this might do what you are looking for:
WITH result AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY SpouseID ORDER BY FromDate DESC) as rowNumber,
*
FROM
SpousePreviousAddresses
)
UPDATE SpousePreviousAddresses
SET
AddressTypeID = 2
FROM
SpousePreviousAddresses spa
INNER JOIN result r ON spa.SpouseId = r.SpouseId
WHERE r.rowNumber = 1
AND spa.PreviousAddressID = r.PreviousAddressID
AND spa.CountryID = 181
In SQLServer2005 the ROW_NUMBER() function is one of the most powerful around. It is very usefull in lots of situations. The time spent learning about it will be re-paid many times over.
The CTE is used to simplyfy the code abit, as it removes the need for a temporary table of some kind to store the itermediate result.
The resulting query should be fast and efficient. I know the select in the CTE uses *, which is a bit of overkill as we dont need all the columns, but it may help to show what is happening if anyone want to see what is happening inside the query.

Here's one way to do it:
UPDATE spa1
SET spa1.AddressTypeID = 1
FROM SpousePreviousAddresses AS spa1
LEFT OUTER JOIN SpousePreviousAddresses AS spa2
ON (spa1.SpouseID = spa2.SpouseID AND spa1.FromDate < spa2.FromDate)
WHERE spa1.CountryID = 181 AND spa2.SpouseID IS NULL;
In other words, update the row spa1 for which no other row spa2 exists with the same spouse and a greater (more recent) date.
There's exactly one row for each value of SpouseID that has the greatest date compared to all other rows (if any) with the same SpouseID.
There's no need to use a GROUP BY, because there's kind of an implicit grouping done by the join.
update: I think you misunderstand the purpose of the OUTER JOIN. If there is no row spa2 that matches all the join conditions, then all columns of spa2.* are returned as NULL. That's how outer joins work. So you can search for the cases where spa1 has no matching row spa2 by testing that spa2.SpouseID IS NULL.

UPDATE spa SET spa.AddressTypeID = 1
WHERE spa.SpouseID IN (
SELECT DISTINCT s1.SpouseID FROM Spa S1, SpousePreviousAddresses S2
WHERE s1.SpouseID = s2.SpouseID
AND s2.CountryID = 181
AND s1.PreviousAddressId = s2.PreviousAddressId
ORDER BY S2.FromDate DESC)
Just a guess.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas