T-SQL SELECT TOP returns duplicates

T-SQL SELECT TOP returns duplicates - sql

I'm using SQL Server 2008 R2.
I'm not sure if I've discovered a strange SQL quirk, or (more likely) something in my code is causing this strange behaviour, particularly as Google has turned up nothing. I have a view called vwResponsible_Office_Address.
SELECT * FROM vwResponsible_Office_Address
..returns 403 rows
This code:
SELECT TOP 1000 * FROM vwResponsible_Office_Address
..returns 409 rows, as it includes 6 duplicates.
However this:
SELECT TOP 1000 * FROM vwResponsible_Office_Address
ORDER BY ID
..returns 403 rows again.
I can post the code for the view if it's relevant, but does it make sense for SELECT TOP to ever work in this way? I understand that SELECT TOP is free to return records in any order but don't understand why the number of records returned should vary.
The view does use cross apply which might be affecting the result set some how?
EDIT: View definition as requested
CREATE VIEW [dbo].[vwResponsible_Office_Address]
AS
SELECT fp.Entity_ID [Reg_Office_Entity_ID],
fp.Entity_Name [Reg_Office_Entity_Name],
addr.Address_ID
FROM [dbo].[Entity_Relationship] er
INNER JOIN [dbo].[Entity] fp
ON er.[Related_Entity_ID] = fp.[Entity_ID]
INNER JOIN [dbo].[Entity_Address] ea
ON ea.[Entity_ID] = fp.[Entity_ID]
CROSS APPLY (
SELECT TOP 1 Address_ID
FROM [dbo].[vwEntity_Address] vea
WHERE [vea].[Entity_ID] = fp.Entity_ID
ORDER by ea.[Address_Type_ID] ASC, ea.[Address_ID] DESC
) addr
WHERE [Entity_Relationship_Type_ID] = 25 -- fee payment relationship
UNION
SELECT ets.[Entity_ID],
ets.[Entity_Name],
addr.[Address_ID]
FROM dbo.[vwEntity_Entitlement_Status] ets
INNER JOIN dbo.[Entity_Address] ea
ON ea.[Entity_ID] = ets.[Entity_ID]
CROSS APPLY (
SELECT TOP 1 [Address_ID]
FROM [dbo].[vwEntity_Address] vea
WHERE vea.[Entity_ID] = ets.[Entity_ID]
ORDER by ea.[Address_Type_ID] ASC, ea.[Address_ID] DESC
) addr
WHERE ets.[Entitlement_Type_ID] = 40 -- registered office
AND ets.[Entitlement_Status_ID] = 11 -- active

I would assume that there is some non determinism going on which means that different access methods can return different results.
Looking at the view definition the only place that appears likely would be if vwEntity_Address has some duplicates for Entity_ID.
This would make the top 1 Address_ID returned arbitrary in that case which will effect the result of the union operation when it removes duplicates.
Definitely this does look extremely suspect
SELECT TOP 1 [Address_ID]
FROM [dbo].[vwEntity_Address] vea
WHERE vea.[Entity_ID] = ets.[Entity_ID]
ORDER by ea.[Address_Type_ID] ASC, ea.[Address_ID] DESC
You are ordering by values from the outer query in the cross apply. This will have absolutely no effect whatsoever as these will be constant for a particular CROSS APPLY invocation.
Can you try changing to
SELECT TOP 1 [Address_ID]
FROM [dbo].[vwEntity_Address] vea
WHERE vea.[Entity_ID] = ets.[Entity_ID]
ORDER by vea.[Address_ID] DESC

I was wondering if your view included a function, until I got to the end, where you say you use cross-apply. I would assume that is your problem, if your interested in the details, take a look at the various query plans.
EDIT: Expansion of answer
I.e. your function is non-deterministic and can either return more than one row per input or return the same row for different input. In combination, this means that you'll get exactly what you are seeing: duplicate rows under some circumsntaces. Adding a distinct to your view is the costly way to solve your problem, a better way would be to change your function so that for any input there is only one row output, and for a row output only one input will produce that row.
EDIT: I didn't see that you're now including your view definition.
Your problem is definitely the cross apply, in particular you are sorting inside the cross apply by values from OUTSIDE of the cross apply, making the top 1 effectively random.

Related

Using second highest value in an ON clause

I have an existing MSSQL view where I need to include a new join to the view. To get the correct record data I need to select the entry where the ActivityKey is the second highest (essentially the second most recent revision of the policy).
select
...
from polmem a
left join polMemPremium wpmp on (wpmp.policyNumber=pf.sreference
and wpmp.lPolicyMemberKey=a.lPolicyMemberKey
and wpmp.lPolicyActivityKey = (select Max(wpmp.lPolicyActivityKey) where wpmp.lPolicyActivityKey
NOT IN (SELECT MAX(wpmp.lPolicyActivityKey))))
where
...
But the above results in this error:
An aggregate cannot appear in an ON clause unless it is in a subquery contained in a HAVING clause or select list, and the column being aggregated is an outer reference.
Essentially the error is telling me I need to have the aggregate
(select Max(wpmp.lPolicyActivityKey) where wpmp.lPolicyActivityKey NOT IN (SELECT MAX(wpmp.lPolicyActivityKey)))
in a Having and then list most if not all of the columns in the view's Select statement in a Group By. My issue is as this is a view used in multiple places and doing what MSSQL wants is a massive change to the view for the sake of what I thought would be a relatively simple addition. I'm just wondering if I'm approaching this wrong and if there is a better way to achieve what I want?

Just try something like:
select ...
from .....
..........
cross apply (select
*
,row_number() over (order by wpmp.lPolicyActivityKey desc)
from web_PolicyMemberPremium wpmp
where wpmp.policyNumber=pf.sreference
and wpmp.lPolicyMemberKey=a.lPolicyMemberKey) wpmp
....
where ...
and wpmp.rn = 2
I added cross apply (that means there should be a policy in the table otherwise the rows will be excluded). You could put an outer apply and change the where clause isnull(wpmp.rn,2) = 2 or similar .. but it doesn't make much sense to me.
PS. It would help a lot us (and mostly you) if you format the code in a nice manner.

Order of Operation in SQL Server Query

I have the below query selecting items and one of its feature from a hundred thousand row of items.
But I am concerned about the performance of sub query. Will it be executed after or before the where clause ?
Suppose, I am selecting 25 items from 10000 items, this subquery will be executed only for 25 items or 10000 items ?
declare #BlockStart int = 1
, #BlockSize int = 25
;
select *, (
select Value_Float
from Features B
where B.Feature_ModelID = Itm.ModelID
and B.Feature_PropertyID = 5
) as Price
from (
select *
, row_number() over (order by ModelID desc) as RowNumber
from Models
) Itm
where Itm.RowNumber >= #BlockStart
and Itm.RowNumber < #BlockStart + #BlockSize
order by ModelID desc

The sub query in the FROM clause produces a full set of results, but the sub query in the SELECT clause will (generally!) only be run for the records included with the final result set.
As with all things SQL, there is a query optimizer involved, which may at times decide to create seemingly-strange execution plans. In this case, I believe we can be pretty confident, but I need to caution about making sweeping generalizations about SQL language order of operations.
Moving on, have you seen the OFFSET/FECTH syntax available in Sql Server 2012 and later? This seems like a better way to handle the #BlockStart and #BlockSize values, especially as it looks like you're paging on the clustered key. (If you end up paging on an alternate column, the link shows a much faster method).
Also, at risk of making generalizations again, if you can know that only one Features record exists per ModelID with Feature_PropertyID = 5, you will tend to get better performance using a JOIN:
SELECT m.*, f.Value_Float As Price
FROM Models m
LEFT JOIN Features f ON f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
If you can't make that guarantee, you may get better performance from an APPLY operation:
SELECT m.*, f.Value_Float As Price
FROM Models m
OUTER APPLY (
SELECT TOP 1 Value_Float
FROM Features f
WHERE f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
) f
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
Finally, this smells like yet another variation of the Entity-Attribute-Value pattern... which, while it has it's places, typically should be a pattern of last resort.

Filtering on ROW_NUMBER() is changing the results

I did implement an OData service of my own that takes an SQL statement and apply the top / skip filter using a ROW_NUMBER(). Most statement tested so far are working well except for a statement involving 2 levels of Left Join. For some reason I can't explain, the data returned by the sql is changing when I apply a where clause on the row number column.
For readability (and testing), I removed most of the sql to keep only the faulty part. Basically, you have a Patients table that may have 0 to N Diagnostics and the Diagnostics may have 0 to N Treatments:
SELECT RowNumber, PatientID, DiagnosticID, TreatmentID
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNumber
, *
FROM PATIENTS
LEFT JOIN DIAGNOSTICS ON DIAGNOSTICS.PatientID = PATIENTS.PatientID
LEFT JOIN TREATMENTS ON TREATMENTS.DiagnosticID = DIAGNOSTICS.DiagnosticID
) AS Wrapper
--WHERE RowNumber BETWEEN 1 AND 10
--If I uncomment the line above, I'll get 10 lines that differs from the first 10 line of this query
This is the results I got from the statement above. The result on the left is showing the first 10 rows without the WHERE clause while the one on the right is showing the results with the WHERE clause.
For the record, I'm using SQL Server 2008 R2 SP3. My application is in C# but the problem occurs in SQL server too so I don't think .NET is involved in this case.
EDIT
About the ORDER BY (SELECT NULL), I took that code a while ago from this SO question. However, an order by null will work only if the statement is sorted... in my case, I forgot about adding an order by clause so that's why I was getting some random sorting.

Let me first ask: why do you expect it to be the same? Or rather, why do you expect it to be anything in particular? You haven't imposed an ordering, so the query optimizer is free to use whatever execution operators are most efficient (according to its cost scheme). When you add the WHERE clause, the plan will change and the natural ordering of the results will be different. This can also happen when adding joins or subqueries, for example.
If you want the results to come back in a specific order, you need to actually use the ORDER BY subclause of the ROW_NUMBER() window function. I'm not sure why you are ordering by SELECT NULL, but I can guarantee you that's the problem.

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.

SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').

Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix

This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!

The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.

Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.

[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

In an EXISTS can my JOIN ON use a value from the original select

I have an order system. Users with can be attached to different orders as a type of different user. They can download documents associated with an order. Documents are only given to certain types of users on the order. I'm having trouble writing the query to check a user's permission to view a document and select the info about the document.
I have the following tables and (applicable) fields:
Docs: DocNo, FileNo
DocAccess: DocNo, UserTypeWithAccess
FileUsers: FileNo, UserType, UserNo
I have the following query:
SELECT Docs.*
FROM Docs
WHERE DocNo = 1000
AND EXISTS (
SELECT * FROM DocAccess
LEFT JOIN FileUsers
ON FileUsers.UserType = DocAccess.UserTypeWithAccess
AND FileUsers.FileNo = Docs.FileNo /* Errors here */
WHERE DocAccess.UserNo = 2000 )
The trouble is that in the Exists Select, it does not recognize Docs (at Docs.FileNo) as a valid table. If I move the second on argument to the where clause it works, but I would rather limit the initial join rather than filter them out after the fact.
I can get around this a couple ways, but this seems like it would be best. Anything I'm missing here? Or is it simply not allowed?

I think this is a limitation of your database engine. In most databases, docs would be in scope for the entire subquery -- including both the where and in clauses.
However, you do not need to worry about where you put the particular clause. SQL is a descriptive language, not a procedural language. The purpose of SQL is to describe the output. The SQL engine, parser, and compiler should be choosing the most optimal execution path. Not always true. But, move the condition to the where clause and don't worry about it.

I am not clear why do you need to join with FileUsers at all in your subquery?
What is the purpose and idea of the query (in plain English)?
In any case, if you do need to join with FileUsers then I suggest to use the inner join and move second filter to the WHERE condition. I don't think you can use it in JOIN condition in subquery - at least I've never seen it used this way before. I believe you can only correlate through WHERE clause.

You have to use aliases to get this working:
SELECT
doc.*
FROM
Docs doc
WHERE
doc.DocNo = 1000
AND EXISTS (
SELECT
*
FROM
DocAccess acc
LEFT OUTER JOIN
FileUsers usr
ON
usr.UserType = acc.UserTypeWithAccess
AND usr.FileNo = doc.FileNo
WHERE
acc.UserNo = 2000
)
This also makes it more clear which table each field belongs to (think about using the same table twice or more in the same query with different aliases).
If you would only like to limit the output to one row you can use TOP 1:
SELECT TOP 1
doc.*
FROM
Docs doc
INNER JOIN
FileUsers usr
ON
usr.FileNo = doc.FileNo
INNER JOIN
DocAccess acc
ON
acc.UserTypeWithAccess = usr.UserType
WHERE
doc.DocNo = 1000
AND acc.UserNo = 2000
Of course the second query works a bit different than the first one (both JOINS are INNER). Depeding on your data model you might even leave the TOP 1 out of that query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

T-SQL SELECT TOP returns duplicates - sql

Related

Using second highest value in an ON clause

Order of Operation in SQL Server Query

Filtering on ROW_NUMBER() is changing the results

SELECT MAX() too slow - any alternatives?

In an EXISTS can my JOIN ON use a value from the original select

Categories

Resources