COUNT (DISTINCT column_name) Discrepancy vs. COUNT (column_name) in SQL Server 2008?

COUNT (DISTINCT column_name) Discrepancy vs. COUNT (column_name) in SQL Server 2008? - sql

I'm running into a problem that's driving me nuts.
When running the query below, I get a count of 233,769
SELECT COUNT(distinct Member_List_Link.UserID)
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
But if I run the same query without the distinct keyword, I get a count of 233,748
SELECT COUNT(Member_List_Link.UserID)
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And Member_List_Link.GroupID = 5
AND MasterMembers.ValidUsers = 1 AND Member_List_Link.Status = 1
To test, I recreated all the tables and place them into temp tables and ran the queries again:
SELECT COUNT(distinct #Temp_Member_List_Link.UserID)
FROM #Temp_Member_List_Link with (nolock)
INNER JOIN #Temp_MasterMembers with (nolock)
ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID
WHERE #Temp_MasterMembers.Active = 1 And
#Temp_Member_List_Link.GroupID = 5 AND
#Temp_MasterMembers.ValidUsers = 1 AND
#Temp_Member_List_Link.Status = 1
And without the distinct keyword
SELECT COUNT(#Temp_Member_List_Link.UserID)
FROM #Temp_Member_List_Link with (nolock)
INNER JOIN #Temp_MasterMembers with (nolock)
ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID
WHERE #Temp_MasterMembers.Active = 1 And
#Temp_Member_List_Link.GroupID = 5 AND
#Temp_MasterMembers.ValidUsers = 1 AND
#Temp_Member_List_Link.Status = 1
On a side note, I recreated the temp tables by simply running (select * from Member_List_Link into #temp...)
And now when I check to see the difference between COUNT(column) vs. COUNT(distinct column) with these temp tables, I don't see any!
So why is there a discrepancy with the original tables?
I'm running SQL Server 2008 (Dev Edition).
UPDATE - Including statistics profile
PhysicalOp column only for the first query (without distinct)
NULL
Compute Scalar
Stream Aggregate
Clustered Index Seek
PhysicalOp column only for the first query (with distinct)
NULL
Compute Scalar
Stream Aggregate
Parallelism
Stream Aggregate
Hash Match
Hash Match
Bitmap
Parallelism
Index Seek
Parallelism
Clustered Index Scan
Rows and Executes for the 1st query (without distinct)
1 1
0 0
1 1
1 1
Rows and Executes for the 2nd query (with distinct)
Rows Executes
1 1
0 0
1 1
16 1
16 16
233767 16
233767 16
281901 16
281901 16
281901 16
234787 16
234787 16
Adding OPTION(MAXDOP 1) to the 2nd query (with distinct)
Rows Executes
1 1
0 0
1 1
233767 1
233767 1
281901 1
548396 1
And the resulting PhysicalOp
NULL
Compute Scalar
Stream Aggregate
Hash Match
Hash Match
Index Seek
Clustered Index Scan

FROM http://msdn.microsoft.com/en-us/library/ms187373.aspx
NOLOCK Is equivalent to READUNCOMMITTED. For more information, see READUNCOMMITTED later in this topic.
READUNCOMMITED will read rows twice if they are the subject of a transation- since both the roll foward and roll back rows exist within the database when the transaction is IN process.
By default all queries are read committed which excludes uncommitted rows
When you insert into a temp table the select will give you only committed rows - I believe this covers all the symptoms you are trying to explain

I think i have got the answer to your question but tell me first is userid a primary key in your original table ?
if yes,then CTAS query to create temp table would not copy any primary key of original table ,it only copy NOT NULL constraint that is not a part of primary key..fine?
now what happened your original table had a primary key so count(distinct column_name) doesnt include tuples with null records and while you created temp tables , primary key doesnt get copied and hence the NOT NULL constraint doesnt get to the temp table!!
is that clear to you?

It's hard to reproduce this behaviour, so I'm punching in the dark here:
The WITH (NOLOCK) statement enables reading of uncommitted data. I'm guessing you've added that to not lock anything for your users? If you remove those and issue a
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
Prior to executing the query, you should get more reliable results. But then, the tables may receive locks while executing the query.
If that doesn't work, my guess is that DISTINCT use an index to optimize. Check the queryplan, and rebuild indexes as necessary. Could be the source of your problem.

What result do you get with
SELECT count(*) FROM (
SELECT distinct Member_List_Link.UserID
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
) as m
AND WITH:
SELECT count(*) FROM (
SELECT distinct Member_List_Link.UserID
FROM Member_List_Link
INNER JOIN MasterMembers
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
) as m

Ray, please try the following
SELECT COUNT(*)
FROM
(
SELECT Member_List_Link.UserID, ROW_NUMBER() OVER (PARTITION BY Member_List_Link.UserID ORDER BY (SELECT NULL)) N
FROM Member_List_Link with (nolock)
INNER JOIN MasterMembers with (nolock)
ON Member_List_Link.UserID = MasterMembers.UserID
WHERE MasterMembers.Active = 1 And
Member_List_Link.GroupID = 5 AND
MasterMembers.ValidUsers = 1 AND
Member_List_Link.Status = 1
) A
WHERE N = 1

when you use count with distinct column it doesn't count columns having values null.
create table #tmp(name char(4) null)
insert into #tmp values(null)
insert into #tmp values(null)
insert into #tmp values("AAA")
Query:-
1> select count(*) from #tmp
2> go
3
1> select count(distinct name) from #tmp
2> go
1
1> select distinct name from #tmp
2> go
name
NULL
AAA
but it works in derived table
1> select count(*) from ( select distinct name from #tmp) a
2> go
2
Note:- I tested it in Sybase

Related

SQL execution time

I face execution time problem here ,really upset
UPDATE [OT]
SET AAA = 8
WHERE ID in
(
select UID from [IP] B
inner join [OT] A
on B.ADDR = A.ADDR and A.ID=B.ID
)
AND AAA= 6 ;
OT table have duplicate ID, count(*) about 900000 rows
IP table have unique ID, count(*) about 800000 rows
AAA is a column of OT table type: tinyint
select count(*) from OT where AAA=6 about 150000
I do not know why this query will take more than 1 hour ??
My other similar query only take 10 seconds

I would just write this more simply as:
UPDATE OT
SET AAA = 8
FROM OT
WHERE EXISTS (SELECT 1 FROM IP WHERE IP.ADDR = OT.ADDR AND IP.ID = OT.ID) AND
OT.AAA = 6 ;
For performance, you want indexes on OT(AAA) and IP(ADDR, ID).

Performance Issue in Left outer join Sql server

In my project I need find difference task based on old and new revision in the same table.
id | task | latest_Rev
1 A N
1 B N
2 C Y
2 A Y
2 B Y
Expected Result:
id | task | latest_Rev
2 C Y
So I tried following query
Select new.*
from Rev_tmp nw with (nolock)
left outer
join rev_tmp old with (nolock)
on nw.id -1 = old.id
and nw.task = old.task
and nw.latest_rev = 'y'
where old.task is null
when my table have more than 20k records this query takes more time?
How to reduce the time?
In my company don't allow to use subquery

Use LAG function to remove the self join
SELECT *
FROM (SELECT *,
CASE WHEN latest_Rev = 'y' THEN Lag(latest_Rev) OVER(partition BY task ORDER BY id) ELSE NULL END AS prev_rev
FROM Rev_tmp) a
WHERE prev_rev IS NULL

My answer assumes
You can't change the indexes
You can't use subqueries
All fields are indexed separately
If you look at the query, the only value that really reduces the resultset is latest_rev='Y'. If you were to eliminate that condition, you'd definitely get a table scan. So we want that condition to be evaluated using an index. Unfortunately a field that just values 'Y' and 'N' is likely to be ignored because it will have terrible selectivity. You might get better performance if you coax SQL Server into using it anyway. If the index on latest_rev is called idx_latest_rev then try this:
Set transaction isolated level read uncommitted
Select new.*
from Rev_tmp nw with (index(idx_latest_rev))
left outer
join rev_tmp old
on nw.id -1 = old.id
and nw.task = old.task
where old.task is null
and nw.latest_rev = 'y'

latest_Rev should be a Bit type (boolean equivalent), i better for performance (Detail here)
May be can you add index on id, task
, latest_Rev columns
You can try this query (replace left outer by not exists)
Select *
from Rev_tmp nw
where nw.latest_rev = 'y' and not exists
(
select * from rev_tmp old
where nw.id -1 = old.id and nw.task = old.task
)

how to properly merge these 2 query into one update?

This currently work but I would like to change the update statement to include the action of the insert below it, is it posssible?
UPDATE cas
SET [Locked] = CASE WHEN cas.Locked <> #TargetState AND cas.LastChanged = filter.SourceDateTime THEN #TargetState ELSE cas.[Locked] end,
OUTPUT inserted.Id, inserted.Locked, CASE WHEN inserted.Locked = #TargetState AND
inserted.LastChanged = filter.SourceDateTime THEN 1
WHEN inserted.LastChanged <> filter.SourceDateTime THEN -1 -- out of sync
WHEN deleted.Locked = #TargetState THEN -2 -- was not in a good state
ELSE 0 END --generic failure
INTO #OUTPUT
FROM dbo.Target cas WITH(READPAST, UPDLOCK, ROWLOCK) INNER JOIN #table filter ON cas.Id = filter.Id
INSERT INTO #OUTPUT
SELECT filter.id, NULL, when cas.id is not null -3 -- row was/is locked
else -4 end --not found
FROM #table filter left join dbo.target cas with(nolock) on filter.id = cas.id
WHERE NOT EXISTS (SELECT 1 FROM #OUTPUT result WHERE filter.id = result.UpdatedId)

I do not think what you want is possible.
You start with a table to be updated. Let’s say this table contains a set of IDs, say, 1 to 6
You join onto a temp table containing a different set of IDs that may partially overlap (say, 4 to 9)
You issue the update using an inner join. Only rows 4 to 6 are updated
The output clause picks up data only for modified rows, so you only get data for rows 4 to 6
If you flipped this to an outer join (such that all temp table rows are selected), you still only update rows 4 to 6, and the output clause still only kicks out data for rows 4 to 6
So, no, I see no way of achieving this goal in a single SQL statement.

Why is Selecting From Table Variable Far Slower than List of Integers

I have a pretty big MSSQL stored procedure that I need to conditionally check for certain IDs:
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where b.SomeID in (1,2,3,4,5)
I wanted to conditionally check the SomeID field, so I did the following:
if #enteredText = 'This'
INSERT INTO #AwesomeIDs
VALUES(1),(2),(3)
if #enteredText = 'That'
INSERT INTO #AwesomeIDs
VALUES(4),(5)
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where b.SomeID in (Select ID from #AwesomeIDs)
Nothing else has changed, yet I can't even get the latter query to grab 5 records. The top query returns 5000 records in less than 3 seconds. Why is selecting from a table variable so much drastically slower?

Two other possible options you can consider
Option 1
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where
( b.SomeID IN (1,2,3) AND #enteredText = 'This')
OR
( b.SomeID IN (4,5) AND #enteredText = 'That')
Option 2
Select SomeColumns
From BigTable b
Join LotsOfTables l on b.LongStringField = l.LongStringField
Where EXISTS (Select 1
from #AwesomeIDs
WHERE b.SomeID = ID)
Mind you for Table variables , SQL Server always assumes there is only ONE row in the table (except sql 2014 , assumption is 100 rows) and it can affect the estimated and actual plans. But 1 row against 3 not really a deal breaker.

What is the correct pattern for working without deferred constraints?

I am using databases that aren't Oracle or Postgresql, which means I don't have access to deferred constraints, which means that constraints must be valid at all times (instead of just on commit).
Let's say I'm storing a linked list type structure in a database like so:
id parentId
---------------
1 null
2 1
3 2
4 3
5 4
6 5
parentId is a foreign key reference to id, and is required to be unique via a constraint.
Let's say I wanted to move item 5 to sit just before item 1, so our DB would look like this:
id parentId
---------------
1 null
2 5 <-- different
3 2
4 3
5 1 <-- different
6 4 <-- different
Three rows need to be altered, which is three update statements. Any one of these update statements will cause a constraint violation: all three statements must be complete before the constraint would be valid again.
My question is: what is the best way of not violating the uniqueness constraint?
I can currently conceive of two different solutions, neither of which I like:
Set each affected parentId to null and then perform the three updates
Completely change my data model so it's more of a 'copy on write' style versioned database, where these sorts of issues are not a problem.

You can do this in a single query. I'm sure there are many variations of this, but here is what I would use...
DECLARE
#node_id INT,
#new_parent_id INT
SELECT
#node_id = 5,
#new_parent = 1
UPDATE
yourTable
SET
parent_id = CASE WHEN yourTable.id = target_node.id THEN new_antiscendant.id
WHEN yourTable.id = descendant.id THEN target_node.parent_id
WHEN yourTable.id = new_descendant.id THEN target_node.id
END
FROM
yourTable AS target_node
LEFT JOIN
yourTable AS descendant
ON descendant.parent_id = target_node.id
LEFT JOIN
yourTable AS new_antiscendant
ON new_antiscendant.id = #new_parent_id
LEFT JOIN
yourTable AS new_descendant
ON COALESCE(new_descendant.parent_id, -1) = COALESCE(new_antiscendant.id, -1)
INNER JOIN
yourTable
ON yourTable.id IN (target_node.id, descendant.id, new_descendant.id)
WHERE
target_node.id = #node_id
This will work even if the #new_parent_id is NULL or the last record in the list.
MySQL doesn't like self joins in updates, so the approach would probably be to do the LEFT JOINs into a temporary table to get the new mapping. Then join on that table to update all three recors in a single query.
INSERT INTO
yourTempTable
SELECT
yourTable.id AS node_id,
CASE WHEN yourTable.id = target_node.id THEN new_antiscendant.id
WHEN yourTable.id = descendant.id THEN target_node.parent_id
WHEN yourTable.id = new_descendant.id THEN target_node.id
END AS new_parent_id
FROM
yourTable AS target_node
LEFT JOIN
yourTable AS descendant
ON descendant.parent_id = target_node.id
LEFT JOIN
yourTable AS new_antiscendant
ON new_antiscendant.id = #new_parent_id
LEFT JOIN
yourTable AS new_descendant
ON COALESCE(new_descendant.parent_id, -1) = COALESCE(new_antiscendant.id, -1)
INNER JOIN
yourTable
ON yourTable.id IN (target_node.id, descendant.id, new_descendant.id)
WHERE
target_node.id = #node_id
UPDATE
yourTable
SET
parent_id = yourTempTable.newParentID
FROM
yourTable
INNER JOIN
yourTempTable
ON yourTempTamp.node_id = yourTable.id
(The exact syntax depends on your RDBMS.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

COUNT (DISTINCT column_name) Discrepancy vs. COUNT (column_name) in SQL Server 2008? - sql

Related

SQL execution time

Performance Issue in Left outer join Sql server

how to properly merge these 2 query into one update?

Why is Selecting From Table Variable Far Slower than List of Integers

What is the correct pattern for working without deferred constraints?

Categories

Resources