Why do I get this unexpected SQL performance gain? - sql

This is more a quiz question rather than me panicking over a deadline, however understanding how/why would no doubt let me scratch my head a little less!
So I have this UPDATE statement:
/*** #Table is a TABLE Variable ***/
UPDATE O
SET O.PPTime = T.PPTime
FROM #Table AS [O]
INNER JOIN
(SELECT O.OSID, O.STID, DATEDIFF(SECOND, O.StartDateTime, O.EndDateTime) AS [PPTime]
FROM tblO AS [O]
INNER JOIN tblS AS [S] ON O.OSID = S.OSID
INNER JOIN tblE AS [E] ON S.EID = E.EID
INNER JOIN tblEF AS [EF] ON E.EFID = EF.EFID
GROUP BY O.OSID, O.STID, O.StartDateTime, O.EndDateTime) AS [T]
ON O.OSID = T.OSID
WHERE O.PPTime IS NULL
The execution time is approximately 12 seconds.
Now below I have added in a small WHERE statement which does not have any impact on how many rows of data are returned to the user:
/*** #Table is a TABLE Variable ***/
UPDATE O
SET O.PPTime = T.PPTime
FROM #Table AS [O]
INNER JOIN
(SELECT O.OSID, O.STID, DATEDIFF(SECOND, O.StartDateTime, O.EndDateTime) AS [PPTime]
FROM tblO AS [O]
INNER JOIN tblS AS [S] ON O.OSID = S.OSID
INNER JOIN tblE AS [E] ON S.EID = E.EID
INNER JOIN tblEF AS [EF] ON E.EFID = EF.EFID
WHERE O.OSID >= 0 /*** Somehow fixes performance slow down! ***/
GROUP BY O.OSID, O.STID, O.StartDateTime, O.EndDateTime) AS [T]
ON O.OSID = T.OSID
WHERE O.PPTime IS NULL
The execution time is now less than a second. If I run both SELECT statements individually, they execute in the same time and return the same data.
Why do I get such a performance gain?

After reviewing the code, I noticed that adding a Primary Key and/or indexing to the table variable done the trick! One for me to remember!

Related

The generated query by EF Core runs very slow in some conditions

We are using EF core in our application
In one of the Services, the EF core generates the below TSQL and runs. but it is prolonged!
exec sp_executesql N'SELECT [di].[Id] AS [Key], [di].[Code], [di].[DocumentTypeRef], [di].[IsActive], [di].[IsVisible], [di].[IsGlobal], [di].[IsPrintable], [di].[RecordPrefix], [di].[Owner_PersonRef], [owner].[FullName] AS [DocumentInfoOwnerFullName], [dil].[_Title] AS [ThisIsA_Title], [dil].[_Description] AS [_Description0], [docw].[WorkbenchRef], [t].[_Title] AS [_WorkbenchTitle], [t].[_Description] AS [_WorkbenchDescription], [docc].[CompanyRef], [cl].[_Title] AS [_CompanyTitle], [dt].[IsRecordable] AS [DocumentTypeIsRecordable], [dt].[ReviewCycle] AS [DocumentTypeReviewCycle], [dtl].[_Title] AS [_DocumentTypeTitle], [dtl].[_Description] AS [_DocumentTypeDescription], [t0].[Id] AS [DocumentVersionKey], [t0].[Creator_PersonRef] AS [DocumentVersionCreator_PersonRef], [t0].[EffectiveDate] AS [DocumentVersionEffectiveDate], [t0].[ExpireDate] AS [DocumentVersionExpireDate], [t0].[IsActive] AS [DocumentVersionIsActive], [t0].[PublishDate] AS [DocumentVersionPublishDate], [t0].[VersionNo] AS [DocumentVersionVersionNo], [t0].[ReviewDate] AS [DocumentVersionReviewDate], [t0].[ItemRowRef_DocVersionState], [t1].[FullName] AS [DocumentVersionCreatorFullName], [t2].[_Title] AS [_DocVersionStateTitle]
FROM [QMS].[DocumentInfo] AS [di]
INNER JOIN [QMS].[DocumentInfoLanguage] AS [dil] ON [di].[Id] = [dil].[DocumentInfoRef]
INNER JOIN [QMS].[DocumentCompany] AS [docc] ON [di].[Id] = [docc].[DocumentInfoRef]
INNER JOIN [HRM].[CompanyLanguage] AS [cl] ON [docc].[CompanyRef] = [cl].[CompanyRef]
INNER JOIN [QMS].[DocumentType] AS [dt] ON [di].[DocumentTypeRef] = [dt].[Id]
INNER JOIN [QMS].[DocumentTypeLanguage] AS [dtl] ON [dt].[Id] = [dtl].[DocumentTypeRef]
INNER JOIN [BAS].[PersonLanguage] AS [owner] ON [di].[Owner_PersonRef] = [owner].[PersonRef]
LEFT JOIN [QMS].[DocumentWorkbench] AS [docw] ON [di].[Id] = [docw].[DocumentInfoRef]
LEFT JOIN (
SELECT [x].*
FROM [QMS].[WorkbenchLanguage] AS [x]
WHERE [x].[LanguageRef] = #__languageRef_0
) AS [t] ON [docw].[WorkbenchRef] = [t].[WorkbenchRef]
LEFT JOIN (
SELECT [x0].*
FROM [QMS].[DocumentVersion] AS [x0]
WHERE ([x0].[ExpireDate] IS NULL OR ([x0].[ExpireDate] > GETDATE())) AND ([x0].[EffectiveDate] < GETDATE())
) AS [t0] ON [di].[Id] = [t0].[DocumentInfoRef]
LEFT JOIN (
SELECT [x1].*
FROM [BAS].[PersonLanguage] AS [x1]
WHERE [x1].[LanguageRef] = #__languageRef_1
) AS [t1] ON [t0].[Creator_PersonRef] = [t1].[PersonRef]
LEFT JOIN (
SELECT [x2].*
FROM [BAS].[ItemRowLanguage] AS [x2]
WHERE [x2].[LanguageRef] = #__languageRef_2
) AS [t2] ON [t0].[ItemRowRef_DocVersionState] = [t2].[ItemRowRef]
WHERE (((((([dil].[LanguageRef] = #__languageRef_3) AND ([cl].[LanguageRef] = #__languageRef_4)) AND ([dtl].[LanguageRef] = #__languageRef_5)) AND ([owner].[LanguageRef] = #__languageRef_6))
AND (CHARINDEX(N''we'', [dil].[_Title]) > 0))
AND [docc].[CompanyRef] IN (CAST(3 AS smallint))) AND ([di].[IsVisible] = 1)
ORDER BY (SELECT 1)
OFFSET #__p_8 ROWS FETCH NEXT #__p_9 ROWS ONLY',N'#__languageRef_0 int,#__languageRef_1 int,#__languageRef_2 int,#__languageRef_3 int,#__languageRef_4 int,#__languageRef_5 int,#__languageRef_6 int,#__p_8 int,#__p_9 int',#__languageRef_0=1,#__languageRef_1=1,#__languageRef_2=1,#__languageRef_3=1,#__languageRef_4=1,#__languageRef_5=1,#__languageRef_6=1,#__p_8=0,#__p_9=25
I got this query from SQL Profiler and tried to run it on SSMS
It ran again, but again it was slow!
I tried to find the problem. after a while I realized when I ignored some portion of the query, the query runs fast! for example when I deleted some of the JOINs, everything runs perfectly, or when I deleted the left hand of WHERE condition, again everything was OK! Even when I replaced the CHARINDEX with LIKE, again! the query ran fast.
I finally realized that the query is only running slowly if these combinations are placed together, which is very strange
It is possible that I am wrong. But no matter how hard I tried, I did not understand the reason for this behavior
Now can anyone help to understand this problem and find a solution for it?

Slow insert into but fast select query

A select query with join returns results in less than 1 sec(returns 1000 rows: 600 ms) but insert into temp table or physical table takes 15-16 seconds.
tested io performance : a select or insert into without any joins takes sub-second to write 1000 rows.
tried trace flag 1118
tried adding clustered index on the temp table and do insert into with tablock and maxdop hint.
None of these improved performance.
Thanks for all your comments. 6000 to 20000 rows that need to be inserted every 5 seconds from Kafka...
I get the data from kafka into sql server using table type variable
Pass it as a parameter to stored procedure
Load this data joining with other tables into a temporary table #table
I use the #table to merge the data into application table
Found a workaround that helps me achieve the target turnaround time but dont exactly know the reason for the behaviour. As I mentioned in the problem statement, the bottleneck was writing the resultset of the select statement that joins the table variable with various other tables to the temp table.
I put this into a stored prod and returned the execution of the stored proc to a temp table. Now the insert takes less than 1 sec
SELECT
i.Id AS IId,
df.Id AS dfid,
MAX(CASE
WHEN lp.Value IS NULL and f.pnp = 1 THEN 0
WHEN lp.Value = 0 and f.tzan = 1 and f.pnp = 0 THEN NULL
ELSE lp.Value
END) 'FV',
MAX(lp.TS),
MAX(lp.Eid),
MAX(0+lp.IsDelayedStream)
FROM
f1 f WITH (NOLOCK)
INNER JOIN ft1 ft WITH (NOLOCK) ON f.FeedTypeId = ft.Id
INNER JOIN FeedDataField fdf WITH (NOLOCK)
ON fdf.FeedId = f.Id
INNER JOIN df1 df WITH (NOLOCK)
ON fdf.dfId = df.Id
INNER JOIN ds1 ds WITH (NOLOCK)
ON df.dsid = ds.Id
INNER JOIN dp1 dp WITH (NOLOCK)
ON ds.dpId = dp.Id
INNER JOIN dc1 dc WITH (NOLOCK)
ON dc.dcId = ds.dcId
INNER JOIN i1 i WITH (NOLOCK)
ON f.iId = I.Id
INNER JOIN id1 id WITH (NOLOCK)
ON id.iId = i.Id
INNER JOIN IdentifierType it WITH (NOLOCK)
ON id.ItId = it.Id
INNER JOIN ivw_Tdf tdf WITH(NOEXPAND)
ON tdf.iId = i.Id
INNER JOIN z.dbo.[tlp] lp
ON lp.Ticker = id.Name AND lp.Field = df.SourceName AND
lp.Contributor = dc.Name AND lp.YellowKey = tdf.TextValue
WHERE
ft.Name in ('X', 'Y') AND f.SA = 1
AND dp.Name = 'B' AND (i.Inactive IS NULL OR i.Inactive = 0)
AND it.Name = 'T' AND id.ValidTo = #InfinityDate
AND tdf.SourceName = 'MSD'
AND tdf.ValidTo = #Infinity
GROUP BY i.Id, df.Id
OPTION(MAXDOP 4, OPTIMIZE FOR (#Infinity = '9999-12-31 23:59:59',
#InfinityDate = '9999-12-31))

How do I optimize my query in MySQL?

I need to improve my query, specially the execution time.This is my query:
SELECT SQL_CALC_FOUND_ROWS p.*,v.type,v.idName,v.name as etapaName,m.name AS manager,
c.name AS CLIENT,
(SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(duration)))
FROM activities a
WHERE a.projectid = p.projectid) AS worked,
(SELECT SUM(TIME_TO_SEC(duration))
FROM activities a
WHERE a.projectid = p.projectid) AS worked_seconds,
(SELECT SUM(TIME_TO_SEC(remain_time))
FROM tasks t
WHERE t.projectid = p.projectid) AS remain_time
FROM projects p
INNER JOIN users m
ON p.managerid = m.userid
INNER JOIN clients c
ON p.clientid = c.clientid
INNER JOIN `values` v
ON p.etapa = v.id
WHERE 1 = 1
ORDER BY idName
ASC
The execution time of this is aprox. 5 sec. If i remove this part: (SELECT SUM(TIME_TO_SEC(remain_time)) FROM tasks t WHERE t.projectid = p.projectid) AS remain_time
the execution time is reduced to 0.3 sec. Is there a way to get the values of the remain_time in order to reduce the exec.time ?
The SQL is invoked from PHP (if this is relevant to any proposed solution).
It sounds like you need an index on tasks.
Try adding this one:
create index idx_tasks_projectid_remaintime on tasks(projectid, remain_time);
The correlated subquery should just use the index and go much faster.
Optimizing the query as it is written would give significant performance benefits (see below). But the FIRST QUESTION TO ASK when approaching any optimization is whether you really need to see all the data - there is no filtering of the resultset implemented here. This is a HUGE impact on how you optimize a query.
Adding an index on the query above will only help if the optimizer is opening a new cursor on the tasks table for every row returned by the main query. In the absence of any filtering, it will be much faster to do a full table scan of the tasks table.
SELECT ilv.*, remaining.rtime
FROM (
SELECT p.*,v.type, v.idName, v.name as etapaName,
m.name AS manager, c.name AS CLIENT,
SEC_TO_TIME(asbq.worked) AS worked, asbq.worked AS seconds_worked,
FROM projects p
INNER JOIN users m
ON p.managerid = m.userid
INNER JOIN clients c
ON p.clientid = c.clientid
INNER JOIN `values` v
ON p.etapa = v.id
LEFT JOIN (
SELECT a.projectid, SUM(TIME_TO_SEC(duration)) AS worked
FROM activities a
GROUP BY a.projectid
) asbq
ON asbq.projectid=p.projectid
) ilv
LEFT JOIN (
(SELECT t.project_id, SUM(TIME_TO_SEC(remain_time)) as rtime
FROM tasks t
GROUP BY t.projectid) remaining
ON ilv.projectid=remaining.projectid

Very slow stored procedure

I have a hard time with query optimization, currently I'm very close to the point of database redesign. And the stackoverflow is my last hope. I don't think that just showing you the query is enough so I've linked not only database script but also attached database backup in case you don't want to generate the data by hand
Here you can find both the script and the backup
The problems start when you try to do the following...
exec LockBranches #count=64,#lockedBy='034C0396-5C34-4DDA-8AD5-7E43B373AE5A',#lockedOn='2011-07-01 01:29:43.863',#unlockOn='2011-07-01 01:32:43.863'
The main problems occur in this part:
UPDATE B
SET B.LockedBy = #lockedBy,
B.LockedOn = #lockedOn,
B.UnlockOn = #unlockOn,
B.Complete = 1
FROM
(
SELECT TOP (#count) B.LockedBy, B.LockedOn, B.UnlockOn, B.Complete
FROM Objectives AS O
INNER JOIN Generations AS G ON G.ObjectiveID = O.ID
INNER JOIN Branches AS B ON B.GenerationID = G.ID
INNER JOIN
(
SELECT SB.BranchID AS BranchID, SUM(X.SuitableProbes) AS SuitableProbes
FROM SpicieBranches AS SB
INNER JOIN Probes AS P ON P.SpicieID = SB.SpicieID
INNER JOIN
(
SELECT P.ID, 1 AS SuitableProbes
FROM Probes AS P
/* ----> */ INNER JOIN Results AS R ON P.ID = R.ProbeID /* SSMS Estimated execution plan says this operation is the roughest */
GROUP BY P.ID
HAVING COUNT(R.ID) > 0
) AS X ON P.ID = X.ID
GROUP BY SB.BranchID
) AS X ON X.BranchID = B.ID
WHERE
(O.Active = 1)
AND (B.Sealed = 0)
AND (B.GenerationNo < O.BranchGenerations)
AND (B.LockedBy IS NULL OR DATEDIFF(SECOND, B.UnlockOn, GETDATE()) > 0)
AND (B.Complete = 1 OR X.SuitableProbes = O.BranchSize * O.EstimateCount * O.ProbeCount)
) AS B
EDIT: Here are the amounts of rows in each table:
Spicies 71536
Results 10240
Probes 10240
SpicieBranches 4096
Branches 256
Estimates 5
Generations 1
Versions 1
Objectives 1
Somebody else might be able to explain better than I can why this is much quicker. Experience tells me when you have a bunch of queries that collectively run slow together but should be quick in their individual parts then its worth trying a temporary table.
This is much quicker
ALTER PROCEDURE LockBranches
-- Add the parameters for the stored procedure here
#count INT,
#lockedOn DATETIME,
#unlockOn DATETIME,
#lockedBy UNIQUEIDENTIFIER
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON
--Create Temp Table
SELECT SpicieBranches.BranchID AS BranchID, SUM(X.SuitableProbes) AS SuitableProbes
INTO #BranchSuitableProbeCount
FROM SpicieBranches
INNER JOIN Probes AS P ON P.SpicieID = SpicieBranches.SpicieID
INNER JOIN
(
SELECT P.ID, 1 AS SuitableProbes
FROM Probes AS P
INNER JOIN Results AS R ON P.ID = R.ProbeID
GROUP BY P.ID
HAVING COUNT(R.ID) > 0
) AS X ON P.ID = X.ID
GROUP BY SpicieBranches.BranchID
UPDATE B SET
B.LockedBy = #lockedBy,
B.LockedOn = #lockedOn,
B.UnlockOn = #unlockOn,
B.Complete = 1
FROM
(
SELECT TOP (#count) Branches.LockedBy, Branches.LockedOn, Branches.UnlockOn, Branches.Complete
FROM Objectives
INNER JOIN Generations ON Generations.ObjectiveID = Objectives.ID
INNER JOIN Branches ON Branches.GenerationID = Generations.ID
INNER JOIN #BranchSuitableProbeCount ON Branches.ID = #BranchSuitableProbeCount.BranchID
WHERE
(Objectives.Active = 1)
AND (Branches.Sealed = 0)
AND (Branches.GenerationNo < Objectives.BranchGenerations)
AND (Branches.LockedBy IS NULL OR DATEDIFF(SECOND, Branches.UnlockOn, GETDATE()) > 0)
AND (Branches.Complete = 1 OR #BranchSuitableProbeCount.SuitableProbes = Objectives.BranchSize * Objectives.EstimateCount * Objectives.ProbeCount)
) AS B
END
This is much quicker with an average execution time of 54ms compared to 6 seconds with the original one.
EDIT
Had a look and combined my ideas with those from RBarryYoung's solution. If you use the following to create the temporary table
SELECT SB.BranchID AS BranchID, COUNT(*) AS SuitableProbes
INTO #BranchSuitableProbeCount
FROM SpicieBranches AS SB
INNER JOIN Probes AS P ON P.SpicieID = SB.SpicieID
WHERE EXISTS(SELECT * FROM Results AS R WHERE R.ProbeID = P.ID)
GROUP BY SB.BranchID
then you can get this down to 15ms which is 400x better than we started with. Looking at the execution plan shows that there is a table scan happening on the temp table. Normally you avoid table scans as best you can but for 128 rows (in this case) it is quicker than whatever it was doing before.
This is basically a complete guess here, but in times past I've found that joining onto the results of a sub-query can be horrifically slow. That is, the subquery was being evaluated way too many times when it really didn't need to.
The way around this was to move the subqueries into CTEs and to join onto those instead. Good luck!
It appears the join on the two uniqueidentifier columns are the source of the problem. One is a clustered index, the other non-clustered on the (FK table). Good that there are indexes on them. Unfortunately guids are notoriously poor performing when joining with large numbers of rows.
As troubleshooting steps:
what state are the indexes in? When was the last time the statistics were updated?
how performant is that subquery onto itself, when executed adhoc? i.e. when you run this statement by itself, how fast does the resultset return? acceptable?
after rebuilding the 2 indexes, and updating statistics, is there any measurable difference?
SELECT P.ID, 1 AS SuitableProbes FROM Probes AS P
INNER JOIN Results AS R ON P.ID = R.ProbeID
GROUP BY P.ID HAVING COUNT(R.ID) > 0
The following runs about 15x faster on my system:
UPDATE B
SET B.LockedBy = #lockedBy,
B.LockedOn = #lockedOn,
B.UnlockOn = #unlockOn,
B.Complete = 1
FROM
(
SELECT TOP (#count) B.LockedBy, B.LockedOn, B.UnlockOn, B.Complete
FROM Objectives AS O
INNER JOIN Generations AS G ON G.ObjectiveID = O.ID
INNER JOIN Branches AS B ON B.GenerationID = G.ID
INNER JOIN
(
SELECT SB.BranchID AS BranchID, COUNT(*) AS SuitableProbes
FROM SpicieBranches AS SB
INNER JOIN Probes AS P ON P.SpicieID = SB.SpicieID
WHERE EXISTS(SELECT * FROM Results AS R WHERE R.ProbeID = P.ID)
GROUP BY SB.BranchID
) AS X ON X.BranchID = B.ID
WHERE
(O.Active = 1)
AND (B.Sealed = 0)
AND (B.GenerationNo < O.BranchGenerations)
AND (B.LockedBy IS NULL OR DATEDIFF(SECOND, B.UnlockOn, GETDATE()) > 0)
AND (B.Complete = 1 OR X.SuitableProbes = O.BranchSize * O.EstimateCount * O.ProbeCount)
) AS B
Insertion of sub query into local temporary table
SELECT SB.BranchID AS BranchID, SUM(X.SuitableProbes) AS SuitableProbes
into #temp FROM SpicieBranches AS SB
INNER JOIN Probes AS P ON P.SpicieID = SB.SpicieID
INNER JOIN
(
SELECT P.ID, 1 AS SuitableProbes
FROM Probes AS P
/* ----> */ INNER JOIN Results AS R ON P.ID = R.ProbeID /* SSMS Estimated execution plan says this operation is the roughest */
GROUP BY P.ID
HAVING COUNT(R.ID) > 0
) AS X ON P.ID = X.ID
GROUP BY SB.BranchID
The below query shows the partial joins with the corresponding table instead of complete!!
UPDATE B
SET B.LockedBy = #lockedBy,
B.LockedOn = #lockedOn,
B.UnlockOn = #unlockOn,
B.Complete = 1
FROM
(
SELECT TOP (#count) B.LockedBy, B.LockedOn, B.UnlockOn, B.Complete
From
(
SELECT ID, BranchGenerations, (BranchSize * EstimateCount * ProbeCount) as MultipliedFactor
FROM Objectives AS O WHERE (O.Active = 1)
)O
INNER JOIN Generations AS G ON G.ObjectiveID = O.ID
Inner Join
(
Select Sealed, GenerationNo, LockedBy, UnlockOn, ID, Complete
From Branches
Where B.Sealed = 0 AND (B.LockedBy IS NULL OR DATEDIFF(SECOND, B.UnlockOn, GETDATE()) > 0)
)B ON B.GenerationID = G.ID
INNER JOIN
(
Select * from #temp
) AS X ON X.BranchID = B.ID
WHERE
AND (B.GenerationNo < O.BranchGenerations)
AND (B.Complete = 1 OR X.SuitableProbes = O.MultipliedFactor)
) AS B

LINQ thinks I need an extra INNER JOIN, but why?

I have a LINQ query, which for some reason is generating an extra/duplicate INNER JOIN. This is causing the query to not return the expected output. If I manually comment that extra JOIN from the generated SQL, then I get seemingly correct output.
Can you detect what I might have done in this LINQ to have caused this extra JOIN?
Thanks.
Here is my approx LINQ
predicate=predicate.And(condition1);
predicate1=predicate1.And(condition2);
predicate1=predicate1.And(condition3);
predicate2=predicate2.Or(predicate1);
predicate=predicate.And(predicate2);
var ids = context.Code.Where(predicate);
var rs = from r in ids
group r by r.PersonID into g
let matchcount=g.Select(p => p.phonenumbers.PhoneNum).Distinct().Count()
where matchcount ==2
select new
{
personid = g.Key
};
and here is the generated SQL (the duplicate join is [t7])
Declare #p1 VarChar(10)='Home'
Declare #p2 VarChar(10)='111'
Declare #p3 VarChar(10)='Office'
Declare #p4 VarChar(10)='222'
Declare #p5 int=2
SELECT [t9].[PersonID] AS [pid]
FROM (
SELECT [t3].[PersonID], (
SELECT COUNT(*)
FROM (
SELECT DISTINCT [t7].[PhoneValue]
FROM [dbo].[Person] AS [t4]
INNER JOIN [dbo].[PersonPhoneNumber] AS [t5] ON [t5].[PersonID] = [t4].[PersonID]
INNER JOIN [dbo].[CodeMaster] AS [t6] ON [t6].[Code] = [t5].[PhoneType]
INNER JOIN [dbo].[PersonPhoneNumber] AS [t7] ON [t7].[PersonID] = [t4].[PersonID]
WHERE ([t3].[PersonID] = [t4].[PersonID]) AND ([t6].[Enumeration] = #p0) AND ((([t6].[CodeDescription] = #p1) AND ([t5].[PhoneValue] = #p2)) OR (([t6].[CodeDescription] = #p3) AND ([t5].[PhoneValue] = #p4)))
) AS [t8]
) AS [value]
FROM (
SELECT [t0].[PersonID]
FROM [dbo].[Person] AS [t0]
INNER JOIN [dbo].[PersonPhoneNumber] AS [t1] ON [t1].[PersonID] = [t0].[PersonID]
INNER JOIN [dbo].[CodeMaster] AS [t2] ON [t2].[Code] = [t1].[PhoneType]
WHERE ([t2].[Enumeration] = #p0) AND ((([t2].[CodeDescription] = #p1) AND ([t1].[PhoneValue] = #p2)) OR (([t2].[CodeDescription] = #p3) AND ([t1].[PhoneValue] = #p4)))
GROUP BY [t0].[PersonID]
) AS [t3]
) AS [t9]
WHERE [t9].[value] = #p5
They aren't being duplicated. You are asking for two different values from the data source.
let matchcount=g.Select(p => p.phonenumbers.PhoneNum).Distinct().Count()
is causing
SELECT COUNT(*)
FROM (
SELECT DISTINCT [t7].[PhoneValue]
FROM [dbo].[Person] AS [t4]
INNER JOIN [dbo].[PersonPhoneNumber] AS [t5] ON [t5].[PersonID] = [t4].[PersonID]
INNER JOIN [dbo].[CodeMaster] AS [t6] ON [t6].[Code] = [t5].[PhoneType]
INNER JOIN [dbo].[PersonPhoneNumber] AS [t7] ON [t7].[PersonID] = [t4].[PersonID]
WHERE ([t3].[PersonID] = [t4].[PersonID]) AND ([t6].[Enumeration] = #p0) AND ((([t6].[CodeDescription] = #p1) AND ([t5].[PhoneValue] = #p2)) OR (([t6].[CodeDescription] = #p3) AND ([t5].[PhoneValue] = #p4)))
) AS [t8]
and
from r in ids
group r by r.PersonID into g
is causing
SELECT [t0].[PersonID]
FROM [dbo].[Person] AS [t0]
INNER JOIN [dbo].[PersonPhoneNumber] AS [t1] ON [t1].[PersonID] = [t0].[PersonID]
INNER JOIN [dbo].[CodeMaster] AS [t2] ON [t2].[Code] = [t1].[PhoneType]
WHERE ([t2].[Enumeration] = #p0) AND ((([t2].[CodeDescription] = #p1) AND ([t1].[PhoneValue] = #p2)) OR (([t2].[CodeDescription] = #p3) AND ([t1].[PhoneValue] = #p4)))
GROUP BY [t0].[PersonID]
) AS [t3]
as for the INNER JOINS, the reason you are getting them is because of the relationship between those tables. For instance Person is 1..1 with PersonPhoneNumber (or 1..*). In either case I assume PersonID on PersonPhoneNumber is an FK and a PK value. So in that case the data source has to go out to that external table to see if the value for the PersonPhoneNumber navigation property actually exists. It does this by performing an INNER JOIN on that table.
My gut feeling is that the .DISTINCT().COUNT() is treated separately by the linq to sql translation.
I'd also wager that the execution plan on SQL just threw out the dupe.
Try to rewrite with explicit condition instead of thah abstract "predicate" construction. From what I see in SQL that composition might look weird to a parser in isolation and one join [t5] which you just called dupe :-) is there to serve that condition.
Also, try to tell us what tit you really want to find with that query and try to write normal SQL that does what you wanted. I'm supposed to be human :-) and it look weird to me as well :-))
Technically speaking, you forced double joint by using a condition on in in two separate queries (every var assignment it technically separate query).
Also doing group by a column without doing any aggregation is not alway equivalent to select distinct. In particular select distinct on a join is allowed to take precedence over a join - queries are declatative (can undergo reorderings) and you were trying to force it to be procedural. So LINQ gave you exact procedural :-) and then SQL reordered according to SQL rules :-))
So, just write normal SQL first, and if you can't LINQ-ize it put it into sproc - it's going to make it faster anyway :-)