I try to find a particular SQL statement to replace an old SQL query. To summarize, I try to make a left join only with where conditions.
Here is my test environment:
create table Mst
(
Id bigint not null primary key clustered,
Firstname nvarchar(200) not null,
Lastname nvarchar(200) not null
);
create table Dtl
(
Id bigint not null primary key clustered,
MstId bigint not null references Mst(Id),
DetailDescr nvarchar(500) not null
);
I fill the tables with some data:
declare #i as bigint = 0;
while #i < 999
begin
insert into Mst values (#i, N'Name ' + Str(#i), N'Lastname ' + Str(#i));
if (#i % 10 = 0)
insert into Dtl values (#i*5+0, #i, N'Description 1 for ' + Str(#i));
if (#i % 2 = 0)
insert into Dtl values (#i*5+1, #i, N'Description 2 for ' + Str(#i));
if (#i % 3 = 0)
insert into Dtl values (#i*5+2, #i, N'Description 3 for ' + Str(#i));
set #i = #i + 1;
end;
The usual way for a left join is this:
select m.Id, m.Firstname, m.Lastname, d.DetailDescr
From Mst m left join Dtl d
on m.id = d.MstId;
This query returns 1266 rows. But in the old application, which I try to migrate, the select- and from-part is still predefined:
select m.Id, m.Firstname, m.Lastname, d.DetailDescr
From Mst m, Dtl d
The old where condition defines (in a separate software module) a no longer available LEFT JOIN:
where m.id *= d.MstId
So we have to migrate that approach and try to modify only the where condition if possible. For an inner join, the where condition is easy to define:
where m.id = d.MstId
But I need a left join, and I find no way with only modify the where condition. But to rewrite only the where-condition is the best way in that special application.
Thanks in advance for your ideas.
Once upon a time, SQL did not support outer join syntax. It was an ancient world, where telephones were connected by wires to walls, where counties in Europe each had their own currencies, and most Americans watched one of three or four major networks on television.
At that time, Microsoft did not even have a real database. But Sybase offered an outer join operator in the WHERE clause, *=, which Microsoft eventually adapted into SQL Server. Microsoft SQL Server supported this through SQL Server 2008. Hence, no supported version of SQL Server supports outer joins in the WHERE clause.
Happily a much better standard syntax now exists (lest we be despondent and think that things do not get better over time). The "comma operator" in the FROM clause is relegated to its original definition -- a CROSS JOIN. The CROSS JOIN filters out non-matches. For instance, if Dtl has no rows, then CROSS JOIN returns no rows.
That is, there is no way to do what you want generically in the WHERE clause. There are queries that can replicate an outer JOIN, but they require much more surgery to the query. But there is a good alternative, which is to write your queries with the correct, modern syntax.
Related
I need help in optimizing this SQL query.
In the main SELECT statement there are three columns which is dependent on the outer query result. This is why my query is taking a long time to return data. I have tried making left joins but this is not working properly.
Can anyone help me to resolve this issue?
SELECT
DISTINCT ou.OrganizationUserID AS StudentID,
ou.FirstName,
ou.LastName,
(
SELECT
STRING_AGG(
(ug.UG_Name),
','
)
FROM
Groups ug
INNER JOIN ApplicantUserGroup augm ON augm.AUGM_UserGroupID = ug.UG_ID
WHERE
augm.AUGM_OrganizationUserID = ou.OrganizationUserID
AND ug.UG_IsDeleted = 0
AND augm.AUGM_IsDeleted = 0
) AS UserGroups,
order1.OrderNumber AS OrderId -- UAT-2455
,
(
SELECT
STRING_AGG(
(CActe.CustomAttribute),
','
)
FROM
CustomAttributeCte CActe
WHERE
CActe.HierarchyNodeID = dpm.DPM_ID
AND CActe.OrganizationUserID = ps.OrganizationUserID
) AS CustomAttributes -- UAT-2455
,
(
SELECT
STRING_AGG(
(CActe.CustomAttributeID),
','
)
FROM
CustomAttributeCte CActe
WHERE
CActe.HierarchyNodeID = dpm.DPM_ID
AND CActe.OrganizationUserID = ps.OrganizationUserID
) AS CustomAttributeID
FROM
ApplicantData acd WITH (NOLOCK)
INNER JOIN ClientPackage ps WITH (NOLOCK) ON acd.ClientSubscriptionID = ps.ClientSubscriptionID
INNER JOIN [ClientOrder] order1 WITH (NOLOCK) ON order1.OrderID = ps.OrderID
AND order1.IsDeleted = 0
INNER JOIN OUser ou WITH (NOLOCK) ON ou.OrganizationUserID = ps.OrganizationUserID
It looks like this query can be simplified, and the dependent subqueries in your SELECT clause removed, Consider your second and third dependent subqueries. You can refactor them into one nondependent subquery with a LEFT JOIN. Using nondependent subqueries is more efficient because the query planner can run them just once, rather than once for each row.
You want two STRING_AGG() results from the same table. This subquery gives those two outputs for every possible combination of HierarchyNodeID and OrganizationUserID values. STRING_AGG() is an aggregate function like SUM() and so works nicely with GROUP BY.
SELECT HierarchyNodeID, OrganizationUserID,
STRING_AGG((CActe.CustomAttribute), ',') CustomAttributes -- UAT-2455,
STRING_AGG((CActe.CustomAttributeID), ',') CustomAttributeIDs -- UAT-2455
FROM CustomAttributeCte CActe
GROUP BY HierarchyNodeID, OrganizationUserID
You can run this subquery itself to convince yourself it works.
Now, we can LEFT JOIN that into your query. Like this. (For readability I took out the NOLOCKs and used JOIN: it means the same thing as INNER JOIN.)
SELECT DISTINCT
ou.OrganizationUserID AS StudentID,
ou.FirstName,
ou.LastName,
'tempvalue' AS UserGroups, -- shortened for testing
order1.OrderNumber AS OrderId, -- UAT-2455
uat2455.CustomAttributes, -- UAT-2455
uat2455.CustomAttributeIDs -- UAT-2455
FROM ApplicantData acd
JOIN ClientPackage ps
ON acd.ClientSubscriptionID = ps.ClientSubscriptionID
JOIN ClientOrder order1
ON order1.OrderID = ps.OrderID
AND order1.IsDeleted = 0
JOIN OUser ou
ON ou.OrganizationUserID = ps.OrganizationUserID
LEFT JOIN (
SELECT HierarchyNodeID, OrganizationUserID,
STRING_AGG((CActe.CustomAttribute), ',') CustomAttributes -- UAT-2455,
STRING_AGG((CActe.CustomAttributeID), ',') CustomAttributeIDs -- UAT-2455
FROM CustomAttributeCte CActe
GROUP BY HierarchyNodeID, OrganizationUserID
) uat2455
ON uat2455.HierarchyNodeID = dpm.DPM_ID
AND uat2455.OrganizationUserId = ps.OrganizationUserID
See how we collapsed your second and third dependent subqueries to just one, then used it as a virtual table with LEFT JOIN? We transformed the WHERE clauses from the dependent subqueries into an ON clause.
You can test this: run it with TOP(50) and eyeball the results.
When you're happy, the next step is to transform your first dependent subquery the same way.
Pro tip Don't use WITH (NOLOCK), ever, unless a database administration expert tells you to after looking at your specific query. If your query's purpose is a historical report and you don't care whether the most recent transactions in your database are represented exactly right, you can precede your query with this statement. It also allows the query to run while avoiding locks.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
Pro tip Be obsessive about formatting your queries for readability. You, your colleagues, and yourself a year from now must be able to read and reason about queries like this.
I created procedure where dynamically collecting from various projects (Databases) some records into temporary table and from that temporary table I am inserting into table. With WHERE statement , but unfortunately when I checked with Execution plan I find out, that this query part take a lot of load. How can I optimize this INSERT part or WHERE statement ?
INSERT INTO dbo.PROJECTS_TESTS ( PROJECTID, ANOTHERTID, DOMAINID, is_test)
SELECT * FROM #temp_Test AS tC
WHERE NOT EXISTS (SELECT TOP 1 1
FROM dbo.PROJECTS_TESTS AS ps WITH (NOLOCK)
WHERE ps.PROJECTID = tC.projectId
AND ps.ANOTHERTID = tC.anotherLink
AND ps.DOMAINID = tC.DOMAINID
AND ps.is_test = tC.test_project
)
I think you'd be better served by doing a JOIN than EXISTS. Depending on the cardinality of your join condition (currently in your WHERE) you might need DISTINCT in there too.
INSERT INTO dbo.PROJECTS_TESTS ( PROJECTID, ANOTHERTID, DOMAINID, is_test)
SELECT <maybe distinct> tC.* FROM #temp_Test AS tC
LEFT OUTER JOIN FROM dbo.PROJECTS_TESTS AS ps on
ps.PROJECTID = tC.projectId
AND ps.ANOTHERTID = tC.anotherLink
AND ps.DOMAINID = tC.DOMAINID
AND ps.is_test = tC.test_project
where ps.PROJECT ID IS NULL
or something like that
I've been struggling with an elegant solution for this for a while, and thought I'd finally cracked it but am now getting the error
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
Which is frustrating me!
In essence the query is:
select
u.username + ' ' + u.surname,
CASE WHEN ugt.type = 'Contract'
THEN
(
select sum(dbo.GET_INVOICE_WEEKLY_AVERAGE_VALUE(pc.placementid, u.UserId))
from PlacementConsultants pc
where pc.UserId = u.UserId
and pc.CommissionPerc >= 80
)
END
from usergradetypes ugt
inner join usergrades ug on ug.gradeid = ugt.usergradetypeid
inner join users u on u.userid = ug.userid
The function GET_INVOICE_WEEKLY_AVERAGE_VALUE is as follows
ALTER function [dbo].[GET_INVOICE_WEEKLY_AVERAGE_VALUE]( #placementid INT, #userid INT )
RETURNS numeric(9,2)
AS
BEGIN
DECLARE #retVal numeric(9,2)
DECLARE #rollingweeks int
SET #rollingweeks = (select ISNULL(rollingweeks,26) FROM UserGradeTypes ugt
inner join UserGrades ug on ugt.UserGradeTypeID = ug.gradeid
WHERE ug.userid = #userid)
SELECT #retVal =
sum(dbo.GET_INVOICE_NET_VALUE(id.InvoiceId)) / #rollingweeks from PlacementInvoices pli
inner join invoicedetails id on id.invoiceid = pli.InvoiceId
where pli.PlacementID = #placementid
and pli.CreatedOn between DATEADD(ww,-#rollingweeks,getdate()) and GETDATE()
RETURN #retVal
The query runs fine without the sum but when I'm trying to sum the value of the deals, it's falling over (which I need to do for a summary page)
I do not know why this fails:
select sum(dbo.GET_INVOICE_WEEKLY_AVERAGE_VALUE(pc.placementid, u.UserId))
but this works:
select sum(dbo.GET_INVOICE_WEEKLY_AVERAGE_VALUE(pc.placementid, pc.UserId))
It is curious and seems like a bug to me.
The error message, though, suggests that all the columns inside the sum() need to come from either the outer referenced tables or the inner referenced tables, but not both. I don't understand the reason for this. My best guess is that mixing the two types of references confuses the optimizer.
I haven't seen this error message before, by the way.
EDIT:
It is very easy to reproduce, and does not require a function call:
with t as (select 1 as col)
select t.*,
(select sum(t2.col + t.col) from t t2) as newcol
from t;
Very interesting. I think this might violate the standard. The equivalent query does run on Oracle.
I have a stored procedure that I have developed on a SQL2008 server that runs <1sec. On another server which is SQL2005 the same sp on the same database takes ~1minute. Without going into the details of the database schema can anyone see anything obvious in this SP that may cause this performance discrepancy? Could it be the use of the CTE? Is there an alternative?
EDIT - I have now noticed that if I run the SQL directly on SQL 2005 it runs in ~4secs but executing the SP still takes over a minute?? Looks like the problem may like in the SP execution??
CREATE PROCEDURE Workflow.GetTopTasks
-- Add the parameters for the stored procedure here
#ownerUserId int,
#topN int = 10
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SET ROWCOUNT #topN;
-- Insert statements for procedure here
WITH cteCalculatedDate (MilestoneDateId, CalculatedMilestoneDate)
AS
(
-- Anchor member definition
SELECT md.MilestoneDateId, md.SpecifiedDate
FROM Workflow.MilestoneDate md
WHERE md.RelativeMilestoneDateId IS NULL
UNION ALL
-- Recursive member definition
SELECT md.MilestoneDateId, CalculatedMilestoneDate + md.RelativeDays
FROM Workflow.MilestoneDate md
INNER JOIN cteCalculatedDate cte
on md.RelativeMilestoneDateId = cte.MilestoneDateId
)
-- Statement that executes the CTE
select
we.*
from Workflow.WorkflowElement we
left outer join cteCalculatedDate cte
on cte.MilestoneDateId = we.DueDateId
inner join Workflow.WorkflowInstance wi
on wi.WorkflowInstanceId = we.WorkflowInstanceId
left outer join Workflow.SchemeWorkflow sw
on sw.WorkflowInstanceId = wi.WorkflowInstanceId
left outer join Workflow.Scheme s
on s.SchemeId = sw.SchemeId
inner join Workflow.WorkflowDefinition wd
on wd.WorkflowDefinitionId = wi.WorkflowDefinitionId
where
we.OwnerId = #ownerUserId -- for given owner
and we.CompletedDate is null -- is not completed
and we.ElementTypeId <= 4 -- is Action, Data, Decision or Document (Not End, Start or KeyDate)
and cte.CalculatedMilestoneDate is not null -- has a duedate
UNION
select
we.*
from Workflow.WorkflowElement we
left outer join cteCalculatedDate cte
on cte.MilestoneDateId = we.DueDateId
inner join Workflow.WorkflowInstance wi
on wi.WorkflowInstanceId = we.WorkflowInstanceId
left outer join Workflow.SchemeWorkflow sw
on sw.WorkflowInstanceId = wi.WorkflowInstanceId
left outer join Workflow.Scheme s
on s.SchemeId = sw.SchemeId
inner join Workflow.WorkflowDefinition wd
on wd.WorkflowDefinitionId = wi.WorkflowDefinitionId
where
we.OwnerId = #ownerUserId -- for given owner
and we.CompletedDate is null -- is not completed
and we.ElementTypeId <= 4 -- is Action, Data, Decision or Document (Not End, Start or KeyDate)
and cte.CalculatedMilestoneDate is null -- does NOT have a duedate
SET ROWCOUNT 0
END
EDIT - I have now noticed that if I
run the SQL directly on SQL 2005 it
runs in ~4secs but executing the SP
still takes over a minute??
Bad parameter sniffing then:
http://elegantcode.com/2008/05/17/sql-parameter-sniffing-and-what-to-do-about-it/
SQL poor stored procedure execution plan performance - parameter sniffing
Parameter sniffing was bad in 2005, but better in 2008.
You union is selecting CalculatedMilestoneDate equal to NULL and not equal to Null.
This is redundant, the entire UNION can be removed by just removing the condition on CalculatedMilestoneDate from the where clause.
Other than that, you should verify that both databases have the same indexes defined.
-- Statement that executes the CTE
select
we.*
from Workflow.WorkflowElement we
left outer join cteCalculatedDate cte
on cte.MilestoneDateId = we.DueDateId
inner join Workflow.WorkflowInstance wi
on wi.WorkflowInstanceId = we.WorkflowInstanceId
left outer join Workflow.SchemeWorkflow sw
on sw.WorkflowInstanceId = wi.WorkflowInstanceId
left outer join Workflow.Scheme s
on s.SchemeId = sw.SchemeId
inner join Workflow.WorkflowDefinition wd
on wd.WorkflowDefinitionId = wi.WorkflowDefinitionId
where
we.OwnerId = #ownerUserId -- for given owner
and we.CompletedDate is null -- is not completed
and we.ElementTypeId <= 4 -- is Action, Data, Decision or Document (Not End, Start or KeyDate)
If the schemas match then perhaps you are missing important indexes in the sql server 2005 instance. Try running the sql server tuning advisors and applying its index recommendations.
SQL newbie here :)
Here are my tables if anyone's interested.
AHH, cant post image yet
http://img832.imageshack.us/img832/442/72098588.jpg
What I'm trying to do is query the tblPatientStats table within a date interval (#StartDate, #EndDate)
and group them accordingly in a data grid on winforms.
So each row in tblPatientStats either have a RefDoctor or RefMode or both or none at all.
So the query should return a table with the Name of the patient from tblPatient, the RefMode from tblRefMode, the Name of the RefDoctor (Title + FirstName + lastName) and SessionDate from tblPatientStats
==> yfrog dot com/0yhi2dj
Here is my attempt so far.
INSERT #Final(Name, Doctor, Mode, SessionDate)
SELECT DISTINCT (FirstNames + LastName) as Name,
(tblRefDoctor.RefDTitle + ' ' + tblRefDoctor.RefDFNames + ' ' + tblRefDoctor.RefDName) AS Doctor,
tblRefMode.RefMode AS Mode, SessionDate
FROM tblPatientStats, tblPatient
left outer join tblRefDoctor on (RefDoctor = tblRefDoctor.RefDoctor)
left outer join tblRefMode on (RefModeID = tblRefMode.RefModeID)
WHERE
tblPatientStats.RefDoctor IS NOT NULL or tblPatientStats.RefModeID IS NOT NULL
AND
tblPatient.PatientID = tblPatientStats.PatientID
AND tblPatientStats.SessionDate between #StartDate AND #EndDate
What am I doing wrong? The query times out every single time, the tables are small, less than 10K records each.
Any help would be much appreciated.
I suspect the issue is because of the cartesian join on
tblPatientStats, tblPatient
Whilst there is a join condition in the where clause there is an issue with the precedence of the boolean operators. This is in order Not, And, Or so I think you need brackets around the 'Or' ed conditions.
The WHERE condition on the original query with brackets applied to show the effective operator precedence is
WHERE
tblPatientStats.RefDoctor IS NOT NULL or
(tblPatientStats.RefModeID IS NOT NULL
AND tblPatient.PatientID = tblPatientStats.PatientID
AND tblPatientStats.SessionDate between #StartDate AND #EndDate)
This is almost certainly not the desired semantics and will likely bring back too many rows.
I've moved the join condition between tblPatientStats and tblPatient up into the JOIN clauses and added brackets to the Or ed conditions.
FROM tblPatientStats
inner join tblPatient on tblPatient.PatientID = tblPatientStats.PatientID
left outer join tblRefDoctor on RefDoctor = tblRefDoctor.RefDoctor
left outer join tblRefMode on RefModeID = tblRefMode.RefModeID
WHERE
(tblPatientStats.RefDoctor IS NOT NULL or tblPatientStats.RefModeID IS NOT NULL)
AND tblPatientStats.SessionDate between #StartDate AND #EndDate