Query including subquery and group by slower than expected - sql

The whole query below runs incredibly slowly.
The subquery query [alias Stage_1] takes only 1.37 minutes returning 9514 records, however the whole query takes over 20 minutes, returning 2606 records.
I could use a #temp table to hold the subquery to improve the performance however I would prefer not to.
An overview of the query is that table WeeklySpace inner joins to Spaceblock_Name_to_PG table on SpaceblockName_SID, this cuts down the results in WeeklySpace and includes PG_Code with the results in WeeklySpace. WeeklySpace is then Full Outer Joined to Sales_PG_Wk across 3 fields. The where clause focuses the results, and may be changed. The results from the subquery are then sum'd. You cannot do the final sum'ing in the subquery due to the group by and sum over used.
I believe the issue is due to the subquery re calculation repeatedly during the group by in the final sum'ing. The field SpaceblockName_SID also appears to be involved in causing the issue as without it the run time with a group by in the subquery isn't affected.
I have read though loads of suggestion, trying them all to resolve the issue.
These include;
Adding TOP 2147483647 with Order by to force intermediate
materialization, both in the subquery and using a CTE.
Adding a join after stage_1.
Cast'ing SpaceblockName_SID from an int to a varchar and back again
The execution plan (cut in two parts, shown below the code) for both the subquery and the whole query appear similar. The cost is around the Full Outer Join (Hash Match), which I expected.
The query is running on T-SQL 2005.
Any help greatly appreciated!
select
Cost_centre
, Fin_week
, SpaceblockName_SID
, sum(Propor_rep_SRV) as Total_SpaceblockName_SID_SRV
from
(
select
coalesce(space_side.fin_week , sales_side.fin_week) as Fin_week
,coalesce(space_side.cost_centre , sales_side.cost_Centre) as Cost_centre
,space_side.SpaceblockName_SID
,case
when space_side.SpaceblockName_SID is null
then sales_side.SalesExVAT
else sum(space_side.TLM)
/nullif(sum (sum(space_side.TLM) ) over (partition by coalesce(space_side.fin_week , sales_side.fin_week)
, coalesce(space_side.cost_centre , sales_side.cost_Centre)
, coalesce( Spaceblock_Name_to_PG.PG_Code, sales_side.PG_Code)) ,0)*sales_side.SalesExVAT
end as Propor_rep_SRV
from
WeeklySpace as space_side
INNER JOIN
Spaceblock_Name_to_PG
ON space_side.SpaceblockName_SID = Spaceblock_Name_to_PG.SpaceblockName_SID
and Spaceblock_Name_to_PG.PG_Code < 10000
full outer join
sales_pg_wk as sales_side
on space_side.fin_week = sales_side.fin_week
and space_side.Cost_Centre = sales_side.Cost_Centre
and Spaceblock_Name_to_PG.PG_code = sales_side.pg_code
where
coalesce(space_side.fin_week, sales_side.fin_week) between 201538 and 201550
and
coalesce(space_side.cost_centre, sales_side.cost_Centre) in (3, 2800)
group by
coalesce(space_side.fin_week, sales_side.fin_week)
,coalesce(space_side.cost_centre, sales_side.cost_Centre)
,coalesce( Spaceblock_Name_to_PG.PG_Code, sales_side.PG_Code)
,sales_side.SalesExVAT
,space_side.SpaceblockName_SID
) as stage_1
group by
Cost_centre
, Fin_week
, SpaceblockName_SID
Execution plan left hand side
Execution plan right hand side

You didn't mentioned about indices are created or not on those columns those you used in your query. If not then create and check performance of the query

In looking at you logic I think you split this in two with a UNION
One with Spaceblock_Name_to_PG.PG_Code < 10000 and the other with Spaceblock_Name_to_PG.PG_Code >= 10000
And consider this change
If may be doing a bunch of join that you are going to throw out anyway
full outer join sales_pg_wk as sales_side
on space_side.fin_week = sales_side.fin_week
and space_side.Cost_Centre = sales_side.Cost_Centre
and Spaceblock_Name_to_PG.PG_code = sales_side.pg_code
and space_side.fin_week between 201538 and 201550
and sales_side.fin_week between 201538 and 201550
and space_side.cost_centre in (3, 2800)
and sales_side.cost_Centre in (3, 2800)

Related

Improve Query Performance, Adding where clause grids query to a halt

Running the following SQL results in a query that runs in around 0.338s
adding a where clause and query times out. All I want to achieve is a list of test results for a particular test_code
Result_Set will have many Test_Results on the index Result_Set_Row_ID
Date_Received_Index will have many Result_Sets on the index Result_Set_Row_ID
I have tried altering the order of JOINS, adding clauses to the join statements.
SELECT
Date_Received_Index.Registration_Number,
Date_Received_Index.Specimen_Number,
Result,
Result_Comment,
Result_Comment_Exp ,
Result_Exp,
Short_Exp,
Test_Code,
Test_Exp,
Test_Row_ID,
Units,
Result_Set.Set_Code ,
Result_Set.Date_Time_Authorised,
Result_Set.Date_Booked_In ,
Date_Received_Index.Discipline,
Date_Received_Index.Namespace
FROM
Result_Set
INNER JOIN Test_Result ON Result_Set.Result_Set_Row_ID = Test_Result.Result_Set_Row_ID
INNER JOIN Date_Received_Index ON (Date_Received_Index.Request_Row_ID = Result_Set.Request_Row_ID)
WHERE
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1 AND
Date_Received_Index.Namespace = 'CHM'
adding a WHERE clause e.g.
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1 AND
Date_Received_Index.Namespace = 'CHM'
AND Test_Code = 'K'
results in the query timing out
I would like to be able to construct an SQL statement that is performant and just selects the test_code specified in the where clause.
This comes down to the Query Plan. Can you share the query plan?
My suspicion is that the column Test_Code is not indexed and the addition to the WHERE clause is causing the optimizer to select the wrong query plan.
I think the SQL optimizer is not able to optimize the portion
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1
without knowing your schema my question would be of the three columns used in the where clause Date_Received_Index.Date_Received, Date_Received_Index.Namespace, and Test_Code are any of these columns index. You already indicated Test_Code is not.
Depending on what version of Cache you are using you might try
SELECT
Date_Received_Index.Registration_Number,
Date_Received_Index.Specimen_Number,
Result,
Result_Comment,
Result_Comment_Exp ,
Result_Exp,
Short_Exp,
Test_Code,
Test_Exp,
Test_Row_ID,
Units,
Result_Set.Set_Code ,
Result_Set.Date_Time_Authorised,
Result_Set.Date_Booked_In ,
Date_Received_Index.Discipline,
Date_Received_Index.Namespace
FROM %PARALLEL
Result_Set
INNER JOIN Test_Result ON Result_Set.Result_Set_Row_ID = Test_Result.Result_Set_Row_ID
INNER JOIN Date_Received_Index ON (Date_Received_Index.Request_Row_ID = Result_Set.Request_Row_ID)
WHERE
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1
AND Date_Received_Index.Namespace = 'CHM'
AND Test_Code = 'K'
Use of %PARALLEL can cause the query to be run using multiple threads. If the server has a large number of CPUs it may run faster even if it's not optimized.

SQl Query get data very slow from different tables

I am writing a sql query to get data from different tables but it is getting data from different tables very slowly.
Approximately above 2 minutes to complete.
What i am doing is here :
1. I am getting data differences and on behalf of date difference i am getting account numbers
2. I am comparing tables to get exact data i need.
here is my query
select T.accountno,
MAX(T.datetxn) as MxDt,
datediff(MM,MAX(T.datetxn), '2011-6-30') as Diffs,
max(P.Name) as POName
from Account_skd A,
AccountTxn_skd T,
POName P
where A.AccountNo = T.AccountNo and
GPOCode = A.OfficeCode and
Code = A.POCode and
A.servicecode = T.ServiceCode
group by T.AccountNo
order by len(T.AccountNo) DESC
please help that how i can use joins or any other way to get data within very less time say 5-10 seconds.
Since it appears you are getting EVERY ACCOUNT, and performance is slow, I would try by creating a prequery by just account, then do a single join to the other join tables something like..
select
T.Accountno,
T.MxDt,
datediff(MM, T.MxDt, '2011-6-30') as Diffs,
P.Name as POName
from
( select T1.AccountNo,
Max( T1.DateTxn ) MxDt
from AccontTxn_skd T1
group by T1.AccountNo ) T
JOIN Account_skd A
on T.AccountNo = A.AccountNo
JOIN POName P
on A.POCode = P.Code <-- GUESSING as you didn't qualify alias.field
AND A.OfficeCode = P.GPOCode <-- in your query for these two fields
order by
len(T.AccountNo) DESC
You had other elements based on the T.ServiceCode matching, but since you are only grouping on the account number anyhow, did it matter which service code was used? Otherwise, you would need to group by both the account AND service code (which I would have added the service code into the prequery and added as join condition to the account table too).

Timeout running SQL query

I'm trying to using the aggregation features of the django ORM to run a query on a MSSQL 2008R2 database, but I keep getting a timeout error. The query (generated by django) which fails is below. I've tried running it directs the SQL management studio and it works, but takes 3.5 min
It does look it's aggregating over a bunch of fields which it doesn't need to, but I wouldn't have though that should really cause it to take that long. The database isn't that big either, auth_user has 9 records, ticket_ticket has 1210, and ticket_watchers has 1876. Is there something I'm missing?
SELECT
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined],
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined]
HAVING
(COUNT([tickets_ticket].[id]) > 0 OR COUNT(T3.[id]) > 0 )
EDIT:
Here are the relevant indexes (excluding those not used in the query):
auth_user.id (PK)
auth_user.username (Unique)
tickets_ticket.id (PK)
tickets_ticket.capturer_id
tickets_ticket.responsible_id
tickets_ticket_watchers.id (PK)
tickets_ticket_watchers.user_id
tickets_ticket_watchers.ticket_id
EDIT 2:
After a bit of experimentation, I've found that the following query is the smallest that results in the slow execution:
SELECT
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id]
The weird thing is that if I comment out any two lines in the above, it runs in less that 1s, but it doesn't seem to matter which lines I remove (although obviously I can't remove a join without also removing the relevant SELECT line).
EDIT 3:
The python code which generated this is:
User.objects.annotate(
Count('tickets_captured'),
Count('assigned_tickets'),
Count('tickets_watched')
)
A look at the execution plan shows that SQL Server is first doing a cross-join on all the table, resulting in about 280 million rows, and 6Gb of data. I assume that this is where the problem lies, but why is it happening?
SQL Server is doing exactly what it was asked to do. Unfortunately, Django is not generating the right query for what you want. It looks like you need to count distinct, instead of just count: Django annotate() multiple times causes wrong answers
As for why the query works that way: The query says to join the four tables together. So say an author has 2 captured tickets, 3 assigned tickets, and 4 watched tickets, the join will return 2*3*4 tickets, one for each combination of tickets. The distinct part will remove all the duplicates.
what about this?
SELECT auth_user.*,
C1.tickets_captured__count
C2.assigned_tickets__count
C3.tickets_watched__count
FROM
auth_user
LEFT JOIN
( SELECT capturer_id, COUNT(*) AS tickets_captured__count
FROM tickets_ticket GROUP BY capturer_id ) AS C1 ON auth_user.id = C1.capturer_id
LEFT JOIN
( SELECT responsible_id, COUNT(*) AS assigned_tickets__count
FROM tickets_ticket GROUP BY responsible_id ) AS C2 ON auth_user.id = C2.responsible_id
LEFT JOIN
( SELECT user_id, COUNT(*) AS tickets_watched__count
FROM tickets_ticket_watchers GROUP BY user_id ) AS C3 ON auth_user.id = C3.user_id
WHERE C1.tickets_captured__count > 0 OR C2.assigned_tickets__count > 0
--WHERE C1.tickets_captured__count is not null OR C2.assigned_tickets__count is not null -- also works (I think with beter performance)

SQL SUM function doubling the amount it should using multiple tables

My query below is doubling the amount on the last record it returns. I have 3 tables - activities, bookings and tempbookings. The query needs to list the activities and attached information and pull the total number (using the SUM) of places booked (as BookingTotal) from the booking table by each activity and then it needs to calculate the same for tempbookings (as tempPlacesReserved) providing the reservedate field inside that table is in the future.
However the first issue is that if there are no records for an activity in the tempbookings table it does not return any records for that activity at all, to get around this i created dummy records in the past so that it still returns the record, but if I can make it so I don't have to do this I would prefer it!
The main issue I have is that on the final record of the returned results it doubles the booking total and the places reserved which of course makes the whole query useless.
I know that I am doing something wrong I just haven't been able to sort it, I have searched similar issues online but am unable to apply them to my situation correctly.
Any help would be appreciated.
P.S. I'm aware that normally you wouldn't need to fully label all the paths to the databases, tables and fields as I have but for the program I am planning to use it in I have to do it this way.
Code:
SELECT [LeisureActivities].[dbo].[activities].[activityID],
[LeisureActivities].[dbo].[activities].[activityName],
[LeisureActivities].[dbo].[activities].[activityDate],
[LeisureActivities].[dbo].[activities].[activityPlaces],
[LeisureActivities].[dbo].[activities].[activityPrice],
SUM([LeisureActivities].[dbo].[bookings].[bookingPlaces]) AS 'bookingTotal',
SUM (CASE WHEN[LeisureActivities].[dbo].[tempbookings].[tempReserveDate] > GetDate() THEN [LeisureActivities].[dbo].[tempbookings].[tempPlaces] ELSE 0 end) AS 'tempPlacesReserved'
FROM [LeisureActivities].[dbo].[activities],
[LeisureActivities].[dbo].[bookings],
[LeisureActivities].[dbo].[tempbookings]
WHERE ([LeisureActivities].[dbo].[activities].[activityID]=[LeisureActivities].[dbo].[bookings].[activityID]
AND [LeisureActivities].[dbo].[activities].[activityID]=[LeisureActivities].[dbo].[tempbookings].[tempActivityID])
AND [LeisureActivities].[dbo].[activities].[activityDate] > GetDate ()
GROUP BY [LeisureActivities].[dbo].[activities].[activityID],
[LeisureActivities].[dbo].[activities].[activityName],
[LeisureActivities].[dbo].[activities].[activityDate],
[LeisureActivities].[dbo].[activities].[activityPlaces],
[LeisureActivities].[dbo].[activities].[activityPrice];
Your current query is using an INNER JOIN between each of the tables so if the tempBookings table has no records, you will not return anything.
I would advise that you start to use JOIN syntax. You might also need to use subqueries to get the totals.
SELECT a.[activityID],
a.[activityName],
a.[activityDate],
a.[activityPlaces],
a.[activityPrice],
coalesce(b.bookingTotal, 0) bookingTotal,
coalesce(t.tempPlacesReserved, 0) tempPlacesReserved
FROM [LeisureActivities].[dbo].[activities] a
LEFT JOIN
(
select activityID,
SUM([bookingPlaces]) AS bookingTotal
from [LeisureActivities].[dbo].[bookings]
group by activityID
) b
ON a.[activityID]=b.[activityID]
LEFT JOIN
(
select tempActivityID,
SUM(CASE WHEN [tempReserveDate] > GetDate() THEN [tempPlaces] ELSE 0 end) AS tempPlacesReserved
from [LeisureActivities].[dbo].[tempbookings]
group by tempActivityID
) t
ON a.[activityID]=t.[tempActivityID]
WHERE a.[activityDate] > GetDate();
Note: I am using aliases because it is easier to read
Use new SQL-92 Join syntax, and make join to tempBookings an outer join. Also clean up your sql with table aliases. Makes it easier to read. As to why last row has doubled values, I don't know, but on off chance that it is caused by extra dummy records you entered. get rid of them. That problem is fixed by using outer join to tempBookings. The other possibility is that the join conditions you had to the tempBookings table(t.tempActivityID = a.activityID) is insufficient to guarantee that it will match to only one record in activities table... If, for example, it matches to two records in activities, then the rows from Tempbookings would be repeated twice in the output, (causing the sum to be doubled)
SELECT a.activityID, a.activityName, a.activityDate,
a.activityPlaces, a.activityPrice,
SUM(b.bookingPlaces) bookingTotal,
SUM (CASE WHEN t.tempReserveDate > GetDate()
THEN t.tempPlaces ELSE 0 end) tempPlacesReserved
FROM LeisureActivities.dbo.activities a
Join LeisureActivities.dbo.bookings b
On b.activityID = a.activityID
Left Join LeisureActivities.dbo.tempbookings t
On t.tempActivityID = a.activityID
WHERE a.activityDate > GetDate ()
GROUP BY a.activityID, a.activityName,
a.activityDate, a.activityPlaces,
a.activityPrice;

Improvement in SQL Query - Access - Join / IN / Exists

This is my SQL Query - using in Access. It is providing the desired result.
But just wanted opinion whether the approach is correct.
How can this be speeded up.
SELECT INVDETAILS2.F5
, INVDETAILS2.F16
, ExpectedResult.DLID
, ExpectedResult.NumRows
FROM INVDETAILS2
INNER
JOIN (INVDL INNER JOIN ExpectedResult ON INVDL.DLID =ExpectedResult.DLID)
ON (INVDETAILS2.F14 = ROUND(ExpectedResult.Total))
AND (INVDETAILS2.F1 = INVDL.RegionCode)
WHERE INVDETAILS2.F29 ='2013-03-06'
AND INVDETAILS2.F5 IN (SELECT INVDETAILS2.F5
FROM (ExpectedResult
INNER JOIN INVDL
ON ExpectedResult.DLID = INVDL.DLID)
INNER JOIN INVDETAILS2
ON INVDL.RegionCode = INVDETAILS2.F1
AND round(ExpectedResult.Total)
= INVDETAILS2.F14
WHERE INVDETAILS2.F29='2013-03-06'
GROUP BY INVDETAILS2.F5
HAVING Count(ExpectedResult.DLID)<2
)
;
Approximate Number of Rows in
"ExpectedResult" - Millions
"INVDL" - 80,000
"INVDETAILS" - 300,000 - Total , For One Date - approx - 10,000 , then again lesser for each region per date.
Please provide a better query if possible.
Two things you could investigate that might help speed things up:
Indexing
Make sure that you have indexed all of the columns involved in JOINs, WHERE clauses, and GROUP BY clauses.
JOIN expressions involving functions
A couple of your JOINs use Round(ExpectedResult.Total), so if you have an index on ExpectedResult.Total your query won't be able to use it. You may get a performance boost if you add a RoundedTotal column (Long Integer, Indexed), populate it with
UPDATE [ExpectedResult] SET [RoundedTotal]=Round([Total])
and then use the RoundedTotal column in your JOINs.