Running an SQL query faster

Running an SQL query faster - sql

SELECT projectID, urlID, COUNT(1) AS totalClicks, projectPage,
(SELECT COUNT(1)
FROM tblStatSessionRoutes, tblStatSessions
WHERE tblStatSessionRoutes.statSessionID = tblStatSessions.ID
AND tblStatSessions.projectID = tblAdClicks.projectID
AND (tblStatSessionRoutes.leftPageID = tblAdClicks.projectPage OR
tblStatSessionRoutes.rightPageID = tblAdClicks.projectPage)) AS totalViews
FROM tblAdClicks
WHERE projectID IN (SELECT projectID FROM tblProjects WHERE userID = 5)
GROUP BY projectID, urlID, projectPage
ORDER BY CASE projectID
WHEN 170 THEN
1
ELSE
0
END, projectID
This is by no means an especially complex query, but because the database is normalised to a good level, and we are dealing with a significant amount of data, this query can be quite slow for the user.
Does anyone have tips on how to improve the speed of it? If I strategically denormalise parts of the database would this help? Will running it in a stored proc offer significant improvements?
The way I handle the data is efficient in my code, the bottleneck really is with this query.
Thanks!

De-normalising your database should be a last resort since (to choose just one reason) you don't want to encourage data inconsistencies which de-normalisation will allow.
First thing is to see if you can get some clues from the query execution plan. It could be, for example, that your sub-selects are costing too much, and would be better done first into temp tables which you then JOIN in your main query.
Also, if you see lots of table-scans, you could benefit from improved indexes.
If you haven't already, you should spend a few minutes re-formatting your query for readability. It's amazing how often the obvious optimisation will jump out at you while doing this.

I would try to break up that
projectID IN (SELECT projectID FROM tblProjects WHERE userID = 5)
and use a JOIN instead:
SELECT
projectID, urlID, COUNT(1) AS totalClicks, projectPage,
(SELECT COUNT(1) ....) AS totalViews
FROM
dbo.tblAdClicks a
INNER JOIN
dbo.tblProjects p ON a.ProjectID = p.ProjectID
WHERE
p.UserID = 5
GROUP BY
a.projectID, a.urlID, a.projectPage
ORDER BY
CASE a.projectID
WHEN 170 THEN 1
ELSE 0
END, a.projectID
Not sure just how much this will help - should help a bit, I hope!
Other than that, I would check if you have indices on the relevant columns, e.g. on a.ProjectID (to help with the JOIN), and maybe on a.urlID and a.ProjectPage (to help with the GROUP BY)

If your dbms has a tool that explains its query plan, use that first. (Your first correlated subquery might be running once per row.) Than make sure every column referenced in a WHERE clause has an index.
This subquery--WHERE projectID IN (SELECT projectID FROM tblProjects WHERE userID = 5)--can surely benefit from being cut and implemented as a view. Then join to the view.
It's not unusual to treat clickstream data as a data warehouse application. If you need to go that route, I'd usually implement a separate data warehouse rather than denormalize a well-designed OLTP database.
I doubt that running it as a stored proc will help you.

I would try to remove the correlated subquery (the inner (SELECT COUNT(1) ...)). Having to join against your session routes where either the left page or the right page matches makes things a bit tricky. Something along the lines of (but I haven't tested this):
SELECT tblAdClicks.projectID, tblAdClicks.urlID, COUNT(1) AS totalClicks, tblAdClicks.projectPage,
SUM(CASE WHEN leftRoute.statSessionID IS NOT NULL OR rightRoute.statSessionID IS NOT NULL THEN 1 ELSE 0 END) AS totalViews
FROM tblAdClicks
JOIN tblProjects ON tblProjects.projectID = tblAdClicks.projectID
LEFT JOIN tblStatSessions ON tblStatSessions.projectID = tblAdClicks.projectID
LEFT JOIN tblStatSessionRoutes leftRoute ON leftRoute.statSessionID = tblStatSessions.ID AND leftRoute.leftPageID = tblAdClicks.projectPage
LEFT JOIN tblStatSessionRoutes rightRoute ON rightRoute.statSessionID = tblStatSessions.ID AND rightRoute.rightPageID = tblAdClicks.projectPage
WHERE tblProjects.userID = 5
GROUP BY tblAdClicks.projectID, tblAdClicks.urlID, tblAdClicks.projectPage
ORDER BY CASE tblAdClicks.projectID WHEN 170 THEN 1 ELSE 0 END, tblAdClicks.projectID
If I were to add some cache tables to help this, as I indicated I'd try to reduce the two queries against tblStatSessionRoutes for both left and right page to a single query. If you know that leftPageID will never be equal to rightPageID, it should be possible to simply use a trigger to populate an additional table with the left and right views in separate rows, for example.

Related

Fastest way to count from a subquery

I have the following query to return a list of current employees and the number of 'corrections' they have. This is working correctly but is very slow.
I was previously not using a subquery, instead opting for a count (from...) as an aggregate subselect but I have read that a subquery should be much faster. Changing to the code to the below did improve performance but not anywhere near what I was expecting.
SELECT DISTINCT
tblStaff.StaffID, CorrectionsOut.Count AS CorrectionsAssigned
FROM tblStaff
LEFT JOIN tblMeetings ON tblMeetings.StaffID = tblStaff.StaffID
JOIN tblTasks ON tblTasks.TaskID = tblMeetings.TaskID
--Get Corrections Issued
LEFT JOIN(
SELECT
COUNT(DISTINCT tblMeetings.TaskID) AS Count, tblMeetings.StaffID
FROM tblRegister
JOIN tblMeetings ON tblRegister.MeetingID = tblMeetings.MeetingID
WHERE tblRegister.FDescription IS NOT NULL
AND tblRegister.CorrectionOutDate IS NULL
GROUP BY tblMeetings.StaffID
) AS CorrectionsOut ON CorrectionsOut.StaffID = tblStaff.StaffID
WHERE tblStaff.CurrentEmployee = 1
I need an open vendor solution as we are transitioning from SQL Server to Postgres. Note this is a simplified example of the query where there are quite few counts. My current query time without the counts is less than half a second, but with the counts, is approx 20 seconds, if it runs at all without locking or otherwise failing.

I would get rid of the joins that you are not using which probably makes the SELECT DISTINCT unnecessary as well:
SELECT s.StaffID, co.Count AS CorrectionsAssigned
FROM tblStaff s LEFT JOIN
(SELECT COUNT(DISTINCT m.TaskID) AS Count, m.StaffID
FROM tblRegister r
tblMeetings m
ON r.MeetingID = m.MeetingID
WHERE r.FDescription IS NOT NULL AND
r.CorrectionOutDate IS NULL
GROUP BY m.StaffID
) co
ON co.StaffID = s.StaffID
WHERE s.CurrentEmployee = 1;
Getting rid of the SELECT DISTINCT and the duplicate rows added by the tasks should help performance.
For additional benefit, you would want to be sure you have indexes on the JOIN keys, and perhaps on the filtering criteria.

How can I rewrite this query (it includes UNION, WITH and self-join applied for filtering)?

I have written this view when deadline was coming.
WITH AllCategories
AS (SELECT CaseTable.CaseID,
CT.Category,
CT.CategoryType,
Q.Note AS CategoryCaseNote,
Q.CategoryID,
Q.CategoryIsDefaultValue
FROM CaseTable
INNER JOIN
((SELECT CaseID, -- Filled categories in table
CategoryCaseNote AS Note,
CategoryID,
-1 AS QuestionID,
0 AS CategoryIsDefaultValue
FROM CaseCategory)
UNION ALL
(SELECT -1 AS CaseID, -- possible categories
NULL AS Note,
CategoryID AS CategoryID,
QuestionID,
1 AS CategoryIsDefaultValue
FROM SHOW_QuestionCategory)) AS Q
ON (Q.QuestionID = -1
OR Q.QuestionID = CaseTransactionTable.QuestionID)
AND (Q.CaseID = -1
OR Q.CaseID = CaseTable.CaseTransactionID)
LEFT OUTER JOIN
CategoryTable AS CT
ON Q.CategoryID = CT.CategoryID)
SELECT A.*
FROM AllCategories AS A
INNER JOIN
(SELECT CaseID,
CategoryID,
MIN(CategoryIsDefaultValue) AS CategoryIsDefaultValue
FROM AllCategories
GROUP BY CaseID, CategoryID) AS B
ON A.CaseID = B.CaseID
AND A.CategoryID = B.CategoryID
AND A.CategoryIsDefaultValue = B.CategoryIsDefaultValue
Now it's becoming bottleneck because of very expensive join between CaseTable and subquery with UNION (resulting in over 30% cost of frequently used procedure; in execution plan it's nested loops node with ~70% cost of select).
I have tried to rewrite it multiple times, but these attempts resulted only in worser perfomance.
Table CaseCategory have unique index on tuple (CaseID, CategoryID).

It's probably a combination of problems with bad cardinality estimates and use of CTE. With what you've told us, I'd try to give some general guidance. Info you provided on the index means nothing without knowing the cardinality and distribution od the data. BTW, not sure if it would qualify as an answer, but it's too long for a comment. Feel free to downvote :)
There is a stored procedure selecting from the view, am I correct? I also presume you have some WHERE clause somewhere, right?
In that case, get rid of the view alltogether, and move the code into the procedure. This will allow to get rid of the CTE (which is most likely executed twice), and to save the intermediate results from what is now the CTE into a #temp table. Could be benefitial to apply the same #temp-table strategy with the UNION ALL subquery.
Make sure to apply the WHERE predicates as soon as possible (SQL Server is usually good with pushing, but this combination of proc-view-CTE might confuse it).

Check the query efficiency

I have this below SQL query that I want to get an opinion on whether I can improve it using Temp Tables or something else or is this good enough? So basically I am just feeding the result set from inner query to the outer one.
SELECT S.SolutionID
,S.SolutionName
,S.Enabled
FROM dbo.Solution S
WHERE s.SolutionID IN (
SELECT DISTINCT sf.SolutionID
FROM dbo.SolutionToFeature sf
WHERE sf.SolutionToFeatureID IN (
SELECT sfg.SolutionToFeatureID
FROM dbo.SolutionFeatureToUsergroup SFG
WHERE sfg.UsergroupID IN (
SELECT UG.UsergroupID
FROM dbo.Usergroup UG
WHERE ug.SiteID = #SiteID
)
)
)

It's going to depend largely on the indexes you have on those tables. Since you are only selecting data out of the Solution table, you can put everything else in an exists clause, do some proper joins, and it should perform better.
The exists clause will allow you to remove the distinct you have on the SolutionToFeature table. Distinct will cause a performance hit because it is basically creating a temp table behind the scenes to do the comparison on whether or not the record is unique against the rest of the result set. You take a pretty big hit as your tables grow.
It will look something similar to what I have below, but without sample data or anything I can't tell if it's exactly right.
Select S.SolutionID, S.SolutionName, S.Enabled
From dbo.Solutin S
Where Exists (
select 1
from dbo.SolutionToFeature sf
Inner Join dbo.SolutionToFeatureTousergroup SFG on sf.SolutionToFeatureID = SFG.SolutionToFeatureID
Inner Join dbo.UserGroup UG on sfg.UserGroupID = UG.UserGroupID
Where S.SolutionID = sf.SolutionID
and UG.SiteID = #SiteID
)

Alternative for joining two tables multiple times

I have a situation where I have to join a table multiple times. Most of them need to be left joins, since some of the values are not available. How to overcome the query poor performance when joining multiple times?
The Scenario
Tables
[Project]: ProjectId Guid, Name VARCHAR(MAX).
[UDF]: EntityId Guid, EntityType Char(1), UDFCode Guid, UDFName varchar(20)
[UDFDetail]: UDFCode Guid, Description VARCHAR(MAX)
Relationship:
[Project].ProjectId - [UDF].EntityId
[UDFDetail].UDFCode - [UDF].UDFCode
The UDF table holds custom fields for projects, based on the UDFName column. The value for these fields, however, is stored on the UDFDetail, in the column Description.
I have lots of custom columns for Project, and they are stored in the UDF table.
So for example, to get two fields for the project I do the following select:
SELECT
p.Name ProjectName,
ud1.Description Field1,
ud1.UDFCode Field1Id,
ud2.Description Field2,
ud2.UDFCode Field2Id
FROM
Project p
LEFT JOIN UDF u1 ON
u1.EntityId = p.ProjectId AND u1.ItemName='Field1'
LEFT JOIN UDFDetail ud1 ON
ud1.UDFCode = u1.UDFCode
LEFT JOIN UDF u2 ON
u2.EntityId = p.ProjectId AND u2.ItemName='Field2'
LEFT JOIN UDFDetail ud2 ON
ud2.UDFCode = u2.UDFCode
The Problem
Imagine the above select but joining with like 15 fields. In my query I have around 10 fields already and the performance is not very good. It is taking about 20 seconds to run. I have good indexes for these tables, so looking at the execution plan, it is doing only index seeks without any lookups. Regarding the joins, it needs to be left join, because Field 1 might not exist for that specific project.
The Question
Is there a more performatic way to retrieve the data?
How would you do the query to retrieve 10 different fields for one project in a schema like this?

Your choices are pivot, explicit aggregation (with conditional functions), or the joins. If you have the appropriate indexes set up, the joins may be the fastest method.
The correct index would be UDF(EntityId, ItemName, UdfCode).
You can test if the group by is faster by running a query such as:
SELECT count(*)
FROM p LEFT JOIN
UDF u1
ON u1.EntityId = p.ProjectId LEFT JOIN
UDFDetail ud1
ON ud1.UDFCode = u1.UDFCode;
If this runs fast enough, then you can consider the group by approach.

You can try this very weird contraption (it does not look pretty, but it does a single set of outer joins). The intermediate result is a very "wide" and "long" dataset, which we can then "compact" with aggregation (for example, for each ProjectName, each Field1 column will have N result, N-1 NULLs and 1 non-null result, which is then selecting with a simple MAX aggregation) [N is the number of fields].
select ProjectName, max(Field1) as Field1, max(Field1Id) as Field1Id, max(Field2) as Field2, max(Field2Id) as Field2Id
from (
select
p.Name as ProjectName,
case when u.UDFName='Field1' then ud.Description else NULL end as Field1,
case when u.UDFName='Field1' then ud.UDFCode else NULL end as Field1Id,
case when u.UDFName='Field2' then ud.Description else NULL end as Field2,
case when u.UDFName='Field2' then ud.UDFCode else NULL end as Field2Id
from Project p
left join UDF u on p.ProjectId=u.EntityId
left join UDFDetail ud on u.UDFCode=ud.UDFCode
) tmp
group by ProjectName
The query can actually be rewritten without the inner query, but that should not make a big difference :), and looking at Gordon Linoff's suggestion and your answer, it might actually take just about 20 seconds as well, but it is still worth giving a try.

optimize SQL query

What more can I do to optimize this query?
SELECT * FROM
(SELECT `item`.itemID, COUNT(`votes`.itemID) AS `votes`,
`item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin' ,
`myItems`.userID as userIDFav, `myItems`.deleted as myDeleted
FROM (votes `votes` RIGHT OUTER JOIN item `item`
ON (`votes`.itemID = `item`.itemID))
INNER JOIN
users `users`
ON (`users`.userID = `item`.userID)
LEFT OUTER JOIN
myItems `myItems`
ON (`myItems`.itemID = `item`.itemID)
WHERE (`item`.deleted = 0)
GROUP BY `item`.itemID,
`votes`.itemID,
`item`.title,
`item`.itemTypeID,
`item`.submitDate,
`item`.deleted,
`item`.ItemCat,
`item`.counter,
`item`.userID,
`users`.name,
`myItems`.deleted,
`myItems`.userID
ORDER BY `item`.itemID DESC) as myTable
where myTable.userIDFav = 3 or myTable.userIDFav is null
limit 0, 20
I'm using MySQL
Thanks

What does the analyzer say for this query? Without knowledge about how many rows there are in the table you cant tell any optimization. So run the analyzer and you'll see what parts costs what.

Of course, as #theomega said, look at the execution plan.
But I'd also suggest to try and "clean up" your statement. (I don't know which one is faster - that depends on your table sizes.) Usually, I'd try to start with a clean statement and start optimizing from there. But typically, a clean statement makes it easier for the optimizer to come up with a good execution plan.
So here are some observations about your statement that might make things slow:
a couple of outer joins (makes it hard for the optimzer to figure out an index to use)
a group by
a lot of columns to group by
As far as I understand your SQL, this statement should do most of what yours is doing:
SELECT `item`.itemID, `item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin'
FROM (item `item` INNER JOIN users `users`
ON (`users`.userID = `item`.userID)
WHERE
Of course, this misses the info from the tables you outer joined, I'd suggest to try to add the required columns via a subselect:
SELECT `item`.itemID,
(SELECT count (itemID)
FROM votes v
WHERE v.itemID = 'item'.itemID) as 'votes', <etc.>
This way, you can get rid of one outer join and the group by. The outer join is replaced by the subselect, so there is a trade-off which may be bad for the "cleaner" statement.
Depending on the cardinality between item and myItems, you can do the same or you'd have to stick with the outer join (but no need to reintroduce the group by).
Hope this helps.

Some quick semi-random thoughts:
Are your itemID and userID columns indexed?
What happens if you add "EXPLAIN " to the start of the query and run it? Does it use indexes? Are they sensible?
DO you need to run the whole inner query and filter on it, or could you put move the where myTable.userIDFav = 3 or myTable.userIDFav is null part into the inner query?

You do seem to have too many fields in the Group By list, since one of them is itemID, I suspect that you could use an inner SELECT to preform the grouping and an outer SELECT to return the set of fields desired.

Can't you add the where clause myTable.userIDFav = 3 or myTable.userIDFav is null to WHERE (item.deleted = 0)?
Regards
Lieven

Look at the way your query is built. You join a lot of stuff, then limit the output to 20 rows. You should have the outer join on items and myitems, since your conditions only apply to these two tables, limit the output to the first 20 rows, then join and aggregate. Here you are performing a lot of work that is going to be discarded.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Running an SQL query faster - sql

Related

Fastest way to count from a subquery

How can I rewrite this query (it includes UNION, WITH and self-join applied for filtering)?

Check the query efficiency

Alternative for joining two tables multiple times

optimize SQL query

Categories

Resources