Improving performance on SQL query - sql

I'm currently having performance problems with an expensive SQL query, and I'd like to improve it.
This is what the query looks like:
SELECT TOP 50 MovieID
FROM (SELECT [MovieID], COUNT(*) AS c
FROM [tblMovieTags]
WHERE [TagID] IN (SELECT TOP 7 [TagID]
FROM [tblMovieTags]
WHERE [MovieID]=12345
ORDER BY Relevance ASC)
GROUP BY [MovieID]
HAVING COUNT(*) > 1) a
INNER JOIN [tblMovies] m ON m.MovieID=a.MovieID
WHERE (Hidden=0) AND m.Active=1 AND m.Processed=1
ORDER BY c DESC, m.IMDB DESC
What I'm trying to find movies that have at least 2 matching tags for MovieID 12345.
Database basic scheme looks like:
Each movie has 4 to 5 tags. I want a list of movies similar to any movie based on the tags. A minimum of 2 tags must match.
This query is causing my server problems as I have hundreds of concurrent users at any given time.
I have already created indexes based on execution plan suggestions, and that has made it quicker, but it's still not enough.
Is there anything I could do to make this faster?

I Like to use temp tables, because they can speed up your queries (if used correctly) and make it easier to read. Try using the query below and see if it speeds it up any. There were a few fields (hidden,imdb) that weren't in your schema, so I left them out.
This query may, or may not, be exactly what you are looking for. The point of it is to show you how to use temp tables to increase the performance and improve readability. Some minor tweaks may be necessary.
SELECT TOP 7 [TagID],[MovieTagID],[MovieID]
INTO #MovieTags
FROM [tblMovieTags]
WHERE [MovieID]=12345
SELECT mt.MovieID, COUNT(mt.MovieTagID)
INTO #Movies
FROM #MovieTags mt
INNER JOIN tblMovies m ON m.MovieID=mt.MovieID AND m.Active=1 AND m.Process=1
GROUP BY [MovieID]
HAVING COUNT(mt.MovieTagID) > 1
SELECT TOP 50 * FROM #Movies
DROP TABLE #MovieTags
DROP TABLE #Movies
Edit
Parameterized Queries
You will also want to use parameterized queries, rather than concatenating your values in your SQL string. Check out this short, to the point, blog that explains why you should use parameterized queries. This, combined with the temp table method, should improve your performance significantly.

I want to see if there is some unnecessary processing happening from that query you wrote. Try the following query and let us know if it's faster slower etc And if it's even getting the same data.
I just threw this together so no guarantees on perfect syntax
SELECT TOP 7 [TagID]
INTO #MovieTags
FROM [tblMovieTags]
WHERE [MovieID]=12345
ORDER BY TagID
;cte_movies AS
(
SELECT
mt.MovieID
,mt.TagID
FROM
tblMovieTags mt
INNER JOIN #MovieTags t ON mt.TagId = t.TagId
INNER JOIN tblMovies m ON mt.MovieID = m.MovieID
WHERE
(Hidden=0) AND m.Active=1 AND m.Processed=1
),
cte_movietags AS
(
SELECT
MovieId
,COUNT(MovieId) AS TagCount
FROM
cte_movies
GROUP BY MovieId
)
SELECT
MovieId
FROM
cte_movietags
WHERE
TagCount > 1
ORDER BY
MovieId
GO
DROP TABLE #MovieTags

Related

Is there a better way to optimize this SQL query?

Assume my table has millions of records.
The following columns have indexes:
type
date
unique_id
Here is my query:
SELECT TOP (1000) T.TIME, T.TYPE, F.NAME,
B.NAME, T.MESSAGE
FROM MY_TABLE T
LEFT OUTER JOIN FOO F ON F.ID = T.FID
LEFT OUTER JOIN BAR B ON B.ID = T.BID
WHERE T.TYPE IN ('success', 'failure')
AND T.DATE BETWEEN 1592585183437 AND 1594232320525
AND T.UNIQUE_ID = "my unique ID"
ORDER BY T.DATE DESC
My question is am I causing myself any trouble with this query if I have tons of records in my table? Can this be optimized further?
My question is am I causing myself any trouble with this query if I have tons of records
in my table?
Wrong question. As long as you are only asking for data you need, you NEED it. Any trouble means you STIL need it.
The query looks as good as it gets. I somehow doubt the TOP 10000 (that is a LOT of data unless you compress it somehow).
The question is more whether you have the proper indices. Without the proper indices. I would also not use strings like this for TYPE, but then the question is about the query, not possible failures in table design.
Check indices.

Query optimization for postgresql

I have to resolve a problem in my class about query optimization in postgresql.
I have to optimize the following query.
"The query determines the yearly loss in revenue if orders just with a quantity of more than the average quantity of all orders in the system would be taken and shipped to customers."
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where i_data like '%b'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a
Is it possible through indices or something else to optimize that query (Materialized view is possible as well)?
Execution plan can be found here. Thanks.
first if you have to do searches from the back of data, simply create an index on the reverse of the data
create index on item(reverse(i_data);
Then query it like so:
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where reverse(i_data) like 'b%'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a
Remember that making indexes may not speed up the query when you have to retreive something like 30% of the table. In this case bitmap index might help you but as far as I remember it is not available in Postgres. So, think which table to index, maybe it would be worth to index the big table by ol_i_id as the join you are making only needs to match less than 10% of the big table and small table is loaded to ram (I might be mistaken here, but at least in SAS hash join means that you load the smaller table to ram).
You may try aggregating data before doing any joins and reuse the groupped data. I assume that you need to do everything in one query without explicitly creating any staging tables by hand. Also recently, I have been working a lot on SQL Server so I may mix the syntax, but give it a try. There are many assumptions I have made about the data and the structure of the table, but hopefully it will work.
;WITH GrOrderline (
SELECT ol_i_id, ol_quantity, SUM(ol_amount) AS Yearly, Count(*) AS cnt
FROM orderline
GROUP BY ol_i_id, ol_quantity
),
WITH AvgOrderline (
SELECT
o.ol_i_id, SUM(o.ol_quantity)/SUM(cnt) AS AvgQ
FROM GrOrderline AS o
INNER JOIN item AS i ON (o.ol_i_id = i.i_id AND RIGHT(i.i_data, 1) = 'b')
GROUP BY o.ol_i_id
)
SELECT SUM(Yearly)/2.0 AS avg_yearly
FROM GrOrderline o INNER JOIN AvgOrderline a ON (a.ol_i_id = a.ol_i_id AND o.ol_quantity < a.AvG)

Check the query efficiency

I have this below SQL query that I want to get an opinion on whether I can improve it using Temp Tables or something else or is this good enough? So basically I am just feeding the result set from inner query to the outer one.
SELECT S.SolutionID
,S.SolutionName
,S.Enabled
FROM dbo.Solution S
WHERE s.SolutionID IN (
SELECT DISTINCT sf.SolutionID
FROM dbo.SolutionToFeature sf
WHERE sf.SolutionToFeatureID IN (
SELECT sfg.SolutionToFeatureID
FROM dbo.SolutionFeatureToUsergroup SFG
WHERE sfg.UsergroupID IN (
SELECT UG.UsergroupID
FROM dbo.Usergroup UG
WHERE ug.SiteID = #SiteID
)
)
)
It's going to depend largely on the indexes you have on those tables. Since you are only selecting data out of the Solution table, you can put everything else in an exists clause, do some proper joins, and it should perform better.
The exists clause will allow you to remove the distinct you have on the SolutionToFeature table. Distinct will cause a performance hit because it is basically creating a temp table behind the scenes to do the comparison on whether or not the record is unique against the rest of the result set. You take a pretty big hit as your tables grow.
It will look something similar to what I have below, but without sample data or anything I can't tell if it's exactly right.
Select S.SolutionID, S.SolutionName, S.Enabled
From dbo.Solutin S
Where Exists (
select 1
from dbo.SolutionToFeature sf
Inner Join dbo.SolutionToFeatureTousergroup SFG on sf.SolutionToFeatureID = SFG.SolutionToFeatureID
Inner Join dbo.UserGroup UG on sfg.UserGroupID = UG.UserGroupID
Where S.SolutionID = sf.SolutionID
and UG.SiteID = #SiteID
)

Alternative for joining two tables multiple times

I have a situation where I have to join a table multiple times. Most of them need to be left joins, since some of the values are not available. How to overcome the query poor performance when joining multiple times?
The Scenario
Tables
[Project]: ProjectId Guid, Name VARCHAR(MAX).
[UDF]: EntityId Guid, EntityType Char(1), UDFCode Guid, UDFName varchar(20)
[UDFDetail]: UDFCode Guid, Description VARCHAR(MAX)
Relationship:
[Project].ProjectId - [UDF].EntityId
[UDFDetail].UDFCode - [UDF].UDFCode
The UDF table holds custom fields for projects, based on the UDFName column. The value for these fields, however, is stored on the UDFDetail, in the column Description.
I have lots of custom columns for Project, and they are stored in the UDF table.
So for example, to get two fields for the project I do the following select:
SELECT
p.Name ProjectName,
ud1.Description Field1,
ud1.UDFCode Field1Id,
ud2.Description Field2,
ud2.UDFCode Field2Id
FROM
Project p
LEFT JOIN UDF u1 ON
u1.EntityId = p.ProjectId AND u1.ItemName='Field1'
LEFT JOIN UDFDetail ud1 ON
ud1.UDFCode = u1.UDFCode
LEFT JOIN UDF u2 ON
u2.EntityId = p.ProjectId AND u2.ItemName='Field2'
LEFT JOIN UDFDetail ud2 ON
ud2.UDFCode = u2.UDFCode
The Problem
Imagine the above select but joining with like 15 fields. In my query I have around 10 fields already and the performance is not very good. It is taking about 20 seconds to run. I have good indexes for these tables, so looking at the execution plan, it is doing only index seeks without any lookups. Regarding the joins, it needs to be left join, because Field 1 might not exist for that specific project.
The Question
Is there a more performatic way to retrieve the data?
How would you do the query to retrieve 10 different fields for one project in a schema like this?
Your choices are pivot, explicit aggregation (with conditional functions), or the joins. If you have the appropriate indexes set up, the joins may be the fastest method.
The correct index would be UDF(EntityId, ItemName, UdfCode).
You can test if the group by is faster by running a query such as:
SELECT count(*)
FROM p LEFT JOIN
UDF u1
ON u1.EntityId = p.ProjectId LEFT JOIN
UDFDetail ud1
ON ud1.UDFCode = u1.UDFCode;
If this runs fast enough, then you can consider the group by approach.
You can try this very weird contraption (it does not look pretty, but it does a single set of outer joins). The intermediate result is a very "wide" and "long" dataset, which we can then "compact" with aggregation (for example, for each ProjectName, each Field1 column will have N result, N-1 NULLs and 1 non-null result, which is then selecting with a simple MAX aggregation) [N is the number of fields].
select ProjectName, max(Field1) as Field1, max(Field1Id) as Field1Id, max(Field2) as Field2, max(Field2Id) as Field2Id
from (
select
p.Name as ProjectName,
case when u.UDFName='Field1' then ud.Description else NULL end as Field1,
case when u.UDFName='Field1' then ud.UDFCode else NULL end as Field1Id,
case when u.UDFName='Field2' then ud.Description else NULL end as Field2,
case when u.UDFName='Field2' then ud.UDFCode else NULL end as Field2Id
from Project p
left join UDF u on p.ProjectId=u.EntityId
left join UDFDetail ud on u.UDFCode=ud.UDFCode
) tmp
group by ProjectName
The query can actually be rewritten without the inner query, but that should not make a big difference :), and looking at Gordon Linoff's suggestion and your answer, it might actually take just about 20 seconds as well, but it is still worth giving a try.

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;