Query optimization for postgresql - sql

I have to resolve a problem in my class about query optimization in postgresql.
I have to optimize the following query.
"The query determines the yearly loss in revenue if orders just with a quantity of more than the average quantity of all orders in the system would be taken and shipped to customers."
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where i_data like '%b'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a
Is it possible through indices or something else to optimize that query (Materialized view is possible as well)?
Execution plan can be found here. Thanks.

first if you have to do searches from the back of data, simply create an index on the reverse of the data
create index on item(reverse(i_data);
Then query it like so:
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where reverse(i_data) like 'b%'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a

Remember that making indexes may not speed up the query when you have to retreive something like 30% of the table. In this case bitmap index might help you but as far as I remember it is not available in Postgres. So, think which table to index, maybe it would be worth to index the big table by ol_i_id as the join you are making only needs to match less than 10% of the big table and small table is loaded to ram (I might be mistaken here, but at least in SAS hash join means that you load the smaller table to ram).
You may try aggregating data before doing any joins and reuse the groupped data. I assume that you need to do everything in one query without explicitly creating any staging tables by hand. Also recently, I have been working a lot on SQL Server so I may mix the syntax, but give it a try. There are many assumptions I have made about the data and the structure of the table, but hopefully it will work.
;WITH GrOrderline (
SELECT ol_i_id, ol_quantity, SUM(ol_amount) AS Yearly, Count(*) AS cnt
FROM orderline
GROUP BY ol_i_id, ol_quantity
),
WITH AvgOrderline (
SELECT
o.ol_i_id, SUM(o.ol_quantity)/SUM(cnt) AS AvgQ
FROM GrOrderline AS o
INNER JOIN item AS i ON (o.ol_i_id = i.i_id AND RIGHT(i.i_data, 1) = 'b')
GROUP BY o.ol_i_id
)
SELECT SUM(Yearly)/2.0 AS avg_yearly
FROM GrOrderline o INNER JOIN AvgOrderline a ON (a.ol_i_id = a.ol_i_id AND o.ol_quantity < a.AvG)

Related

How can I rewrite this query (it includes UNION, WITH and self-join applied for filtering)?

I have written this view when deadline was coming.
WITH AllCategories
AS (SELECT CaseTable.CaseID,
CT.Category,
CT.CategoryType,
Q.Note AS CategoryCaseNote,
Q.CategoryID,
Q.CategoryIsDefaultValue
FROM CaseTable
INNER JOIN
((SELECT CaseID, -- Filled categories in table
CategoryCaseNote AS Note,
CategoryID,
-1 AS QuestionID,
0 AS CategoryIsDefaultValue
FROM CaseCategory)
UNION ALL
(SELECT -1 AS CaseID, -- possible categories
NULL AS Note,
CategoryID AS CategoryID,
QuestionID,
1 AS CategoryIsDefaultValue
FROM SHOW_QuestionCategory)) AS Q
ON (Q.QuestionID = -1
OR Q.QuestionID = CaseTransactionTable.QuestionID)
AND (Q.CaseID = -1
OR Q.CaseID = CaseTable.CaseTransactionID)
LEFT OUTER JOIN
CategoryTable AS CT
ON Q.CategoryID = CT.CategoryID)
SELECT A.*
FROM AllCategories AS A
INNER JOIN
(SELECT CaseID,
CategoryID,
MIN(CategoryIsDefaultValue) AS CategoryIsDefaultValue
FROM AllCategories
GROUP BY CaseID, CategoryID) AS B
ON A.CaseID = B.CaseID
AND A.CategoryID = B.CategoryID
AND A.CategoryIsDefaultValue = B.CategoryIsDefaultValue
Now it's becoming bottleneck because of very expensive join between CaseTable and subquery with UNION (resulting in over 30% cost of frequently used procedure; in execution plan it's nested loops node with ~70% cost of select).
I have tried to rewrite it multiple times, but these attempts resulted only in worser perfomance.
Table CaseCategory have unique index on tuple (CaseID, CategoryID).
It's probably a combination of problems with bad cardinality estimates and use of CTE. With what you've told us, I'd try to give some general guidance. Info you provided on the index means nothing without knowing the cardinality and distribution od the data. BTW, not sure if it would qualify as an answer, but it's too long for a comment. Feel free to downvote :)
There is a stored procedure selecting from the view, am I correct? I also presume you have some WHERE clause somewhere, right?
In that case, get rid of the view alltogether, and move the code into the procedure. This will allow to get rid of the CTE (which is most likely executed twice), and to save the intermediate results from what is now the CTE into a #temp table. Could be benefitial to apply the same #temp-table strategy with the UNION ALL subquery.
Make sure to apply the WHERE predicates as soon as possible (SQL Server is usually good with pushing, but this combination of proc-view-CTE might confuse it).

Improving performance on SQL query

I'm currently having performance problems with an expensive SQL query, and I'd like to improve it.
This is what the query looks like:
SELECT TOP 50 MovieID
FROM (SELECT [MovieID], COUNT(*) AS c
FROM [tblMovieTags]
WHERE [TagID] IN (SELECT TOP 7 [TagID]
FROM [tblMovieTags]
WHERE [MovieID]=12345
ORDER BY Relevance ASC)
GROUP BY [MovieID]
HAVING COUNT(*) > 1) a
INNER JOIN [tblMovies] m ON m.MovieID=a.MovieID
WHERE (Hidden=0) AND m.Active=1 AND m.Processed=1
ORDER BY c DESC, m.IMDB DESC
What I'm trying to find movies that have at least 2 matching tags for MovieID 12345.
Database basic scheme looks like:
Each movie has 4 to 5 tags. I want a list of movies similar to any movie based on the tags. A minimum of 2 tags must match.
This query is causing my server problems as I have hundreds of concurrent users at any given time.
I have already created indexes based on execution plan suggestions, and that has made it quicker, but it's still not enough.
Is there anything I could do to make this faster?
I Like to use temp tables, because they can speed up your queries (if used correctly) and make it easier to read. Try using the query below and see if it speeds it up any. There were a few fields (hidden,imdb) that weren't in your schema, so I left them out.
This query may, or may not, be exactly what you are looking for. The point of it is to show you how to use temp tables to increase the performance and improve readability. Some minor tweaks may be necessary.
SELECT TOP 7 [TagID],[MovieTagID],[MovieID]
INTO #MovieTags
FROM [tblMovieTags]
WHERE [MovieID]=12345
SELECT mt.MovieID, COUNT(mt.MovieTagID)
INTO #Movies
FROM #MovieTags mt
INNER JOIN tblMovies m ON m.MovieID=mt.MovieID AND m.Active=1 AND m.Process=1
GROUP BY [MovieID]
HAVING COUNT(mt.MovieTagID) > 1
SELECT TOP 50 * FROM #Movies
DROP TABLE #MovieTags
DROP TABLE #Movies
Edit
Parameterized Queries
You will also want to use parameterized queries, rather than concatenating your values in your SQL string. Check out this short, to the point, blog that explains why you should use parameterized queries. This, combined with the temp table method, should improve your performance significantly.
I want to see if there is some unnecessary processing happening from that query you wrote. Try the following query and let us know if it's faster slower etc And if it's even getting the same data.
I just threw this together so no guarantees on perfect syntax
SELECT TOP 7 [TagID]
INTO #MovieTags
FROM [tblMovieTags]
WHERE [MovieID]=12345
ORDER BY TagID
;cte_movies AS
(
SELECT
mt.MovieID
,mt.TagID
FROM
tblMovieTags mt
INNER JOIN #MovieTags t ON mt.TagId = t.TagId
INNER JOIN tblMovies m ON mt.MovieID = m.MovieID
WHERE
(Hidden=0) AND m.Active=1 AND m.Processed=1
),
cte_movietags AS
(
SELECT
MovieId
,COUNT(MovieId) AS TagCount
FROM
cte_movies
GROUP BY MovieId
)
SELECT
MovieId
FROM
cte_movietags
WHERE
TagCount > 1
ORDER BY
MovieId
GO
DROP TABLE #MovieTags

How to reduce scope of subquery?

I've got SQL running on MS SQL Server similar to the following:
SELECT
CustNum,
Name,
FROM
Cust
LEFT JOIN (
SELECT
CustNum, MAX(OrderDate) as LastOrderDate
FROM
Orders
GROUP BY
CustNum) as Orders
ON Orders.CustNum = Cust.CustNum
WHERE
Region = 1
It contains a subquery to find the MAX record from a child table. The concern is that these tables have a very large number of rows. It seems like the subquery would operate on all the rows of the child table, even though only a very few of them are actually needed because of the WHERE clause on the outer query
Is there a way to reduce the scope of the inner query? Something like adding a WHERE clause to only include the records that are included in the outer query? Something like
WHERE CustomerOrders.CustomerNumber = Customers.CustomerNumber -- Customers from the outer query.
I suspect that this is not necessary, but I am getting some push back from another developer and I wanted to be sure (my SQL is a little rusty).
You are correct about the subquery. It will have to summarize all the data. You could re-write the query like this:
SELECT CustNum, Name, max(OrderDate) as LastOrderDate
FROM Cust LEFT JOIN
Orders
ON Orders.CustNum = Cust.CustNum
WHERE Region = 1
group by CustNum, Name
This would let the SQL optimizer choose the optimal path.
If you know that there are very, very few customers matching Region = 1 and you have an index on CustNum, OrderDate in Orders, you could write the query like this:
select CustNum, Name,
(select top 1 OrderDate
from Orders o
where Cust.CustNum = o.CustNum
order by OrderDate desc
) as LastOrderDate
from Cust
Where Region = 1
I think you would get a very similar effect by using cross apply.
By the way, I'm not a fan of re-writing queries for such purposes. But, I haven't found a SQL optimizer that would do anything other than summarize all the orders rows in this case.
No it's generally not necessary if your statistics etc are up to date. That's the job of the optimiser. You can try the CROSS APPLY operator if you think you're missing out on some shortcuts but generally if you have all constraints and stats it will be fine.
Your proposed additional WHERE might make sense to you, but as it doesn't correlate to anything in the actual query you posted it will change the results (if it works at all). If you want comments on that you need to post tables & relations etc.
Best way is to check the execution plan and see if it's doing anything dumb.

SubQuery vs TempTable before Merge

I have a complex query that I want to use as the Source of a Merge into a table. This will be executed over millions of rows. Currently I am trying to apply constraints to the data by inserting it into a temp table before the merge.
The operations are:
Filter out duplicate data.
Join some tables to pull in additional data
Insert into the temp table.
Here is the query.
-- Get all Orders that aren't in the system
WITH Orders AS
(
SELECT *
FROM [Staging].Orders o
WHERE NOT EXISTS
(
SELECT 1
FROM Maps.VendorBOrders vbo
JOIN OrderFact of
ON of.Id = vbo.OrderFactId
AND InternalOrderId = o.InternalOrderId
AND of.DataSetId = o.DataSetId
AND of.IsDelete = 0
)
)
INSERT INTO #VendorBOrders
(
CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,BillingProviderId
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,RelatedOrderId
,DataSetId
)
SELECT
vc.CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,bp.Id
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,ro.Id
,o.DataSetId
FROM
Orders o
-- Join related orders to match orders sharing same instance
JOIN Maps.VendorBRelatedOrder ro
ON ro.OrderControlNumber = o.OrderControlNumber
AND ro.EquitableCustomerId = o.EquitableCustomerId
AND ro.DataSetId = o.DataSetId
JOIN BillingProvider bp
ON bp.ProviderNPI = o.ProviderNPI
-- Join on customers and fail if the customer doesn't exist
LEFT OUTER JOIN [Maps].VendorBCustomer vc
ON vc.ExtenalCustomerId = o.ExtenalCustomerId
AND vc.VendorId = o.VendorId;
I am wondering if there is anything I can do to optimize it for time. I have tried using the DB Engine Tuner, but this query takes 100x more CPU Time than the other queries I am running. Is there anything else that I can look into or can the query not be improved further?
CTE is just syntax
That CTE is evaluated (run) on that join
First just run it as a select statement (no insert)
If the select is slow then:
Move that CTE to a #TEMP so it is evaluated once and materialized
Put an index (PK if applicable) on the three join columns
If the select is not slow then it is insert time on #VendorBOrders
Fist only create PK and sort the insert on the PK so as not to fragment that clustered index
Then AFTER the insert is complete build any other necessary indexes
Generally when I do speed testing I perform checks on the parts of SQL to see where the problem lies. Turn on the 'Execution plan' and see where a lot of the time is going. Also if you want to just do the quick and dirty highlight your CTE and run just that. Is that fast, yes, move on.
I have at times found a single index being off throws off a whole complex logic of joins by merely having the database do one part of something large and then finding that piece.
Another idea is that if you have a fast tempdb on a production environment or the like, dump your CTE to a temp table as well. Index on that and see if that speeds things up. Sometimes CTE's, table variables, and temp tables lose some performance at joins. I have found that creating an index on a partial object will improve performance at times but you are also putting more load on the tempdb to do this, so keep that in mind.

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;