Postgres consistently favoring nested loop join over merge join

Postgres consistently favoring nested loop join over merge join - sql

I have a complex query:
SELECT DISTINCT ON (delivery.id)
delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status = 1000
-- LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id
WHERE
NOT EXISTS (SELECT dl_finished.id FROM mailer.mailer_message_recipient_rel_log AS dl_finished WHERE dl_finished.rel_id = delivery.id AND dl_finished.status <> 1000) AND
dl_processing.date <= NOW() - (36000 * INTERVAL '1 second') AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
-- (r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id, dl_processing.id DESC
LIMIT 1000;
It is running very slowly, and the reason seems to be that Postgres is consistently avoiding using merge joins in its query plan despite me having all the indices that I would need for this. It looks really depressing:
http://explain.depesz.com/s/tVY
http://i.stack.imgur.com/Myw4R.png
Why would this happen? How do I troubleshoot such an issue?
UPD: with #wildplasser's help I have reworked the query to fix performance (while changing its semantics somewhat):
SELECT delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status in (1000, 2, 5) AND dl_processing.date <= NOW() - (36000 * INTERVAL '1 second')
LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
WHERE
(r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT dl_other.id FROM mailer.mailer_message_recipient_rel_log AS dl_other WHERE dl_other.rel_id = delivery.id AND dl_other.id > dl_processing.id) AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id
LIMIT 1000
It now runs well, but the query plan still sports these horrible nested loop joins <_<:
http://explain.depesz.com/s/MTo3
I would still like to know why that is.

The reason is that Postgres is actually doing the right thing, and I suck at math. Suppose table A has N rows, and table B has M rows, and they are being joined via a column that they both have a B-tree index for. Then the following is true:
Nested loop join's time complexity is not O(MN), like I naively thought, but O(M log N) or O(N log M), depending on which table is scanned linearly. If both are scanned by an index, we get O(M log M log N) or O(N log M log N), respectively. But since this is only required if a specific order of the rows is needed for yet another join or due to the ORDER clause, as we'll see it's not a bad deal at all.
Merge join's time complexity is O(M log M + N log N), which means that it loses to the nested loop join, provided that the asymptotic proportionality coefficients are the same, and AFAIK they should both be equal to 1 in most implementations. Since both tables must be iterated by the same index in the same direction, if different order is required, an additional sort is required, which easily makes the complexity worse than in the case of the nested loop sort.
So basically despite being associated with the merge sort, which we all love, merge join almost always sucks.
The reason why my first query was so slow was because it had to perform sort before applying limit, and it was also bad in many other ways. After applying #wildplasser's suggestions, I managed to reduce the number of (still expensive) nested loops and also allow for limit to be taken without a sort, thus ensuring that Postgres most likely won't need to run the outer scan to its completition, which is where I derive the bulk of performance gains from.

Related

Position of ON and WHERE clauses and the efficiency performance

I have two tables, one called Health_User and the other called Diary. They have users' demographic information, and their recorded values respectively. What I want to do is retrieving the recorded values, but:
Excluding testers (not real users) with the "is_tester" column (boolean values) in Health_User, and
Excluding unreasonable values with too high or too low measurements in Diary.
So I have several queries which should get the same results:
# Query 1
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN (
SELECT id
FROM Health_User
WHERE is_tester = false
) AS u
ON d.user_id = u.id
WHERE ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 2
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN Health_User AS u
ON d.user_id = u.id
WHERE u.is_tester = false
AND ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 3
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Health_User AS u
JOIN (
SELECT id, user_id, glucose_value, unit
FROM Diary
WHERE ((glucose_value >= 20 AND glucose_value <= 600 AND unit = 'mg/dL')
OR (glucose_value >= 20/18.02 AND glucose_value <= 600/18.02 AND unit = 'mmol/L'))
) AS d
ON d.user_id = u.id
WHERE u.is_tester = false;
Here I have three questions:
Question 1: I would speculate that Query 1 would have better performance than Query 2, because a) it joins only one column instead of the whole table of Health_User and b) it filters out testers before joining the tables. Am I correct?
Question 2: The conditional limitation is more complex for Diary (See the last WHERE clause in Query 1). Is it better to switch Diary inside the JOIN and make Health_User outside like Query 3, or it makes no difference?
Question 3: Is there any even better solution in terms of performance?

There would be a difference if the database executed the queries in the order your queries suggest (first filter, then join or vice versa).
As it is, PostgreSQL has a query optimizer that rearranges the query to find the most efficient execution order, and all your queries will end up with the same execution plan, which you can verify using the SQL statement EXPLAIN.
For inner joins, it does not influence the result if you filter before or after the join; you could also write all the conditions into the join condition without changing the result. The optimizer knows that.
You can speed up execution by creating appropriate indexes. It depends on the distribution of the data to know if a certain index is useful. The rule of thumb is that indexes on selective conditions (that filter out many data) are more useful. Work with EXPLAIN to find the best indexes.

optimizing a large "distinct" select in postgres

I have a rather large dataset (millions of rows). I'm having trouble introducing a "distinct" concept to a certain query. (I putting distinct in quotes, because this could be provided by the posgtres keyword DISTINCT or a "group by" form).
A non-distinct search takes 1ms - 2ms ; all attempts to introduce a "distinct" concept have grown this to the 50,000ms - 90,000ms range.
My goal is to show the latest resources based on their most recent appearance in an event stream.
My non-distinct query is essentially this:
SELECT
resource.id AS resource_id,
stream_event.event_timestamp AS event_timestamp
FROM
resource
JOIN
resource_2_stream_event ON (resource.id = resource_2_stream_event.resource_id)
JOIN
stream_event ON (resource_2_stream_event.stream_event_id = stream_event.id)
WHERE
stream_event.viewer = 47
ORDER BY event_timestamp DESC
LIMIT 25
;
I've tried many different forms of queries (and subqueries) using DISTINCT, GROUP BY and MAX(event_timestamp). The issue isn't getting a query that works, it's getting one that works in a reasonable execution time. Looking at the EXPLAIN ANALYZE output for each one, everything is running off of indexes. Th problem seems to be that with any attempt to deduplicate my results, postges must assemble the entire resultset onto disk; since each table has millions of rows, this becomes a bottleneck.
--
update
here's a working group-by query:
EXPLAIN ANALYZE
SELECT
resource.id AS resource_id,
max(stream_event.event_timestamp) AS stream_event_event_timestamp
FROM
resource
JOIN resource_2_stream_event ON (resource_2_stream_event.resource_id = resource.id)
JOIN stream_event ON stream_event.id = resource_2_stream_event.stream_event_id
WHERE (
(stream_event.viewer_id = 57) AND
(resource.condition_1 IS NOT True) AND
(resource.condition_2 IS NOT True) AND
(resource.condition_3 IS NOT True) AND
(resource.condition_4 IS NOT True) AND
(
(resource.condition_5 IS NULL) OR (resource.condition_6 IS NULL)
)
)
GROUP BY (resource.id)
ORDER BY stream_event_event_timestamp DESC LIMIT 25;
looking at the query planner (via EXPLAIN ANALYZE), it seems that adding in the max+groupby clause (or a distinct) forces a sequential scan. that is taking about half the time to computer. there already is an index that contains every "condition", and i tried creating a set of indexes (one for each element). none work.
in any event, the difference is between 2ms and 72,000ms

Often, distinct on is the most efficient way to get one row per something. I would suggest trying:
SELECT DISTINCT ON (r.id) r.id AS resource_id, se.event_timestamp
FROM resource r JOIN
resource_2_stream_event r2se
ON r.id = r2se.resource_id JOIN
stream_event se
ON r2se.stream_event_id = se.id
WHERE se.viewer = 47
ORDER BY r.id, se.event_timestamp DESC
LIMIT 25;
An index on resource(id, event_timestamp) might help performance.
EDIT:
You might try using a CTE to get what you want:
WITH CTE as (
SELECT r.id AS resource_id,
se.event_timestamp AS stream_event_event_timestamp
FROM resource r JOIN
resource_2_stream_event r2se
ON r2se.resource_id = r.id JOIN
stream_event se
ON se.id = r2se.stream_event_id
WHERE ((se.viewer_id = 57) AND
(r.condition_1 IS NOT True) AND
(r.condition_2 IS NOT True) AND
(r.condition_3 IS NOT True) AND
(r.condition_4 IS NOT True) AND
( (r.condition_5 IS NULL) OR (r.condition_6 IS NULL)
)
)
)
SELECT resource_id, max(stream_event_event_timestamp) as stream_event_event_timestamp
FROM CTE
GROUP BY resource_id
ORDER BY stream_event_event_timestamp DESC
LIMIT 25;
Postgres materializes the CTE. So, if there are not that many matches, this may speed the query by using indexes for the CTE.

How can I optimize this SQL query? (Solarwinds Orion)

I'm very new to SQL, and still learning. I'm using a reporting tool called Solarwinds Orion, and I'm honestly not sure how specific the query I have written is to the program, so if there's anything in the query that's confusing, let me know and I'll try to figure out if it's specific to the program or not.
The problem with the query I'm running is that it times out after a very long time (maybe an hour) of running. The database I'm using is huge. Unfortunately I don't really know how huge, but I've been told it's huge.
Is there anything I am doing wrong that would have a huge performance impact?
SELECT TOP 10000
Nodes.Caption AS NodeName,
NetflowApplicationSummary.AppName AS Application_Name,
SUM(NetflowApplicationSummary.TotalBytes) AS SUM_of_Bytes_Transferred,
AVG(Case OutBandwidth
When 0 Then 0
Else (NetflowApplicationSummary.TotalBytes/OutBandwidth) * 100
End) AS TEST_PERCENT
FROM
((NetflowApplicationSummary
INNER JOIN Nodes ON (NetflowApplicationSummary.NodeID = Nodes.NodeID))
INNER JOIN InterfaceTraffic ON (Nodes.NodeID = InterfaceTraffic.InterfaceID))
INNER JOIN Interfaces ON (Nodes.NodeID = Interfaces.NodeID)
WHERE
( InterfaceTraffic.DateTime > (GetDate()-30) )
AND
(Nodes.WANCircuit = 1)
GROUP BY Nodes.Caption, NetflowApplicationSummary.AppName
EDIT: I ran COUNT() on each of my tables with the below result.
SELECT COUNT(*) FROM NetflowApplicationSummary # 50671011
SELECT COUNT(*) FROM Nodes # 898
SELECT COUNT(*) FROM InterfaceTraffic # 18000166
SELECT COUNT(*) FROM Interfaces # 3938
# Total : 68,676,013
I really have no idea if 68 million items is a huge database to be honest.

A couple of notes:
The INNER JOIN operator is associative, so get rid of those parenthesis in the FROM clause and let the optimizer figure out the best join order.
You may have an implied cursor from the getdate() function being called for every row. Store the value in a local variable and compare to that.
The resulting SQL should look like this:
DECLARE #Date as datetime = getdate() - 30;
SELECT TOP 10000
Nodes.Caption AS NodeName,
NetflowApplicationSummary.AppName AS Application_Name,
SUM(NetflowApplicationSummary.TotalBytes) AS SUM_of_Bytes_Transferred,
AVG(Case OutBandwidth
When 0 Then 0
Else (NetflowApplicationSummary.TotalBytes/OutBandwidth) * 100
End) AS TEST_PERCENT
FROM NetflowApplicationSummary
INNER JOIN Nodes ON NetflowApplicationSummary.NodeID = Nodes.NodeID
INNER JOIN InterfaceTraffic ON Nodes.NodeID = InterfaceTraffic.InterfaceID
INNER JOIN Interfaces ON Nodes.NodeID = Interfaces.NodeID
WHERE InterfaceTraffic.DateTime > #Date
AND Nodes.WANCircuit = 1
GROUP BY Nodes.Caption, NetflowApplicationSummary.AppName
Also, make sure you have an index on table InterfaceTraffic with a leading field of DateTime. If this doesn't exist you may need to pay the penalty of a first time creation of it.
If this doesn't help, then you may need to post the execution plan where it can be inspected.
Out of interest, also perform a count() on all four tables and post that result, just so members here can make their own assessment of how big your database really is. It is amazing how many non-technical people still think a 1 or 10 GB database is huge, while I run that easily on my workstation!

Improvement in SQL Query - Access - Join / IN / Exists

This is my SQL Query - using in Access. It is providing the desired result.
But just wanted opinion whether the approach is correct.
How can this be speeded up.
SELECT INVDETAILS2.F5
, INVDETAILS2.F16
, ExpectedResult.DLID
, ExpectedResult.NumRows
FROM INVDETAILS2
INNER
JOIN (INVDL INNER JOIN ExpectedResult ON INVDL.DLID =ExpectedResult.DLID)
ON (INVDETAILS2.F14 = ROUND(ExpectedResult.Total))
AND (INVDETAILS2.F1 = INVDL.RegionCode)
WHERE INVDETAILS2.F29 ='2013-03-06'
AND INVDETAILS2.F5 IN (SELECT INVDETAILS2.F5
FROM (ExpectedResult
INNER JOIN INVDL
ON ExpectedResult.DLID = INVDL.DLID)
INNER JOIN INVDETAILS2
ON INVDL.RegionCode = INVDETAILS2.F1
AND round(ExpectedResult.Total)
= INVDETAILS2.F14
WHERE INVDETAILS2.F29='2013-03-06'
GROUP BY INVDETAILS2.F5
HAVING Count(ExpectedResult.DLID)<2
)
;
Approximate Number of Rows in
"ExpectedResult" - Millions
"INVDL" - 80,000
"INVDETAILS" - 300,000 - Total , For One Date - approx - 10,000 , then again lesser for each region per date.
Please provide a better query if possible.

Two things you could investigate that might help speed things up:
Indexing
Make sure that you have indexed all of the columns involved in JOINs, WHERE clauses, and GROUP BY clauses.
JOIN expressions involving functions
A couple of your JOINs use Round(ExpectedResult.Total), so if you have an index on ExpectedResult.Total your query won't be able to use it. You may get a performance boost if you add a RoundedTotal column (Long Integer, Indexed), populate it with
UPDATE [ExpectedResult] SET [RoundedTotal]=Round([Total])
and then use the RoundedTotal column in your JOINs.

Bad performance of SQL query due to ORDER BY clause

I have a query joining 4 tables with a lot of conditions in the WHERE clause. The query also includes ORDER BY clause on a numeric column. It takes 6 seconds to return which is too long and I need to speed it up. Surprisingly I found that if I remove the ORDER BY clause it takes 2 seconds. Why the order by makes so massive difference and how to optimize it? I am using SQL server 2005. Many thanks.
I cannot confirm that the ORDER BY makes big difference since I am clearing the execution plan cache. However can you shed light at how to speed this up a little bit? The query is as follows (for simplicity there is "SELECT *" but I am only selecting the ones I need).
SELECT *
FROM View_Product_Joined j
INNER JOIN [dbo].[OPR_PriceLookup] pl on pl.siteID = NodeSiteID and pl.skuid = j.skuid
LEFT JOIN [dbo].[OPR_InventoryRules] irp on irp.ID = pl.SkuID and irp.InventoryRulesType = 'Product'
LEFT JOIN [dbo].[OPR_InventoryRules] irs on irs.ID = pl.siteID and irs.InventoryRulesType = 'Store'
WHERE (((((SiteName = N'EcommerceSite') AND (Published = 1)) AND (DocumentCulture = N'en-GB')) AND (NodeAliasPath LIKE N'/Products/Cats/Computers/Computer-servers/%')) AND ((NodeSKUID IS NOT NULL) AND (SKUEnabled = 1) AND pl.PriceLookupID in (select TOP 1 PriceLookupID from OPR_PriceLookup pl2 where pl.skuid = pl2.skuid and (pl2.RoleID = -1 or pl2.RoleId = 13) order by pl2.RoleID desc)))
ORDER BY NodeOrder ASC

Why the order by makes so massive difference and how to optimize it?
The ORDER BY needs to sort the resultset which may take long if it's big.
To optimize it, you may need to index the tables properly.
The index access path, however, has its drawbacks so it can even take longer.
If you have something other than equijoins in your query, or the ranged predicates (like <, > or BETWEEN, or GROUP BY clause), then the index used for ORDER BY may prevent the other indexes from being used.
If you post the query, I'll probably be able to tell you how to optimize it.
Update:
Rewrite the query:
SELECT *
FROM View_Product_Joined j
LEFT JOIN
[dbo].[OPR_InventoryRules] irp
ON irp.ID = j.skuid
AND irp.InventoryRulesType = 'Product'
LEFT JOIN
[dbo].[OPR_InventoryRules] irs
ON irs.ID = j.NodeSiteID
AND irs.InventoryRulesType = 'Store'
CROSS APPLY
(
SELECT TOP 1 *
FROM OPR_PriceLookup pl
WHERE pl.siteID = j.NodeSiteID
AND pl.skuid = j.skuid
AND pl.RoleID IN (-1, 13)
ORDER BY
pl.RoleID desc
) pl
WHERE SiteName = N'EcommerceSite'
AND Published = 1
AND DocumentCulture = N'en-GB'
AND NodeAliasPath LIKE N'/Products/Cats/Computers/Computer-servers/%'
AND NodeSKUID IS NOT NULL
AND SKUEnabled = 1
ORDER BY
NodeOrder ASC
The relation View_Product_Joined, as the name suggests, is probably a view.
Could you please post its definition?
If it is indexable, you may benefit from creating an index on View_Product_Joined (SiteName, Published, DocumentCulture, SKUEnabled, NodeOrder).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas