Position of ON and WHERE clauses and the efficiency performance - sql

I have two tables, one called Health_User and the other called Diary. They have users' demographic information, and their recorded values respectively. What I want to do is retrieving the recorded values, but:
Excluding testers (not real users) with the "is_tester" column (boolean values) in Health_User, and
Excluding unreasonable values with too high or too low measurements in Diary.
So I have several queries which should get the same results:
# Query 1
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN (
SELECT id
FROM Health_User
WHERE is_tester = false
) AS u
ON d.user_id = u.id
WHERE ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 2
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN Health_User AS u
ON d.user_id = u.id
WHERE u.is_tester = false
AND ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 3
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Health_User AS u
JOIN (
SELECT id, user_id, glucose_value, unit
FROM Diary
WHERE ((glucose_value >= 20 AND glucose_value <= 600 AND unit = 'mg/dL')
OR (glucose_value >= 20/18.02 AND glucose_value <= 600/18.02 AND unit = 'mmol/L'))
) AS d
ON d.user_id = u.id
WHERE u.is_tester = false;
Here I have three questions:
Question 1: I would speculate that Query 1 would have better performance than Query 2, because a) it joins only one column instead of the whole table of Health_User and b) it filters out testers before joining the tables. Am I correct?
Question 2: The conditional limitation is more complex for Diary (See the last WHERE clause in Query 1). Is it better to switch Diary inside the JOIN and make Health_User outside like Query 3, or it makes no difference?
Question 3: Is there any even better solution in terms of performance?

There would be a difference if the database executed the queries in the order your queries suggest (first filter, then join or vice versa).
As it is, PostgreSQL has a query optimizer that rearranges the query to find the most efficient execution order, and all your queries will end up with the same execution plan, which you can verify using the SQL statement EXPLAIN.
For inner joins, it does not influence the result if you filter before or after the join; you could also write all the conditions into the join condition without changing the result. The optimizer knows that.
You can speed up execution by creating appropriate indexes. It depends on the distribution of the data to know if a certain index is useful. The rule of thumb is that indexes on selective conditions (that filter out many data) are more useful. Work with EXPLAIN to find the best indexes.

Related

Is it possible to use UNION here instead of OR?

UPDATE: Changed title. Previous title "Does UNION instead of OR always speed up queries?"
Here is my query. The question is concerning the second last line with the OR:
SELECT distinct bigUnionQuery.customer
FROM ((SELECT buyer.customer
FROM membership_vw buyer
JOIN account_vw account
ON account.buyer = buyer.id
WHERE account.closedate >= 'some_date')
UNION
(SELECT joint.customer
FROM entity_vw joint
JOIN transactorassociation_vw assoc
ON assoc.associatedentity = joint.id
JOIN account_vw account
ON account.buyer = assoc.entity
WHERE assoc.account is null and account.closedate >= 'some_date')
UNION
(SELECT joint.customer
FROM entity_vw joint
JOIN transactorassociation_vw assoc
ON assoc.associatedentity = joint.id
JOIN account_vw account
ON account.id = assoc.account
WHERE account.closedate >= '2021-02-11 00:30:22.339'))
AS bigUnionQuery
JOIN entity_vw
ON entity_vw.customer = bigUnionQuery.customer OR entity_vw.id = bigUnionQuery.customer
WHERE entity_vw.lastmodifieddate >= 'some_date';
The original query doesn't have the OR in the second last line. Adding the OR here has slowed down the query. I'm wondering if there is a way to use UNION here to speed it up.
I tried doing (pseudo):
bigUnionQuery bq join entity_vw e on e.customer = bq.customer
union
bigUnionQuery bq join entity_vw e on e.id = bq.customer
But that slowed down the query even more, probably because the bigUnionQuery is a large, slow query, and running it twice in the UNION is not the correct way. What would be the right way to use UNION here, or is it always going to be faster with OR?
Does UNION instead of OR always speed up queries? In some cases it does. I think it depends on your indexes too. I have worked on tables with 1 million records and my queries' speed usually improves if I use union instead of 'or' or 'and'.

Oracle complex query with multiple joins on same table

I am dealing with a monster query ( ~800 lines ) on oracle 11, and its taking expensive resources.
The main problem here is a table mouvement with about ~18 million lines, on which I have like 30 left joins on this table.
LEFT JOIN mouvement mracct_ad1
ON mracct_ad1.code_portefeuille = t.code_portefeuille
AND mracct_ad1.statut_ligne = 'PROPRE'
AND substr(mracct_ad1.code_valeur,1,4) = 'MRAC'
AND mracct_ad1.code_transaction = t.code_transaction
LEFT JOIN mouvement mracct_zias
ON mracct_zias.code_portefeuille = t.code_portefeuille
AND mracct_zias.statut_ligne = 'PROPRE'
AND substr(mracct_zias.code_valeur,1,4) = 'PRAC'
AND mracct_zias.code_transaction = t.code_transaction
LEFT JOIN mouvement mracct_zixs
ON mracct_zias.code_portefeuille = t.code_portefeuille
AND mracct_zias.statut_ligne = 'XROPRE'
AND substr(mracct_zias.code_valeur,1,4) = 'MRAT'
AND mracct_zias.code_transaction = t.code_transaction
is there some way so I can get rid of the left joins, (union join or example) to make the query faster and consumes less? execution plan or something?
Just a note on performance. Usually you want to "rephrase" conditions like:
AND substr(mracct_ad1.code_valeur,1,4) = 'MRAC'
In simple words, expressions on the left side of the equality will prevent the best usage of indexes and may push the SQL optimizer toward a less than optimal plan. The database engine will end up doing more work than is really needed, and the query will be [much] slower. In extreme cases they can even decide to use a Full Table Scan. In this case you can rephrase it as:
AND mracct_ad1.code_valeur like 'MRAC%'
or:
AND mracct_ad1.code_valeur >= 'MRAC' AND mracct_ad1.code_valeur < 'MRAD'
I am guessing so. Your code sample doesn't make much sense, but you can probably do conditional aggregation:
left join
(select m.code_portefeuille, m.code_transaction,
max(case when m.statut_ligne = 'PROPRE' and m.code_valeur like 'MRAC%' then ? end) as ad1,
max(case when m.statut_ligne = 'PROPRE' and m.code_valeur like 'MRAC%' then ? end) as zia,
. . . -- for all the rest of the joins as well
from mouvement m
group by m.code_portefeuille, m.code_transaction
) m
on m.code_portefeuille = t.code_portefeuille and m.code_transaction = t.code_transaction
You can probably replace all 30 joins with a single join to the aggregated table.

Having clause not working

i make this query in oracle 11g database
SELECT DISTINCT JOC_FIN_CLTH_DFCT_LOT.LOT_NO,
I.ISSUE_DATE,
R.PROC_DESC,
R.RECV_DATE,
M.DFCT_DATE,
JOC_FIN_CLTH_DFCT_LOT.FCD_MAIN_ID
FROM JOC_FIN_CLTH_DFCT_LOT,
JOC_FIN_CLTH_DFCT_MAIN M,
JOC_DAILY_FABRC_RECV_FOLD R,
JOC_LOT_ISSUE_REG I
WHERE M.FCD_MAIN_ID = JOC_FIN_CLTH_DFCT_LOT.FCD_MAIN_ID
AND R.LOT_NO = JOC_FIN_CLTH_DFCT_LOT.LOT_NO
AND I.LOT_NO = R.LOT_NO
AND I.LOT_YEAR = R.LOT_YEAR
AND JOC_FIN_CLTH_DFCT_LOT.LOT_YEAR = R.LOT_YEAR
AND JOC_FIN_CLTH_DFCT_LOT.LOT_YEAR = '1213'
AND JOC_FIN_CLTH_DFCT_LOT.FCDL_ID IN
( SELECT MIN (DFCT_LOT.FCDL_ID)
FROM JOC_FIN_CLTH_DFCT_LOT DFCT_LOT, JOC_FIN_CLTH_DFCT_MAIN DFT_MAIN
WHERE DFCT_LOT.FCD_MAIN_ID IN (DFT_MAIN.FCD_MAIN_ID)
GROUP BY DFCT_LOT.FCD_MAIN_ID)
ORDER BY JOC_FIN_CLTH_DFCT_LOT.FCD_MAIN_ID
it retrieve data within 2 sec no. of rows=5100
but when i use this query in my front end application it takes too much times so after troubleshooting i find subquery cause problem when data retrieve so i simplify this query
SELECT DISTINCT DFCT_LOT.LOT_NO,
I.ISSUE_DATE,
R.PROC_DESC,
R.RECV_DATE,
M.DFCT_DATE,
DFCT_LOT.FCD_MAIN_ID
FROM JOC_FIN_CLTH_DFCT_LOT DFCT_LOT,
JOC_FIN_CLTH_DFCT_MAIN M,
JOC_DAILY_FABRC_RECV_FOLD R,
JOC_LOT_ISSUE_REG I,
JOC_FIN_CLTH_DFCT_MAIN DFT_MAIN
WHERE M.FCD_MAIN_ID = DFCT_LOT.FCD_MAIN_ID
AND R.LOT_NO = DFCT_LOT.LOT_NO
AND I.LOT_NO = R.LOT_NO
AND I.LOT_YEAR = R.LOT_YEAR
AND DFCT_LOT.LOT_YEAR = R.LOT_YEAR
AND DFCT_LOT.LOT_YEAR = '1213'
AND DFCT_LOT.FCD_MAIN_ID IN (DFT_MAIN.FCD_MAIN_ID)
GROUP BY DFCT_LOT.FCDL_ID,
DFCT_LOT.FCD_MAIN_ID,
DFCT_LOT.LOT_NO,
I.ISSUE_DATE,
R.PROC_DESC,
R.RECV_DATE,
M.DFCT_DATE,
DFCT_LOT.FCD_MAIN_ID
HAVING DFCT_LOT.FCDL_ID in MIN (DFCT_LOT.FCDL_ID)
ORDER BY DFCT_LOT.FCD_MAIN_ID
this is simplified form of above query but number of rows increase
no.of rows=5578 but i know the actual no. of rows=5100
having clause not working here
kindly look into my query and guide me
In your second query you are joining the table JOC_FIN_CLTH_DFCT_MAIN a second time, once as M and once as DFCT_LOT. In the second query's SELECT list you take the first column from M and the last one from DFCT_LOT.
But in the first query they are both from the same M table. When you have more than one record in JOC_FIN_CLTH_DFCT_MAIN for the same FCD_MAIN_ID, then this will result in more combinations in the second query, and this explains why you have more results with it.
But there are several other differences. In the second query you group by many more columns than in the first. Moreover, MIN (DFCT_LOT.FCDL_ID) does not really make sense in the second query, as it is already grouped by, so it is exactly the same as just DFCT_LOT.FCDL_ID. As a consequence, the HAVING clause is just a plain tautology, and you could just as well leave it out and still get the same results.
If you are sure the first query gives the results you want, then I would suggest a different way to achieve a possible optimisation of it:
SELECT DISTINCT
L.LOT_NO,
I.ISSUE_DATE,
R.PROC_DESC,
R.RECV_DATE,
L.DFCT_DATE,
L.FCD_MAIN_ID,
FROM (SELECT L.FCD_MAIN_ID,
L.LOT_NO,
L.LOT_YEAR,
M.DFCT_DATE,
ROW_NUMBER() OVER (PARTITION BY L.FCD_MAIN_ID
ORDER BY L.FCDL_ID) AS RN
FROM JOC_FIN_CLTH_DFCT_LOT L,
INNER JOIN JOC_FIN_CLTH_DFCT_MAIN M
ON M.FCD_MAIN_ID = L.FCD_MAIN_ID
) L
INNER JOIN JOC_DAILY_FABRC_RECV_FOLD R
ON R.LOT_NO = L.LOT_NO
AND R.LOT_YEAR = L.LOT_YEAR
INNER JOIN JOC_LOT_ISSUE_REG I
ON I.LOT_NO = R.LOT_NO
AND I.LOT_YEAR = R.LOT_YEAR
WHERE L.LOT_YEAR = '1213'
AND L.RN = 1
ORDER BY L.FCD_MAIN_ID
Note that I have used the ANSI/ISO syntax for joins, which I would strongly advise you to do. Defining join conditions in the WHERE clause is something of the eighties; don't do it. Queries becomes much more readable once you are used to the ANSI/ISO syntax.
The suggested query selects all the columns that are needed from both JOC_FIN_CLTH_DFCT_LOT and JOC_FIN_CLTH_DFCT_MAIN in the sub-query, that way you don't have to include those tables again.
The major trick is the use of the ROW_NUMBER window function, which gives an sequence number according to the PARTITION clause. The outer query then filters for only those records which got number 1, which are the records where the value of FCD_MAIN_ID is minimal for a given FCDL_ID.

optimizing a large "distinct" select in postgres

I have a rather large dataset (millions of rows). I'm having trouble introducing a "distinct" concept to a certain query. (I putting distinct in quotes, because this could be provided by the posgtres keyword DISTINCT or a "group by" form).
A non-distinct search takes 1ms - 2ms ; all attempts to introduce a "distinct" concept have grown this to the 50,000ms - 90,000ms range.
My goal is to show the latest resources based on their most recent appearance in an event stream.
My non-distinct query is essentially this:
SELECT
resource.id AS resource_id,
stream_event.event_timestamp AS event_timestamp
FROM
resource
JOIN
resource_2_stream_event ON (resource.id = resource_2_stream_event.resource_id)
JOIN
stream_event ON (resource_2_stream_event.stream_event_id = stream_event.id)
WHERE
stream_event.viewer = 47
ORDER BY event_timestamp DESC
LIMIT 25
;
I've tried many different forms of queries (and subqueries) using DISTINCT, GROUP BY and MAX(event_timestamp). The issue isn't getting a query that works, it's getting one that works in a reasonable execution time. Looking at the EXPLAIN ANALYZE output for each one, everything is running off of indexes. Th problem seems to be that with any attempt to deduplicate my results, postges must assemble the entire resultset onto disk; since each table has millions of rows, this becomes a bottleneck.
--
update
here's a working group-by query:
EXPLAIN ANALYZE
SELECT
resource.id AS resource_id,
max(stream_event.event_timestamp) AS stream_event_event_timestamp
FROM
resource
JOIN resource_2_stream_event ON (resource_2_stream_event.resource_id = resource.id)
JOIN stream_event ON stream_event.id = resource_2_stream_event.stream_event_id
WHERE (
(stream_event.viewer_id = 57) AND
(resource.condition_1 IS NOT True) AND
(resource.condition_2 IS NOT True) AND
(resource.condition_3 IS NOT True) AND
(resource.condition_4 IS NOT True) AND
(
(resource.condition_5 IS NULL) OR (resource.condition_6 IS NULL)
)
)
GROUP BY (resource.id)
ORDER BY stream_event_event_timestamp DESC LIMIT 25;
looking at the query planner (via EXPLAIN ANALYZE), it seems that adding in the max+groupby clause (or a distinct) forces a sequential scan. that is taking about half the time to computer. there already is an index that contains every "condition", and i tried creating a set of indexes (one for each element). none work.
in any event, the difference is between 2ms and 72,000ms
Often, distinct on is the most efficient way to get one row per something. I would suggest trying:
SELECT DISTINCT ON (r.id) r.id AS resource_id, se.event_timestamp
FROM resource r JOIN
resource_2_stream_event r2se
ON r.id = r2se.resource_id JOIN
stream_event se
ON r2se.stream_event_id = se.id
WHERE se.viewer = 47
ORDER BY r.id, se.event_timestamp DESC
LIMIT 25;
An index on resource(id, event_timestamp) might help performance.
EDIT:
You might try using a CTE to get what you want:
WITH CTE as (
SELECT r.id AS resource_id,
se.event_timestamp AS stream_event_event_timestamp
FROM resource r JOIN
resource_2_stream_event r2se
ON r2se.resource_id = r.id JOIN
stream_event se
ON se.id = r2se.stream_event_id
WHERE ((se.viewer_id = 57) AND
(r.condition_1 IS NOT True) AND
(r.condition_2 IS NOT True) AND
(r.condition_3 IS NOT True) AND
(r.condition_4 IS NOT True) AND
( (r.condition_5 IS NULL) OR (r.condition_6 IS NULL)
)
)
)
SELECT resource_id, max(stream_event_event_timestamp) as stream_event_event_timestamp
FROM CTE
GROUP BY resource_id
ORDER BY stream_event_event_timestamp DESC
LIMIT 25;
Postgres materializes the CTE. So, if there are not that many matches, this may speed the query by using indexes for the CTE.

Postgres consistently favoring nested loop join over merge join

I have a complex query:
SELECT DISTINCT ON (delivery.id)
delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status = 1000
-- LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id
WHERE
NOT EXISTS (SELECT dl_finished.id FROM mailer.mailer_message_recipient_rel_log AS dl_finished WHERE dl_finished.rel_id = delivery.id AND dl_finished.status <> 1000) AND
dl_processing.date <= NOW() - (36000 * INTERVAL '1 second') AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
-- (r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id, dl_processing.id DESC
LIMIT 1000;
It is running very slowly, and the reason seems to be that Postgres is consistently avoiding using merge joins in its query plan despite me having all the indices that I would need for this. It looks really depressing:
http://explain.depesz.com/s/tVY
http://i.stack.imgur.com/Myw4R.png
Why would this happen? How do I troubleshoot such an issue?
UPD: with #wildplasser's help I have reworked the query to fix performance (while changing its semantics somewhat):
SELECT delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status in (1000, 2, 5) AND dl_processing.date <= NOW() - (36000 * INTERVAL '1 second')
LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
WHERE
(r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT dl_other.id FROM mailer.mailer_message_recipient_rel_log AS dl_other WHERE dl_other.rel_id = delivery.id AND dl_other.id > dl_processing.id) AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id
LIMIT 1000
It now runs well, but the query plan still sports these horrible nested loop joins <_<:
http://explain.depesz.com/s/MTo3
I would still like to know why that is.
The reason is that Postgres is actually doing the right thing, and I suck at math. Suppose table A has N rows, and table B has M rows, and they are being joined via a column that they both have a B-tree index for. Then the following is true:
Nested loop join's time complexity is not O(MN), like I naively thought, but O(M log N) or O(N log M), depending on which table is scanned linearly. If both are scanned by an index, we get O(M log M log N) or O(N log M log N), respectively. But since this is only required if a specific order of the rows is needed for yet another join or due to the ORDER clause, as we'll see it's not a bad deal at all.
Merge join's time complexity is O(M log M + N log N), which means that it loses to the nested loop join, provided that the asymptotic proportionality coefficients are the same, and AFAIK they should both be equal to 1 in most implementations. Since both tables must be iterated by the same index in the same direction, if different order is required, an additional sort is required, which easily makes the complexity worse than in the case of the nested loop sort.
So basically despite being associated with the merge sort, which we all love, merge join almost always sucks.
The reason why my first query was so slow was because it had to perform sort before applying limit, and it was also bad in many other ways. After applying #wildplasser's suggestions, I managed to reduce the number of (still expensive) nested loops and also allow for limit to be taken without a sort, thus ensuring that Postgres most likely won't need to run the outer scan to its completition, which is where I derive the bulk of performance gains from.