UPDATE: Changed title. Previous title "Does UNION instead of OR always speed up queries?"
Here is my query. The question is concerning the second last line with the OR:
SELECT distinct bigUnionQuery.customer
FROM ((SELECT buyer.customer
FROM membership_vw buyer
JOIN account_vw account
ON account.buyer = buyer.id
WHERE account.closedate >= 'some_date')
UNION
(SELECT joint.customer
FROM entity_vw joint
JOIN transactorassociation_vw assoc
ON assoc.associatedentity = joint.id
JOIN account_vw account
ON account.buyer = assoc.entity
WHERE assoc.account is null and account.closedate >= 'some_date')
UNION
(SELECT joint.customer
FROM entity_vw joint
JOIN transactorassociation_vw assoc
ON assoc.associatedentity = joint.id
JOIN account_vw account
ON account.id = assoc.account
WHERE account.closedate >= '2021-02-11 00:30:22.339'))
AS bigUnionQuery
JOIN entity_vw
ON entity_vw.customer = bigUnionQuery.customer OR entity_vw.id = bigUnionQuery.customer
WHERE entity_vw.lastmodifieddate >= 'some_date';
The original query doesn't have the OR in the second last line. Adding the OR here has slowed down the query. I'm wondering if there is a way to use UNION here to speed it up.
I tried doing (pseudo):
bigUnionQuery bq join entity_vw e on e.customer = bq.customer
union
bigUnionQuery bq join entity_vw e on e.id = bq.customer
But that slowed down the query even more, probably because the bigUnionQuery is a large, slow query, and running it twice in the UNION is not the correct way. What would be the right way to use UNION here, or is it always going to be faster with OR?
Does UNION instead of OR always speed up queries? In some cases it does. I think it depends on your indexes too. I have worked on tables with 1 million records and my queries' speed usually improves if I use union instead of 'or' or 'and'.
Related
I am dealing with a monster query ( ~800 lines ) on oracle 11, and its taking expensive resources.
The main problem here is a table mouvement with about ~18 million lines, on which I have like 30 left joins on this table.
LEFT JOIN mouvement mracct_ad1
ON mracct_ad1.code_portefeuille = t.code_portefeuille
AND mracct_ad1.statut_ligne = 'PROPRE'
AND substr(mracct_ad1.code_valeur,1,4) = 'MRAC'
AND mracct_ad1.code_transaction = t.code_transaction
LEFT JOIN mouvement mracct_zias
ON mracct_zias.code_portefeuille = t.code_portefeuille
AND mracct_zias.statut_ligne = 'PROPRE'
AND substr(mracct_zias.code_valeur,1,4) = 'PRAC'
AND mracct_zias.code_transaction = t.code_transaction
LEFT JOIN mouvement mracct_zixs
ON mracct_zias.code_portefeuille = t.code_portefeuille
AND mracct_zias.statut_ligne = 'XROPRE'
AND substr(mracct_zias.code_valeur,1,4) = 'MRAT'
AND mracct_zias.code_transaction = t.code_transaction
is there some way so I can get rid of the left joins, (union join or example) to make the query faster and consumes less? execution plan or something?
Just a note on performance. Usually you want to "rephrase" conditions like:
AND substr(mracct_ad1.code_valeur,1,4) = 'MRAC'
In simple words, expressions on the left side of the equality will prevent the best usage of indexes and may push the SQL optimizer toward a less than optimal plan. The database engine will end up doing more work than is really needed, and the query will be [much] slower. In extreme cases they can even decide to use a Full Table Scan. In this case you can rephrase it as:
AND mracct_ad1.code_valeur like 'MRAC%'
or:
AND mracct_ad1.code_valeur >= 'MRAC' AND mracct_ad1.code_valeur < 'MRAD'
I am guessing so. Your code sample doesn't make much sense, but you can probably do conditional aggregation:
left join
(select m.code_portefeuille, m.code_transaction,
max(case when m.statut_ligne = 'PROPRE' and m.code_valeur like 'MRAC%' then ? end) as ad1,
max(case when m.statut_ligne = 'PROPRE' and m.code_valeur like 'MRAC%' then ? end) as zia,
. . . -- for all the rest of the joins as well
from mouvement m
group by m.code_portefeuille, m.code_transaction
) m
on m.code_portefeuille = t.code_portefeuille and m.code_transaction = t.code_transaction
You can probably replace all 30 joins with a single join to the aggregated table.
I have two tables, one called Health_User and the other called Diary. They have users' demographic information, and their recorded values respectively. What I want to do is retrieving the recorded values, but:
Excluding testers (not real users) with the "is_tester" column (boolean values) in Health_User, and
Excluding unreasonable values with too high or too low measurements in Diary.
So I have several queries which should get the same results:
# Query 1
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN (
SELECT id
FROM Health_User
WHERE is_tester = false
) AS u
ON d.user_id = u.id
WHERE ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 2
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN Health_User AS u
ON d.user_id = u.id
WHERE u.is_tester = false
AND ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 3
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Health_User AS u
JOIN (
SELECT id, user_id, glucose_value, unit
FROM Diary
WHERE ((glucose_value >= 20 AND glucose_value <= 600 AND unit = 'mg/dL')
OR (glucose_value >= 20/18.02 AND glucose_value <= 600/18.02 AND unit = 'mmol/L'))
) AS d
ON d.user_id = u.id
WHERE u.is_tester = false;
Here I have three questions:
Question 1: I would speculate that Query 1 would have better performance than Query 2, because a) it joins only one column instead of the whole table of Health_User and b) it filters out testers before joining the tables. Am I correct?
Question 2: The conditional limitation is more complex for Diary (See the last WHERE clause in Query 1). Is it better to switch Diary inside the JOIN and make Health_User outside like Query 3, or it makes no difference?
Question 3: Is there any even better solution in terms of performance?
There would be a difference if the database executed the queries in the order your queries suggest (first filter, then join or vice versa).
As it is, PostgreSQL has a query optimizer that rearranges the query to find the most efficient execution order, and all your queries will end up with the same execution plan, which you can verify using the SQL statement EXPLAIN.
For inner joins, it does not influence the result if you filter before or after the join; you could also write all the conditions into the join condition without changing the result. The optimizer knows that.
You can speed up execution by creating appropriate indexes. It depends on the distribution of the data to know if a certain index is useful. The rule of thumb is that indexes on selective conditions (that filter out many data) are more useful. Work with EXPLAIN to find the best indexes.
I'd like to make a list based on whether a field in the original table is in two lists. My code is thus:
SELECT *
FROM ListofPlaces
WHERE Property = 'MODERATE'
and (HOMELAND in (
SELECT distinct HOMELAND
FROM PLANS
WHERE left(plans.code, 1) = '1')
or HOMELAND in (
'PlaceA'
, 'PlaceB'
, 'PlaceC'
, 'PlaceD'
, 'PlaceE'))
The list and the sub query will work fine individually, taking 00:00:01.43 for the sub query and 00:00:00.13 for the list, however they take around a min once combined.
I have tried using a left join, but this leads to a more significant reduction in performance.
The table 'PLANS' is a larger table of 4M+ rows, whilst list of places is less than 1000.
My question is whether I'm using the and/or operators efficiently, and if so, is there a more efficient way to run this query?
Try rewriting this using UNION:
SELECT *
FROM ListofPlaces
WHERE Property = 'MODERATE' AND
HOMELAND IN (SELECT HOMELAND
FROM PLANS
WHERE left(plans.code, 1) = '1'
)
UNION
SELECT *
FROM ListofPlaces
WHERE Property = 'MODERATE' AND
HOMELAND in ('PlaceA', 'PlaceB', 'PlaceC', 'PlaceD', 'PlaceE');
The optimizer can sometimes be confused by ORs. UNION may be needed here instead of UNION ALL if the two lists contain similar elements. Otherwise, if you know they are disjoint, use UNION ALL.
Why not join?
SELECT LoP.*
FROM ListofPlaces LoP
left join PLANS Pl
on LoP.HOMELAND = Pl.HOMELAND
WHERE Property = 'MODERATE'
and (left(plans.code, 1) = '1'
or LoP.HOMELAND in (
'PlaceA'
, 'PlaceB'
, 'PlaceC'
, 'PlaceD'
, 'PlaceE'))
I'm trying to using the aggregation features of the django ORM to run a query on a MSSQL 2008R2 database, but I keep getting a timeout error. The query (generated by django) which fails is below. I've tried running it directs the SQL management studio and it works, but takes 3.5 min
It does look it's aggregating over a bunch of fields which it doesn't need to, but I wouldn't have though that should really cause it to take that long. The database isn't that big either, auth_user has 9 records, ticket_ticket has 1210, and ticket_watchers has 1876. Is there something I'm missing?
SELECT
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined],
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined]
HAVING
(COUNT([tickets_ticket].[id]) > 0 OR COUNT(T3.[id]) > 0 )
EDIT:
Here are the relevant indexes (excluding those not used in the query):
auth_user.id (PK)
auth_user.username (Unique)
tickets_ticket.id (PK)
tickets_ticket.capturer_id
tickets_ticket.responsible_id
tickets_ticket_watchers.id (PK)
tickets_ticket_watchers.user_id
tickets_ticket_watchers.ticket_id
EDIT 2:
After a bit of experimentation, I've found that the following query is the smallest that results in the slow execution:
SELECT
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id]
The weird thing is that if I comment out any two lines in the above, it runs in less that 1s, but it doesn't seem to matter which lines I remove (although obviously I can't remove a join without also removing the relevant SELECT line).
EDIT 3:
The python code which generated this is:
User.objects.annotate(
Count('tickets_captured'),
Count('assigned_tickets'),
Count('tickets_watched')
)
A look at the execution plan shows that SQL Server is first doing a cross-join on all the table, resulting in about 280 million rows, and 6Gb of data. I assume that this is where the problem lies, but why is it happening?
SQL Server is doing exactly what it was asked to do. Unfortunately, Django is not generating the right query for what you want. It looks like you need to count distinct, instead of just count: Django annotate() multiple times causes wrong answers
As for why the query works that way: The query says to join the four tables together. So say an author has 2 captured tickets, 3 assigned tickets, and 4 watched tickets, the join will return 2*3*4 tickets, one for each combination of tickets. The distinct part will remove all the duplicates.
what about this?
SELECT auth_user.*,
C1.tickets_captured__count
C2.assigned_tickets__count
C3.tickets_watched__count
FROM
auth_user
LEFT JOIN
( SELECT capturer_id, COUNT(*) AS tickets_captured__count
FROM tickets_ticket GROUP BY capturer_id ) AS C1 ON auth_user.id = C1.capturer_id
LEFT JOIN
( SELECT responsible_id, COUNT(*) AS assigned_tickets__count
FROM tickets_ticket GROUP BY responsible_id ) AS C2 ON auth_user.id = C2.responsible_id
LEFT JOIN
( SELECT user_id, COUNT(*) AS tickets_watched__count
FROM tickets_ticket_watchers GROUP BY user_id ) AS C3 ON auth_user.id = C3.user_id
WHERE C1.tickets_captured__count > 0 OR C2.assigned_tickets__count > 0
--WHERE C1.tickets_captured__count is not null OR C2.assigned_tickets__count is not null -- also works (I think with beter performance)
I have created a query in MS Access to simulate a FULL OUTER JOIN and combine the results that looks something like the following:
SELECT NZ(estimates.employee_id, actuals.employee_id) AS employee_id
, NZ(estimates.a_date, actuals.a_date) AS a_date
, estimates.estimated_hours
, actuals.actual_hours
FROM (SELECT *
FROM estimates
LEFT JOIN actuals ON estimates.employee_id = actuals.employee_id
AND estimates.a_date = actuals.a_date
UNION ALL
SELECT *
FROM estimates
RIGHT JOIN actuals ON estimates.employee_id = actuals.employee_id
AND estimates.a_date = actuals.a_date
WHERE estimates.employee_id IS NULL
OR estimates.a_date IS NULL) AS qFullJoinEstimatesActuals
I have saved this query as an object (let's call it qEstimatesAndActuals). My objective is to LEFT JOIN qEstimatesAndActuals with another table. Something like the following:
SELECT *
FROM qJoinedTable
LEFT JOIN (SELECT *
FROM labor_rates) AS rates
ON qJoinedTable.employee_id = rates.employee_id
AND qJoinedTable.a_date BETWEEN rates.begin_date AND rates.end_date
MS Access accepts the syntax and runs the query, but it omits results that are clearly within the result set. Wondering if the date format was somehow lost, I placed a FORMAT around the begin_date and end_date to force them to be interpreted as Short Dates. Oddly, this produced a different result set, but it still omitted result that it shouldn't have.
I am wondering if the queries are performed in such a way that you can't LEFT JOIN the result set of a UNION ALL. Does anyone have any thoughts/ideas on this? Is there a better way of accomplishing the end goal?
I would try breaking each part of the query into its own access query object, e.g.
SELECT *
FROM estimates
LEFT JOIN actuals ON estimates.employee_id = actuals.employee_id
AND estimates.a_date = actuals.a_date
Would be qryOne
SELECT *
FROM estimates
RIGHT JOIN actuals ON estimates.employee_id = actuals.employee_id
AND estimates.a_date = actuals.a_date
WHERE estimates.employee_id IS NULL
OR estimates.a_date IS NULL
Would be qryTwo
SELECT * FROM qryOne
UNION ALL
SELECT * FROM qryTwo
Would be qryFullJoinEstimatesActuals, and finally
SELECT NZ(estimates.employee_id, actuals.employee_id) AS employee_id
, NZ(estimates.a_date, actuals.a_date) AS a_date
, estimates.estimated_hours
, actuals.actual_hours
FROM qryFullJoinEstimatesActuals
I've found that constructs that don't work in complex Access SQL statements often do work properly if they are broken down into individual query objects and reassembled step-by-step. Additionally, you can test each part of the query individually. This will help you find a workaround if one proves to be necessary.
You can find exactly how to do this here.
You're missing an INNER JOIN.... UNION ALL step.
Consistent with the odd behavior surrounding the dates, this issue turned out to be related to the use of NZ to select a date from qFullJoinEstimatesActuals. The use of NZ appears to make the data type ambiguous. As such, the following line from the example in my post caused the error:
, NZ(estimates.a_date, actuals.a_date) AS a_date
The ambiguous data type of a_date caused the BETWEEN operator to produce erroneous results when comparing a_date to rates.begin_date and rates.end_date in the LEFT JOIN. The issue was resolved by type casting the result of the NZ function, as follows:
, CDate(NZ(estimates.a_date, actuals.a_date)) AS a_date