Enhance Postgresql query (many subqueries) - sql

I have a postgrsql query which takes a lot of time to execute (5 min) because of sub queries I think. I would like to find a way to enhance this query:
select v.id, v.pos, v.time, v.status, vi.name,vi.type,
(select c.fullname
from company c
where vi.registered_owner_code = c.owcode ) AS registered_owner
,(select c.fullname
from company c
where vi.group_beneficial_owner_code=c.owcode) AS group_beneficial_owner
,(select c.fullname
from company c
where vi.operator_code = c.owcode ) AS operator
,(select c.fullname
from company c
where vi.manager_code = c.owcode ) AS manager
from (car_pos v left join cars vi on v.id = vi.id)
where age(now(), v.time::time with time zone) < '1 days'::interval

because of subqueries I think
This is not really a guessing game. You can get query execution plan explanation in pgadmin or just under console
http://www.pgadmin.org/docs/1.4/query.html
http://www.postgresql.org/docs/current/static/sql-explain.html
then you can see what's going on and what takes that much time.
After analysis you can add indexes or change something else but at least you will know what needs to be changed.

The WHERE condition can't use an index, you have to change that one. v.time should not be in a volatile function, age() in this case.

3 key ingredients:
Do away with the correlated subqueries, use JOIN instead - like
other answers mentioned already.
In the WHERE clause, don't use an expression on your column, which cannot utilize an index. #Frank already mentions it. Only the most basic, stable expressions can be rewritten by the query planner to use an index. See how I rewrote it.
Create suitable indexes.
SELECT v.id, v.pos, v.time, v.status, c.name, c.type
,r.fullname AS registered_owner
,g.fullname AS group_beneficial_owner
,o.fullname AS operator
,m.fullname AS manager
FROM car_pos v
LEFT JOIN cars c ON USING (id)
LEFT JOIN company r ON r.owcode = c.registered_owner_code
LEFT JOIN company g ON g.owcode = c.group_beneficial_owner_code
LEFT JOIN company o ON o.owcode = c.operator_code
LEFT JOIN company m ON m.owcode = c.manager_code
WHERE v.time > (now() - interval '1 day');
You need unique indexes on cars.id and company.owcode (primary keys do the job, too).
And you need an index on car_pos.time like:
CREATE INDEX car_pos_time_idx ON car_pos (time DESC);
Works without descending order, too. If you have lots of rows (-> big table, big index), you might want to create a partial index that covers only recent history and recreate it on a daily or weekly basis at off hours:
CREATE INDEX car_pos_time_idx ON car_pos (time DESC);
WHERE time > $mydate
Where $mydate is the result of (now() - interval '1 day'). This matches the condition of your query logically at any time. Effectiveness slowly deteriorates over time.
Aside: don't name a column of type timestamp "time", that's misleading from a documentation point of view. Actually, rather don't use time as column name at all. It's a reserved word in every SQL standard and a type name in PostgreSQL.

select v.id, v.pos, v.time, v.status, vi.name,vi.type,
c1.fullname as Registered_owner,
c2.fullname as group_beneficial_owner,
c3.fullname AS operator,
c4.fullname AS manager
from car_pos v
left outer join cars vi on v.id = vi.id
left outerjoin company c1 on vi.registered_owner_code=c1.owcode
left outerjoin company c2 on vi.group_beneficial_owner_code=c2.owcode
left outerjoin company c3 on vi.operator_code=c3.owcode
left outerjoin company c4 on vi.manager_code=c4.owcode
where age(now(), v.time::time with time zone) < '1 days'::interval

One trivial solution would be to convert it to joins
select v.id, v.pos, v.time, v.status, vi.name,vi.type,
reg_owner.fullname AS registered_owner,
gr_ben_owner.fullname AS group_beneficial_owner,
op.fullname AS operator,
man.fullname AS manager
from
car_pos v
left join cars vi on v.id = vi.id
left join company reg_owner on vi.registered_owner_code = reg_owner.owcode
left join company gr_ben_owner on vi.group_beneficial_owner_code = gr_ben_owner.owcode
left join company op on vi.operator_code = op.owcode
left join company man on vi.manager_code = man.owcode
where age(now(), v.time::time with time zone) < '1 days'::interval
I suspect however, that it might be possible by doing only one join of the table Company... I'm not 100% sure about the exact syntax to, and I have doubts that this will enhance performance (because of all the CASE-WHEN, GROUP by, etc) compared to the four time join solution, but I think this should work too. (I assumed, that cars-car_pos is a one-to-one relationship)
select v.id, MAX(v.pos) as pos, MAX(v.time) as vtime, MAX(v.status) as status, MAX(vi.name) as name,MAX(vi.type) as type,
MAX(CASE WHEN c.owcode = vi.registered_owner_code THEN c.fullname END) AS registered_owner,
MAX(CASE WHEN c.owcode = vi.group_beneficial_owner_code THEN c.fullname END) AS group_beneficial_owner,
MAX(CASE WHEN c.owcode = vi.operator_code THEN op.fullname END) AS operator,
MAX(CASE WHEN c.owcode = vi.manager_code THEN man.fullname END) AS manager
from
car_pos v
left join cars vi on v.id = vi.id
left join company c on c.owcode IN (vi.registered_owner_code, vi.group_beneficial_owner_code, vi.operator_code, vi.manager_code)
group by v.id
having age(now(), vtime::time with time zone) < '1 days'::interval
If you could put the table creation DDL scripts, and some inserts into the question, it would be easy to try in SQL fiddle...

Related

SQL subquery with join to main query

I have this:
SELECT
SU.FullName as Salesperson,
COUNT(DS.new_dealsheetid) as Units,
SUM(SO.New_profits_sales_totaldealprofit) as TDP,
SUM(SO.New_profits_sales_totaldealprofit) / COUNT(DS.new_dealsheetid) as PPU,
-- opportunities subquery
(SELECT COUNT(*) FROM Opportunity O
LEFT JOIN Account A ON O.AccountId = A.AccountId
WHERE A.OwnerId = SU.SystemUserId AND
YEAR(O.CreatedOn) = 2022)
-- /opportunities subquery
FROM New_dealsheet DS
LEFT JOIN SalesOrder SO ON DS.New_DSheetId = SO.SalesOrderId
LEFT JOIN New_salespeople SP ON DS.New_SalespersonId = SP.New_salespeopleId
LEFT JOIN SystemUser SU ON SP.New_SystemUserId = SU.SystemUserId
WHERE
YEAR(SO.New_purchaseordersenddate) = 2022 AND
SP.New_SalesGroupIdName = 'LO'
GROUP BY SU.FullName
I'm getting an error from the subquery:
Column 'SystemUser.SystemUserId' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Is it possible to use the SystemUser table join from the main query in this way?
As has been mentioned extensively in the comments, the error is actually telling you the problem; SU.SystemUserId isn't in the GROUP BY nor in an aggregate function, and it appears in the SELECT of the query (albeit in the WHERE of a correlated subquery). Any columns in the SELECT must be either aggregated or in the GROUP BY when using one of the other for a query scope. As the column in question isn't aggregated nor in the GROUP BY, the error occurs.
There are, however, other problems. Like mentioned ikn the comments too, your LEFT JOINs make little sense, as many of the tables you LEFT JOIN to require a column in that table to have a non-NULL value; it is impossible for a column to have a non-NULL value if a row was not found.
You also use syntax like YEAR(<Column Name>) = <int Value> in the WHERE; this is not SARGable, and thus should be avoided. Use explicit date boundaries instead.
I assume here that SU.SystemUserId is a primary key, and so should be in the GROUP BY. This is probably a good thing anyway, as a person's full name isn't something that can be used to determine who a person is on their own (take this from someone who shared their name, date of birth and post code with another person in their youth; it caused many problems on the rudimentary IT systems of the time). This results in a query like this:
SELECT SU.FullName AS Salesperson,
COUNT(DS.new_dealsheetid) AS Units,
SUM(SO.New_profits_sales_totaldealprofit) AS TDP,
SUM(SO.New_profits_sales_totaldealprofit) / COUNT(DS.new_dealsheetid) AS PPU,
(SELECT COUNT(*)
FROM dbo.Opportunity O
JOIN dbo.Account A ON O.AccountId = A.AccountId --A.OwnerID must have a non-NULL value, so why was this a LEFT JOIN?
WHERE A.OwnerId = SU.SystemUserId
AND O.CreatedOn >= '20220101' --Don't use YEAR(<Column Name>) = <int Value> syntax, it isn't SARGable
AND O.CreatedOn < '20230101') AS SomeColumnAlias
FROM dbo.New_dealsheet DS
JOIN dbo.SalesOrder SO ON DS.New_DSheetId = SO.SalesOrderId --SO.New_purchaseordersenddate must have a non-NULL value, so why was this a LEFT JOIN?
JOIN dbo.New_salespeople SP ON DS.New_SalespersonId = SP.New_salespeopleId --SP.New_SalesGroupIdName must have a non-NULL value, so why was this a LEFT JOIN?
LEFT JOIN dbo.SystemUser SU ON SP.New_SystemUserId = SU.SystemUserId --This actually looks like it can be a LEFT JOIN.
WHERE SO.New_purchaseordersenddate >= '20220101' --Don't use YEAR(<Column Name>) = <int Value> syntax, it isn't SARGable
AND SO.New_purchaseordersenddate < '20230101'
AND SP.New_SalesGroupIdName = 'LO'
GROUP BY SU.FullName,
SU.SystemUserId;
Doing such a sub-query is bad on a performance-wise point of view
better do it like this:
SELECT
SU.FullName as Salesperson,
COUNT(DS.new_dealsheetid) as Units,
SUM(SO.New_profits_sales_totaldealprofit) as TDP,
SUM(SO.New_profits_sales_totaldealprofit) / COUNT(DS.new_dealsheetid) as PPU,
SUM(csq.cnt) as Count
FROM New_dealsheet DS
LEFT JOIN SalesOrder SO ON DS.New_DSheetId = SO.SalesOrderId
LEFT JOIN New_salespeople SP ON DS.New_SalespersonId = SP.New_salespeopleId
LEFT JOIN SystemUser SU ON SP.New_SystemUserId = SU.SystemUserId
-- Moved subquery as sub-join
LEFT JOIN (SELECT a.OwnerId, YEAR(o.CreatedOn) as year, COUNT(*) cnt FROM Opportunity O
LEFT JOIN Account A ON O.AccountId = A.AccountId
GROUP BY a.OwnerId, YEAR(o.CreatedOn) as csq ON csq.OwnerId = su.SystemUserId and csqn.Year = 2022
WHERE
YEAR(SO.New_purchaseordersenddate) = 2022 AND
SP.New_SalesGroupIdName = 'LO'
GROUP BY SU.FullName
So you have a nice join and a clean result
The query above is untested

Left outer join with left() function

I'm running the below query daily (overnight) and it's taking considerable time to run (1-1.5 hours). I'm certain the "Acc.DateKey >= LEFT (LocationKey, 8)" is the reason and if this part of the join is removed the query executes in around 5 minutes. I just cannot think of a more efficient way.
Acc.DateKey is a bigint typically 20180101 etc., with the location key being a bigint typically 201801011234 etc.
So far I've considered including a new column in the LO table "AccLocationKey" which will be inserted with the LEFT (LocationKey, 8) function when loaded.
I've decided to pose the question here first - could this be improved upon without changing the LO table?
SELECT
ISNULL(MAX(L.LocationKey),(SELECT MIN(LocationKey) FROM LO WHERE Location = Acc.Location)) AS LocationKey
FROM
Acc
LEFT OUTER JOIN
(
SELECT
LocationKey
,Location
FROM
LO
)AS L
ON Acc.Location = L.Location AND Acc.DateKey >= LEFT(LocationKey,8)
Let's rewrite the query without the subquery in the SELECT:
SELECT COALESCE(MAX(L.LocationKey), MIN(L.MIN_LocationKey)) AS LocationKey
FROM Acc LEFT OUTER JOIN
(SELECT MIN(l.LocationKey) OVER (PARTITION BY l.Location) as min_location,
l.*
FROM LO l
) L
ON Acc.Location = L.Location AND Acc.DateKey >= LEFT(l.LocationKey, 8);
Probably your best chance at performance is to add a computed column and appropriate index. So:
alter table lo add locationkey_datekey as (try_convert(bigint, LEFT(l.LocationKey, 8))) persisted;
Then, the appropriate index:
create index idx_lo_location_datekey on lo(location, locationkey_datekey);
Then use this in the query:
SELECT COALESCE(MAX(L.LocationKey), MIN(L.MIN_LocationKey)) AS LocationKey
FROM Acc LEFT OUTER JOIN
(SELECT MIN(l.LocationKey) OVER (PARTITION BY l.Location) as min_location,
l.*
FROM LO l
) L
ON Acc.Location = L.Location AND Acc.DateKey >= l.LocationKey_datekey;
Happily, this index will also work for the window function.

SQL Using WITH and JOIN together to create a view efficiently

I'm trying to use data from one table named period which specifies what period a date falls into, then using that instance to join into another table using the following statement.
WITH rep_prod AS (
SELECT t.tran_num, t.amount, t.provider_id, t.clinic,
t.tran_date, t.type, t.impacts, p.period_id, p.fiscal_year, p.period_weeks
FROM transactions t, period p
WHERE tran_date BETWEEN period_start AND period_end
)
SELECT r.tran_num, r.amount, r.provider_id, d.first_name, d.last_name,
d.clinic, r.tran_date, r.period_id, r.period_weeks, r.type, r.impacts
FROM rep_prod AS r
INNER JOIN provider AS d
ON r.provider_id = d.provider_id AND r.clinic = d.clinic
Looking to create this as a view on my DB, is there a more efficient way to accomplish this? This is currently holding around 6.2 million rows and it will only continue to get bigger. This query alone took over 7 minutes to complete, granted I'm using SQL Express with the memory limitations.
Update: Change query to reflect the removal of the SELECT DISTINCT function
EDIT: #Rabbit So you're suggesting something like this?
SELECT t.tran_num, t.amount, t.provider_id, d.first_name, d.last_name,
d.clinic, t.tran_date, p.period_id, p.period_weeks, p.fiscal_year, p.period_start, p.period_end, t.type, t.impacts
FROM transactions t
INNER JOIN provider d
ON provider.provider_id = transactions.provider_id AND provider.clinic = transactions.clinic
INNER JOIN period p
ON t.tran_date BETWEEN p.period_start AND p.period_end

Left Outer Joins if condition not true

I am struggling a little bit with an SQL query. Too much time spent in Rails land lately!
I have three tables
Panels
id
BookingPanel (join table)
panel_id
booking_id
Booking
id
from_date
to_date
I want to select all the Panels that do not have a booking on a certain date. I tried the following:
SELECT COUNT(*) FROM "panels"
LEFT JOIN "booking_panels" ON "booking_panels"."panel_id" = "panels"."id"
LEFT JOIN "bookings" ON "bookings"."id" = "booking_panels"."booking_id"
WHERE (bookings.from_date != '2015-04-11' AND bookings.to_date != '2015-04-16')
For some reason it doesn't return anything. If I change the dates where clause to = instead of != then it correctly find the records that are booked on that date. Why doesn't != find the opposite?
Applying the WHERE condition you do squashes your nice left joins into the equivalent of inner joins, on account of the fact that NULL != <anything> never evaluates to true. This variation should do the trick:
SELECT COUNT(*)
FROM "panels"
LEFT JOIN "booking_panels"
ON "booking_panels"."panel_id" = "panels"."id"
LEFT JOIN "bookings"
ON "bookings"."id" = "booking_panels"."booking_id"
AND (bookings.from_date = '2015-04-11' OR bookings.to_date = '2015-04-16')
WHERE "bookings"."id" IS NULL
I would think about not exists for this purpose:
select count(*)
from panels p
where not exists (select 1
from booking_panes bp join
bookings b
on b.id = bp.booking_id
where bp.panel_id = p.id and
<date> >= b.from_date and
<date> < b.to_date + interval '1 day'
);
Note: your question specifically is about one date. However, the sample query has two dates, which is rather confusing.

How do I optimize my query in MySQL?

I need to improve my query, specially the execution time.This is my query:
SELECT SQL_CALC_FOUND_ROWS p.*,v.type,v.idName,v.name as etapaName,m.name AS manager,
c.name AS CLIENT,
(SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(duration)))
FROM activities a
WHERE a.projectid = p.projectid) AS worked,
(SELECT SUM(TIME_TO_SEC(duration))
FROM activities a
WHERE a.projectid = p.projectid) AS worked_seconds,
(SELECT SUM(TIME_TO_SEC(remain_time))
FROM tasks t
WHERE t.projectid = p.projectid) AS remain_time
FROM projects p
INNER JOIN users m
ON p.managerid = m.userid
INNER JOIN clients c
ON p.clientid = c.clientid
INNER JOIN `values` v
ON p.etapa = v.id
WHERE 1 = 1
ORDER BY idName
ASC
The execution time of this is aprox. 5 sec. If i remove this part: (SELECT SUM(TIME_TO_SEC(remain_time)) FROM tasks t WHERE t.projectid = p.projectid) AS remain_time
the execution time is reduced to 0.3 sec. Is there a way to get the values of the remain_time in order to reduce the exec.time ?
The SQL is invoked from PHP (if this is relevant to any proposed solution).
It sounds like you need an index on tasks.
Try adding this one:
create index idx_tasks_projectid_remaintime on tasks(projectid, remain_time);
The correlated subquery should just use the index and go much faster.
Optimizing the query as it is written would give significant performance benefits (see below). But the FIRST QUESTION TO ASK when approaching any optimization is whether you really need to see all the data - there is no filtering of the resultset implemented here. This is a HUGE impact on how you optimize a query.
Adding an index on the query above will only help if the optimizer is opening a new cursor on the tasks table for every row returned by the main query. In the absence of any filtering, it will be much faster to do a full table scan of the tasks table.
SELECT ilv.*, remaining.rtime
FROM (
SELECT p.*,v.type, v.idName, v.name as etapaName,
m.name AS manager, c.name AS CLIENT,
SEC_TO_TIME(asbq.worked) AS worked, asbq.worked AS seconds_worked,
FROM projects p
INNER JOIN users m
ON p.managerid = m.userid
INNER JOIN clients c
ON p.clientid = c.clientid
INNER JOIN `values` v
ON p.etapa = v.id
LEFT JOIN (
SELECT a.projectid, SUM(TIME_TO_SEC(duration)) AS worked
FROM activities a
GROUP BY a.projectid
) asbq
ON asbq.projectid=p.projectid
) ilv
LEFT JOIN (
(SELECT t.project_id, SUM(TIME_TO_SEC(remain_time)) as rtime
FROM tasks t
GROUP BY t.projectid) remaining
ON ilv.projectid=remaining.projectid