Optimizing sql condition to apply condition to all dependent rows - sql

I have the following query, split up into a view for readability:
CREATE TEMPORARY VIEW task_depcount AS
SELECT
t.*,
COUNT(p.id) AS unfinished_dep_count
FROM
task t
LEFT JOIN taskdependency d on t.id = d.task_id
LEFT JOIN task p on d.parent_task_id = p.id and p.status != 'SUCCESS'
GROUP BY t.id;
SELECT t.id, t.task_type, t.status
FROM task_depcount t
WHERE t.status = 'READY' AND t.unfinished_dep_count = 0;
Now If we're looking at the EXPLAIN ANALYZE output, this is obviously very inefficient, as we cannot really do index scans over a COUNT() result. Rewriting into a single query with HAVING would also not improve it.
So here's the question: Is there a way to write this query so that the database isn't forced to do sequence scans all over? Database is PostgreSQL 9.2, with no option to upgrade to newer versions.
Or, to state the intended result in plain english: I need all the tasks where either all it's dependencies are of status "success", or there are no dependencies at all.

You can use not exists:
SELECT t.*
FROM task t
WHERE NOT EXISTS (SELECT 1
FROM taskdependency d JOIN
task p
ON d.parent_task_id = p.id
WHERE t.id = d.task_id AND p.status <> 'SUCCESS'
);
With the right indexes, this should be much faster.
The use of an aggregation function such as COUNT() -- whether in a view, subquery, or CTE -- requires processing all the data. With NOT EXISTS, the processing can stop for each at the first unsuccessful one (if any) and not have to do any aggregation.

create temporary view task_depcount as
select t.*
from
task t
left join
taskdependency d on t.id = d.task_id
left join
task p on d.parent_task_id = p.id
group by t.id
having not bool_or(p.status != success) or not bool_or(d.task_id is not null)
;
select t.id, t.task_type, t.status
from task_depcount t
where t.status = 'READY'

Related

SQL UPDATE statement with LEFT JOIN, GROUP BY and HAVING?

I need to update some rows in a table. I've created a Select statement to make sure I've got the rows I wanted to select.
I want to update task_status_id in the table task, and I've tried in various ways but always end up with a syntax error and have honestly no idea how to do so even though I've tried to follow others examples by using INNER JOIN and putting the select statement in parenthesis. Any help would be appreciated.
UPDATE statement to merge with the SELECT statement.
UPDATE task
SET task_status_id = (SELECT task_status_id
FROM task_status
WHERE task_type_id = 1
AND name = 'Completed');
WHERE
SELECT
t.task_id
FROM task t
LEFT JOIN user u
ON t.user_id = u.user_id
LEFT JOIN contract co
ON u.user_id = co.user_id
LEFT JOIN task_status ts
ON t.task_status_id = ts.task_status_id
WHERE co.status = 'Closed' AND
t.task_type_id = 1 AND
t.task_status_id != (SELECT task_status_id
FROM task_status
WHERE task_type_id = 1
AND name = 'Completed')
GROUP BY t.task_id
HAVING count(t.contract_id) <= 2;
First of all, it doesn't make sense to use LEFT JOIN contract co and then filter results using co.status = 'Closed', because if you're going to filter by a column from a joined table then you should use INNER JOIN (unless you're comparing to null in the filter).
Secondly, syntax here is incorrect - you should use not in instead of !=
AND t.task_status_id != (SELECT task_status_id
FROM task_status
WHERE task_type_id = 1
AND name = 'Completed')
However, since you already joined the task_status table you can replace the above block of code with the following (assuming that task_status_id is a unique column):
AND ts.name != 'Completed'
Either way, you should post sample data and expected result.

Need help in optimizing sql query

I am new to sql and have created the below sql to fetch the required results.However the query seems to take ages in running and is quite slow. It will be great if any help in optimization is provided.
Below is the sql query i am using:
SELECT
Date_trunc('week',a.pair_date) as pair_week,
a.used_code,
a.used_name,
b.line,
b.channel,
count(
case when b.sku = c.sku then used_code else null end
)
from
a
left join b on a.ma_number = b.ma_number
and (a.imei = b.set_id or a.imei = b.repair_imei
)
left join c on a.used_code = c.code
group by 1,2,3,4,5
I would rewrite the query as:
select Date_trunc('week',a.pair_date) as pair_week,
a.used_code, a.used_name, b.line, b.channel,
count(*) filter (where b.sku = c.sku)
from a left join
b
on a.ma_number = b.ma_number and
a.imei in ( b.set_id, b.repair_imei ) left join
c
on a.used_code = c.code
group by 1,2,3,4,5;
For this query, you want indexes on b(ma_number, set_id, repair_imei) and c(code, sku). However, this doesn't leave much scope for optimization.
There might be some other possibilities, depending on the tables. For instance, or/in in the on clause is usually a bad sign -- but it is unclear what your intention really is.

Why do multiple EXISTS break a query

I am attempting to include a new table with values that need to be checked and included in a stored procedure. Statement 1 is the existing table that needs to be checked against, while statement 2 is the new table to check against.
I currently have 2 EXISTS conditions that function independently and produce the results I am expecting. By this I mean if I comment out Statement 1, statement 2 works and vice versa. When I put them together the query doesn't complete, there is no error but it times out which is unexpected because each statement only takes a few seconds.
I understand there is likely a better way to do this but before I do, I would like to know why I cannot seem to do multiple exists statements like this? Are there not meant to be multiple EXISTS conditions in the WHERE clause?
SELECT *
FROM table1 S
WHERE
--Statement 1
EXISTS
(
SELECT 1
FROM table2 P WITH (NOLOCK)
INNER JOIN table3 SA ON SA.ID = P.ID
WHERE P.DATE = #Date AND P.OTHER_ID = S.ID
AND
(
SA.FILTER = ''
OR
(
SA.FILTER = 'bar'
AND
LOWER(S.OTHER) = 'foo'
)
)
)
OR
(
--Statement 2
EXISTS
(
SELECT 1
FROM table4 P WITH (NOLOCK)
INNER JOIN table5 SA ON SA.ID = P.ID
WHERE P.DATE = #Date
AND P.OTHER_ID = S.ID
AND LOWER(S.OTHER) = 'foo'
)
)
EDIT: I have included the query details. Table 1-5 represent different tables, there are no repeated tables.
Too long to comment.
Your query as written seems correct. The timeout will only be able to be troubleshot from the execution plan, but here are a few things that could be happening or that you could benefit from.
Parameter sniffing on #Date. Try hard-coding this value and see if you still get the same slowness
No covering index on P.OTHER_ID or P.DATE or P.ID or SA.ID which would cause a table scan for these predicates
Indexes for the above columns which aren't optimal (including too many columns, etc)
Your query being serial when it may benefit from parallelism.
Using the LOWER function on a database which doesn't have a case sensitive collation (most don't, though this function doesn't slow things down that much)
You have a bad query plan in cache. Try adding OPTION (RECOMPILE) at the bottom so you get a new query plan. This is also done when comparing the speed of two queries to ensure they aren't using cached plans, or one isn't when another is which would skew the results.
Since your query is timing out, try including the estimated execution plan and post it for us at past the plan
I found putting 2 EXISTS in the WHERE condition made the whole process take significantly longer. What I found fixed it was using UNION and keeping the EXISTS in separate queries. The final result looked like the following:
SELECT *
FROM table1 S
WHERE
--Statement 1
EXISTS
(
SELECT 1
FROM table2 P WITH (NOLOCK)
INNER JOIN table3 SA ON SA.ID = P.ID
WHERE P.DATE = #Date AND P.OTHER_ID = S.ID
AND
(
SA.FILTER = ''
OR
(
SA.FILTER = 'bar'
AND
LOWER(S.OTHER) = 'foo'
)
)
)
UNION
--Statement 2
SELECT *
FROM table1 S
WHERE
EXISTS
(
SELECT 1
FROM table4 P WITH (NOLOCK)
INNER JOIN table5 SA ON SA.ID = P.ID
WHERE P.DATE = #Date
AND P.OTHER_ID = S.ID
AND LOWER(S.OTHER) = 'foo'
)

SELECT DISTINCT count() in Microsoft Access

I've created a database where we can track bugs we have raised with our developers (Table: ApplixCalls) and track any correspondence related to the logged bugs (Table: Correspondence).
I'm trying to create a count where we can see the number of bugs which have no correspondence or only correspondence from us. This should give us the visibility to see where we should be chasing our developers for updates etc.
So far I have this SQL:
SELECT DISTINCT Count(ApplixCalls.OurRef) AS CountOfOurRef
FROM ApplixCalls LEFT JOIN Correspondence ON ApplixCalls.OurRef = Correspondence.OurRef
HAVING (((Correspondence.OurRef) Is Null)
AND ((ApplixCalls.Position)<>'Closed'))
OR ((ApplixCalls.Position)<>'Closed')
AND ((Correspondence.[SBSUpdate?])=True);
I'm finding that this part is counting every occasion we have sent an update, when I need it to count 1 where OurRef is unique and it only has updates from us:
OR ((ApplixCalls.Position)<>'Closed')
AND ((Correspondence.[SBSUpdate?])=True);
Hopefully that makes sense...
Is there a way around this?
MS Access does not support count(distinct). In your case, you can use a subquery. In addition, your query should not work. Perhaps this is what you intend:
SELECT COUNT(*)
FROM (SELECT ApplixCalls.OurRef
FROM ApplixCalls LEFT JOIN
Correspondence
ON ApplixCalls.OurRef = Correspondence.OurRef
WHERE (((orrespondence.OurRef Is Null) AND (ApplixCalls.Position) <> 'Closed')) OR
(ApplixCalls.Position <> 'Closed') AND (Correspondence.[SBSUpdate?] = True))
)
GROUP BY ApplixCalls.OurRef
) as x;
Modifications:
You have a HAVING clause with no GROUP BY. I think this should be a WHERE (although I am not 100% sure of the logic you intend).
The SELECT DISTINCT is replaced by SELECT . . . GROUP BY.
The COUNT(DISTINCT) is now COUNT(*) with a subquery.
EDIT:
Based on the description in your comments:
SELECT COUNT(*)
FROM (SELECT ApplixCalls.OurRef
FROM ApplixCalls LEFT JOIN
Correspondence
ON ApplixCalls.OurRef = Correspondence.OurRef
WHERE (((orrespondence.OurRef Is Null) AND (ApplixCalls.Position) <> 'Closed')) OR
(ApplixCalls.Position <> 'Closed') AND (Correspondence.[SBSUpdate?] = True))
)
GROUP BY ApplixCalls.OurRef
HAVING SUM(IIF(Correspondence.[SBSUpdate?] = False, 1, 0)) = 0
) as x;
I can not understand why are you using having clause. I hope this query will fullfill youe need.
SELECT DISTINCT Count(ApplixCalls.OurRef) AS CountOfOurRef
FROM ApplixCalls LEFT JOIN Correspondence ON ApplixCalls.OurRef = Correspondence.OurRef
HAVING (((Correspondence.OurRef) Is Null)
AND ((ApplixCalls.Position)<>'Closed'))
OR ((ApplixCalls.Position)<>'Closed')
AND ((Correspondence.[SBSUpdate?])=True);
If you are counting all the element that respond to you condition you don't need DISTINCT .. distinct if for removing duplicate result
SELECT Count(distinct ApplixCalls.OurRef) AS CountOfOurRef
FROM ApplixCalls LEFT JOIN Correspondence ON ApplixCalls.OurRef = Correspondence.OurRef
WHERE (((Correspondence.OurRef) Is Null)
AND ((ApplixCalls.Position)<>'Closed'))
OR ((ApplixCalls.Position)<>'Closed')
AND ((Correspondence.[SBSUpdate?])=True);

"Simple" SQL Query

Each of my clients can have many todo items and every todo item has a due date.
What would be the query for discovering the next undone todo item by due date for each file? In the event that a client has more than one todo, the one with the lowest id is the correct one.
Assuming the following minimal schema:
clients (id, name)
todos (id, client_id, description, timestamp_due, timestamp_completed)
Thank you.
I haven't tested this yet, so you may have to tweak it:
SELECT
TD1.client_id,
TD1.id,
TD1.description,
TD1.timestamp_due
FROM
Todos TD1
LEFT OUTER JOIN Todos TD2 ON
TD2.client_id = TD1.client_id AND
TD2.timestamp_completed IS NULL AND
(
TD2.timestamp_due < TD1.timestamp_due OR
(TD2.timestamp_due = TD1.timestamp_due AND TD2.id < TD1.id)
)
WHERE
TD2.id IS NULL
Instead of trying to sort and aggregate, you're basically answering the question, "Is there any other todo that would come before this one?" (based on your definition of "before"). If not, then this is the one that you want.
This should be valid on most SQL platforms.
This question is the classic pick-a-winner for each group. It gets posted about twice a day.
SELECT *
FROM todos t
WHERE t.timestamp_completed is null
and
(
SELECT top 1 t2.id
FROM todos t2
WHERE t.client_id = t2.client_id
and t2.timestamp_completed is null
--there is no earlier record
and
(t.timestamp_due > t2.timestamp_due
or (t.timestamp_due = t2.timestamp_due and t.id > t2.id)
)
) is null
SELECT c.name, MIN(t.id)
FROM clients c, todos t
WHERE c.id = t.client_id AND t.timestamp_complete IS NULL
GROUP BY c.id
HAVING t.timestamp_due <= MIN(t.timestamp_due)
Avoids a subquery, correlated or otherwise but introduces a bunch of aggregate operations which aren't much better.
Some Jet SQL, I realize it is unlikely that the questioner is using Jet, however the reader may be.
SELECT c.name, t.description, t.timestamp_due
FROM (clients c
INNER JOIN
(SELECT t.client_id, Min(t.id) AS MinOfid
FROM todos t
WHERE t.timestamp_completed Is Null
GROUP BY t.client_id) AS tm
ON c.id = tm.client_id)
INNER JOIN todos t ON tm.MinOfid = t.id
The following should get you close, first get the min time for each client, then lookup the client/todo information
SELECT
C.Id,
C.Name,
T.Id
T.Description,
T.timestamp_due
FROM
{
SELECT
client_id,
MIN(timestamp_due) AS "DueDate"
FROM todos
WHERE timestamp_completed IS NULL
GROUP BY ClientId
} AS MinValues
INNER JOIN Clients C
ON (MinValues.client_id = C.Id)
INNER JOIN todos T
ON (MinValues.client_id = T.client_id
AND MinValues.DueDate = T.timestamp_due)
ORDER BY C.Name
NOTE: Written assuming SQL Server