Given two tables Friend_request (requester_id, sent_to_id , time) and Request_accepted (acceptor_id, requestor_id, time) , find the overall acceptance rate of requests.
So my code/basic logic is the following:
select count(acceptor_id)/count(requester_id)
from Friend_Request left join Request_Accepted
on Friend_Request.sent_to_id = Request_Accepted.acceptor_id
where Request_Accepted.acceptor_id is not null
Would that be correct?
In theory it would be almost correct.
SELECT count(ra.acceptor_id) / count(fr.requester_id)
FROM Friend_Request fr
LEFT JOIN Request_Accepted ra ON ra.acceptor_id = fr.sent_to_id
Just remove the WHERE because you want the rows will NULL acceptors from your Request_Accepted table. Otherwise it would be the same number divided by the same number.
Edit: Although you have to consider if they send multiple requests with one acceptance... then that's a different can of worms.
I handled this question in a slightly different manner and thought I would share.
( SELECT COUNT(DISTINCT requester_id, accepter_id)
FROM request_accepted) /
( SELECT COUNT(DISTINCT sender_id, send_to_id)
FROM friend_request)
I guess there would be a more efficient solution.
Since all the request actions are stored in the Friend_request table, and all the accept actions are stored in the Request_Accepted table, we can simply get the counts of each table and divide them to get the rate we need.
requests_count = select count(*) from Friend_request;
accepts_count = select count(*) from Request_Accepted;
accept_rate = accepts_count / requests_count;
In most of the databases, count(*) option do not need to scan all the records in the table, it is an O(1) option.
Related
I've got a query that's decently sized on it's own, but there's one section of it that turns it into something ridiculously large (billions of rows returned type thing).
There must be a better way to write it than what I have done.
To simplify the section of the query in question, it takes the client details from one table and tries to find the most recent transaction dates in their savings and spending accounts (not the actual situation, but close enough).
I've joined it with left joins because if someone (for example) doesn't have a savings account, I still want the client details to pop up. But when there's hundreds of thousands of clients with tens of thousands of transactions, it's a little slow to run.
select client_id, max(e.transation_date), max(s.transaction_date)
from client_table c
left join everyday_account e
on c.client_id = e.client_id
left join savings_account s
on c.client_id = s.client_id
group by client_id
I'm still new to this so I'm not great at knowing how to optimise things, so is there any thing I should be looking at? Perhaps different joins, or something other than max()?
I've probably missed some key details while trying to simplify it, let me know if so!
Sometimes aggregating first, then joining to the aggregated result is faster. But this depends on the actual DBMS being used and several other factors.
select client_id, e.max_everyday_transaction_date, s.max_savings_transaction_date
from client_table c
left join (
select client_id, max(transaction_date) as max_everyday_transaction_date
from everyday_account
group by client_id
) e on c.client_id = e.client_id
left join (
select client_id, max(transaction_date) as max_savings_transaction_date
from savings_account
) s on c.client_id = s.client_id
The indexes suggested by Tim Biegeleisen should help in this case as well.
But as the query has to process all rows from all tables there no good way to speed up this query, other than throwing more hardware at it. If your database supports it, make sure parallel query is enabled (which will distribute the total work over multiple threads in the backend which can substantially improve query performance if the I/O system can keep up)
There are no WHERE or HAVING clauses, which basically means there is no explicit filtering in your SQL query. However, we can still try to optimize the joins using appropriate indices. Consider:
CREATE INDEX idx1 ON everyday_account (client_id, transation_date);
CREATE INDEX idx2 ON savings_account (client_id, transation_date);
These two indices, if chosen for use, should speed up the two left joins in your query. I also cover the transaction_date in both cases, in case that might help.
Side note: You might want to also consider just having a single table containing all customer accounts. Include a separate column which distinguishes between everyday and savings accounts.
I would suggest correlated subqueries:
select client_id,
(select max(e.transation_date)
from everyday_account e
where c.client_id = e.client_id
),
(select max(s.transaction_date)
from savings_account s
where c.client_id = s.client_id
)
from client_table c;
Along with indexes on everyday_account(client_id, transaction_date desc) and savings_account(client_id, transaction_date desc).
The subqueries should basically be index lookups (or very limited index scans), with no additional joining needed.
For this problem, I have 2 main existing postgres tables I am working with. The first table is named client, the second table is named task.
A single client can have multiple tasks, each with it's own scheduled_date and scheduled_time.
I'm trying to run a query that will return a list of all clients along with the date/time of their latest task.
Currently, my query works and looks something like this...
SELECT
c.*,
t1.scheduled_time||' '||t1.scheduled_time::timestamp AS latest_task_datetime
FROM
client c
LEFT JOIN
task t1 ON t1.client_id = c.client_id
LEFT JOIN
task t2 ON t2.client_id = t1.client_id AND ((t1.scheduled_date||' '||t1.scheduled_time)::timestamp < (t2.scheduled_date||' '||t2.scheduled_time)::timestamp) OR ((t1.scheduled_date||' '||t1.scheduled_time)::timestamp = (t2.scheduled_date||' '||t2.scheduled_time)::timestamp AND t1.task_id < t2.task_id);
The problem I'm having is the actual query I am working with deals with a lot more other tables (7+ tables), and every table has a lot of data in them, so because of the two left joins shown above, it is slowing down the execution of the query from 4 seconds to almost 45 seconds, which of course is very bad.
Does anyone know a possible faster way to write this query to run more efficiently?
A question I think you might initially have after seeing this is why I have scheduled_date and scheduled_time as separate columns? Why not have it as just a single timestamp column? The answer to that is this is an existing table that I can't change, at least not easily without requiring a lot of work making the changes in the entire server to support it.
Edit: Not quite the solution, but I just ended up doing it a different way. (See my comment below)
If you want to get multiple columns of information from different tables -- but one row for each client and his/her latest task, then you can use distinct on:
SELECT DISTINCT ON (c.client_id) c.*, t.*
FROM client c LEFT JOIN
task t
ON t.client_id = c.client_id
ORDER BY c.client_id, t.scheduled_date desc, t.scheduled_time desc;
I have a table that contains all the pupils.
I need to look through my registered table and find all students and see what their current status is.
If it's reg = y then include this in the search, however student may change from y to n so I need it to be the most recent using start_date to determine the most recent reg status.
The next step is that if n, then don't pass it through. However if latest reg is = y then search the pupil table, using pupilnumber; if that pupil number is in the pupils table then add to count.
Select Count(*)
From Pupils Partition(Pupils_01)
Where Pupilnumber in (Select t1.pupilnumber
From registered t1
Where T1.Start_Date = (Select Max(T2.Start_Date)
From registered T2
Where T2.Pupilnumber = T1.Pupilnumber)
And T1.reg = 'N');
This query works, but it is very slow as there are several records in the pupils table.
Just wondering if there is any way of making it more efficient
Worrying about query performance but not indexing your tables is, well, looking for a kind word here... ummm... daft. That's the whole point of indexes. Any variation on the query is going to be much slower than it needs to be.
I'd guess that using analytic functions would be the most efficient approach since it avoids the need to hit the table twice.
SELECT COUNT(*)
FROM( SELECT pupilnumber,
startDate,
reg,
rank() over (partition by pupilnumber order by startDate desc) rnk
FROM registered )
WHERE rnk = 1
AND reg = 'Y'
You can look execution plan for this query. It will show you high cost operations. If you see table scan in execution plan you should index them. Also you can try "exists" instead of "in".
This query MIGHT be more efficient for you and hope at a minimum you have indexes per "pupilnumber" in the respective tables.
To clarify what I am doing, the first inner query is a join between the registered table and the pupil which pre-qualifies that they DO Exist in the pupil table... You can always re-add the "partition" reference if that helps. From that, it is grabbing both the pupil AND their max date so it is not doing a correlated subquery for every student... get all students and their max date first...
THEN, join that result to the registration table... again by the pupil AND the max date being the same and qualify the final registration status as YES. This should give you the count you need.
select
count(*) as RegisteredPupils
from
( select
t2.pupilnumber,
max( t2.Start_Date ) as MostRecentReg
from
registered t2
join Pupils p
on t2.pupilnumber = p.pupilnumber
group by
t2.pupilnumber ) as MaxPerPupil
JOIN registered t1
on MaxPerPupil.pupilNumber = t1.pupilNumber
AND MaxPerPupil.MostRecentRec = t1.Start_Date
AND t1.Reg = 'Y'
Note: If you have multiple records in the registration table, such as a person taking multiple classes registered on the same date, then you COULD get a false count. If that might be the case, you could change from
COUNT(*)
to
COUNT( DISTINCT T1.PupilNumber )
I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.