Calculations based on condition in PostgreSQL - sql

I am having trouble doing calculations in one table using conditional statements. I have a table 'df' with the following column names:
id - int
time - timestamp
correctness - boolean
subject - text
Every student (id) completes tasks on particular subject (subject). The system assigns "True" value in column "correctness" if the assignment is completed correctly and "False" if not. The time (time) when the student completes the task is also saved by the system.
I need to write an optimal sql query that counts all students who completed 20 tasks successfully within an hour during March 2020.
Thanks in advance!

You can do this with no subqueries using:
select distinct s.id
from t
where t.timestamp >= '2020-03-01' and t.timestamp < '2020-04-01'
group by s.id, date_trunc('hour', timestamp)
having count(*) >= 20;
Note: You may want that the tasks are completed successfully, but that is not actually mentioned in your question.
For performance, you want an index on (timestamp).

You need to look at each 'correct' task and see if there are 20 previous tasks, delivered within one hour, that are correct.
That means you have to inner join task unto itself and then count them.
select distinct on(tasks.id) tasks.id, tasks.time, sum(previous_tasks.id)
from tasks
inner join tasks previous_tasks
on tasks.id = previous_tasks.id
and (previous_tasks.time - tasks.time) < interval '1 hour'
and previous_tasks.correctness
and tasks.time >= '2020-03-01' and tasks.time < '2020-04-01'
and previous_tasks.time >= '2020-03-01' and previous_tasks.time < '2020-04-01'
group by 1, 2
having sum(previous_tasks.id) >= 20

Related

How to calculate average time between different events from the same table

I am currently trying to calculate the average processing time of a messages in PostGres, there are multiple stages in the processing lifecycle and I would like to identify the average processing time between each stage. I have successfully calculated the average processing time for the full lifecycle using the following:
select AVG(e2.timestamp - e.timestamp) avg_gap
from event e
join event e2 on (e.message_id = e2.message_id)
where e.event_stage= 'start' and e.timestamp > '2022-10-01T00:00:08.000001Z' and e.timestamp < '2022-10-31T23:59:59.999999Z'
and e2.event_stage= 'end' and e2.timestamp > '2022-10-01T00:00:08.000001Z' and e2.timestamp < '2022-10-31T23:59:59.999999Z'
However I would now like to add additional event stages to the query to calculate the average processing time between each stage of the lifecycle.
As I am an SQL noob I have tried to update my query to the below; but receive the error operator does not exist: interval & interval.
select AVG((e3.timestamp - e2.timestamp) & (e2.timestamp - e.timestamp)) avg_gap
from event e
join event e2 on (e.message_id = e2.message_id)
join event e3 on (e2.message_id = e3.message_id)
where e.event_stage= 'start' and e.timestamp > '2022-10-01T00:00:08.000001Z' and e.timestamp < '2022-10-31T23:59:59.999999Z'
and e2.event_stage= 'validation' and e2.timestamp > '2022-10-01T00:00:08.000001Z' and e2.timestamp < '2022-10-31T23:59:59.999999Z'
and e3.event_stage= 'end' and e3.timestamp > '2022-10-01T00:00:08.000001Z' and e3.timestamp < '2022-10-31T23:59:59.999999Z'
I was hoping that the above would provide me with an average processing time from start to validation, and validation to end.
NOTE - There are other stages that I would eventually like to include such as parsing and transforming.
Is it possible for someone to provide some input on how to add multiple stages to the query?
EDIT - table structure as per the below:
Can you clarify the table structure?
Along with stage_event, do you also have some kind of an ID associated to this stage?
That would help if you wanted to use lead/lag to get the timestamp for the "next stage"
Something like this -
SELECT message,
stage_name,
timestamp,
lag(timestamp, 1) over (partition by message_id order by event_stage_id) as next_stage_timestamp
FROM events
You could then use "next_stage_timestamp - timestamp" to get your difference and average it grouping by event_stage.
Like this -
select stageName,
avg(next_stage_timestamp - timestamp) as avg_time
from above_results
group by stageName
This is better than doing multiple self joins. However it would work only if you had some kind of an ID associated to the stage_event.
So your table would be like this -
Message StageID StageName Timestamp
----------------------------------------------
A 1 Start 00
A 2 Calculate 20
A 3 Intermediate 30
A 4 Validate 40
A 5 End 60

Cancel rate keeps returning 100%, not sure if I'm misunderstanding CTEs

I'm trying to calculate cancel rate for transactions with close dates in 2022. However, I keep running into an issue where cancel rate keeps returning 100% and I'm wondering if it may be because of my misunderstanding of how to use CTEs.
Data lives in two tables: ad and transactions. Ad is set up in such a way that if a transaction had a close date and that close date later changes, both are recorded.
As an example, if transaction 10 was supposed to close on 9/15/22 but closed on 9/22/22 instead, there is entry for (ordeid: 10 | closedate: 9/15/22) & for (orderid: 10 | closedate: 9/22/22). I'm only interested in the earliest possible close date, hence the MIN(a.closedate).
WITH cancelled AS (
SELECT a.orderid AS "Order",
    MIN(a.closedate) AS "CloseDate"
FROM ad.order_history a
WHERE a.closedate < current_date
AND a.status = 'Cancelled'
AND a.closedate IS NOT NULL
AND a.orderid IS NOT NULL
AND a.closedate >= '2022-01-01'
GROUP by 1
ORDER BY 2
)
SELECT DATE_TRUNC('month', c.closedate) AS "Month",
COUNT(DISTINCT t.ad_id) AS "Total Orders",
COUNT(DISTINCT c.order) / COUNT(DISTINCT t.id) AS "Cancel Rate"
FROM transactions t
LEFT JOIN cancelled c ON t.ad_id = c.order
WHERE (t.ad_id IS NOT null OR t.order_number IS NOT NULL)
AND DATE_TRUNC('year', c.closedate) >= '2022-01-01'
AND c.closedate < current_date
AND t.deleted_at IS NULL
GROUP BY 1
When I run this query, the 'Cancelled rate' returns as 100%, which makes me a little confused. Logically, counting only distinct t.a_id, a.orderid, and t.id should return the same number. I thought making the CTE results in picking certain ids from a.orderid, so c.order should not be equal to a.orderid, as it's all the transactions that have been canceled, not all the transactions generally.
I must have misunderstood/misused CTE then, since it keeps returning 100%, which tells me it's picking out all of the a.orderid values, not just the canceled ones. I'm not quite sure how to fix it/get it to work correctly and would appreciate any pointers. Thank you!
Not an answer, but a long form comment with the idea of getting at answer. Observations:
You are not working on table ad, you are using table order_history in schema ad.
As mentioned in the comments I don't see a.orderid AS "Order" and ON t.ad_id = c.order working. Same for MIN(a.closedate) AS "CloseDate" and c.closedate. The double quotes would force the identifier to uppercase and that would not work with the lowercase usage later on. This leads me to believe the query you posted is not the query you are running. To your question add the actual query you are running.
You describe the purpose of order_history but you have not described the purpose of transactions and more importantly what their relationship is? Also add the schema definitions for both tables.
You can simplify this:
a.closedate < current_date ... AND a.closedate >= '2022-01-01'
to:
WHERE a.closedate BETWEEN '2022-01-01' AND (current_date - '1 day'::interval)
Why t.ad_id IS NOT null?

PSQL Filter query by time intervals

I have a query that will count the number of all completed issuances from a specific network. Problem is DB has a lot if issuances, starting from 2019-2020 and it counts all of them while I need the ones since last month (from current time, not some fixed date), IN A PRACTICAL WAY. Examples:
This is the query that counts all, which is about 12k
select count(*)
from issuances_extended
where network = 'ethereum'
and status = 'completed'
And this is the query I wrote that counts from a month ago to current time, which is about 100
select count(*)
from issuances_extended
where network = 'ethereum'
and issued_at > now() - interval '1 month'
and status = 'completed'
But I have a lot to count (1,2,3,4,5 months ago, year to date) and different networks so if I go my way as a solution it's ultimately very inefficient way of solving this. Is there a better way? Seems like this could be done via JS transformers but I couldn't figure it out.
Try using GROUP BY and DATE_TRUNC.
SELECT DATE_TRUNC('month', issued_at) as month, count(*) as issuances
FROM issuances_extended
WHERE network = 'ethereum'
AND status = 'completed'
GROUP BY DATE_TRUNC('month', issued_at)
How to Group by Month in PostgreSQL

sql query to get today new records compared with yesterday

i have this table:
COD (Integer) (PK)
ID (Varchar)
DATE (Date)
I just want to get the new ID's from today, compared with yesterday (the ID's from today that are not present yesterday)
This needs to be done with just one query, maximum efficiency because the table will have 4-5 millions records
As a java developer i am able to do this with 2 queries, but with just one is beyond my knowledge so any help would be so much appreciated
EDIT: date format is dd/mm/yyyy and every day each ID may come 0 or 1 times
Here is a solution that will go over the base data one time only. It selects the id and the date where the date is either yesterday or today (or both). Then it GROUPS BY id - each group will have either one or two rows. Then it filters by the condition that the MIN date in the group is "today". Those are the id's that exist today but did not exist yesterday.
DATE is an Oracle keyword, best not used as a column name. I changed that to DT. I also assume that your "dt" field is a pure date (as pure as it can be in Oracle, meaning: time of day, which is always present, is 00:00:00).
select id
from your_table
where dt in (trunc(sysdate), trunc(sysdate) - 1)
group by id
having min(dt) = trunc(sysdate)
;
Edit: Gordon makes a good point: perhaps you may have more than one such row per ID, in the same day? In that case the time-of-day may also be different from 00:00:00.
If so, the solution can be adapted:
select id
from your_table
where dt >= trunc(sysdate) - 1 and dt < trunc(sysdate) + 1
group by id
having min(dt) >= trunc(sysdate)
;
Either way: (1) the base table is read just once; (2) the column DT is not wrapped within any function, so if there is an index on that column, it can be used to access just the needed rows.
The typical method would use not exists:
select t.*
from t
where t.date >= trunc(sysdate) and t.date < trunc(sysdate + 1) and
not exists (select 1
from t t2
where t2.id = t.id and
t2.date >= trunc(sysdate - 1) and t2.date < trunc(sysdate)
);
This is a general solution. If you know that there is at most one record per day, there are better solutions, such as using lag().
Use MINUS. I suppose your date column has a time part, so you need to truncate it.
select id from mytable where trunc(date) = trunc(sysdate)
minus
select id from mytable where trunc(date) = trunc(sysdate) - 1;
I suggest the following function index. Without it, the query would have to full scan the table, which would probably be quite slow.
create idx on mytable( trunc(sysdate) , id );

Using a top 10 query to then search all records associated with them

I'm not super experienced with sql in general, and I'm trying to accomplish a pretty specific task- I want to first run a query to get the ID's of all my units with the top number of hits, and then from that run again to get the messages and counts of all the types of hits for those IDs in a specific time period.For the first query, I have this:
SELECT entity, count(entity) as Count
from plugin_status_alerts
where entered BETWEEN now() - INTERVAL '14 days' AND now()
group by entity
order by count(entity) DESC
limit 10
which results in this return:
"38792";3
"39416";2
"37796";2
"39145";2
"37713";2
"37360";2
"37724";2
"39152";2
"39937";2
"39667";2
The idea is to then use that result set to then run another query that orders by entity and status_code. I tried something like this:
SELECT status_code, entity, COUNT(status_code) statusCount
FROM plugin_status_alerts
where updated BETWEEN now() - INTERVAL '14 days' AND now() AND entity IN
(SELECT id.entity, count(id.entity) as Count
from plugin_status_alerts id
where id.updated BETWEEN now() - INTERVAL '14 days' AND now()
group by id.entity
order by count(id.entity) DESC
limit 10
)
GROUP BY status_code, entity
but I get the error
ERROR: subquery has too many columns
I'm not sure if this is the route I should be going, or if maybe I should be trying a self join- either way not sure how to correct for whats happening now.
Use a JOIN instead of IN (subquery). That's typically faster, and you can use additional values from the subquery if you need to (like the total count per entity):
SELECT entity, status_code, count(*) AS status_ct
FROM (
SELECT entity -- not adding count since you don't use it, but you could
FROM plugin_status_alerts
WHERE entered BETWEEN now() - interval '14 days' AND now()
GROUP BY entitiy
ORDER BY count(*) DESC, entitiy -- as tie breaker to get stable result
LIMIT 10
) sub
JOIN plugin_status_alerts USING (entity)
WHERE updated BETWEEN now() - interval '14 days' AND now()
GROUP BY 1, 2;
Notes
If you don't have future entries by design, you can simplify:
WHERE entered > now() - interval '14 days'
Since the subquery only returns a single column (entity), which is merged with the USING clause, column names are unambiguous and we don't need table qualification here.
LIMIT 10 after you sort by the count is likely to be ambiguous. Multiple rows can tie for the 10th row. Without additional items in ORDER BY, Postgres returns arbitrary picks, which may or may not be fine. But the result of the query can change between calls without any changes to the underlying data. Typically, that's not desirable and you should add columns or expressions to the list to break ties.
count(*) is a bit faster than count(status_code) and doing the same - unless status_code can be null, in which case you would get 0 as count for this row (count() never returns null) instead of the actual row count, which is either useless or actively wrong. Use count(*) either way here.
GROUP BY 1, 2 is just syntactical shorthand. Details:
Select first row in each GROUP BY group?
When you plug your first query into the second and use it in the in clause you still return two columns when the in only wants one. Either do this:
SELECT status_code, entity, COUNT(status_code) statusCount
FROM plugin_status_alerts
where updated BETWEEN now() - INTERVAL '14 days' AND now()
AND entity IN (
SELECT id.entity
from plugin_status_alerts id
where id.updated BETWEEN now() - INTERVAL '14 days' AND now()
group by id.entity
order by count(id.entity) DESC
limit 10
)
GROUP BY status_code, entity
Or use the first query as a derived table and join with it.