How to optimize query to compute row-dependent datetime relationships? - sql

Say I have a simplified model in which a patient can have zero or more events. An event has a category and a date. I want to support questions like:
Find all patients that were given a medication after an operation and
the operation happened after an admission.
Where medication, operation and admission are all types of event categories. There are ~100 possible categories.
I'm expecting 1000s of patients and every patient has ~10 events per category.
The naive solution I came up with was to have two tables, a patient and an event table. Create an index on event.category and then query using inner-joins like:
SELECT COUNT(DISTINCT(patient.id)) FROM patient
INNER JOIN event AS medication
ON medication.patient_id = patient.id
AND medication.category = 'medication'
INNER JOIN event AS operation
ON operation.patient_id = patient.id
AND operation.category = 'operation'
INNER JOIN event AS admission
ON admission.patient_id = patient.id
AND admission.category = 'admission'
WHERE medication.date > operation.date
AND operation.date > admission.date;
However this solution does not scale well as more categories/filters are added. With 1,000 patients and 45,000 events I see the following performance behaviour:
| number of inner joins | approx. query response |
| --------------------- | ---------------------- |
| 2 | 100ms |
| 3 | 500ms |
| 4 | 2000ms |
| 5 | 8000ms |
Explain:
Does anyone have any suggestions on how to optimize this query/data model?
Extra info:
Postgres 10.6
In the Explain output, project_result is equivalent to patient in the simplified model.
Advanced use case:
Find all patients that were given a medication within 30 days after an
operation and the operation happened within 7 days after an admission.

First, if referential integrity is enforced with FK constraints, you can drop the patient table from the query completely:
SELECT COUNT(DISTINCT patient) -- still not optimal
FROM event a
JOIN event o USING (patient_id)
JOIN event m USING (patient_id)
WHERE a.category = 'admission'
AND o.category = 'operation'
AND m.category = 'medication'
AND m.date > o.date
AND o.date > a.date;
Next, get rid of the repeated multiplication of rows and the DISTINCT to counter that in the outer SELECT by using EXISTS semi-joins instead:
SELECT COUNT(*)
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date > a.date
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date > o.date
)
)
AND a.category = 'admission';
Note, there can still be duplicates in the admission, but that's probably a principal problem in your data model / query design, and would need clarification as discussed in the comments.
If you indeed want to lump all cases of the same patient together for some reason, there are various ways to get the earliest admission for each patient in the initial step - and repeat a similar approach for every additional step. Probably fastest for your case (re-introducing the patient table to the query):
SELECT count(*)
FROM patient p
CROSS JOIN LATERAL ( -- get earliest admission
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'admission'
ORDER BY e.date
LIMIT 1
) a
CROSS JOIN LATERAL ( -- get earliest operation after that
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'operation'
AND e.date > a.date
ORDER BY e.date
LIMIT 1
) o
WHERE EXISTS ( -- the *last* step can still be a plain EXISTS
SELECT FROM event m
WHERE m.patient_id = p.id
AND m.category = 'medication'
AND m.date > o.date
);
See:
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest record per user
You might optimize your table design by shortening the lengthy (and redundant) category names. Use a lookup table and only store an integer (or even int2 or "char" value as FK.)
For best performance (and this is crucial) have a multicolumn index on (parent_id, category, date DESC) and make sure all three columns are defined NOT NULL. The order of index expressions is important. DESC is mostly optional here. Postgres can use the index with default ASC sort order almost as efficiently in your case.
If VACUUM (preferably in the form of autovacuum) can keep up with write operations or you have a read-only situation to begin with, you'll get very fast index-only scans out of this.
Related:
Optimizing queries on a range of timestamps (two columns)
Select Items that has one item but not the other
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
To implement your additional time frames (your "advanced use case"), build on the second query since we have to consider all events again.
You should really have case IDs or something more definitive to tie operation to admission and medication to operation etc. where relevant. (Could simply be the id of the referenced event!) Dates / timestamps alone are error-prone.
SELECT COUNT(*) -- to count cases
-- COUNT(DISTINCT patient_id) -- to count patients
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date >= a.date -- or ">"
AND o.date < a.date + 7 -- based on data type "date"!
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date >= o.date -- or ">"
AND m.date < o.date + 30 -- syntax for timestamp is different
)
)
AND a.category = 'admission';
About date / timestamp arithmetic:
How to get the end of a day?

You might find that conditional aggregation does what you want. The time component can be difficult to handle (see below), if your sequences get complicated, but the basic idea:
select e.patient_id
from events e
group by e.patient_id
having (max(date) filter (where e.category = 'medication') >
min(e.date) filter (where e.category = 'operation')
) and
(min(date) filter (where e.category = 'operation') >
min(e.date) filter (where e.category = 'admission'
);
This can be generalized for further categories.
Using group by and having should have the consistent performance characteristics that you want (although for simple queries is might be slower). The trick with this -- or any approach -- is what happens when there are multiple categories for a given patient.
For instance, this or your approach will find:
admission --> operation --> admission --> medication
I suspect that you don't really want to find these records. You probably need an intermediate level, representing some sort of "episode" for a given patient.
If that that is the case, you should ask another question with clearer examples both of the data, questions you might want to ask, and cases that match and do not match the conditions.

Related

SQL - Guarantee at least n unique users with 2 appearances each in query

I'm working with AWS Personalize and one of the service Quotas is to have "At least 1000 records containing a min of 25 unique users with at least 2 records each", I know my raw data has those numbers but I'm trying to find a way to guarantee that those numbers will always be met, even if the query is run by someone else in the future.
The easy way out would be to just use the full dataset, but right now we are working towards a POC, so that is not really my first option. I have covered the "two records each" section by just counting the appearances, but I don't know how to guarantee the min of 25 users.
It is important to say that my data is not shuffled in any way at the time of saving.
My query
SELECT C.productid AS ITEM_ID,
A.userid AS USER_ID,
A.createdon AS "TIMESTAMP",
B.fromaddress_countryname AS "LOCATION"
FROM A AS orders
JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
JOIN C AS order_items ON orders.order_id = order_items.order_id
WHERE orders.userid IN (
SELECT orders.userid
FROM A AS ORDERS
GROUP BY orders.userid
HAVING count(*) > 2
)
LIMIT 10
I use the LIMIT to just query a subset since I'm in AWS Athena.
The IN query is not very efficient since it needs to compare each row with all (worst case) the elements of the subquery to find a match.
It would be easier to start by storing all users with at least 2 records in a common table expression (CTE) and do a join to select them.
To ensure at least 25 distinct users you will need a window function to count the unique users since the first row and add a condition on that count. Since you can't use a window function in the where clause, you will need a second CTE and a final query that queries it.
For example:
with users as (
select userid as good_users
from orders
group by 1
having count(*) > 1 -- this condition ensures at least 2 records
),
cte as (
SELECT C.productid AS ITEM_ID,
A.userid AS USER_ID,
A.createdon AS "TIMESTAMP",
B.fromaddress_countryname AS "LOCATION",
count(distinct A.userid) over (rows between unbounded preceding and current row) as n_distinct_users
FROM A AS orders
JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
JOIN C AS order_items ON orders.order_id = order_items.order_id
JOIN users on A.userid = users.userid --> ensure only users with 2 records
order by A.userid -- needed for the window function
)
select * from cte where n_distinct_users < 26
sorting over userid in cte will ensure that at least 2 records per userid will appear in the results.

Count of records in the 2nd table according to the values in the first table

I have 3 tables like this:
And I would like to get this:
There are all rows from Notification table (selected attributes). Then Name of Location where the notif. occurred (Notification.Location_ID = Location.ID)
and count of processes that happened in the same location during the time period of the notif. (COUNT(Process.ID) WHERE Notification.Location_ID = Process.Location_ID AND DateTime > Begin AND DateTime < End)
I think I have a problem with joining Process table properly.
How should the whole SQL query look to get the wanted output? Thx.
It looks like the join between the first two tables is Ok, so, I would propose you to add the following subquery at the end of the SELECT clause before FROM to calculate the counter for the rows.
(SELECT Count(*) FROM Process WHERE Process.Location_ID = Location.ID) AS Counter
If you have included the third table (Process) in the join, you have to eliminate it.
I think you want to join process on common location ID and the timestamp being in the time range.
SELECT n.name,
n.begin,
n.end,
l.name,
count(p.id)
FROM notification n
LEFT JOIN location l
ON l.id = n.location_id
LEFT JOIN process p
ON p.location_id = n.location_id
AND p.datetime >= n.begin
AND p.datetime < n.end
GROUP BY n.name,
n.begin,
n.end,
l.name;

How do I join with the latest update?

I have two tables:
delivery with columns uid, dtime, candy which records which candy was given to which user when
lookup with columns uid and ltime which records when the user's pocket was examined
I need to know the result of the lookup, i.e., the result table should have columns uid, ltime, candy, telling me what was found in the user's pocket (assume the user eats the old candy when given the new one).
There were several deliveries before each lookup.
I need only the latest one.
E.g.,
select l.uid, l.ltime,
d.candy /* ... for max(d.dtime):
IOW, I want to sort (d.dtime, d.candy)
by the first field in decreasing order,
then take the second field in the first element */
from delivery d
join lookup l
on d.uid = l.uid
and d.dtime <= l.ltime
group by l.uid, l.ltime
So, how do I know what was found by the lookup?
Use Top 1 with Ties to get latest delivery and Join back to the Lookup Table
Select * from lookup
Inner Join (
Select Top 1 with Ties uid,dtime
From delivery
Order by row_number() over (partition by uid order by dtime desc)) as Delivery
on lookup.uid = Delivery.uid and lookup.ltime >= delivery.dtime
I would suggest outer apply:
select l.*, d.candy
from lookup l outer apply
(select top 1 d.*
from delivery d
where d.uid = l.uid and d.dtime <= l.ltime
order by d.dtime desc
) d;
That answers your question. But, wouldn't the user have all the candies since the last lookup? Or, are we assuming that the user eats the candy on hand when the user is given another? Perhaps the pocket only holds one candy.

Retrieve records from multiple Records returned by Sub-Query

I have this database Diagram :
the diagram is represent a database for Insurance Company.
the final_cost table represent the cost that the company should paid to repair a car
the table car has the field car_type which take one of the following values (1,2,3) where 1 refers to small cars, 2 refers to trucks , 3 refers to buses
I want to retrieve the name of kind (1 or 2 or 3 ) which has the maximum repaired cost during the 2013 year
I wrote The following Query :
select innerr.car_type from (
select car_type ,sum(fina_cost.cost) from car_acc inner join cars on cars.car_id = car_acc.car_id
inner join final_cost on FINAL_COST.CAR_ACC_ID = car_acc.CAR_ACC_ID
where (extract(year from final_cost.fittest_date)=2013)
group by(car_type)) innerr;
but I don't know how to get the car_type with maximum repaired Cost from the inner Sub-Query !
You can have access to anything and everything from a subquery if you use it right. The best way to build a complicated query is to start simply, seeing what data you have and usually the answer, or the next step, will be obvious.
So let's start by displaying all the accidents for 2013. We aren't interest in the individual cars, just the most expensive accidents by type. So...
select c.car_type, f.cost
from car_acc a
join cars c
on c.car_id = a.car_id
join final_cost f
on f.car_acc_id = a.car_acc_id
where f.fittest_date >= date '2013-01-01'
and f.fittest_date < date '2014-01-01';
I've changed the filtering criteria to a sargable form for efficiency. I don't usually worry about performance early in the design of a query, but when it's this obvious, why not?
Anyway, we now have a list of all 2013 accidents, by car type and the cost of each one. So now we only have to group by the type and take the Max of the cost of each group.
select c.car_type, Max( f.cost ) MaxCost
from car_acc a
join cars c
on c.car_id = a.car_id
join final_cost f
on f.car_acc_id = a.car_acc_id
where f.fittest_date >= date '2013-01-01'
and f.fittest_date < date '2014-01-01'
group by c.car_type;
Now we have a list of car types and the most expensive accidents for that type for 2013. With only three rows in the result set, it's easy to see which is the car type we're looking for. Now we just have to isolate that one row. The easiest step from here is to use this query in a CTE.
with MaxPerType( car_type, MaxCost )as(
select c.car_type, Max( f.cost ) MaxCost
from car_acc a
join cars c
on c.car_id = a.car_id
join final_cost f
on f.car_acc_id = a.car_acc_id
where f.fittest_date >= date '2013-01-01'
and f.fittest_date < date '2014-01-01'
group by c.car_type
)
select m.car_type, m.MaxCost
from MaxPerType m
where m.MaxCost =(
select Max( MaxCost )
from MaxPerType );
So the CTE gives us the largest cost per type and the subquery in the main query gives us the largest cost overall. So the result is the type(s) that match the largest cost overall.
You could try either orderby or better yet, use the Max Function http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions085.htm
Try this:
SELECT A.car_type
FROM (SELECT c.car_type, SUM(fc.cost) totalCost
FROM car_accident ca
INNER JOIN cars c ON c.car_id = ca.car_id
INNER JOIN final_cost fc ON fc.CAR_ACC_ID = ca.CAR_ACC_ID
WHERE EXTRACT(YEAR FROM fc.fittest_date) = 2013
GROUP BY c.car_type
ORDER BY totalCost DESC
) AS A
WHERE ROWNUM = 1;

Flattening Join in PostgreSQL

Is it possible to join a table so that only a specific row at a specific ordered offset is joined instead of every matching record in table?
I have two tables, Customer and MonthlyRecommendation. MonthlyRecommendation points to Customer and tracks one product recommendation made by the customer at some day in each month.
I'm trying to write a query that retrieves each customer, along with the last 12-months of recommendations. Simply doing:
SELECT c.id, m.date, m.product
FROM Customer AS c
INNER JOIN MonthlyRecommendation AS m ON m.customer_id = c.id
will get me the data I want, but I need it flattened so that each customer's data is in one row, and the result signature looks like:
id, date_01, product_01, date_02, product_02, ..., date_12, product_12
Is there any way to do this in PostgreSQL? For similar problems, I would normally just make 12 separate JOINs, joining on specific sub-condition for each one, but in this case, the condition is relative to the order of the date values in the table. I'd like to be able to specify and ORDER BY, with maybe a LIMIT and OFFSET, but I don't believe and SQL dialect supports that.
Some databases support the pivot operation directly. In Postgres, you could use a user-defined function, such as cross tab. But the aggregation method is simple enough:
SELECT c.id,
'2013-01' as date_01, max(case when m.date = '2013-01' then m.product end) as product_01,
'2013-02' as date_02, max(case when m.date = '2013-02' then m.product end) as product_02,
. . .
'2013-12' as date_12, max(case when m.date = '2013-12' then m.product end) as product_12
FROM Customer c INNER JOIN
MonthlyRecommendation m
ON m.customer_id = c.id
GROUP BY c.id;
Of course, the above query is just guessing at a format for date. You'll need to put the right comparison in for your data.