This is a common question but from the questions I've browsed I wasn't able to find a good answer. For every employee I want to display their last known status and the time of that status.
I have two SQL Server tables (well actually they're views):
Employee
EmployeeID
Name
123
xyz
Clock
EmployeeID
ClockType
Time
123
I
2022-12-19 10:00:00
123
G
2022-12-19 19:21:00
There is some more additional data there but I think I'll be able to figure that out. My problem right now is that I need to find the latest entry for every employee in a performant way.
Current approach:
SELECT
e.EmployeeID,
c.Time AS LastClock,
c.ClockType
FROM Employee e
LEFT JOIN Clock c ON e.EmployeeID = c.EmployeeID
AND c.Time = (
SELECT max(Time) FROM Clock c1
WHERE e.EmployeeID = c1.EmployeeID
)
This works, but with many millions of rows, this is really slow. I've also tried simply limiting the inner select with a condition to only return the last X days but that breaks the requirements since I need it for every employee, even if they weren't here in the past month.
What are better ways to do this? The code would be called about every minute, so performance is quite important.
I think a reasonable way of writing it is the following:
WITH c AS (
SELECT clock.*, row_number() OVER (PARTITION BY employeeid ORDER BY time DESC) rn
FROM clock
)
SELECT
e.EmployeeID,
c.Time LastClock,
c.Clocktype
FROM Employee e
LEFT JOIN c ON e.EmployeeID = c.EmployeeID
WHERE COALESCE(c.rn, 1) = 1
However, I think the most important thing for performance of this (or your original query) is going to be to ensure you have a proper index on clock (employeeid, time). e.g. create index ix_clock on clock (employeeid, time)
for better performance use the below query to get the same result to optiomize the performance.
SELECT e.EmployeeID, c.Time AS LastClock, c.ClockType
FROM Employee e
LEFT JOIN (
SELECT EmployeeID, Time, ClockType
FROM Clock
WHERE Time = (SELECT MAX(Time) FROM Clock c1 WHERE c1.EmployeeID = Clock.EmployeeID)
) c ON e.EmployeeID = c.EmployeeID
Related
I need help figuring out how to write a query against the employee database's employees table that will generate a list of all employees (EMPY_ID) and their retirement dates (RETIRE_DT) then take those results and run a second query against the payroll database's paychecks table looking for all payments (PAY_DT) made to the employee (EMPY_ID) after they retire (RETIRE_DT). I have this so far but I think I am heading down the wrong path. Still learning. Any help is appreciated.
WITH R as (select p.* from [paydatabase].[dbo].[PAY_DT] p
inner join [employeedb].[dbo].[employees] e on p.EMPY_ID = e.EMPY_ID where e.RETIRE_DT <= '2020-01-01')
SELECT * FROM R WHERE PAY_DT >= '2020-01-01'
I think the idea is:
SELECT p.*
FROM [employeedb].[dbo].[employees] e JOIN
[paydatabase].[dbo].[PAY_DT] p
ON p.EMPY_ID = e.EMPY_ID AND
p.PAY_DT >= e.RETIRE_DT;
That is, you want to compare the pay date to the retirements date, not to a constant.
Say I have a simplified model in which a patient can have zero or more events. An event has a category and a date. I want to support questions like:
Find all patients that were given a medication after an operation and
the operation happened after an admission.
Where medication, operation and admission are all types of event categories. There are ~100 possible categories.
I'm expecting 1000s of patients and every patient has ~10 events per category.
The naive solution I came up with was to have two tables, a patient and an event table. Create an index on event.category and then query using inner-joins like:
SELECT COUNT(DISTINCT(patient.id)) FROM patient
INNER JOIN event AS medication
ON medication.patient_id = patient.id
AND medication.category = 'medication'
INNER JOIN event AS operation
ON operation.patient_id = patient.id
AND operation.category = 'operation'
INNER JOIN event AS admission
ON admission.patient_id = patient.id
AND admission.category = 'admission'
WHERE medication.date > operation.date
AND operation.date > admission.date;
However this solution does not scale well as more categories/filters are added. With 1,000 patients and 45,000 events I see the following performance behaviour:
| number of inner joins | approx. query response |
| --------------------- | ---------------------- |
| 2 | 100ms |
| 3 | 500ms |
| 4 | 2000ms |
| 5 | 8000ms |
Explain:
Does anyone have any suggestions on how to optimize this query/data model?
Extra info:
Postgres 10.6
In the Explain output, project_result is equivalent to patient in the simplified model.
Advanced use case:
Find all patients that were given a medication within 30 days after an
operation and the operation happened within 7 days after an admission.
First, if referential integrity is enforced with FK constraints, you can drop the patient table from the query completely:
SELECT COUNT(DISTINCT patient) -- still not optimal
FROM event a
JOIN event o USING (patient_id)
JOIN event m USING (patient_id)
WHERE a.category = 'admission'
AND o.category = 'operation'
AND m.category = 'medication'
AND m.date > o.date
AND o.date > a.date;
Next, get rid of the repeated multiplication of rows and the DISTINCT to counter that in the outer SELECT by using EXISTS semi-joins instead:
SELECT COUNT(*)
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date > a.date
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date > o.date
)
)
AND a.category = 'admission';
Note, there can still be duplicates in the admission, but that's probably a principal problem in your data model / query design, and would need clarification as discussed in the comments.
If you indeed want to lump all cases of the same patient together for some reason, there are various ways to get the earliest admission for each patient in the initial step - and repeat a similar approach for every additional step. Probably fastest for your case (re-introducing the patient table to the query):
SELECT count(*)
FROM patient p
CROSS JOIN LATERAL ( -- get earliest admission
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'admission'
ORDER BY e.date
LIMIT 1
) a
CROSS JOIN LATERAL ( -- get earliest operation after that
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'operation'
AND e.date > a.date
ORDER BY e.date
LIMIT 1
) o
WHERE EXISTS ( -- the *last* step can still be a plain EXISTS
SELECT FROM event m
WHERE m.patient_id = p.id
AND m.category = 'medication'
AND m.date > o.date
);
See:
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest record per user
You might optimize your table design by shortening the lengthy (and redundant) category names. Use a lookup table and only store an integer (or even int2 or "char" value as FK.)
For best performance (and this is crucial) have a multicolumn index on (parent_id, category, date DESC) and make sure all three columns are defined NOT NULL. The order of index expressions is important. DESC is mostly optional here. Postgres can use the index with default ASC sort order almost as efficiently in your case.
If VACUUM (preferably in the form of autovacuum) can keep up with write operations or you have a read-only situation to begin with, you'll get very fast index-only scans out of this.
Related:
Optimizing queries on a range of timestamps (two columns)
Select Items that has one item but not the other
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
To implement your additional time frames (your "advanced use case"), build on the second query since we have to consider all events again.
You should really have case IDs or something more definitive to tie operation to admission and medication to operation etc. where relevant. (Could simply be the id of the referenced event!) Dates / timestamps alone are error-prone.
SELECT COUNT(*) -- to count cases
-- COUNT(DISTINCT patient_id) -- to count patients
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date >= a.date -- or ">"
AND o.date < a.date + 7 -- based on data type "date"!
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date >= o.date -- or ">"
AND m.date < o.date + 30 -- syntax for timestamp is different
)
)
AND a.category = 'admission';
About date / timestamp arithmetic:
How to get the end of a day?
You might find that conditional aggregation does what you want. The time component can be difficult to handle (see below), if your sequences get complicated, but the basic idea:
select e.patient_id
from events e
group by e.patient_id
having (max(date) filter (where e.category = 'medication') >
min(e.date) filter (where e.category = 'operation')
) and
(min(date) filter (where e.category = 'operation') >
min(e.date) filter (where e.category = 'admission'
);
This can be generalized for further categories.
Using group by and having should have the consistent performance characteristics that you want (although for simple queries is might be slower). The trick with this -- or any approach -- is what happens when there are multiple categories for a given patient.
For instance, this or your approach will find:
admission --> operation --> admission --> medication
I suspect that you don't really want to find these records. You probably need an intermediate level, representing some sort of "episode" for a given patient.
If that that is the case, you should ask another question with clearer examples both of the data, questions you might want to ask, and cases that match and do not match the conditions.
I want to select employees, having more than 10 products and older than 50. I also want to have their last product selected. I use the following query:
SELECT
PE.EmployeeID, E.Name, E.Age,
COUNT(*) as ProductCount,
(SELECT TOP(1) xP.Name
FROM ProductEmployee xPE
INNER JOIN Product xP ON xPE.ProductID = xP.ID
WHERE xPE.EmployeeID = PE.EmployeeID
AND xPE.Date = MAX(PE.Date)) as LastProductName
FROM
ProductEmployee PE
INNER JOIN
Employee E ON PE.EmployeeID = E.ID
WHERE
E.Age > 50
GROUP BY
PE.EmployeeID, E.Name, E.Age
HAVING
COUNT(*) > 10
Here is the execution plan link: https://www.dropbox.com/s/rlp3bx10ty3c1mf/ximExPlan.sqlplan?dl=0
However it takes too much time to execute it. What's wrong with it? Is it possible to make a more efficient query?
I have one limitation - I can not use CTE. I believe it will not bring performance here anyway though.
Before creating Index I believe we can restructure the query.
Your query can be rewritten like this
SELECT E.ID,
E.NAME,
E.Age,
CS.ProductCount,
CS.LastProductName
FROM Employee E
CROSS apply(SELECT TOP 1 P.NAME AS LastProductName,
ProductCount
FROM (SELECT *,
Count(1)OVER(partition BY EmployeeID) AS ProductCount -- to find product count for each employee
FROM ProductEmployee PE
WHERE PE.EmployeeID = E.Id) PE
JOIN Product P
ON PE.ProductID = P.ID
WHERE ProductCount > 10 -- to filter the employees who is having more than 10 products
ORDER BY date DESC) CS -- To find the latest sold product
WHERE age > 50
This should work:
SELECT *
FROM Employee AS E
INNER JOIN (
SELECT PE.EmployeeID
FROM ProductEmployee AS PE
GROUP BY PE.EmployeeID
HAVING COUNT(*) > 10
) AS PE
ON PE.EmployeeID = E.ID
CROSS APPLY (
SELECT TOP (1) P.*
FROM Product AS P
INNER JOIN ProductEmployee AS PE2
ON PE2.ProductID = P.ID
WHERE PE2.EmployeeID = E.ID
ORDER BY PE2.Date DESC
) AS P
WHERE E.Age > 50;
Proper indexes should speed query up.
You're filtering by Age, so followining one should help:
CREATE INDEX ix_Person_Age_Name
ON Person (Age, Name);
Subquery that finds emploees with more than 10 records should be calculated first and CROSS APPLY should bring back data more efficient with TOP operator rather than comparing it to MAX value.
Answer by #Prdp is great, but I thought I'll drop an alternative in. Sometimes windowed functions do not work very well and it's worth to replace them with ol'good subqueries.
Also, do not use datetime, use datetime2. This is suggest by Microsoft:
https://msdn.microsoft.com/en-us/library/ms187819.aspx
Use the time, date, datetime2 and datetimeoffset data
types for new work. These types align with the SQL Standard. They are
more portable. time, datetime2 and datetimeoffset provide
more seconds precision. datetimeoffset provides time zone support
for globally deployed applications.
By the way, here's a tip. Try to name your surrogate primary keys after table, so they become more meaningful and joins feel more natural. I.E.:
In Employee table replace ID with EmployeeID
In Product table replace ID with ProductID
I find these a good practice.
with usersOver50with10productsOrMore (employeeID, productID, date, id, name, age, products ) as (
select employeeID, productID, date, id, name, age, count(productID) from productEmployee
join employee on productEmployee.employeeID = employee.id
where age >= 50
group by employeeID, productID, date, id, name, age
having count(productID) >= 10
)
select sfq.name, sfq.age, pro.name, sfq.products, max(date) from usersOver50with10productsOrMore as sfq
join product pro on sfq.productID = pro.id
group by sfq.name, sfq.age, pro.name, sfq.products
;
There is no need to find the last productID for the entire table, just filler the last product from the results of employees with 10 or more products and over the age of 50.
I have a query that pulls user id and various events associated with each user. Since a user can have multiple events, I need the first event ever associated with each user.
Add a constraint to any answers - our server is delicate and has a hard time handling subqueries.
Here is the initial query:
select
c.id as eid,
c.created as contact_created,
e.event_time,
cs.name as initial_event_type
from bm_emails.cid378 c
inner join bm_sets.event378 e on e.eid = c.id
inner join bm_config.classes cs on cs.id = e.class_id
group by
eid, initial_class
order by eid desc
Which produces results that look like this:
eid contact_created event_time initial_event_type
283916 2015-03-09 10:56:22 2015-03-09 10:57:21 Hot
283916 2015-03-09 10:56:22 2015-03-09 10:56:22 Warm
283914 2015-03-09 10:17:32 2015-03-09 10:17:32 Warm
283912 2015-03-09 10:11:03 2015-03-09 10:11:03 Warm
283910 2015-03-09 09:54:15 2015-03-09 09:54:15 Hot
So in this case user 283916 has been returned twice in the results. What I'd like is to return only one result for this user, the one where initial_event_type says "warm" since that happened first (min event_time).
Here is what I tied. Presumably it would work on a more powerful server but ours just can't handle sub queries - it takes a long time and our developer gets upset whenever I leave queries running.
select
c.id as eid,
c.created as contact_created,
e.event_time,
cs.name as initial_class
from bm_emails.cid378 c
inner join bm_sets.event378 e on e.eid = c.id
inner join bm_config.classes cs on cs.id = e.class_id
where concat(e.eid, e.event_time) in ( select concat(eid, min(event_time))
from bm_sets.event378
group by eid)
group by
eid, initial_class
order by eid desc
Is it possible to pull this data without use of sub queries? I've seen people do multiple joins on the same table before and I think that may be the right path but, like our server, my brain is not powerful enough to figure out how to start on that path.
Any other more efficient solutions?
** Following answer by below, here is outcome of explain statement:
Is it possible to pull this data without use of sub queries
The fact that you use a subquery is not the problem. The problem is cause by filtering on concat(eid, min(event_time) in your subquery since there is likely not an index on this expression, requiring a table scan. A better option would be a filtered subquery:
select
c.id as eid,
c.created as contact_created,
e.event_time,
cs.name as initial_class
from bm_emails.cid378 c
inner join bm_sets.event378 e on e.eid = c.id
inner join bm_config.classes cs on cs.id = e.class_id
where e.event_time = ( select min(event_time)
from bm_sets.event378
WHERE eid = e.eid))
order by eid desc
An employee mapped with 2 supervisors for specific periods. I need to find the supervisor which the employee mapped maximum period.
Employee Mapped with the supervisor A from '01/01/2010' to '31/08/2010'
Mapped with the supervisor B from '01/09/2010' to '31/12/2010'
So the maximum period of the supervisor is 'A'
This should find using sql server query.
As no DDL has been posted as yet, this may or may not help.
Select e.EmployeeName,
s.SupervisorName,
es.StartDate,
es.EndDate,
EmpMaxDays.MaxDays as 'TotalNumberOfDaysAssigned'
From dbo.Employees e
Left Join dbo.EmployeeSupervisors es on es.EmployeeID = e.EmployeeId
Left Join
(
Select Max(DateDiff(day, es.StartDate, es.EndDate)) as 'MaxDays',
EmployeeId
From dbo.EmployeeSupervisor
Group By EmployeeId
)EmpMaxDays on es.EmployeeId = EmpMaxDays.EmployeeId
Left Join dbo.Supervisros s on es.SupervisorId = s.SupervisorId
Where DateDiff(day, es.StartDate, es.EndDate) = EmpMaxDays.MaxDays
And es.EmployeeId = EmpMaxDays.EmployeeId
I suggest you use rank partitioning. This way you can select where the rank = 1 (the correct match). See here.