Efficiently identify all FK items with n>3 dates within any 8 week period from a SQL table? - sql

I have a ~400,000 row table containing the dates at which a collection of ~30,000 people had appointments. Each row has the patient ID number and an appointment date. I want to efficiently select people who had at least 4 appointments in an 8 week span. Ideally, I would also flag the appointments that were within this 8 week span as I did so. I am working in a server environment that does not allow CLR aggregate functions. Is this possible to do in SQL server? If so, how?
What I've thought about:
If I could write my own aggregate function to do this via GROUP BY that would obviously be best - but I can't seem to find any way to do it with the built in aggregate functions.
I can add a column to my original table giving a date 8 weeks out from any given appointment, but can't come up with any way that doesn't involve a for loop to then ask the question row by row whether there are at least 3 other appointments within that window.
Finally, I've even though that perhaps I could just do GROUP BY but somehow create 100 new columns (as there are up to that many appointments for some patients) to create a table that contains every appointment indexed by patient, but even as a SQL newbie I'm pretty sure that as soon as I get to the point of imagining adding 100 new columns I'm going down the wrong road....
For clarity of discussion, here is some notation:
MyTable:
ApptID PatientID ApptDate (in smalldatetime)
--------------------------------------------------
Apt1 Pt1 Datetime1
Apt2 Pt1 Datetime2
Apt3 Pt2 Datetime3
... ... ...
Desired output (one option):
PatientID 4aptsIn8weeks? (Boolean) InitialApptDateForWin
Pt1 1 Datetime1
Pt2 0 NULL
Pt3 1 Datetime3
...
Desired output (another option):
ApptID PatientID ApptDate InAn8wkWindow? InitialApptDateForWin
Apt1 Pt1 Datetime1 1 Datetime1
Apt2 Pt1 Datetime2 1 Datetime1
Apt3 Pt2 Datetime3 0 NULL
... ... ...
But really, any output format that will in the end let me select patients and appointments that meet this criterion would be dandy....
Thanks for any ideas!
EDIT: Here's a slightly decompressed outline of my implementation of the selected answer below, just in case the details are helpful for anyone else (being new to SQL, it took me a couple stabs to get it working):
WITH MyTableAlias AS (
SELECT * FROM MyTable
)
SELECT MyTableAlias.PatientID, MyTable.Apptdate AS V1,
MyTableAlias.Apptdate AS V2
INTO temp1
FROM MyTable INNER JOIN MyTableAlias
ON (
MyTable.PatientID = MyTableAlia.PatientID
AND (DATEDIFF(Wk,MyTable.Apptdate,MyTableAlias.Apptdate) <=8 )
);
-- Since this gives for any given two visit dates 3 hits
-- (V1-V1, V1-V2, V2-V2), delete the ones where the second visit is being
-- selected as V1:
DELETE FROM temp1
WHERE V2<V1;
-- So far we have just selected pairs of visits within an 8 week
-- span of each other, including an entry for each visit being
-- within 8 weeks of itself, but for the rest only including the item
-- where the second visit is after the first. Now we want to look
-- for examples of first visits where there are at least 4 hits:
SELECT PatientID, V1, MAX(V2) AS lastvisitinspan, DATEDIFF(Wk,V1,MAX(V2))
AS nWeeksInSpan, COUNT(*) AS nWeeksInSpan
INTO MyOutputTable
FROM temp
GROUP BY PatientID, V1
HAVING COUNT(*)>3;
-- From here on it's just a matter of how I want to handle patients with two
-- separate V1 examples meeting criteria...

Rough outline of the query:
INNER JOIN the table ("table") with itself ("alias"), the ON clause would be:
table.patientid = alias.patientid
table.appointment_date < alias.appointment_date
datediff(table.appointment_date, alias.appointment_date) <= 8 week
Then GROUP BY table.patientid, table.appointment_date
Output table.patientid, table.appointment_date, MAX(alias.appointment_date), COUNT(*)
Add a HAVING COUNT(*) > n clause
There are some issues though:
With 400,000 rows the JOIN could produce a very large result set
It will count some date ranges twice. E.g. if there were 4 visits in 9 week period then it will return two rows (#1, #2, #3 and #2, #3, #4).

Related

Firebird select distinct with count

In Firebird 2.5 I have a table of hardware device events; each row contains a timestamp, a device ID and an integer status of the event. I need to retrieve a rowset of the subset of IDs with non-0 statuses and the number of instances of the non-0 events for each ID, within a specified date range. I can get the subset of IDs with non-0 statuses in the specified date range, but I can't figure out how to get the count of non-0-status rows associated with each ID in the same rowset. I'd prefer to do this in a query rather than a stored proc, if possible.
The table is:
RPR_HISTORY
TSTAMP timestamp
RPRID integer
PARID integer
LASTRES integer
LASTCUR float
The rowset I want is like
RPRID ERRORCOUNT
-------------------
18 4
19 2
66 7
The query
select distinct RPRID from RPR_HISTORY
where (LASTRES <> 0)
and (TSTAMP >= :STARTSTAMP);
gives me the IDs I'm looking for, but obviously not the count of non-0-status rows for each ID. I've tried a bunch of combinations of nested queries derived from the above; all generate errors, usually on grouping or aggregation errors. It seems like a straightforward thing to do but is just escaping me.
Got it! The query
select rh.RPRID, count(rh.RPRID) from RPR_HISTORY rh
where (rh.LASTRES <> 0)
and (rh.TSTAMP >= :STARTSTAMP)
and rh.RPRID in
(select distinct rd.RPRID from RPR_HISTORY rd where rd.LASTRES <> 0)
group by rh.RPRID;
returns the rowset I need.

How do I stop my query from pulling duplicates?

Yes, I know this seems simple:
SELECT DISTINCT(...)
Except, it apparently isn't
Here is my actual Query:
SELECT
DeclinationReasons.Reason,
EmployeeInformation.ID,
EmployeeInformation.Employee,
EmployeeInformation.Active,
CompletedTrainings.DecShotDate,
CompletedTrainings.DecShotLocation,
CompletedTrainings.DecReason,
CompletedTrainings.DecExplanation,
IIf([DecShotLocation]="MCS","Yes","No") AS YesMCS,
IIf([DecReason]=1,1,0) AS YesAllergy,
IIf([DecReason]=2,1,0) AS YesImmune,
IIf([DecReason]=3,1,0) AS YesAdverse,
IIf([DecReason]=4,1,0) AS YesMedical,
IIf([DecReason]=5,1,0) AS YesSpiritual,
IIf([DecReason]=6,1,0) AS YesOther,
IIf([DecReason]=7,1,0) AS YesAlready
FROM
EmployeeInformation
INNER JOIN (CompletedTrainings
LEFT JOIN DeclinationReasons ON CompletedTrainings.DecReason = DeclinationReasons.ReasonID)
ON EmployeeInformation.ID = CompletedTrainings.Employee
GROUP BY
DeclinationReasons.Reason,
EmployeeInformation.ID,
EmployeeInformation.Employee,
EmployeeInformation.Active,
CompletedTrainings.DecShotDate,
CompletedTrainings.DecShotLocation,
CompletedTrainings.DecReason,
CompletedTrainings.DecExplanation,
IIf([DecShotLocation]="MCS","Yes","No"),
IIf([DecReason]=1,1,0),
IIf([DecReason]=2,1,0),
IIf([DecReason]=3,1,0),
IIf([DecReason]=4,1,0),
IIf([DecReason]=5,1,0),
IIf([DecReason]=6,1,0),
IIf([DecReason]=7,1,0)
HAVING
((((EmployeeInformation.Active) Like -1)
AND ((CompletedTrainings.DecShotDate + 365 >= DATE())
OR (CompletedTrainings.DecShotDate IS NULL))));
This is Joining a few tables (obviously) in order to get a number of records. The problem is that if someone is duplicated on the table with a NULL in one of the date fields, and a date in another field, it pulls both the NULL and the DATE, or pulls multiple NULLS it might pull multiple dates but those are not present right at the moment.
I need the Nulls, they are actual data in this particular case, but if someone has a date and a NULL I need to pull only the newest record, I thought I could add MAX(RecordID) from the table, but that didn't change the results of the query either.
That code:
SELECT
DeclinationReasons.Reason,
EmployeeInformation.ID,
EmployeeInformation.Employee,
EmployeeInformation.Active,
MAX(CompletedTrainings.RecordID),
CompletedTrainings.DecShotDate
...
And it returned the same issue, Duplicated EmployeeInformation.ID with different DecShotDate values.
Currently it returns:
ID
Active
DecShotDate
etc. x a bunch
1
-1
date date
whatever goes
2
-1
in these
2
-1
date date
columns
These are being used in a report, that is to determine the total number of employees who fit the criteria of the report. The NULLs in DecShotDate are needed as they show people who did not refuse to get a flu vaccine in the current year, while the dates are people who did refuse.
Now I have come up with one simple solution, I could add a column to the CompletedTrainings Table that contains a date or other value, and add that to the HAVING statement. This might be the right solution as this is a yearly training questionnaire that employees have to fill out. But I am asking for advice before doing this.
Am I right in thinking I need to add a column to filter by so that older data isn't being pulled, or should I be able to do this by pulling recordID, and did I just bork that part of the query up?
Edited to add raw table views:
EmployeeInformation Table:
ID
Last
First
empID
Active
Termdate
DoH
Title
PT/FT/PD
PI
1
Doe
Jane
982
-1
date
Sr
PD
X
2
Roe
John
278
0
date
date
Jr
PD
X
3
Moe
Larry
1232
-1
date
Sr
FT
X
4
Zoe
Debbie
1424
-1
date
Sr
PT
X
DeclinationReasons Table:
ReasonID
Reason
1
Allergy
2
Already got it
3
Illness
CompletedTrainings Table:
RecordID
Employee
Training
...
DecShotdate
DecShotLocation
DecShotReason
DecExp
1
1
4
date
location
2
text
2
1
4
3
2
4
4
3
4
date
location
3
text
5
3
4
date
location
1
text
6
4
4
After some serious soul searching, I decided to use another column and filter by that.
In the end my query looks like this:
SELECT *
FROM (
(
SELECT RecordID, DecShotDate, DecShotLocation, DecReason, DecExplanation, Employee,
IIf([DecShotLocation]="MCS","Yes","No") AS YesMCS, IIf([DecReason]=1,1,0) AS YesAllergy,
IIf([DecReason]=2,1,0) AS YesImmune, IIf([DecReason]=3,1,0) AS YesAdverse,
IIf([DecReason]=4,1,0) AS YesMedical, IIf([DecReason]=5,1,0) AS YesSpiritual,
IIf([DecReason]=6,1,0) AS YesOther, IIf([DecReason]=7,1,0) AS YesAlready
FROM CompletedTrainings WHERE (CompletedDate > DATE() - 365 ) AND (Training = 69)) AS T1
LEFT JOIN
(
SELECT ID, Active FROM EmployeeInformation) AS T2 ON T1.Employee = T2.ID)
LEFT JOIN
(
SELECT Reason, ReasonID FROM DeclinationReasons) AS T3 ON T1.DecReason = T3.ReasonID;
This may not have been the best solution, but it did exactly what I needed. Which is to get the information by latest entry into the database.
Previously I had tried to use MAX(), DISTINCT(), etc. but always had a problem of multiple records being retrieved. In this case, I intentionally SELECT the most recent records first, then join them to the results of the next query, and so on. Until I have all the required data for my report.
I write this in hopes someone else finds it useful. Or even better if someone tells me why this is wrong, so as to improve my own skills.

Select the FIRST record with a value greater than XXX

OK, another newbie SQL question which i'm sure has a simple solution and i'll kick myself when someone posts the answer!
I have two tables as follows
PRICE_DTA
PRC_DATE PRC_TIME PRC_PRICE PRC_ITEM
2008-01-01 06.00.00 1.05 JUMPER
2008-01-01 09.00.00 1.20 JUMPER
2008-01-25 17.00.00 1.75 JUMPER
2008-01-02 09.00.00 2.25 TROUSERS
2008-10-25 12.00.00 2.95 TROUSERS
SALE_DTA
TRN_DATE TRN_TIME TRN_PRICE_PAID TRN_ITEM
2008-01-01 08.30.00 JUMPER
2008-01-03 10.00.00 JUMPER
2008-01-03 17.00.00 JUMPER
2008-01-01 13.00.00 TROUSERS
2008-01-02 09.00.00 TROUSERS
The way the prices work is that you get the NEXT available price(prices aren't set until after the purchase because we bulk all the orders up and get a cheaper price the more we buy in one go). So, in the example the 08.30.00 purchase on 2008-01-01 will have been for 1.20 because that is the first available price after the purchase date/time
So, I need to populate the prices in the SALE_DTA table using the TRN_DATE/TRN_TIME fields to go an get the next available price off the PRICE_DTA tables. NOTE: The DATE and TIME fields on both tables are CHAR fields not date/timestamp fields
I can concatenate the date and time easily enough but i'm not sure how to find the FIRST record on PRICE_DTA with a date/time stamp greater than that. I know on UNISYS DMS II I can use a 'FIND NEXT GREATER THAN' but can't find a similar command in SQL?
I'm happy to create a temporary table as part of the solution if that makes it simpler.
The generic SQL solution for this can be done with a couple of joins:
SELECT
* --TODO - Pick appropriate columns
FROM
SALE_DTA s
INNER JOIN
PRICE_DTA p
ON
p.PRC_ITEM = s.TRN_ITEM and
(p.PRC_DATE > s.TRN_DATE or
(p.PRC_DATE = s.TRN_DATE and
p.PRC_TIME > s.TRN_TIME
))
LEFT JOIN
PRICE_DTA p2
ON
p2.PRC_ITEM = s.TRN_ITEM and
(p2.PRC_DATE > s.TRN_DATE or
(p2.PRC_DATE = s.TRN_DATE and
p2.PRC_TIME > s.TRN_TIME
)) and
(p2.PRC_DATE < p.PRC_DATE or
(p2.PRC_DATE = p.PRC_DATE and
p2.PRC_TIME < p.PRC_TIME
))
WHERE
p2.PRC_ITEM IS NULL
Hopefully, you can see the logic here. The INNER JOIN is used to match rows in SALE_DTA with all rows in PRICE_DTA that occur afterwards. We then do a second join (the LEFT JOIN) to this PRICE_DTA again, this time trying to locate a row with this join (p2) such that it still occurs after the s date/time, but occurs before the p date/time.
Finally, in the WHERE clause, we eliminate any rows where this LEFT JOIN actually succeeded. Therefore, by deduction, we know that the row that we matched in p is the earliest row from PRICE_DTA which occurs after the SALE_DTA date/time.
You can certainly get the data required but DB2 don't support JOIN with UPDATE statement. So you can take a different route like
Create a auxiliary table
create table SALE_DTA_temp(TRN_DATE,TRN_TIME,TRN_PRICE_PAID,TRN_ITEM)
Do a insert into temp table from the query
insert into SALE_DTA_temp
select sd.TRN_DATE,
sd.TRN_TIME,
tab.max_PRC_PRICE as TRN_PRICE_PAID,
sd.TRN_ITEM
from SALE_DTA sd
join
(
select PRC_DATE, max(PRC_PRICE) as max_PRC_PRICE
from PRICE_DTA
group by PRC_DATE
) tab on sd.TRN_DATE = tab.PRC_DATE
Drop the old table
drop table SALE_DTA
Rename the table
RENAME TABLE SALE_DTA_temp TO SALE_DTA

Determine records which held particular "state" on a given date

I have a state machine architecture, where a record will have many state transitions, the one with the greatest sort_key column being the current state. My problem is to determine which records held a particular state (or states) for a given date.
Example data:
items table
id
1
item_transitions table
id item_id created_at to_state sort_key
1 1 05/10 "state_a" 1
2 1 05/12 "state_b" 2
3 1 05/15 "state_a" 3
4 1 05/16 "state_b" 4
Problem:
Determine all records from items table which held state "state_a" on date 05/15. This should obviously return the item in the example data, but if you query with date "05/16", it should not.
I presume I'll be using a LEFT OUTER JOIN to join the items_transitions table to itself and narrow down the possibilities until I have something to query on that will give me the items that I need. Perhaps I am overlooking something much simpler.
Your question rephrased means "give me all items which have been changed to state_a on 05/15 or before and have not changed to another state afterwards. Please note that for the example it added 2001 as year to get a valid date. If your "created_at" column is not a datetime i strongly suggest to change it.
So first you can retrieve the last sort_key for all items before the threshold date:
SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id
Next step is to join this result back to the item_transitions table to see to which state the item was switched at this specific sort_key:
SELECT *
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
Finally you only want those who switched to 'state_a' so just add a condition:
SELECT DISTINCT it.item_id
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
WHERE it.to_state='state_a'
You did not mention which DBMS you use but i think this query should work with the most common ones.

Find duplicates in Select statement after an if check

I am working on a project that keeps a track of repaired cell phones.
In the select statement, I would like to find the duplicate IMEI numbers and check if the AddedDate between the duplicates is less than 30 days. Another words, the select should list all the phones even including the duplicated IMEI numbers if the AddedDate is more than 30 days.
I hope I described it clear enough. Thank you.
Additional notes:
I have tried it by including groupBy under a sub-select which did find the duplicates, but I wasn't able to implement an if condition. Instead, I was going to place all duplicates into a dynamic table and then use a select statement against this table. Before doing so, I thought of posting my question here.
For example DB_Phones has the following rows
ID - AddedDate - IMEI
1 - 01.10.2012 - 123456789012345
2 - 15.10.2012 - 987654321012345
3 - 20.10.2012 - 123456789012345
Based on the table above, I would like to list only the second row (ID# 2) because the last duplicate (ID# 3) wasn't added 30 days after the row with the ID# 1. If rows were as below:
ID - AddedDate - IMEI
1 - 01.10.2012 - 123456789012345
2 - 15.10.2012 - 987654321012345
3 - 20.10.2012 - 123456789012345
4 - 21.11.2012 - 123456789012345
Then the second and fourth row should be returned. I need to return just one of the duplicates (last one) if the 30 day condition is met.
I hope it make more sense now. Thanks again.
A guess at what you're after:
SELECT
r.*,
(SELECT COUNT(*) FROM Repairs r2 WHERE r.IMEI = r2.IMEI
AND r.ID != r2.ID) as NumberOfAllDuplicates,
(SELECT COUNT(*) FROM Repairs r2 WHERE r.IMEI = r2.IMEI
AND ABS(DATEDIFF(day, r.AddedDate, r2.AddedDate)) < 30
AND r.ID != r2.ID) as NumberOfNearDuplicates
FROM
Repairs r
This depends on having an ID field, and everything existing in one table. With the correlated sub queries, it may not be very fast on long data.