Detecting duplicates which fall outside of a date interval - sql

I searched in SO but couldnt find a direct answer.
There are patients, hospitals, medical branches(ER,urology,orthopedics,internal disease etc), medical operation codes (examination,surgical operation, MRI, ultrasound or sth. else) and patient visiting dates.
Patient visits doctor, doctor prescribes medicine and asks to come again for control check.
If patient returns after 10 days, (s)he has to pay another examination fee to the same hospital. Hospitals may appoint a date after 10 days telling there are no available slots in following 10 days, in order to get the examination fee.
Table structure is like:
Patient id.no Hospital Medical Branch Medical Op. Code Date
1 H1 M0 P1 01/05/2011
5 H1 M1 P9 03/05/2011
3 H2 M0 P2 09/05/2011
1 H1 M0 P1 14/05/2011
3 H1 M0 P2 20/05/2011
5 H1 M2 P9 25/05/2011
1 H1 M0 P3 26/05/2011
Here, visiting patients no. 3 and 5 does not constitute a problem as patient no. 3 visits different hospitals and patient no.5 visits different medical branches. They would pay the examination fee even if they visited within 10 days.
Patient no.1, however, visits same hospital, same branch and is subject to same process (P1: examination) on 01/05 and 14/05.
26/05 doesnt count because it is not medical examination.
What I want to flag is same patient, same hospital, same branch and same medical operation code (that is specifically medical examination : P1 ), with date range more than 10 days.
The format of resulting table:
HOSPITAL TOTAL NUM. of PATIENTS NUM. of PATIENTS OUT OF DATE RANGE
H1 x a
H2 y b
H3 z c
Thanks.

Once again, it's analytic functions to the rescue.
This query uses the LAG() function to link a record in YOUR_TABLE with the previous (defined by DATE) matching record (defined by PATIENT_ID) in the table.
select hospital_id
, count(*) as total_num_of_patients
, sum (out_of_range) as num_of_patients_out_of_range
from (
select patient_id
, hospital_id
, case
when hospital_id_1 = hospital_id_0
and visit_1 > visit_0 + 10
and med_op_code_1 = med_op_code_0
then 1
else 0
end as out_of_range
from (
select patient_id
, hospital_id as hospital_id_1
, date as visit_1
, med_op_code as med_op_code_1
, lag (date) over (partition by patient_id order by date) as visit_0
, lag (hopital_id) over (partition by patient_id order by date) as hopital_id_0
, lag (med_op_code) over (partition by patient_id order by date) as med_op_code_0
from your_table
where med_op_code = 'P1'
)
)
group by hospital_id
/
Caveat: I haven't tested this code, so it may contain syntax errors. I will check it the next time I can access an Oracle database.

This is a little rough, as I haven't got an Oracle DB to hand, but the key feature is the same: the analytical function LAG(). Along with its companion function, LEAD(), they're great for helping to deal with things like periods of activity.
Here's my attempt at the code:
select n.hospital, COUNT(n.patient_id) as patients_out_of_date_range
from (
select *
from (
select d.*, lag(date, 1) over (partition by d.patient_id, d.hospital, d.medical_branch, d.medical_op_code order by d.date) as prev_date
from datatable d inner join
(
select d.patient_id, d.hospital, d.medical_branch, d.medical_op_code
from datatable d
where d.medical_op_code = 'P1'
group by d.patient_id, d.hospital, d.medical_branch, d.medical_op_code
having COUNT(d.date) > 1
) t on d.patient_id = t.patient_id and d.hospital = t.hospital and d.medical_branch = t.medical_branch and d.medical_op_code = t.medical_op_code
) m
where date - prev_date > 10
) n
group by n.hospital
Like I say, this isn't tested, but it should at least get you started in the right direction.
Some references:
http://www.adp-gmbh.ch/ora/sql/analytical/lag.html
http://www.oracle-base.com/articles/misc/LagLeadAnalyticFunctions.php

I think this is what you're trying for:
WITH Patient_Visits (Patient_Id, Hospital_Id, Branch_Id, Visit_Date, Visit_Order) as (
SELECT Patient_Id, Hospital_Id, BranchId, Visit_Date,
ROW_NUMBER() OVER(PARTITION BY Patient_ID, Hospital_Id, Branch_Id,
ORDER_BY Patient_Id, Hospital_Id, Branch_Id, Visit_Date)
FROM Hospital_Visits
WHERE Procedure_Id = 'P1'),
Hospital_Recent_Visits (Hospital_Id, Recent_Visitor_Count) as (
SELECT a.Hospital_Id, COUNT(DISTINCT a.Patient_Id)
FROM Patient_Visits as a
JOIN Patient_Visits as b
ON b.Hospital_Id = a.Hospital_Id
AND b.Branch_Id = a.Branch_Id
AND b.Patient_Id = a.Patient_Id
AND b.Visit_Order = a.Visit_Order - 1
AND b.Visit_Date + 10 > a.Visit_Date
GROUP BY a.Hospital_Id, a.Patient_Id),
Hospital_Patient_Count (Hospital_Id, Patient_Count) as (
SELECT Hospital_Id, COUNT(DISTINCT Patient_Id)
FROM Hospital_Visits
GROUP BY Hospital_Id, Patient_Id)
SELECT a.Hospital_Id, b.Patient_Count, c.Recent_Visitor_Count
FROM Hospitals as a
LEFT JOIN Hospital_Patient_Count as b
ON b.Hospital_Id = a.Hospital_Id
LEFT JOIN Hospital_Recent_Visits as c
ON c.Hospital_id = a.Hospital_Id
Please note that this was written and tested against a DB2 system. I think Oracle databases have the relevant functionality, so the query should still work as written. However, DB2 appears to lack some of the OLAP functions Oracle has (my version, at least), which could be useful in knocking out some of the CTEs.

Related

Need to pull customers with specific activity this year (Leads), where that activity (Lead) was their last one

I'm using SSMS to create a report showing customer accounts where the Sales Reps didn't follow up on leads we received this year. That would be indicated in accounts wheriin the list of activities (actions in the account), 'Lead' is the last one listed (the rep didn't take any actions after receiving the lead).
My code is pulling the latest 'Lead' activity for all customers who've had at least one lead this year:
CustomerName
Activity
Date
Bob's Tires
Lead
2021-01-05
Ned's Nails
Lead
2021-02-02
Good Eats
Lead
2021-02-03
I need it to only pull customers where the Lead was the last activity:
CustomerName
Activity
Date
Ned's Nails
Lead
2021-02-02
Here is my code and example tables. What am I missing? I've tried many things with no luck.
WITH activities AS (
SELECT
a. *
, CASE WHEN a.ContactDate = MAX(CASE WHEN a.Activity LIKE 'Lead%'
THEN a.ContactDate END) OVER (PARTITION BY a.AcctID)
THEN 1 ELSE 0 END AS no_followup
FROM AcctActivities a
WHERE a.ContactDate >= '2021-01-01'
)
SELECT
c.Name,
act.Activity,
act.ContactDate
FROM Customers c
INNER JOIN activities act ON c.AcctID = act.AcctID AND act.no_followup = 1
ORDER BY c.AcctID, act.ContactDate ASC
Table 1: Customers (c)
AcctID
CustomerName
11
Bob's Tires
12
Ned's Nails
13
Good Eats
14
Embers
Table 2: Activities (a)
AcctivityID
AcctID
Activity
Date
1
11
Contact Added
2021-01-01
2
11
Lead
2021-01-05
3
11
Phone Call
2021-01-06
4
12
Lead
2021-02-02
5
13
Lead
2021-02-03
6
13
Phone Call
2021-01-15
7
13
Sales Email
2021-01-15
8
14
Cold Call
2021-01-20
Your approach filters which rows may be considered in the max value in the comparison. I've included the suggested modification below which also modifies your CASE expression to consider that the current row is a lead as the case expression may filter the bounded values to consider (i.e. it will give you the latest lead activity but the latest lead activity may not be your latest activity).
Another modification, possibly optional but safe is adding the ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING in the OVER clause of your partition. While you could have also used UNBOUNDED PRECEDING instead of CURRENT ROW, it seems like extra processing when all the rows before the ordered ContactDate would be already be less than the current value and you are interested in the maximum value for contact date . The window function by default considers all rows current and before. The amendment would ask the window function to look at all the results in the partition after the current row.
Eg.
WITH activities AS
(
SELECT
a. *,
CASE
WHEN a.Activity LIKE 'Lead%' AND
a.ContactDate = (MAX(a.ContactDate) OVER (PARTITION BY a.AcctID ORDER BY a.ContactDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ))
THEN 1
ELSE 0
END AS no_followup
FROM
AcctActivities a
WHERE
a.ContactDate >= '2021-01-01'
)
SELECT
c.Name,
act.Activity,
act.ContactDate
FROM
Customers c
INNER JOIN
activities act ON c.AcctID = act.AcctID
AND act.no_followup = 1
ORDER BY
c.AcctID, act.ContactDate ASC
Furthermore, if you are only interested in the customer details and all the resulting activity names would be Lead you may consider the following approach which uses aggregation and having clause to filter your desired results. This approach returns less details in the resulting CTE by filtering early.
WITH customers_with_last_activity_as_lead as (
SELECT AcctID, MAX(ContactDate) as ContactDate
FROM AcctActivities a
WHERE a.ContactDate >= '2021-01-01'
GROUP BY AcctID
HAVING
MAX(a.ContactDate) = MAX(
CASE
WHEN a.Activity LIKE 'Lead%' THEN a.ContactDate
END
)
)
SELECT
c.Name,
-- 'Lead' as Activity, -- Uncomment this line if it is that you would like to see this constant value in your resulting queries.
act.ContactDate
FROM
Customers c
INNER JOIN
customers_with_last_activity_as_lead act ON c.AcctID = act.AcctID
ORDER BY
c.AcctID, act.ContactDate ASC
if all the values aren't a constant/literal Lead then the following approach may assist in retrieving the correct activity name also
WITH customers_with_last_activity_as_lead as (
SELECT
AcctID,
REPLACE(MAX(CONCAT(ContactDate,Activity)),MAX(ContactDate),'') as Activity,
MAX(ContactDate) as ContactDate
FROM AcctActivities a
WHERE a.ContactDate >= '2021-01-01'
GROUP BY AcctID
HAVING
MAX(a.ContactDate) = MAX(
CASE
WHEN a.Activity LIKE 'Lead%' THEN a.ContactDate
END
)
)
SELECT
c.Name,
act.Activity,
act.ContactDate
FROM
Customers c
INNER JOIN
customers_with_last_activity_as_lead act ON c.AcctID = act.AcctID
ORDER BY
c.AcctID, act.ContactDate ASC
Let me know if this works for you.
I agree with Gordon that your question isn't totally clear (do you actually only care about activities this year? What is the intended meaning of no_followup?). Having said that, I think this is what you want:
select c.Name,
lastActivity.Activity,
lastActivity.ContactDate
from Customers c
cross apply (
select top 1 a.Activity, a.ContactDate
from activities a
where a.acctId = c.acctId
-- and a.ContactDate >= '2021-01-01' uncomment this if you only care about activity this year
order by a.ContactDate desc
) lastActivity
where lastActivity.Activity like 'Lead%'

Determine cluster of access time within 10min intervals per user per day in SQL Server

How to query in SQL from the sample data, it will group or cluster the access_time per user per day within 10min intervals?
This is a complete guess, based on reading between the lines, and is untested due to a lack of consumable sample data.
It, however, looks like you are after a triangular JOIN (these can perform poorly, especially as this won't be SARGable) and a DENSE_RANK:
SELECT YT.[date],
YT.User_ID,
YT2.AccessTime,
DENSE_RANK() OVER (PARTITION BY YT.[date], YT.User_ID ORDER BY YT1.AccessTime) AS Cluster
FROM dbo.YourTable YT
JOIN dbo.YourTable YT2 ON YT.[date] = YT2.[date]
AND YT.User_ID = YT2.User_ID
AND YT.AccessTime <= YT2.AccessTime --This will join the row to itself
AND DATEADD(MINUTE,10,YT.AccessTime) >= YT2.AccessTime; --That is intentional
If I have understood your problem you want to group all accesses for a user in a day when all accesses of that group are in a time interval of 10 minutes. Not counting single accesses, so an access distant more than 10 minutes from every other is not counted as a cluster.
You can identify the clusters joining the accesses table with itself to get all possible time intervals of 10 minutes and number them.
Finally simply rejoin access table to get accesses for each cluster:
; with
user_clusters as (
select a1.date, a1.user_id, a1.access_time cluster_start, a2.access_time cluster_end,
ROW_NUMBER() over (partition by a1.date, a1.user_id order by a1.access_time) user_cluster_id
from ACCESS_TIMES a1
join ACCESS_TIMES a2 on a1.date = a2.date and a1.user_id = a2.user_id
and a1.access_time < a2.access_time
and datediff(minute, a1.access_time, a2.access_time)<10
)
select *
from user_clusters c
join ACCESS_TIMES a on a.date = c.date and a.user_id = c.user_id and a.access_time between c.cluster_start and cluster_end
order by a.date, a.user_id, c.user_cluster_id, a.access_time
output:
date user_id access_time user_cluster_id
'2020-09-19', 'AA083P', '2020-09-19 18:15:00', 1
'2020-09-19', 'AA083P', '2020-09-19 18:22:00', 1
'2020-09-19', 'AA083P', '2020-09-19 18:22:00', 2
'2020-09-19', 'AA083P', '2020-09-19 18:28:00', 2
'2020-09-20', 'AB162Y', '2020-09-20 19:34:00', 1
'2020-09-20', 'AB162Y', '2020-09-20 19:37:00', 1

how to filter data in sql based on percentile

I have 2 tables, the first one is contain customer information such as id,age, and name . the second table is contain their id, information of product they purchase, and the purchase_date (the date is from 2016 to 2018)
Table 1
-------
customer_id
customer_age
customer_name
Table2
------
customer_id
product
purchase_date
my desired result is to generate the table that contain customer_name and product who made purchase in 2017 and older than 75% of customer that make purchase in 2016.
Depending on your flavor of SQL, you can get quartiles using the more general ntile analytical function. This basically adds a new column to your query.
SELECT MIN(customer_age) as min_age FROM (
SELECT customer_id, customer_age, ntile(4) OVER(ORDER BY customer_age) AS q4 FROM table1
WHERE customer_id IN (
SELECT customer_id FROM table2 WHERE purchase_date = 2016)
) q
WHERE q4=4
This returns the lowest age of the 4th-quartile customers, which can be used in a subquery against the customers who made purchases in 2017.
The argument to ntile is how many buckets you want to divide into. In this case 75%+ equals 4th quartile, so 4 buckets is OK. The OVER() clause specifies what you want to sort by (customer_age in our case), and also lets us partition (group) the data if we want to, say, create multiple rankings for different years or countries.
Age is a horrible field to include in a database. Every day it changes. You should have date-of-birth or something similar.
To get the 75% oldest value in 2016, there are several possibilities. I usually go for row_number() and count(*):
select min(customer_age)
from (select c.*,
row_number() over (order by customer_age) as seqnum,
count(*) over () as cnt
from customers c join
where exists (select 1
from customer_products cp
where cp.customer_id = c.customer_id and
cp.purchase_date >= '2016-01-01' and
cp.purchase_date < '2017-01-01'
)
)
where seqnum >= 0.75 * cnt;
Then, to use this for a query for 2017:
with a2016 as (
select min(customer_age) as customer_age
from (select c.*,
row_number() over (order by customer_age) as seqnum,
count(*) over () as cnt
from customers c
where exists (select 1
from customer_products cp
where cp.customer_id = c.customer_id and
cp.purchase_date >= '2016-01-01' and
cp.purchase_date < '2017-01-01'
)
) c
where seqnum >= 0.75 * cnt
)
select c.*, cp.product_id
from customers c join
customer_products cp
on cp.customer_id = c.customer_id and
cp.purchase_date >= '2017-01-01' and
cp.purchase_date < '2018-01-01' join
a2016 a
on c.customer_age >= a.customer_age;

Need to sub count in a count

In the following, we get a count of how many times a patient did not show for an medical appointment (NOSHOW). It is based on if they did not show for the current day, then we display their count of this from the past. How can I get also the department that they did not show for? We have 6 different medical department so the manager wants to see if the problem is only for say dental, or for all. This will assist them from perhaps not booking someone etc.
SELECT Distinct
Appt_DateTime j,
Patient_Name j,
Appt_Status j,
Appt_Sched_Department_ID j,
Appt_Sched_Department_Descr j,
Patient_id j,
Patient_number j,
Appt_NoShow_Date j,
ISNULL(P.NotShowCount,0) AS NotShowCount
FROM
vwGenPatInfo vwGenPatInfo j
INNER JOIN vwGenPatApptInfo vwGenPatApptInfo ON vwGenPatInfo.Patient_ID=vwGenPatApptInfo.Patient_ID
LEFT JOIN (
SELECT Patient_ID, COUNT(Appt_Status) AS NotShowCount
FROM (SELECT Appt_DateTime, Appt_Status, Appt_Sched_Department_ID, Appt_Sched_Department_Descr, Appt_NoShow_Date, Patient_ID
FROM vwGenPatapptInfo AS vwGenPatApptInfo
WHERE (Appt_Status = 'N') AND (Appt_DateTime < DATEADD(day, DATEDIFF(day, 0, GETDATE()), - 1))) AS L
GROUP BY
Patient_ID) AS P ON vwGenPatInfo.Patient_ID=P.Patient_ID
WHERE
vwGenPatApptInfo.Appt_Status='N'
ORDER BY
vwGenPatApptInfo.Appt_Sched_Department_ID,
vwGenPatApptInfo.Appt_DateTime
the data currently is like this: the last number is the count of previous noshows. So we want to break this down like Adult Medicine_NS 3, Dental_NS 9. the datetime showing is the noshow from previous day. The call room will call them to re-schedule.
Patient_Name Appt_Sched_Departmen Appt_NoShow_Date Previous No Show Count
8/31/2016 No Shows 8/30/2016
Patient_number
Sinca Blay Adult Medicine 8/30/2016 12:05:46PM 12
Wiske Semns Adult Medicine 8/30/2016 5:25:32PM 4
Rose Alhar Adult Medicine 8/30/2016 5:57:01PM 6
You could use analytic, aggregate and ranking window functions for this.
Here is an untested query to give you an idea:
SELECT Patient_ID,
Patient_Name,
Appt_Sched_Department_ID,
Appt_Sched_Department_Descr,
Appt_NoShow_Date_Today,
Appt_NoShow_Date_Prev,
NotShowCount
FROM (SELECT app.Patient_ID,
pat.Patient_Name,
app.Appt_Sched_Department_ID,
app.Appt_Sched_Department_Descr,
app.Appt_DateTime AS Appt_NoShow_Date_Today,
LEAD(app.Appt_DateTime)
OVER (PARTITION BY app.Patient_ID
ORDER BY app.Appt_DateTime DESC) AS Appt_NoShow_Date_Prev,
COUNT(app.Appt_Status)
OVER (PARTITION BY app.Patient_ID) AS NotShowCount,
ROW_NUMBER()
OVER (PARTITION BY app.Patient_ID
ORDER BY app.Appt_DateTime DESC) AS rn
FROM vwGenPatApptInfo app
INNER JOIN vwGenPatInfo pat
ON pat.Patient_ID = app.Patient_ID
WHERE app.Appt_Status = 'N'
) AS base
WHERE rn = 1
AND Appt_NoShow_Date_Today >= DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0)
ORDER BY Appt_Sched_Department_ID,
Appt_NoShow_Date_Today
The inner query gets all records with status 'N', but adds information:
Appt_NoShow_Date_Prev: the Appt_DateTime value of the next record if the records are sorted by descending Appt_DateTime (for the same patient: this is the "window").
NotShowCount: The number of records in that window.
rn: the sequential record number in that window, again when records are sorted by descending Appt_DateTime.
The outer query only keeps the records with rn = 1, which means: the record with the most recent Appt_DateTime per patient. As a second condition, this Appt_DateTime must be today (after last midnight): this makes sure we only list patients which were absent today.

sql query to find customers who order too frequently?

My database isn't actually customers and orders, it's customers and prescriptions for their eye tests (just in case anyone was wondering why I'd want my customers to make orders less frequently!)
I have a database for a chain of opticians, the prescriptions table has the branch ID number, the patient ID number, and the date they had their eyes tested. Over time, patients will have more than one eye test listed in the database. How can I get a list of patients who have had a prescription entered on the system more than once in six months. In other words, where the date of one prescription is, for example, within three months of the date of the previous prescription for the same patient.
Sample data:
Branch Patient DateOfTest
1 1 2007-08-12
1 1 2008-08-30
1 1 2008-08-31
1 2 2006-04-15
1 2 2007-04-12
I don't need to know the actual dates in the result set, and it doesn't have to be exactly three months, just a list of patients who have a prescription too close to the previous prescription. In the sample data given, I want the query to return:
Branch Patient
1 1
This sort of query isn't going to be run very regularly, so I'm not overly bothered about efficiency. On our live database I have a quarter of a million records in the prescriptions table.
Something like this
select p1.branch, p1.patient
from prescription p1, prescription p2
where p1.patient=p2.patient
and p1.dateoftest > p2.dateoftest
and datediff('day', p2.dateoftest, p1.dateoftest) < 90;
should do... you might want to add
and p1.dateoftest > getdate()
to limit to future test prescriptions.
This one will efficiently use an index on (Branch, Patient, DateOfTest) which you of course should have:
SELECT Patient, DateOfTest, pDate
FROM (
SELECT (
SELECT TOP 1 DateOfTest AS last
FROM Patients pp
WHERE pp.Branch = p.Branch
AND pp.Patient = p.Patient
AND pp.DateOfTest BETWEEN DATEADD(month, -3, p.DateOfTest) AND p.DateOfTest
ORDER BY
DateOfTest DESC
) pDate
FROM Patients p
) po
WHERE pDate IS NOT NULL
On way:
select d.branch, d.patient
from data d
where exists
( select null from data d1
where d1.branch = d.branch
and d1.patient = d.patient
and "difference (d1.dateoftest ,d.dateoftest) < 6 months"
);
This part needs changing - I'm not familiar with SQL Server's date operations:
"difference (d1.dateoftest ,d.dateoftest) < 6 months"
Self-join:
select a.branch, a.patient
from prescriptions a
join prescriptions b
on a.branch = b.branch
and a.patient = b.patient
and a.dateoftest > b.dateoftest
and a.dateoftest - b.dateoftest < 180
group by a.branch, a.patient
This assumes you want patients who visit the same branch twice. If you don't, take out the branch part.
SELECT Branch
,Patient
FROM (SELECT Branch
,Patient
,DateOfTest
,DateOfOtherTest
FROM Prescriptions P1
JOIN Prescriptions P2
ON P2.Branch = P1.Branch
AND P2.Patient = P2.Patient
AND P2.DateOfTest <> P1.DateOfTest
) AS SubQuery
WHERE DATEDIFF(day, SubQuery.DateOfTest, SubQuery.DateOfOtherTest) < 90