SQL SERVER - get records based on the earliest date being within a year - sql

Bit of a complicated one for me.
I have a database full of hundreds of thousands of records, many of which are duplicated.
I need to get all records within the last year but making sure every instance of that record is within the last year, e.g. if a record is duplicated and one is older than a year this shouldnt be included.
So far I have the below...
Step 1 - find out earliest date for each record
SELECT MIN(CreateDate) AS Date, Email FROM Results R
WHERE (R.Email IS NOT NULL AND R.Email <> '')
GROUP BY R.Email
I created this as a view and called it EarliestInteraction
Step 2 - grab all within the last year
Note - so I need records within the last year but they also need to be in a log table also. So all records within the last year that are also present in some log tables.
So far I have done this...
SELECT * FROM EarliestInteraction ECI
WHERE ( CAST(ECI.Date AS DATE) >= CAST(GETDATE() - 365 AS DATE) )
AND (
EXISTS (
SELECT Id FROM LOG1 R
WHERE Source = 'LOGGED'
AND R.Email = ECI.Email
)
OR
EXISTS (
SELECT Id FROM LOG2 R WHERE (R.Email IS NOT NULL AND R.Email <> '')
AND R.Email = ECI.Email
AND R.EventType IN (
'LOGGED'
))
)
My question is, is this a good way of doing this and accurate?
Or am I missing something that would bring back earlier duplicates...
Any thoughts on if this is accurate or achieves the brief would be great.

You want records where there is no record on the same email address before this year:
select r.*
from results r
where not exists (select 1
from results r2
where r2.email = r.email and
r2.created_date < dateadd(year, -1, getdate())
);

Related

How can I count duplicates that fall within a date range? (SQL)

I have a table that contains Applicant ID, Application Date and Job Description.
I am trying to identify duplicates, defined as when the same Applicant ID applies for the same Job Description within 3 days of their other application.
I have already done this for the same date, this way:
CREATE TABLE Duplicates
SELECT
COUNT (ApplicantID) as ApplicantCount
ApplicantID
ApplicationDate
JobDescription
FROM Applications
GROUP BY ApplicantID,ApplicationDate,JobDescription
-
DELETE FROM Duplicates WHERE ApplicantCount <2
SELECT COUNT(*) FROM Duplicates
I'm now trying to make it so it doesn't have to match exactly on the ApplicationDate, but falls within a range. How do you do this?
You can use lead()/lag(). Here is an example that returns the first application when there is a duplicate:
SELECT a.*
FROM (SELECT a.*,
LEAD(ApplicationDate) OVER (PARTITION BY ApplicantID, JobDescription) as next_ad
FROM Applications a
) a
WHERE next_ad <= ApplicationDate + INTERVAL 3 DAY;
You can also phrase this using exists:
select a.*
from applications a
where exists (select 1
from applications a2
where a2.ApplicantID = a.ApplicantID and
a2.JobDescription = a.JobDescription and
a2.ApplicationDate > a.ApplicationDate and
a2.ApplicationDate <= a.ApplicationDate + interval 3 day
);

Using a stored procedure in Teradata to build a summarial history table

I am using Terdata SQL Assistant connected to an enterprise DW. I have written the query below to show an inventory of outstanding items as of a specific point in time. The table referenced loads and stores new records as changes are made to their state by load date (and does not delete historical records). The output of my query is 1 row for the specified date. Can I create a stored procedure or recursive query of some sort to build a history of these summary rows (with 1 new row per day)? I have not used such functions in the past; links to pertinent previously answered questions or suggestions on how I could get on the right track in researching other possible solutions are totally fine if applicable; just trying to bridge this gap in my knowledge.
SELECT
'2017-10-02' as Dt
,COUNT(DISTINCT A.RECORD_NBR) as Pending_Records
,SUM(A.PAY_AMT) AS Total_Pending_Payments
FROM DB.RECORD_HISTORY A
INNER JOIN
(SELECT MAX(LOAD_DT) AS LOAD_DT
,RECORD_NBR
FROM DB.RECORD_HISTORY
WHERE LOAD_DT <= '2017-10-02'
GROUP BY RECORD_NBR
) B
ON A.RECORD_NBR = B.RECORD_NBR
AND A.LOAD_DT = B.LOAD_DT
WHERE
A.RECORD_ORDER =1 AND Final_DT Is Null
GROUP BY Dt
ORDER BY 1 desc
Here is my interpretation of your query:
For the most recent load_dt (up until 2017-10-02) for record_order #1,
return
1) the number of different pending records
2) the total amount of pending payments
Is this correct? If you're looking for this info, but one row for each "Load_Dt", you just need to remove that INNER JOIN:
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
ORDER BY 1 DESC
If you want to get the summary info per record_order, just add record_order as a grouping column:
SELECT
load_Dt,
record_order,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE final_Dt IS NULL
GROUP BY load_Dt, record_order
ORDER BY 1,2 DESC
If you want to get one row per day (if there are calendar days with no corresponding "load_dt" days), then you can SELECT from the sys_calendar.calendar view and LEFT JOIN the query above on the "load_dt" field:
SELECT cal.calendar_date, src.Pending_Records, src.Total_Pending_Payments
FROM sys_calendar.calendar cal
LEFT JOIN (
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
) src ON cal.calendar_date = src.load_Dt
WHERE cal.calendar_date BETWEEN <start_date> AND <end_date>
ORDER BY 1 DESC
I don't have access to a TD system, so you may get syntax errors. Let me know if that works or you're looking for something else.

Query for active records between date range or most recent before date range

I need to find active records that fall between a range of date parameters from a table containing applications. First, I look for a record between the date range in a table called 'app_notes' and check if is linked to an application. If there is no app_note record in the date range, I must look at the most recent app note from before the date range. If this app note indicates a status of active, I select it.
The app_indiv table connects an individual to an application. There can be multiple app_indiv records for each individual and multiple app_notes for each app_indiv. Here is what I have so far:
SELECT DISTINCT individual.indiv_id
FROM individual INNER JOIN
app_indiv ON app_indiv.indiv_id = individual.indiv_id INNER JOIN
app_note ON app_indiv.app_indiv_id = app_note.app_indiv_id
WHERE (app_note.when_mod BETWEEN #date_from AND #date_to)
/* OR most recent app_note indicates active */
How can I get the most recent app_note record if there is not one in the date range? Since there are multiple app_note records possible, I don't know how to make it only retrieve the most recent.
SELECT *
FROM individual i
INNER JOIN app_indiv ai
ON ai.indiv_id = i.indiv_id
OUTER APPLY
(
SELECT TOP 1 * FROM app_note an
WHERE an.app_indiv_id = ai.app_indiv_id
AND an.when_mod < #date_to
ORDER BY an.when_mod DESC
) d
WHERE d.status = 'active'
Find the last note less than end date, check to see if it's active and if so show the individual record.
(untested) You'll need to use a CASE switch.
SELECT DISTINCT individual.indiv_id
FROM individual INNER JOIN
app_indiv ON app_indiv.indiv_id = individual.indiv_id INNER JOIN
app_note ON app_indiv.app_indiv_id = app_note.app_indiv_id
WHERE (CASE WHEN app_note.when_mod BETWEEN #date_from AND #date_to
THEN (SELECT appnote.when_mod from individual where appnote.when_mod BETWEEN #date_from AND #date_to)
WHEN app_note.when_mod NOT BETWEEN #date_from and #date_to
THEN (SELECT appnote.when_mod from individual appnote.when_mod LIMIT 1))
Query might not be correct. Switch might need to be in the first SELECT part of the query.
It seems to me that you really only care about the end date of your date range, since you want to be able to look farther back if there's nothing in that date range. I would use a CTEand the ROW_NUMBER() function. The CTE is just a cleaner way to write a sub-query (in this case, a CTE can do a lot more though). The Row_Number function will numbers the rows based on the order by statement. The partition by resets the numbering to one each time you hit a new value in that column.
with AppNoteCTE as
(select
<not sure what columns you need here>
app_indiv_id,
ROW_NUMBER() OVER (PARTITION BY APP_INDIV_ID ORDER BY WHEN_MOD DESC) RN
FROM
APP_INDIV
WHERE
WHEN_MOD <= #endDate)
SELECT DISTINCT individual.indiv_id
FROM individual INNER JOIN
app_indiv ON app_indiv.indiv_id = individual.indiv_id INNER JOIN
AppNoteCTE ON app_indiv.app_indiv_id = AppNoteCTE .app_indiv_id
and AppNoteCTE.RN = 1

Calculating current consecutive days from a table

I have what seems to be a common business request but I can't find no clear solution. I have a daily report (amongst many) that gets generated based on failed criteria and gets saved to a table. Each report has a type id tied to it to signify which report it is, and there is an import event id that signifies the day the imports came in (a date column is added for extra clarification). I've added a sqlfiddle to see the basic schema of the table (renamed for privacy issues).
http://www.sqlfiddle.com/#!3/81945/8
All reports currently generated are working fine, so nothing needs to be modified on the table. However, for one report (type 11), not only I need pull the invoices that showed up today, I also need to add one column that totals the amount of consecutive days from date of run for that invoice (including current day). The result should look like the following, based on the schema provided:
INVOICE MESSAGE EVENT_DATE CONSECUTIVE_DAYS_ON_REPORT
12345 Yes July, 30 2013 6
54355 Yes July, 30 2013 2
644644 Yes July, 30 2013 4
I only need the latest consecutive days, not any other set that may show up. I've tried to run self joins to no avail, and my last attempt is also listed as part of the sqlfiddle file, to no avail. Any suggestions or ideas? I'm quite stuck at the moment.
FYI: I am working in SQL Server 2000! I have seen a lot of neat tricks that have come out in 2005 and 2008, but I can't access them.
Your help is greatly appreciated!
Something like this? http://www.sqlfiddle.com/#!3/81945/14
SELECT
[final].*,
[last].total_rows
FROM
tblEventInfo AS [final]
INNER JOIN
(
SELECT
[first_of_last].type_id,
[first_of_last].invoice,
MAX([all_of_last].event_date) AS event_date,
COUNT(*) AS total_rows
FROM
(
SELECT
[current].type_id,
[current].invoice,
MAX([current].event_date) AS event_date
FROM
tblEventInfo AS [current]
LEFT JOIN
tblEventInfo AS [previous]
ON [previous].type_id = [current].type_id
AND [previous].invoice = [current].invoice
AND [previous].event_date = [current].event_date-1
WHERE
[current].type_id = 11
AND [previous].type_id IS NULL
GROUP BY
[current].type_id,
[current].invoice
)
AS [first_of_last]
INNER JOIN
tblEventInfo AS [all_of_last]
ON [all_of_last].type_id = [first_of_last].type_id
AND [all_of_last].invoice = [first_of_last].invoice
AND [all_of_last].event_date >= [first_of_last].event_date
GROUP BY
[first_of_last].type_id,
[first_of_last].invoice
)
AS [last]
ON [last].type_id = [final].type_id
AND [last].invoice = [final].invoice
AND [last].event_date = [final].event_date
The inner most query looks up the starting record of the last block of consecutive records.
Then that joins on to all the records in that block of consecutive records, giving the final date and the count of rows (consecutive days).
Then that joins on to the row for the last day to get the message, etc.
Make sure that in reality you have an index on (type_id, invoice, event_date).
You have multiple problems. Tackle them separately and build up.
Problems:
1) Identifying consecutive ranges: subtract the row_number from the range column and group by the result
2) No ROW_NUMBER() functions in SQL 2000: Fake it with a correlated subquery.
3) You actually want DENSE_RANK() instead of ROW_NUMBER: Make a list of unique dates first.
Solutions:
3)
SELECT MAX(id) AS id,invoice,event_date FROM tblEventInfo GROUP BY invoice,event_date
2)
SELECT t2.invoice,t2.event_date,t2.id,
DATEDIFF(day,(SELECT COUNT(DISTINCT event_date) FROM (SELECT MAX(id) AS id,invoice,event_date FROM tblEventInfo GROUP BY invoice,event_date) t1 WHERE t2.invoice = t1.invoice AND t2.event_date > t1.event_date),t2.event_date) grp
FROM (SELECT MAX(id) AS id,invoice,event_date FROM tblEventInfo GROUP BY invoice,event_date) t2
ORDER BY invoice,grp,event_date
1)
SELECT
t3.invoice AS INVOICE,
MAX(t3.event_date) AS EVENT_DATE,
COUNT(t3.event_date) AS CONSECUTIVE_DAYS_ON_REPORT
FROM (
SELECT t2.invoice,t2.event_date,t2.id,
DATEDIFF(day,(SELECT COUNT(DISTINCT event_date) FROM (SELECT MAX(id) AS id,invoice,event_date FROM tblEventInfo GROUP BY invoice,event_date) t1 WHERE t2.invoice = t1.invoice AND t2.id > t1.id),t2.event_date) grp
FROM (SELECT MAX(id) AS id,invoice,event_date FROM tblEventInfo GROUP BY invoice,event_date) t2
) t3
GROUP BY t3.invoice,t3.grp
The rest of your question is a little ambiguous. If two ranges are of equal length, do you want both or just the most recent? Should the output MESSAGE be 'Yes' if any message = 'Yes' or only if the most recent message = 'Yes'?
This should give you enough of a breadcrumb though
I had a similar requirement not long ago getting a "Top 5" ranking with a consecutive number of periods in Top 5. The only solution I found was to do it in a cursor. The cursor has a date = #daybefore and inside the cursor if your data does not match quit the loop, otherwise set #daybefore = datediff(dd, -1, #daybefore).
Let me know if you want an example. There just seem to be a large number of enthusiasts, who hit downvote when they see the word "cursor" even if they don't have a better solution...
Here, try a scalar function like this:
CREATE FUNCTION ConsequtiveDays
(
#invoice bigint, #date datetime
)
RETURNS int
AS
BEGIN
DECLARE #ct int = 0, #Count_Date datetime, #Last_Date datetime
SELECT #Last_Date = #date
DECLARE counter CURSOR LOCAL FAST_FORWARD
FOR
SELECT event_date FROM tblEventInfo
WHERE invoice = #invoice
ORDER BY event_date DESC
FETCH NEXT FROM counter
INTO #Count_Date
WHILE ##FETCH_STATUS = 0 AND DATEDIFF(dd,#Last_Date,#Count_Date) < 2
BEGIN
#ct = #ct + 1
END
CLOSE counter
DEALLOCATE counter
RETURN #ct
END
GO

SQL Server 2008: Using Multiple dts Ranges to Build a Set of Dates

I'm trying to build a query for a medical database that counts the number of patients that were on at least one medication from a class of medications (the medications listed below in the FAST_MEDS CTE) and had either:
1) A diagnosis of myopathy (the list of diagnoses in the FAST_DX CTE)
2) A CPK lab value above 1000 (the lab value in the FAST_LABS CTE)
and this diagnosis or lab happened AFTER a patient was on a statin.
The query I've included below does that under the assumption that once a patient is on a statin, they're on a statin forever. The first CTE collects the ids of patients that were on a statin along with the first date of their diagnosis, the second those with a diagnosis, and the third those with a high lab value. After this I count those that match the above criteria.
What I would like to do is drop the assumption that once a patient is on a statin, they're on it for life. The table edw_dm.patient_medications has a column called start_dts and end_dts. This table has one row for each prescription written, with start_dts and end_dts denoting the start and end date of the prescription. End_dts could be null, which I'll take to assume that the patient is currently on this medication (it could be a missing record, but I can't do anything about this). If a patient is on two different statins, the start and ends dates can overlap, and there may be multiple records of the same medication for a patient, as in a record showing 3-11-2000 to 4-5-2003 and another for the same patient showing 5-6-2007 to 7-8-2009.
I would like to use these two columns to build a query where I'm only counting the patients that had a lab value or diagnosis done during a time when they were already on a statin, or in the first n (say 3) months after they stopped taking a statin. I'm really not sure how to go about rewriting the first CTE to get this information and how to do the comparison after the CTEs are built. I know this is a vague question, but I'm really stumped. Any ideas?
As always, thank you in advance.
Here's the current query:
WITH FAST_MEDS AS
(
select distinct
statins.mrd_pt_id, min(year(statins.order_dts)) as statin_yr
from
edw_dm.patient_medications as statins
inner join mrd.medications as mrd
on statins.mrd_med_id = mrd.mrd_med_id
WHERE mrd.generic_nm in (
'Lovastatin (9664708500)',
'lovastatin-niacin',
'Lovastatin/Niacin',
'Lovastatin',
'Simvastatin (9678583966)',
'ezetimibe-simvastatin',
'niacin-simvastatin',
'ezetimibe/Simvastatin',
'Niacin/Simvastatin',
'Simvastatin',
'Aspirin Buffered-Pravastatin',
'aspirin-pravastatin',
'Aspirin/Pravastatin',
'Pravastatin',
'amlodipine-atorvastatin',
'Amlodipine/atorvastatin',
'atorvastatin',
'fluvastatin',
'rosuvastatin'
)
and YEAR(statins.order_dts) IS NOT NULL
and statins.mrd_pt_id IS NOT NULL
group by statins.mrd_pt_id
)
select *
into #meds
from FAST_MEDS
;
--return patients who had a diagnosis in the list and the year that
--diagnosis was given
with
FAST_DX AS
(
SELECT pd.mrd_pt_id, YEAR(pd.init_noted_dts) as init_yr
FROM edw_dm.patient_diagnoses as pd
inner join mrd.diagnoses as mrd
on pd.mrd_dx_id = mrd.mrd_dx_id
and mrd.icd9_cd in
('728.89','729.1','710.4','728.3','729.0','728.81','781.0','791.3')
)
select *
into #dx
from FAST_DX;
--return patients who had a high cpk value along with the year the cpk
--value was taken
with
FAST_LABS AS
(
SELECT
pl.mrd_pt_id, YEAR(pl.order_dts) as lab_yr
FROM
edw_dm.patient_labs as pl
inner join mrd.labs as mrd
on pl.mrd_lab_id = mrd.mrd_lab_id
and mrd.lab_nm = 'CK (CPK)'
WHERE
pl.lab_val between 1000 AND 999998
)
select *
into #labs
from FAST_LABS;
-- count the number of patients who had a lab value or a medication
-- value taken sometime AFTER their initial statin diagnosis
select
count(distinct p.mrd_pt_id) as ct
from
mrd.patient_demographics as p
join #meds as m
on p.mrd_pt_id = m.mrd_pt_id
AND
(
EXISTS (
SELECT 'A' FROM #labs l WHERE p.mrd_pt_id = l.mrd_pt_id
and l.lab_yr >= m.statin_yr
)
OR
EXISTS(
SELECT 'A' FROM #dx d WHERE p.mrd_pt_id = d.mrd_pt_id
AND d.init_yr >= m.statin_yr
)
)
You probably don't need to select all of your CTE defined queries into temp tables.
I think that the query you're after has the form:
WITH FAST_MEDS(PatientID, StartDate, EndDate) AS
(
--your query for patients on statins, projecting the patient ID and the start/end date for the medication
),
FAST_DX(PatientID, Date) AS
(
--your query for patients with certain diagnosis, projecting the patient ID and the date
),
FAST_LABS(PatientID, Date) AS
(
--your query for patients with certain labs, projecting the patient ID and the date
)
SELECT PatientID
FROM FAST_MEDS
WHERE PatientID IN (SELECT PatientID FROM FAST_DX WHERE Date BETWEEN StartDate AND EndDate OR EndDate IS NULL AND StartDate < Date)
OR PatientID IN (SELECT PatientID FROM FAST_LABS WHERE Date BETWEEN StartDate AND EndDate OR EndDate IS NULL AND StartDate < Date)