How to select duplicates by first order of appearance - sql

I am looking to select unique values from a SQL database but I want to make sure that I am selecting only the first duplicate in order of appearance (in my case - date in the hospital, intime col)
You can see the code below.
I am trying to take only the IDs of the first time the patients were hospitalized which correspond to the "intime" col.
I have no absolute way to check that by ordering as I did and by using groupby, SQL will in fact return the id in the same order.
Thank you very much.
WITH ccupatients AS
(SELECT HADM_ID
FROM `physionet-data.mimiciii_clinical.icustays` i
WHERE first_careunit = 'CCU'
ORDER BY intime)
SELECT hadm_id
FROM ccupatients
GROUP BY hadm_id

Use ROW_NUMBER() if your RDBMS supports it: this works by ranking records by increasing intime within groups of records having the same ham_id, and then filtering in the outer query on the top record per group:
SELECT hadm_id
FROM (
SELECT hadm_id, ROW_NUMBER() OVER(PARTITION BY hadm_id ORDER BY intime) rn
FROM `physionet-data.mimiciii_clinical`.icustays
WHERE first_careunit = 'CCU'
) x
WHERE rn = 1
If you RDBMS does not support window functions such as ROW_NUMBER(), another option is to use a NOT EXISTS condition with a correlated subquery:
SELECT hadm_id
FROM `physionet-data.mimiciii_clinical`.icustays i
WHERE
first_careunit = 'CCU'
AND NOT EXISTS (
SELECT 1
FROM `physionet-data.mimiciii_clinical`.icustays i1
WHERE
i1.first_careunit = 'CCU'
AND i1.hadm_id = i.hadm_id
AND i1.intime < i.intime
)

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.
The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is
You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results
You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1
You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

SQL Max or empty value grouped by conditions

I have a table like this
and i want my output to look like this
I need to look at the ID and then take max created date and max completed date for that ID. There is also some cases where completed date is still empty so in that case i just need to look at the max created date. Im not sure how to tackle this, doing a group by doesnt account for my multiple scenarios
Use ROW_NUMBER:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY QUOTE_NUMBER
ORDER BY WORKBOOK_CREATED_DATE DESC) rn
FROM yourTable
)
SELECT *
FROM yourTable
WHERE rn = 1;

Find the second largest value with Groupings

In SQL Server, I am attempting to pull the second latest NOTE_ENTRY_DT_TIME (items highlighted in screenshot). With the query written below it still pulls the latest date (I believe it's because of the grouping but the grouping is required to join later). What is the best method to achieve this?
SELECT
hop.ACCOUNT_ID,
MAX(hop.NOTE_ENTRY_DT_TIME) AS latest_noteid
FROM
NOTES hop
WHERE
hop.GEN_YN IS NULL
AND hop.NOTE_ENTRY_DT_TIME < (SELECT MAX(hope.NOTE_ENTRY_DT_TIME)
FROM NOTES hope
WHERE hop.GEN_YN IS NULL)
GROUP BY
hop.ACCOUNT_ID
Data sample in the table:
One of the "easier" ways to get the Nth row in a group is to use a CTE and ROW_NUMBER:
WITH CTE AS(
SELECT Account_ID,
Note_Entry_Dt_Time,
ROW_NUMBER() OVER (PARTITION BY AccountID ORDER BY Note_Entry_Dt_Time DESC) AS RN
FROM dbo.YourTable)
SELECT Account_ID,
Note_Entry_Dt_Time
FROM CTE
WHERE RN = 2;
Of course, if an ACCOUNT_ID only has 1 row, then it will not be returned in the result set.
The OP's statement "The row will not always be 2." from the comments conflicts with their statement "I am attempting to pull the second latest NOTE_ENTRY_DT_TIME" in the question. At a best guess, this means that the OP has rows with the same date, that could be the "latest" date. If so, then would simply need to replace ROW_NUMBER with DENSE_RANK. Their sampple data, however, doesn't suggest this is the case.
You can use window functions:
select *
from (
select
n.*,
row_number() over(partition by account_id order by note_entry_dt_time desc) rn
from notes n
) t
where rn = 2

SQL query to grab most current record with multiple groupings

I am using SQL Server 2014 and Management Studio. Let me try to explain what I am doing.
I have a table which look similar to the following (very simplified)
I want to create a query which will grab the most current record for each parameter if the Well Global ID is the same. What I want would look like the following:
With me not being a great SQL jockey I would like a little help.
The closest thing I could find was the following which doesn't take into account the parameter field so it would just grab the most current record if the Global ID matches:
SELECT TOP 1000
[OBJECTID], SampleDate,
Collector, Parameter, Result, Unit,
WellGlobalID, GlobalID
FROM
WellSamples
WHERE
SampleDate IN (SELECT MAX(SampleDate)
FROM WellSamples
GROUP BY WellGlobalID);
Use the ROW_NUMBER function.
SELECT *
FROM (
SELECT w.*,
ROW_NUMBER() OVER(PARTITION BY parameter,wellglobalid
ORDER BY sampledate DESC) as RN
FROM WellSamples w
) x
WHERE RN = 1
ROW_NUMBER would be my solution https://msdn.microsoft.com/en-us/library/ms186734.aspx
SELECT
[OBJECTID]
,SampleDate
,Collector
,Parameter
,Result
,Unit
,WellGlobalID
,GlobalID
FROM (
SELECT
[OBJECTID]
,SampleDate
,Collector
,Parameter
,Result
,Unit
,WellGlobalID
,GlobalID
,ROW_NUMBER() OVER (PARTITION BY Parameter, WellGlobalID ORDER BY SampleDate DESC) AS [ROW_NUM]
FROM WellSamples
) tbl
WHERE ROW_NUM = 1
You need to subquery since windowed functions (ROW_NUMBER) can't be used in a where clause.
You could also do this with a sub-query. First find the most recent date for each parameter and then join the rest of the data, like this:
SELECT w.parameter, w.sampledate, w.result, w.wellglobalid
FROM wellsamples w
INNER JOIN
(SELECT MAX(sampledate) AS mxdate, parameter
FROM wellsamples
GROUP BY parameter) sub
ON w.parameter = sub.parameter
AND w.sampledate = sub.mxdate

Over clause in SQL Server

I have the following query
select * from
(
SELECT distinct
rx.patid
,rx.fillDate
,rx.scriptEndDate
,MAX(datediff(day, rx.filldate, rx.scriptenddate)) AS longestScript
,rx.drugClass
,COUNT(rx.drugName) over(partition by rx.patid,rx.fillDate,rx.drugclass) as distinctFamilies
FROM [I 3 SCI control].dbo.rx
where rx.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
GROUP BY rx.patid, rx.fillDate, rx.scriptEndDate,rx.drugName,rx.drugClass
) r
order by distinctFamilies desc
which produces results that look like
This should mean that between the two dates in the table the patID that there should be 5 unique drug names. However, when I run the following query:
select distinct *
from rx
where patid = 1358801781 and fillDate between '2008-10-17' and '2008-11-16' and drugClass='H4B'
I have a result set returned that looks like
You can see that while there are in fact five rows returned for the second query between the dates of 2008-10-17 and 2009-01-15, there are only three unique names. I've tried various ways of modifying the over clause, all with different levels of non-success. How can I alter my query so that I only find unique drugNames within the timeframe specified for each row?
Taking a shot at it:
SELECT DISTINCT
patid,
fillDate,
scriptEndDate,
MAX(DATEDIFF(day, fillDate, scriptEndDate)) AS longestScript,
drugClass,
MAX(rn) OVER(PARTITION BY patid, fillDate, drugClass) as distinctFamilies
FROM (
SELECT patid, fillDate, scriptEndDate, drugClass,rx.drugName,
DENSE_RANK() OVER(PARTITION BY patid, fillDate, drugClass ORDER BY drugName) as rn
FROM [I 3 SCI control].dbo.rx
WHERE drugClass IN ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
)x
GROUP BY x.patid, x.fillDate, x.scriptEndDate,x.drugName,x.drugClass,x.rn
ORDER BY distinctFamilies DESC
Not sure if DISTINCT is really necessary - left it in since you've used it.