How to de-duplicate SQL table rows by multiple columns with hierarchy? - sql

I have a table with multiple records for each patient.
My end goal is a table that is 1-to-1 between Patient_id and Value.
I would like to de-duplicate (in respect to patient_id) my rows based on "a hierarchical series of aggregate functions" (if someone has a better way to phrase this, I'd appreciate that as well.)
+----+------------+------------+------------+----------+-----------------+-------+
| ID | patient_id | Date | Date2 | Priority | Source | Value |
+----+------------+------------+------------+----------+-----------------+-------+
| 1 | 1 | 2017-09-09 | 2018-09-09 | 1 | 'verified' | 55 |
| 2 | 1 | 2017-09-09 | 2018-11-11 | 2 | 'verified' | 78 |
| 3 | 1 | 2017-11-11 | 2018-09-09 | 3 | 'verified' | 23 |
| 4 | 1 | 2017-11-11 | 2018-11-11 | 1 | 'self_reported' | 11 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 2 | 'self_reported' | 90 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 3 | 'self_reported' | 34 |
| 6 | 2 | 2017-11-11 | 2018-09-09 | 2 | 'self_reported' | 21 |
+----+------------+------------+------------+----------+-----------------+-------+
For each patient_id, I would like to get the row(s) that has/have the MAX(Date). In the case that there are still duplicated patient_id, I would like to get the row(s) with the MIN(Priority). In the case that there are still duplicated rows I would like to get the row(s) with the MIN(Date2).
The way I've approached this problem is using a series of queries like this to de-duplicate on the columns one at a time.
SELECT *
FROM #table t1
LEFT JOIN
(SELECT
patient_id,
MIN(priority) AS min_priority
FROM #table
GROUP BY patient_id) t2 ON t2.patient_id = t1.patient_id
WHERE t2.min_priority = t1.priority
Is there a way to do this that allows me to de-dup on multiple columns at once? Is there a more elegant way to do this?
I'm able to get my results, but my solution feels very inefficient, and I keep running into this. Thank you for any input.

You could use row_number(), if your RDBMS supports it:
select ID, patient_id, Date, Date2, Priority, Source, Value
from (
select
t.*,
row_number() over(partition by patient_id order by Date desc, Priority, Date2) rn
from mytable t
) where rn = 1
Another option is to filter with a correlated subquery that sorts the record according to your criteria, like so:
select t.*
from mytable t
where id = (
select id
from mytable t1
where t1.patient_id = t.patient_id
order by t1.Date desc, t1.Priority, t1.Date2
limit 1
)
The actual syntax for limit varies accross RDBMS.

Related

BigQuery: delete duplicated row that are not fully duplicated (delete desire row)

I have a table recording customer step on daily basis. The table had Id, date and step column. Some rows contained different steps on the same day for the same Id. Sample as shown below on 5/3/2020 and 5/4/2020 for Id 1:
| Id | Date | Step |
|:-----|:---------|:-----|
| 1 | 5/1/2020 | 1 |
| 1 | 5/2/2020 | 1 |
| 1 | 5/3/2020 | 0 |
| 1 | 5/3/2020 | 5 |
| 1 | 5/4/2020 | 2 |
| 1 | 5/4/2020 | 10 |
| 1 | 5/5/2020 | 1 |
| 2 | 5/1/2020 | 1 |
| 2 | 5/2/2020 | 2 |
| 2 | 5/3/2020 | 0 |
I want to delete rows that contain lesser step, which is 5/3/2020 for 0 step, 5/4/2020 for 2 step for Id 1.
I had tried using row_number() like this:
SELECT
Id,
Date,
step,
ROW_NUMBER() OVER (PARTITION BY Id, Date ORDER BY Id, Date) AS rn
FROM
`dataset.step`
WHERE rn>1
But that will give me rows with higher step, which is not want I want.
I also able to select rows with fewer step like this:
SELECT * FROM
`dataset.step` AS A
INNER JOIN
`dataset.step` AS B
ON A.Id = B.Id
AND A.Date = B.Date
WHERE A.step < B.step
But find no way to use it for delete.
Use below approach
select *
from your_table
qualify 1 = row_number() over win
window win as (partition by id, date order by step desc)
if applied to sample data in your question - output is

Retrieve SQL records where only the last unique entries match criteria in postgresql

I've got a long table that tracks a numerical 'state' value (0=new, 1=setup mode, 2=retired, 3=active, 4=inactive) of a collection of 'devices' historically. These devices may be activated/deactivated throughout the year, so the table is continuous collection of state changes - mostly state 3 and 4, ordered by id, with a timestamp on the end, for example:
id | device_id | new_state | when
----------+-----------+-----------+----------------------------
218010581 | 2505 | 0 | 2022-06-06 16:28:11.174084
218010580 | 2505 | 1 | 2022-06-06 16:28:11.174084
218010634 | 2505 | 3 | 2022-06-06 16:29:25.129019
218087737 | 659 | 3 | 2022-06-07 22:55:48.705208
218087744 | 1392 | 3 | 2022-06-07 22:55:59.016974
218087757 | 1556 | 3 | 2022-06-07 22:56:09.811876
218087758 | 2071 | 1 | 2022-06-07 22:56:20.850095
218087765 | 2071 | 3 | 2022-06-07 22:56:29.122074
When I want to look for a list of devices and see their 'history', I know I can use something like:
select *
from devstatechange
where device_id = 2345
order by "when";
id | device_id | new_state | when
-----------+-----------+-----------+----------------------------
184682659 | 2345 | 0 | 2021-05-27 17:03:36.894429
184682658 | 2345 | 1 | 2021-05-27 17:03:36.894429
184684721 | 2345 | 3 | 2021-05-27 17:31:01.968314
194933399 | 2345 | 4 | 2021-08-31 23:30:05.555407
195213746 | 2345 | 3 | 2021-09-03 16:53:39.043005
206278232 | 2345 | 4 | 2021-12-31 22:30:08.820068
206515355 | 2345 | 3 | 2022-01-03 16:06:01.223759
215709888 | 2345 | 4 | 2022-04-30 23:30:30.309389
215846807 | 2345 | 3 | 2022-05-02 19:40:31.525514
select *
from devstatechange
where device_id = 2351
order by "when";
id | device_id | new_state | when
-----------+-----------+-----------+----------------------------
186091252 | 2351 | 0 | 2021-06-09 15:36:02.775035
186091253 | 2351 | 1 | 2021-06-09 15:36:02.775035
186091349 | 2351 | 3 | 2021-06-09 15:37:56.965599
197880878 | 2351 | 4 | 2021-09-30 23:30:06.691835
197945073 | 2351 | 3 | 2021-10-01 15:32:35.907913
208981857 | 2351 | 4 | 2022-01-31 22:30:09.521694
209722639 | 2351 | 3 | 2022-02-09 15:20:12.412816
217666572 | 2351 | 4 | 2022-05-31 23:30:30.881928
What I am really looking for is a query that returns a unique list of devices where the latest dated entry for each device only contains a state of '4' ('inactive state'), and not include records that do not match.
So in using the above data samples, even though both devices 2345 and 2351 have states of 3 and 4 throughout their history, only device 2351 has it's last dated entry with a state of 4 - meaning it is currently in an 'inactive' state. Device 2345's would not appear in the result set since its last dated entry has a state of 3 - it's still active.
Stabbing in the dark, I've tried variants of:
SELECT DISTINCT *
FROM devstatechange
WHERE MAX("when") AND new_state = 4
ORDER BY "when";
SELECT DISTINCT device_id, new_state, MAX("when")
FROM devstatechange
WHERE new_state = 4
ORDER BY "when";
with obviously no success.
I'm thinking I might need to 'group' the entries together, but I don't know how to specify 'return last entry only if new_state = 4' in SQL, or rather PostgreSQL.
Any tidbits or pokes in the right direction would be appreciated.
SELECT * FROM (
SELECT DISTINCT ON (device_id)
*
FROM devstatechange
ORDER BY device_id, "when" DESC
) AS latest
WHERE new_state = 4;
The DISTINCT ON keyword together with the ORDER BY will pull the newest row for each device. The outer query then filters these by your condition.
You may use Row_Number() function with a partition by device_id and order by when.
Try the following CTE:
with cte as
(
Select id ,device_id ,new_state ,when_ ,
row_number() over (partition by device_id order by when_ desc) as rn
from devstatechange
)
select * from cte where rn=1 and new_state=4
See a demo from db-fiddle.
The problem with:
SELECT DISTINCT * FROM devstatechange WHERE MAX("when") AND new_state=4 ORDER BY "when";
is that MAX("when") refers to all the entrys on the table.
you should change it to:
when = (select max(when) from devstatechange dev2 where dev2.device_id = dev1.device_id )
You can use CTE to obtain a last state of each device and then select only those, whose last state is 4, like this
WITH device_last_state AS (
SELECT DISTINCT ON (device_id)
id,
device_id,
last_value (new_state) over (partition by device_id order by "when" desc) as new_state,
"when"
FROM devicestatechange
)
SELECT * FROM device_last_state
WHERE new_state = 4
Check a demo

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

SQL Server: most efficient way to update multiple records depending on each other

I want to update multiple records from table "a" depending on each other. The values of the table "a" look like:
+------------+---------------+-------+
| date | transfervalue | value |
+------------+---------------+-------+
| 01.03.2018 | 0 | 10 |
| 02.03.2018 | 0 | 6 |
| 03.03.2018 | 0 | 13 |
+------------+---------------+-------+
After the update the values of the table "a" should look like:
+------------+---------------+-------+
| date | transfervalue | value |
+------------+---------------+-------+
| 01.03.2018 | 0 | 10 |
| 02.03.2018 | 10 | 6 |
| 03.03.2018 | 16 | 13 |
+------------+---------------+-------+
What is the most efficient way to do this? I've tried three different solutions, but the last solution doesn't work.
Solution 1: do a loop and iterate over each day to do the update statement
Solution 2: do an update statement statement for each day
Solution 3: do the update for the whole timespan in one statement
The output of solution 3 was:
+------------+---------------+-------+
| date | transfervalue | value |
+------------+---------------+-------+
| 01.03.2018 | 0 | 10 |
| 02.03.2018 | 10 | 6 |
| 03.03.2018 | 6 | 13 |
+------------+---------------+-------+
You seem to want a cumulative sum:
with toupdate as (
select t.*, sum(value) over (order by date rows between unbounded preceding and 1 preceding) as running_value
from t
)
update toupdate
set transfervalue = coalesce(running_value, 0);
This should work:
select t1.*,
coalesce((select sum(value) from table1 t2 where t2.date < t1.date), 0) MyNewValue
from table1 t1

Select latest values for group of related records

I have a table that accommodates data that is logically groupable by multiple properties (foreign key for example). Data is sequential over continuous time interval; i.e. it is a time series data. What I am trying to achieve is to select only latest values for each group of groups.
Here is example data:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 1 | 01.01.2016 | 1 |
| A | 2 | 02.01.2016 | 1 |
| A | 3 | 03.01.2016 | 1 |
| A | 4 | 01.01.2016 | 2 |
| A | 5 | 02.01.2016 | 2 |
| A | 6 | 03.01.2016 | 2 |
| B | 1 | 01.01.2016 | 1 |
| B | 2 | 02.01.2016 | 1 |
| B | 3 | 03.01.2016 | 1 |
| B | 4 | 01.01.2016 | 2 |
| B | 5 | 02.01.2016 | 2 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
And here is example of desired output:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 3 | 03.01.2016 | 1 |
| A | 6 | 03.01.2016 | 2 |
| B | 3 | 03.01.2016 | 1 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
To put this in perspective — for every related object I want to select each code with latest date.
Here is a select I came with. I've used ROW_NUMBER OVER (PARTITION BY...) approach:
SELECT indicators.code, indicators.dimension, indicators.unit, x.value, x.date, x.ticker, x.name
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY indicator_id ORDER BY date DESC) AS r,
t.indicator_id, t.value, t.date, t.company_id, companies.sic_id,
companies.ticker, companies.name
FROM fundamentals t
INNER JOIN companies on companies.id = t.company_id
WHERE companies.sic_id = 89
) x
INNER JOIN indicators on indicators.id = x.indicator_id
WHERE x.r <= (SELECT count(*) FROM companies where sic_id = 89)
It works but the problem is that it is painfully slow; when working with about 5% of production data which equals to roughly 3 million fundamentals records this select take about 10 seconds to finish. My guess is that happens due to subselect selecting huge amounts of records first.
Is there any way to speed this query up or am I digging in wrong direction trying to do it the way I do?
Postgres offers the convenient distinct on for this purpose:
select distinct on (relation_id, code) t.*
from t
order by relation_id, code, date desc;
So your query uses different column names than your sample data, so it's hard to tell, but it looks like you just want to group by everything except for date? Assuming you don't have multiple most recent dates, something like this should work. Basically don't use the window function, use a proper group by, and your engine should optimize the query better.
SELECT mytable.code,
mytable.value,
mytable.date,
mytable.relation_id
FROM mytable
JOIN (
SELECT code,
max(date) as date,
relation_id
FROM mytable
GROUP BY code, relation_id
) Q1
ON Q1.code = mytable.code
AND Q1.date = mytable.date
AND Q1.relation_id = mytable.relation_id
Other option:
SELECT DISTINCT Code,
Relation_ID,
FIRST_VALUE(Value) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Value,
FIRST_VALUE(Date) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Date
FROM mytable
This will return top value for what ever you partition by, and for whatever you order by.
I believe we can try something like this
SELECT CODE,Relation_ID,Date,MAX(value)value FROM mytable
GROUP BY CODE,Relation_ID,Date