Select latest values for group of related records - sql

I have a table that accommodates data that is logically groupable by multiple properties (foreign key for example). Data is sequential over continuous time interval; i.e. it is a time series data. What I am trying to achieve is to select only latest values for each group of groups.
Here is example data:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 1 | 01.01.2016 | 1 |
| A | 2 | 02.01.2016 | 1 |
| A | 3 | 03.01.2016 | 1 |
| A | 4 | 01.01.2016 | 2 |
| A | 5 | 02.01.2016 | 2 |
| A | 6 | 03.01.2016 | 2 |
| B | 1 | 01.01.2016 | 1 |
| B | 2 | 02.01.2016 | 1 |
| B | 3 | 03.01.2016 | 1 |
| B | 4 | 01.01.2016 | 2 |
| B | 5 | 02.01.2016 | 2 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
And here is example of desired output:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 3 | 03.01.2016 | 1 |
| A | 6 | 03.01.2016 | 2 |
| B | 3 | 03.01.2016 | 1 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
To put this in perspective — for every related object I want to select each code with latest date.
Here is a select I came with. I've used ROW_NUMBER OVER (PARTITION BY...) approach:
SELECT indicators.code, indicators.dimension, indicators.unit, x.value, x.date, x.ticker, x.name
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY indicator_id ORDER BY date DESC) AS r,
t.indicator_id, t.value, t.date, t.company_id, companies.sic_id,
companies.ticker, companies.name
FROM fundamentals t
INNER JOIN companies on companies.id = t.company_id
WHERE companies.sic_id = 89
) x
INNER JOIN indicators on indicators.id = x.indicator_id
WHERE x.r <= (SELECT count(*) FROM companies where sic_id = 89)
It works but the problem is that it is painfully slow; when working with about 5% of production data which equals to roughly 3 million fundamentals records this select take about 10 seconds to finish. My guess is that happens due to subselect selecting huge amounts of records first.
Is there any way to speed this query up or am I digging in wrong direction trying to do it the way I do?

Postgres offers the convenient distinct on for this purpose:
select distinct on (relation_id, code) t.*
from t
order by relation_id, code, date desc;

So your query uses different column names than your sample data, so it's hard to tell, but it looks like you just want to group by everything except for date? Assuming you don't have multiple most recent dates, something like this should work. Basically don't use the window function, use a proper group by, and your engine should optimize the query better.
SELECT mytable.code,
mytable.value,
mytable.date,
mytable.relation_id
FROM mytable
JOIN (
SELECT code,
max(date) as date,
relation_id
FROM mytable
GROUP BY code, relation_id
) Q1
ON Q1.code = mytable.code
AND Q1.date = mytable.date
AND Q1.relation_id = mytable.relation_id

Other option:
SELECT DISTINCT Code,
Relation_ID,
FIRST_VALUE(Value) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Value,
FIRST_VALUE(Date) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Date
FROM mytable
This will return top value for what ever you partition by, and for whatever you order by.

I believe we can try something like this
SELECT CODE,Relation_ID,Date,MAX(value)value FROM mytable
GROUP BY CODE,Relation_ID,Date

Related

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

SQL - Group two rows by columns that value and null on different columns

Question
Say I have a table with such rows:
id | country | place | last_action | second_to_last_action
----------------------------------------------------------
1 | US | 2 | reply |
1 | US | 2 | | comment
4 | DE | 5 | reply |
4 | | | | comment
What I want to do is to combine these by id, country and place so that the last_action and second_to_last_action would be on the same row
id | country | place | last_action | second_to_last_action
----------------------------------------------------------
1 | US | 2 | reply | comment
4 | DE | 5 | reply | comment
How would I approach this? I guess I would need an aggregate here but my mind is hitting completely blank on which one should I use.
It can be expected that there will always be a matching pair.
Background:
Note: this table has been derived from something like this:
id | country | place | action | time
----------------------------------------------------------
1 | US | 2 | reply | 16:15
1 | US | 2 | comment | 15:16
1 | US | 2 | view | 13:16
4 | DE | 5 | reply | 17:15
4 | DE | 5 | comment | 16:16
4 | DE | 5 | view | 14:12
Code used to partition was:
row_number() over (partition by id order by time desc) as event_no
And then I got the last and second_to_last action by getting event_no 1 & 2. So if there's more efficient way to get the last two actions in two distinct columns I would be happy to hear that.
You can fix your first data by using aggregation:
select id, country, place, max(last_action), max(second_to_last_action)
from derived
group by id, country, place;
You can do this from the original table using conditional aggregation:
select id, country, place,
max(case when seqnum = 1 then action end) as last_action,
max(case when seqnum = 2 then action end) as second_to_last_action
from (select t.*,
row_number() over (partition by id order by time desc) as seqnum
from t
) t
group by id, country, place;

How to de-duplicate SQL table rows by multiple columns with hierarchy?

I have a table with multiple records for each patient.
My end goal is a table that is 1-to-1 between Patient_id and Value.
I would like to de-duplicate (in respect to patient_id) my rows based on "a hierarchical series of aggregate functions" (if someone has a better way to phrase this, I'd appreciate that as well.)
+----+------------+------------+------------+----------+-----------------+-------+
| ID | patient_id | Date | Date2 | Priority | Source | Value |
+----+------------+------------+------------+----------+-----------------+-------+
| 1 | 1 | 2017-09-09 | 2018-09-09 | 1 | 'verified' | 55 |
| 2 | 1 | 2017-09-09 | 2018-11-11 | 2 | 'verified' | 78 |
| 3 | 1 | 2017-11-11 | 2018-09-09 | 3 | 'verified' | 23 |
| 4 | 1 | 2017-11-11 | 2018-11-11 | 1 | 'self_reported' | 11 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 2 | 'self_reported' | 90 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 3 | 'self_reported' | 34 |
| 6 | 2 | 2017-11-11 | 2018-09-09 | 2 | 'self_reported' | 21 |
+----+------------+------------+------------+----------+-----------------+-------+
For each patient_id, I would like to get the row(s) that has/have the MAX(Date). In the case that there are still duplicated patient_id, I would like to get the row(s) with the MIN(Priority). In the case that there are still duplicated rows I would like to get the row(s) with the MIN(Date2).
The way I've approached this problem is using a series of queries like this to de-duplicate on the columns one at a time.
SELECT *
FROM #table t1
LEFT JOIN
(SELECT
patient_id,
MIN(priority) AS min_priority
FROM #table
GROUP BY patient_id) t2 ON t2.patient_id = t1.patient_id
WHERE t2.min_priority = t1.priority
Is there a way to do this that allows me to de-dup on multiple columns at once? Is there a more elegant way to do this?
I'm able to get my results, but my solution feels very inefficient, and I keep running into this. Thank you for any input.
You could use row_number(), if your RDBMS supports it:
select ID, patient_id, Date, Date2, Priority, Source, Value
from (
select
t.*,
row_number() over(partition by patient_id order by Date desc, Priority, Date2) rn
from mytable t
) where rn = 1
Another option is to filter with a correlated subquery that sorts the record according to your criteria, like so:
select t.*
from mytable t
where id = (
select id
from mytable t1
where t1.patient_id = t.patient_id
order by t1.Date desc, t1.Priority, t1.Date2
limit 1
)
The actual syntax for limit varies accross RDBMS.

SQL formula for Row number

I'm trying to rank the rows in the following table that looks like this:
| ID | Key | Date | Row|
*****************************
| P175 | 5 | 2017-01| 2 |
| P175 | 5 | 2017-02| 2 |
| P175 | 5 | 2017-03| 2 |
| P175 | 12 | 2017-03| 1 |
| P175 | 12 | 2017-04| 1 |
| P175 | 12 | 2017-05| 1 |
This person has two Keys at once during 2017-03, but I want the formula to put '1' for the rows where Key=12 since it reflects the most recent records.
I want the same formula to also work for the people who don't have overlapping Keys, putting '1' for the most recent records:
| ID | Key | Date | Row|
*****************************
| P170 | 8 | 2017-01| 2 |
| P170 | 8 | 2017-02| 2 |
| P170 | 8 | 2017-03| 2 |
| P170 | 6 | 2017-04| 1 |
| P170 | 6 | 2017-05| 1 |
I've tried variations of ROW_NUMBER() OVER PARTITION BY and DENSE_RANK but cannot figure out the correct formula. Thanks for your help.
First calculate the max date for the key. Then use dense_rank():
select t.*,
dense_rank() over (partition by id order by max_date desc, key) as row
from (select t.*, max(date) over (partition by id, key) as max_date
from t
) t;
If the ranges for each key did not overlap, you could do this with a cumulative count distinct:
select t.*, count(distinct key) over (partition by id order by date desc) as rank
from t;
However, this would not work in the first case. I just find it interesting that this does almost the same thing as the first query.
I guess you are looking for something like this
select personid, mykey, month,
dense_rank() over (partition by personid order by mykey desc) rown
from personkeys
order by month
see the example
http://sqlfiddle.com/#!15/cf751/8

LEFT JOINing the max/top

I have two tables from which I'm trying to run a query to return the maximum (or top) transaction for each person. I should note that I cannot change the table structure. Rather, I can only pull data.
People
+-----------+
| id | name |
+-----------+
| 42 | Bob |
| 65 | Ted |
| 99 | Stu |
+-----------+
Transactions (there is no primary key)
+---------------------------------+
| person | amount | date |
+---------------------------------+
| 42 | 3 | 9/14/2030 |
| 42 | 4 | 7/02/2015 |
| 42 | *NULL* | 2/04/2020 |
| 65 | 7 | 1/03/2010 |
| 65 | 7 | 5/20/2020 |
+---------------------------------+
Ultimately, for each person I want to return the highest amount. If that doesn't work then I'd like to look at the date and return the most recent date.
So, I'd like my query to return:
+----------------------------------------+
| person_id | name | amount | date |
+----------------------------------------+
| 42 | Bob | 4 | 7/02/2015 | (<- highest amount)
| 65 | Ted | 7 | 5/20/2020 | (<- most recent date)
| 99 | Stu | *NULL* | *NULL* | (<- no records in Transactions table)
+----------------------------------------+
SELECT People.id, name, amount, date
FROM People
LEFT JOIN (
SELECT TOP 1 person_id
FROM Transactions
WHERE person_id = People.id
ORDER BY amount DESC, date ASC
)
ON People.id = person_id
I can't figure out what I am doing wrong, but I know it's wrong. Any help would be much appreciated.
You are almost there but since there are duplicate Id in the Transaction table ,so you need to remove those by using Row_number() function
Try this :
With cte as
(Select People,amount,date ,row_number() over (partition by People
order by amount desc, date desc) as row_num
from Transac )
Select * from People as a
left join cte as b
on a.ID=b.People
and b.row_num=1
The result is in Sql Fiddle
Edit: Row_number() from MSDN
Returns the sequential number of a row within a partition of a result set,
starting at 1 for the first row in each partition.
Partition is used to group the result set and Over by clause is used
Determine the partitioning and ordering of the rowset before the
associated window function is applied.