Combining multiple rows based on recent values in PostgreSQL

Combining multiple rows based on recent values in PostgreSQL - sql

My first question on here, so i will try to explain it good.
I have a specific need which i tried to come up with a query but din't succeed to. Also googled it, and did not find it, but probably my input was not good, as it does not seem to me it should be that hard.
So some example of table and data i have (dates are in format here dd/MM/yyyy):
----------------------------------------------------------------------------
| id | asset_id | value | start_date | end_date |
----------------------------------------------------------------------------
| 1 | 1 | value1 | 20-10-2020 | 31-10-2020 |
----------------------------------------------------------------------------
| 1 | 1 | value1 | 01-11-2020 | 05-11-2020 |
----------------------------------------------------------------------------
| 1 | 2 | value2 | 05-10-2020 | 10-10-2020 |
----------------------------------------------------------------------------
| 1 | 2 | value3 | 10-10-2020 | 15-10-2020 |
----------------------------------------------------------------------------
| 1 | 3 | value3 | 15-08-2020 | 31-08-2020 |
----------------------------------------------------------------------------
| 1 | 3 | value1 | 01-09-2020 | 05-09-2020 |
----------------------------------------------------------------------------
| 1 | 3 | value1 | 05-09-2020 | 10-09-2020 |
----------------------------------------------------------------------------
So the specific need i have is to look at the two most recent rows grouped by id and asset_id. If the value of these two rows is the same, then combine the rows into one, with the start_date from the first row and end_date of the second one. If the values do not match, then nothing should be done.
For the specific input (previous table), some desired output should be:
----------------------------------------------------------------------------
| id | asset_id | value | start_date | end_date |
----------------------------------------------------------------------------
| 1 | 1 | value1 | 20-10-2020 | 05-11-2020 |
----------------------------------------------------------------------------
| 1 | 2 | value2 | 05-10-2020 | 10-10-2020 |
----------------------------------------------------------------------------
| 1 | 2 | value3 | 10-10-2020 | 15-10-2020 |
----------------------------------------------------------------------------
| 1 | 3 | value3 | 15-08-2020 | 31-08-2020 |
----------------------------------------------------------------------------
| 1 | 3 | value3 | 01-09-2020 | 10-09-2020 |
----------------------------------------------------------------------------
So for the group (id, asset_id) where the values are (1,1), two rows form the input table should be combined as i described as their value is the same. So the 1st and 2nd row should combine to the 1st row from the output. For the (1,2) group, the values are different, so no combining should be done. For the (1,3) group, the two most recent rows (the 6th and 7th from the input) should combine in the 5th in the output table.
It seems not hard, but i have trouble to come with something specific. I made an sqlfiddle where anyone can try.
Any help really appreciated.

You can filter the top two rows per group with row_number(). Then, aggregate by value: if both rows in the group have the same value, they are grouped together, else then end up in two different groups.
So:
select id, asset_id, value, min(start_date) start_date, max(end_date) end_date
from (
select t.*,
row_number() over(partition by id, asset_id order by start_date desc) rn
from mytable t
) t
where rn <= 2
group by id, asset_id, value

Related

Replace null values with most recent non-null values SQL

I have a table where each row consists of an ID, date, variable values (eg. var1).
When there is a null value for var1 in a row, I want like to replace the null value with the most recent non-null value before that date for that ID. How can I do this quickly for a very large table?
So presume I start with this table:
+----+------------|-------+
| id |date | var1 |
+----+------------+-------+
| 1 |'01-01-2022'|55 |
| 2 |'01-01-2022'|12 |
| 3 |'01-01-2022'|45 |
| 1 |'01-02-2022'|Null |
| 2 |'01-02-2022'|Null |
| 3 |'01-02-2022'|20 |
| 1 |'01-03-2022'|15 |
| 2 |'01-03-2022'|Null |
| 3 |'01-03-2022'|Null |
| 1 |'01-04-2022'|Null |
| 2 |'01-04-2022'|77 |
+----+------------+-------+
Then I want this
+----+------------|-------+
| id |date | var1 |
+----+------------+-------+
| 1 |'01-01-2022'|55 |
| 2 |'01-01-2022'|12 |
| 3 |'01-01-2022'|45 |
| 1 |'01-02-2022'|55 |
| 2 |'01-02-2022'|12 |
| 3 |'01-02-2022'|20 |
| 1 |'01-03-2022'|15 |
| 2 |'01-03-2022'|12 |
| 3 |'01-03-2022'|20 |
| 1 |'01-04-2022'|15 |
| 2 |'01-04-2022'|77 |
+----+------------+-------+

cte suits perfect here
this snippets returns the rows with values, just an update query and thats all (will update my response).
WITH selectcte AS
(
SELECT * FROM testnulls where var1 is NOT NULL
)
SELECT t1A.id, t1A.date, ISNULL(t1A.var1,t1B.var1) varvalue
FROM selectcte t1A
OUTER APPLY (SELECT TOP 1 *
FROM selectcte
WHERE id = t1A.id AND date < t1A.date
AND var1 IS NOT NULL
ORDER BY id, date DESC) t1B
Here you can dig further about CTEs :
https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql?view=sql-server-ver16

Merging multiple "state-change" time series

Given a number of tables like the following, representing state-changes at time t of an entity identified by id:
| A | | B |
| t | id | a | | t | id | b |
| - | -- | - | | - | -- | - |
| 0 | 1 | 1 | | 0 | 1 | 3 |
| 1 | 1 | 2 | | 2 | 1 | 2 |
| 5 | 1 | 3 | | 3 | 1 | 1 |
where t is in reality a DateTime field with millisecond precision (making discretisation infeasible), how would I go about creating the following output?
| output |
| t | id | a | b |
| - | -- | - | - |
| 0 | 1 | 1 | 3 |
| 1 | 1 | 2 | 3 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 1 |
| 5 | 1 | 3 | 1 |
The idea is that for any given input timestamp, the entire state of a selected entity can be extracted by selecting one row from the resulting table. So the latest state of each variable corresponding to any time needs to be present in each row.
I've tried various JOIN statements, but I seem to be getting nowhere.
Note that in my use case:
rows also need to be joined by entity id
there may be more than two source tables to be merged
I'm running PostgreSQL, but I will eventually translate the query to SQLAlchemy, so a pure SQLAlchemy solution would be even better
I've created a db<>fiddle with the example data.

I think you want a full join and some other manipulations. The ideal would be:
select t, id,
last_value(a.a ignore nulls) over (partition by id order by t) as a,
last_value(b.b ignore nulls) over (partition by id order by t) as b
from a full join
b
using (t, id);
But . . . Postgres doesn't support ignore nulls. So an alternative method is:
select t, id,
max(a) over (partition by id, grp_a) as a,
max(b) over (partition by id, grp_b) as b
from (select *,
count(a.a) over (partition by id order by t) as grp_a,
count(b.b) over (partition by id order by t) as grp_b
from a full join
b
using (t, id)
) ab;

calculating sum of rows with identical id

Let's imagine a table with two columns ex:
| Value | ID |
+-------+----+
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 1 | 2 |
| 2 | 2 |
| 2 | 2 |
What I am trying to do is to calculate the sum of those with similar id and display them in different table like:
| Sum | ID |
+-----+----+
| 9 | 1 |
| 5 | 2 |
and so on.
I could find a sum of a known id by
SELECT SUM(VALUE) FROM MYTABLE WHERE ID = 1;
However not sure on how to find sum of different id's separately, could you give an idea on how to proceed?

Select SUM(VALUE),ID FROM MYTABLE GROUP BY ID

Use GROUP BY clause:
SELECT SUM(VALUE) Sum, ID FROM MYTABLE GROUP BY ID;

SELECT SUM(VALUE),ID FROM MYTABLE Group By ID

Window functions limited by value in separate column

I have a "responses" table in my postgres database that looks like
| id | question_id |
| 1 | 1 |
| 2 | 2 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
I want to produce a table with the response and question id, as well as the id of the previous response with that same question id, as such
| id | question_id | lag_resp_id |
| 1 | 1 | |
| 2 | 2 | |
| 3 | 1 | 1 |
| 4 | 2 | 2 |
| 5 | 2 | 4 |
Obviously pulling "lag(responses.id) over (order by responses.id)" will pull the previous response id regardless of question_id. I attempted the below subquery, but I know it is wrong since I am basically making a table of all lag ids for each question id in the subquery.
select
responses.question_id,
responses.id as response_id,
(select
lag(r2.id, 1) over (order by r2.id)
from
responses as r2
where
r2.question_id = responses.question_id
)
from
responses
I don't know if I'm on the right track with the subquery, or if I need to do something more advanced (which may involve "partition by", which I do not know how to use).
Any help would be hugely appreciated.

Use partition by. There is no need for a correlated subquery here.
select id,question_id,
lag(id) over (partition by question_id order by id) lag_resp_id
from responses

selecting data with highest field value in a field

I have a table, and I'd like to select rows with the highest value. For example:
----------------
| user | index |
----------------
| 1 | 1 |
| 2 | 1 |
| 2 | 2 |
| 3 | 4 |
| 3 | 7 |
| 4 | 1 |
| 5 | 1 |
----------------
Expected result:
----------------
| user | index |
----------------
| 1 | 1 |
| 2 | 2 |
| 3 | 7 |
| 4 | 1 |
| 5 | 1 |
----------------
How may I do so? I assume it can be done by some oracle function I am not aware of?
Thanks in advance :-)

You can use MAX() function for that with grouping user column like this:
SELECT "user"
,MAX("index") AS "index"
FROM Table1
GROUP BY "user"
ORDER BY "user";
Result:
| USER | INDEX |
----------------
| 1 | 1 |
| 2 | 2 |
| 3 | 7 |
| 4 | 1 |
| 5 | 1 |
See this SQLFiddle

if you have more than one column
select user , index
from (
select u.* , row_number() over (partition by user order by index desc) as rnk
from some_table u)
where rnk = 1
user is a reserved word - you should use a different name for the column.

select user,max(index) index from tbl
group by user;

Alternatively, you can use analytic functions:
select user,index, max(index) over (partition by user order by 1 ) highest from YOURTABLE
Note: Try NOT to use words like user, index, date etc.. as your column names, as they are reserved words for Oracle. If you will use, then use them with quotation marks, eg. "index", "date"...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Combining multiple rows based on recent values in PostgreSQL - sql

Related

Replace null values with most recent non-null values SQL

Merging multiple "state-change" time series

calculating sum of rows with identical id

Window functions limited by value in separate column

selecting data with highest field value in a field

Categories

Resources