SQL Group by only correlative rows - sql

Say I have the following table:
Code A B C Date ID
------------------------------
50 1 1 A 2018-01-08 150001
50 1 1 A 2018-01-15 165454
50 1 1 B 2018-02-01 184545
50 1 1 A 2018-02-02 195487
I need the sql query to output the following:
Code A B C Min(Date) Min(ID)
-------------------------------
50 1 1 A 2018-01-08 150001
50 1 1 B 2018-02-01 184545
50 1 1 A 2018-02-02 195487
If I use standard group by, rows 1,2,4 are grouped in 1 row, and this is not that I want.
I want to select the row with MIN(date) and MIN(id) from the duplicate records that are together based on column code, A, B and C
in this case 1st 2 rows are duplicates so i want the min() row.
and 3rd and 4th row are distinct.
Note that the database is Vertica 8.1, that is very similar to Oracle or PostgreSQL

I think you would need the analytic function LAG(). Using this function, you can get the value of the previous row (or NULL if it's the first row itself). So you can check if the value on the previous row is different or not, and filter accordingly.
I'm not familiar with Vertica, but this should be the correct documentation for it: https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/LAGAnalytic.htm
Please try the query below, it should do it:
SELECT l.Code, l.A, l.B, l.C, l.Date, l.ID
FROM (SELECT t.*,
LAG(t.C, 1) OVER (PARTITION BY t.Code, t.A ORDER BY t.Date) prev_val
FROM table_1 t) l
WHERE l.C != l.prev_val
OR l.prev_val IS NULL
ORDER BY l.Code, l.A, l.Date

Related

Get certain rows, plus rows before and after

Let's say I have the following data set:
ID
Identifier
Admission_Date
Release_Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
234
2
4/15/22
4/18/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
789
2
7/1/22
7/5/22
321
2
6/1/21
6/3/21
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
321
2
5/6/21
5/10/21
I want all rows with identifier=1. I also want rows that are either directly below or above rows with Identifier=1 - sorted by most recent to least recent.
There is always a row below rows with identifier=1. There may or may not be a row above. If there is no row with identifier=1 for an ID, then it will not be brought in with a prior step.
The resulting data set should be as follows:
ID
Identifier
Admission Date
Release Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
I am using DBeaver, which runs PostgreSQL.
I admittedly don't know Postgres well so the following could possibly be optimised, however using a combination of lag and lead to obtain the previous and next dates (assuming Admission_date is the one to order by) you could try
with d as (
select *,
case when identifier = 1 then Lag(admission_date) over(partition by id order by Admission_Date desc) end pd,
case when identifier = 1 then Lead(admission_date) over(partition by id order by Admission_Date desc) end nd
from t
)
select id, Identifier, Admission_Date, Release_Date
from d
where identifier = 1
or exists (
select * from d d2
where d2.id = d.id
and (d.Admission_Date = pd or d.admission_date = nd)
)
order by Id, Admission_Date desc;
One way:
SELECT (x.my_row).* -- decompose fields from row type
FROM (
SELECT identifier
, lag(t) OVER w AS t0 -- take whole row
, t AS t1
, lead(t) OVER w AS t2
FROM tbl t
WINDOW w AS (PARTITION BY id ORDER BY admission_date)
) sub
CROSS JOIN LATERAL (
VALUES (t0), (t1), (t2) -- pivot
) x(my_row)
WHERE sub.identifier = 1
AND (x.my_row).id IS NOT NULL; -- exclude rows with NULL ( = missing row)
db<>fiddle here
The query is designed to only make a single pass over the table.
Uses some advanced SQL / Postgres features.
About LATERAL:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
About the VALUES expression:
Postgres: convert single row to multiple rows (unpivot)
The manual about extracting fields from a composite type.
If there are many rows per id, other solutions will be (much) faster - with proper index support. You did not specify ...

Select the first row in the last group of consecutive rows

How would I select the row that is the first occurrence in the last 'grouping' of consecutive rows, where a grouping is defined by the consecutive appearance of a particular column value (in the example below state).
For example, given the following table:
id
datetime
state
value_needed
1
2021-04-01 09:42:41.319000
incomplete
A
2
2021-04-04 09:42:41.319000
done
B
3
2021-04-05 09:42:41.319000
incomplete
C
4
2021-04-05 10:42:41.319000
incomplete
C
5
2021-04-07 09:42:41.319000
done
D
6
2021-04-012 09:42:41.319000
done
E
I would want the row with id=5 as it it is the first occurrence of state=done in the last (i.e. most recent) grouping of state=done.
Assuming all columns NOT NULL.
SELECT *
FROM tbl t1
WHERE NOT EXISTS (
SELECT FROM tbl t2
WHERE t2.state <> t1.state
AND t2.datetime > t1.datetime
)
ORDER BY datetime
LIMIT 1;
db<>fiddle here
NOT EXISTS is only true for the last group of peers. (There is no later row with a different state.)
ORDER BY datetime and take the first. Voilá.
Here's a window function solution that accesses your table only once (which may or may not perform better for large data sets):
SELECT *
FROM (
SELECT *,
LEAD (state) OVER (ORDER BY datetime DESC)
IS DISTINCT FROM state AS first_in_group
FROM tbl
) t
WHERE first_in_group
ORDER BY datetime DESC
LIMIT 1
A dbfiddle based on Erwin Brandstetter's. To illustrate, here's the value of first_in_group for each row:
id datetime state value_needed first_in_group
---------------------------------------------------------------------
6 2021-04-12 09:42:41.319 done E f
5 2021-04-07 09:42:41.319 done D t
4 2021-04-05 10:42:41.319 incomplete C f
3 2021-04-05 09:42:41.319 incomplete C t
2 2021-04-04 09:42:41.319 done B t
1 2021-04-01 09:42:41.319 incomplete A t

SQL compares the value of 2 columns and select the column with max value row-by-row

I have table something like:
GROUP
NAME
Value_1
Value_2
1
ABC
0
0
1
DEF
4
4
50
XYZ
6
6
50
QWE
6
7
100
XYZ
26
2
100
QWE
26
2
What I would like to do is to groupby group and select the name with highest value_1. If their value_1 are the same, compare and select the max with value_2. If they're still the same, select the first one.
The output will be something like:
GROUP
NAME
Value_1
Value_2
1
DEF
4
4
50
QWE
6
7
100
XYZ
26
2
The challenge for me here is I don't know how many categories in NAME so a simple case when is not working. Thanks for help
You can use window functions to solve the bulk of your problem:
select t.*
from (select t.*,
row_number() over (partition by group order by value1 desc, value2 desc) as seqnum
from t
) t
where seqnum = 1;
The one caveat is the condition:
If they're still the same, select the first one.
SQL tables represent unordered (multi-) sets. There is no "first" one unless a column specifies the ordering. The best you can do is choose an arbitrary value when all the other values are the same.
That said, you might have another column that has an ordering. If so, add that as a third key to the order by.

Select Distinct Rows And Include Non-Distinct Identifier

Consider a table like the following.
ID Value Change_Date
1 A 1/1/2017
1 B 1/2/2017
1 B 1/3/2017
2 C 1/1/2017
2 C 1/3/2017
3 D 1/1/2017
3 E 1/3/2017
3 F 1/4/2017
3 D 1/10/2017
I would like to perform a select statement which effectively does a distinct, but includes the change_date value of the first chronological occurrence of the distinct row. So the above table would be rendered into something like this:
ID Value Change_Date
1 A 1/1/2017
1 B 1/2/2017
2 C 1/1/2017
3 D 1/1/2017
3 E 1/3/2017
3 F 1/4/2017
3 D 1/10/2017
Is this even remotely possible in Oracle? I'm trying to weed out a bunch of dummy updates from an audit table, but need the change_date so I can show when it changed from value to value. A query like
select distinct id, value from my_table
will obviously get me seed information, but I need to tie it back to the proper dates. I could possibly use this and min(change_date), but that would mean the query would see the first row for id 3 as being the same as the last row for id 3, which is not correct.
EDIT: Please note, I'm not looking for the simple distinct on ID and Value. I need to also include when it switches back to a previous value, as seen for ID 3. All four values of ID 3 should be preserved in the output.
You seem to only want the first row when multiple id/value pairs are in a row. Use lag():
select t.*
from (select t.*,
lag(value) over (partition by id order by change_date) as prev_value
from t
) t
where prev_value is null or prev_value <> value;

Very special kind of AVG statement

Table example:
time a b c
-------------
12:00 1 0 1
12:00 2 3 1
13:00 3 2 1
13:00 3 3 3
14:00 1 1 1
How can I get AVG(a) from row WHERE b!=0 and AVG(c) grouped by time. Is it possible to solve with sql only? I mean that query should not count 1st row to get AVG(a), but not the same with AVG(c).
You can utilize CASE statements to get conditional aggregates:
SELECT AVG(CASE WHEN b != 0 THEN a END)
,AVG(c)
FROM YourTable
GROUP BY time
Demo: SQL Fiddle
This works because a value not captured by WHEN criteria in a CASE statement will default to NULL, and NULL values are ignored by aggregate functions.
SELECT AVG(a), AVG(c) from table WHERE b != 0
group by time
Yea... is this what you need?
You might want to try something like
SELECT T.tTIME
, AVG(CASE WHEN T.B != 0 THEN T.A END)
, AVG(T.C)
FROM #T T
GROUP BY T.tTIME
The output is the following:
tTIME (No column name) (No column name)
12:00:00.0000000 2 1
13:00:00.0000000 3 2
14:00:00.0000000 1 1