Get certain rows, plus rows before and after - sql

Let's say I have the following data set:
ID
Identifier
Admission_Date
Release_Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
234
2
4/15/22
4/18/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
789
2
7/1/22
7/5/22
321
2
6/1/21
6/3/21
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
321
2
5/6/21
5/10/21
I want all rows with identifier=1. I also want rows that are either directly below or above rows with Identifier=1 - sorted by most recent to least recent.
There is always a row below rows with identifier=1. There may or may not be a row above. If there is no row with identifier=1 for an ID, then it will not be brought in with a prior step.
The resulting data set should be as follows:
ID
Identifier
Admission Date
Release Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
I am using DBeaver, which runs PostgreSQL.

I admittedly don't know Postgres well so the following could possibly be optimised, however using a combination of lag and lead to obtain the previous and next dates (assuming Admission_date is the one to order by) you could try
with d as (
select *,
case when identifier = 1 then Lag(admission_date) over(partition by id order by Admission_Date desc) end pd,
case when identifier = 1 then Lead(admission_date) over(partition by id order by Admission_Date desc) end nd
from t
)
select id, Identifier, Admission_Date, Release_Date
from d
where identifier = 1
or exists (
select * from d d2
where d2.id = d.id
and (d.Admission_Date = pd or d.admission_date = nd)
)
order by Id, Admission_Date desc;

One way:
SELECT (x.my_row).* -- decompose fields from row type
FROM (
SELECT identifier
, lag(t) OVER w AS t0 -- take whole row
, t AS t1
, lead(t) OVER w AS t2
FROM tbl t
WINDOW w AS (PARTITION BY id ORDER BY admission_date)
) sub
CROSS JOIN LATERAL (
VALUES (t0), (t1), (t2) -- pivot
) x(my_row)
WHERE sub.identifier = 1
AND (x.my_row).id IS NOT NULL; -- exclude rows with NULL ( = missing row)
db<>fiddle here
The query is designed to only make a single pass over the table.
Uses some advanced SQL / Postgres features.
About LATERAL:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
About the VALUES expression:
Postgres: convert single row to multiple rows (unpivot)
The manual about extracting fields from a composite type.
If there are many rows per id, other solutions will be (much) faster - with proper index support. You did not specify ...

Related

Postgres query with limit that selects all records with similar identifier

I have a table that looks something like this:
customer_id
data
1
123
1
456
2
789
2
101
2
121
2
123
3
123
4
456
What I would like to do is perform a SELECT combined with a LIMIT X to get X number of records as well as any other records that have the same customer_id
Example query: SELECT customer_id, data FROM table ORDER BY customer_id LIMIT 3;
This query returns:
customer_id
data
1
123
1
456
2
789
I'd like a query that will look at the last customer_id value and return all remaining records that match beyond the LIMIT specified. Is it possible to do this in a single operation?
Desired output:
customer_id
data
1
123
1
456
2
789
2
101
2
121
2
123
In Postgres 13 can use with ties:
select t.*
from t
order by customer_id
fetch first 3 rows with ties;
In earlier versions you can use in:
select t.*
from t
where t.customer_id in (select t2.customer_id
from t t2
order by t2.customer_id
limit 3
);
You can use corelated subquery with count as follows:
Select t.*
From t
Where 3 >= (select count(distinct customer_id)
From t tt
where t.customer_id >= tt.customer_id)

How to join two tables without performing Cartesian product in SQL

I have index_date information for IDs and I want to extract baseline ( information between index_date and Index_date minus 6 months). I want to do this without using Cartesian product.
Total Table
ID index_date detail
1 01Jan2012 xyz
1 01Dec2011 pqr
1 01Nov2010 pqr
2 26Feb2013 abc
3 02Mar2013 abc
3 02Feb2013 ert
3 02Jan2013 tyu
4 07May2015 rts
I have a table A extracted from Total which has the index_dates:
ID index_date index_detail
1 01Jan2012 xyz
2 26Feb2013 abc
3 02Mar2013 abc
4 07May2015 rts
I want to extract baseline periods data for IDs in A from from the Total table
Table want :
ID date index_date detail index_detail
1 01Jan2012 01Jan2012 xyz xyz
1 01Dec2011 01Jan2012 pqr xyz
2 26Feb2013 26Feb2013 abc abc
3 02Mar2013 02Mar2013 abc abc
3 02Feb2013 02Mar2013 ert abc
3 02Jan2013 02Mar2013 tyu abc
4 07May2015 07May2015 rts rts
code used :
create table want as
select a.* , b.date,b.detail
from table_a as a
right join
Total as b
on a.id = b.id where
a.index_date > b.date
AND b.date >= add_months( a.index_date, -6)
;
But this requires Cartesian Product. Is there a way to do this without requiring Cartesian product.
DBMS - Hive
Sorry, I don't know it.
I'll give the solution on pure SQL for MySQL 8+ - maybe you'll find the way to convert it to Hive syntax.
SELECT id,
index_date date,
FIRST_VALUE(index_date) OVER (PARTITION BY ID ORDER BY STR_TO_DATE(index_date, '%d%b%Y') DESC) index_date,
detail,
FIRST_VALUE(detail) OVER (PARTITION BY ID ORDER BY STR_TO_DATE(index_date, '%d%b%Y') DESC) index_detail
FROM test
ORDER BY 1 ASC, 2 DESC
fiddle
I would recommend three steps:
Convert the date to a number.
Find the minimum date in a six month period.
Get the first value in that group.
This looks like:
select t.*, t2.index_date, t2.detail
from (select t.*,
min(index_date) over (partition by id
order by months
range between 6 preceding and current row
) as sixmonth_date
from (select t.*,
year(index_date) * 12 + month(index_date) as months
from total t
) t
) t left join
total t2
on t2.id = t.id and t2.index_date = t.sixmonth_date;
This is marginally simpler if first_value() accepts range window frames -- but I'm not sure if it does. It is worth trying, though:
select t.*,
min(index_date) over (partition by id
order by months
range between 6 preceding and current row
) as sixmonth_date,
first_value(detail) over (partition by id
order by months
range between 6 preceding and current row
) as sixmonth_value
from (select t.*,
year(index_date) * 12 + month(index_date) as months
from total t
) t

Filter and keep most recent duplicate

Please help me with this one, I'm stuck and cant figure out how to write my Query. I'm working with SQL Server 2014.
Table A (approx 65k ROWS) CEID = primary key
CEID State Checksum
1 2 666
2 2 666
3 2 666
4 2 333
5 2 333
6 9 333
7 9 111
8 9 111
9 9 741
10 2 656
Desired output
CEID State Checksum
3 2 666
6 9 333
8 9 111
9 9 741
10 2 656
I want to keep the row with highest CEID if "state" is equal for all duplicate checksums. If state differs but Checksum is equal i want to keep the row with highest CEID for State=9. Unique rows like CEID 9 and 10 should be included in result regardless of State.
This join returns all duplicates:
SELECT a1.*, a2.*
FROM tableA a1
INNER JOIN tableA a2 ON a1.ChecksumI = a2.ChecksumI
AND a1.CEID <> a2.CEID
I've also identified MAX(CEID) for each duplicate checksum with this query
SELECT a.Checksum, a.State, MAX(a.CEID) CEID_MAX ,COUNT(*) cnt
FROM tableA a
GROUP BY a.Checksum, a.State
HAVING COUNT(*) > 1
ORDER BY a.Checksum, a.State
With the first query, I can't figure out how to SELECT the row with the highest CEID per Checksum.
The problem I encounter with last one is that GROUP BY isn't allowed in subqueries when I try to join on it.
You can use row_number() with partition by checksum and order by State desc and CEID desc. Note both of your condition may be fulfill by ORDER BY State desc, CEID desc
And take the first row_number
;with
cte as
(
select *, rn = row_number() over (Partition by Checksum order by State desc, CEID desc)
from TableA
)
select *
from cte
where rn = 1
order by CEID;

SQL query to get value of another column corresponding to a max value of a column based on group by

I have the following table:
ID BLOWNUMBER TIME LADLE
--- ---------- ---------- -----
124 1 01/01/2012 2
124 1 02/02/2012 1
124 1 03/02/2012 0
124 2 04/01/2012 1
125 2 04/06/2012 1
125 2 01/03/2012 0
I want to have the TIME for the maximum value of LADLE for a group of ID & BLOWNUMBER.
Output required:
124 1 01/01/2012
124 2 04/01/2012
125 2 04/06/2012
If you're using SQL Server (or another engine which supports CTE's and ROW_NUMBER), you can use this CTE (Common Table Expression) query:
;WITH CTE AS
(
SELECT
ID, BlowNumber, [Time],
RN = ROW_NUMBER() OVER (PARTITION BY ID, BLOWNUMBER ORDER BY [Time] DESC)
FROM Sample
)
SELECT *
FROM CTE
WHERE RN = 1
See this SQL Fiddle here for an online live demo.
This CTE "partitions" your data by (ID, BLOWNUMBER), and the ROW_NUMBER() function hands out numbers, starting at 1, for each of those "partitions", ordered by the [Time] columns (newest time value first).
Then, you just select from that CTE and use RN = 1 to get the most recent of each data partition.
If you are using sqllite (probably compatible with other DBs as well); you could do:
select
ct.id
, ct.blownumber
, time
from
new
, (
select
id
, blownumber
, max(ladle) as ldl
from
new
group by
id
, blownumber
) ct
where
ct.id = new.id
and ct.blownumber = new.blownumber
and ct.ldl = new.ladle;

Implementing Hierarchy in SQL

Suppose I have a table which has a "CDATE" representing the date when I retrieved the data, a "SECID" identifying the security I retrieved data for, a "SOURCE" designating where I got the data and the "VALUE" which I got from the source. My data might look as following:
CDATE | SECID | SOURCE | VALUE
--------------------------------
1/1/2012 1 1 23
1/1/2012 1 5 45
1/1/2012 1 3 33
1/4/2012 2 5 55
1/5/2012 1 5 54
1/5/2012 1 3 99
Suppose I have a HIERARCHY table like the following ("SOURCE" with greatest HIERARCHY number takes precedence):
SOURCE | NAME | HIERARCHY
---------------------------
1 ABC 10
3 DEF 5
5 GHI 2
Now let's suppose I want my results to be picked according to the hierarchy above. So applying the hierarch and selecting the source with the greatest HIERARCHY number I would like to end up with the following:
CDATE | SECID | SOURCE | VALUE
---------------------------------
1/1/2012 1 1 23
1/4/2012 2 5 55
1/5/2012 1 3 99
This joins on your hierarchy and selects the top-ranked source for each date and security.
SELECT CDATE, SECID, SOURCE, VALUE
FROM (
SELECT t.CDATE, t.SECID, t.SOURCE, t.VALUE,
ROW_NUMBER() OVER (PARTITION BY t.CDATE, t.SECID
ORDER BY h.HIERARCHY DESC) as nRow
FROM table1 t
INNER JOIN table2 h ON h.SOURCE = t.SOURCE
) A
WHERE nRow = 1
You can get the results you want with the below. It combines your data with your hierarchies and ranks them according to the highest hierarchy. This will only return one result arbitrarily though if you have a source repeated for the same date.
;with rankMyData as (
select
d.CDATE
, d.SECID
, d.SOURCE
, d.VALUE
, row_number() over(partition by d.CDate, d.SECID order by h.HIERARCHY desc) as ranking
from DATA d
inner join HIERARCHY h
on h.source = d.source
)
SELECT
CDATE
, SECID
, SOURCE
, VALUE
FROM rankMyData
where ranking = 1