Select distinct rows "modulo null" - sql

Suppose I have a table mytable:
a b c d
------------------------
1 2 3 4
1 1 1 null
1 2 3 4
1 null null null
1 2 null null
1 null 1 null
null null null null
Now the first and third rows of this table are exact duplicates. However, we can also think of the fifth row as duplicating the information contained the first row, in the sense that 1 2 null null is just a copy of 1 2 3 4 but with some data missing. Let's say that 1 2 null null is covered by 1 2 3 4.
"Covering by" is a relationship like <=, while "exact duplication" is a relationship like ==. In the table above, we also have that the sixth row is covered by the second row, the fourth row is covered by all other rows except for the last, the last row is covered by all other rows, and the first and third rows are covered by each other.
Now I want to deduplicate mytable using this notion of covering. Said differently, I want the "minimal cover." That means that whenever row1 <= row2, row1 should be removed from the result. In this case, the outcome is
a b c d
------------------------
1 2 3 4
1 1 1 null
This is like SELECT DISTINCT, but with enhanced null-handling behavior.
More formally, we can define deduplicate(table) as the subset of rows of table such that:
for every row r of table, there exists a row c of deduplicate(table) such that r <= c, and
if c1 and c2 are any two separate rows in deduplicate(table), then c1 <= c2 does not hold.
Or algorithmically:
def deduplicate(table):
outcome = set()
for nextRow in table:
if any(nextRow <= o for o in outcome):
continue
else:
for possiblyNowADuplicate in outcome:
if possiblyNowADuplicate <= nextRow:
# it is now a duplicate
outcome.remove(possiblyNowADuplicate)
outcome.add(nextRow)
return outcome
How can I do this in SQL?
(I'm working in Presto, which allegedly implements modern ANSI SQL; moreover, the table I'm working with has many more columns and tons more rows than mytable, so the solution has to scale reasonably well, both in code complexity (ideally should not require code length O(n^2) in the number of columns!), and in terms of execution time.)
Edit: Based on #toonice's response, I have the following refinements:
On further reflection, it'd be nice if the query code length were O(1) in the number of columns (possibly excluding a single explicit naming of the columns to be operated on in a subtable select, for maintainability). Having a complex boolean condition for each column in both a group by and an order by is a bit much. I'd have to write a python script to generate my sql query. It may be that this is unavoidable, however.
I am operating on at least millions of rows. I cannot do this in O(n^2) time. So:
Is it possible to do this faster?
If not, I should mention that in my real dataset, I have a nonnull column "userid" such that each userid has at most say 100 rows associated with it. Can we take advantage of this segmentation to do the quadratic stuff only over each userid, and then recombine the data all back together? (And there are 60k users, so I definitely cannot name them explicitly in the query.)

Please try the following...
SELECT DISTINCT leftTable.a,
leftTable.b,
leftTable.c,
leftTable.d
FROM tblTable AS leftTable
JOIN tblTable AS rightTable ON ( ( leftTable.a = rightTable.a OR
rightTable.a IS NULL ) AND
( leftTable.b = rightTable.b OR
rightTable.b IS NULL ) AND
( leftTable.c = rightTable.c OR
rightTable.c IS NULL ) AND
( leftTable.d = rightTable.d OR
rightTable.d IS NULL ) )
GROUP BY rightTable.a,
rightTable.b,
rightTable.c,
rightTable.d
ORDER BY ISNULL( leftTable.a ),
leftTable.a DESC,
ISNULL( leftTable.b ),
leftTable.b DESC,
ISNULL( leftTable.c ),
leftTable.c DESC,
ISNULL( leftTable.d ),
leftTable.d DESC;
This statement starts by performing an INNER JOIN on two copies of tblTable, which I have given the aliases of leftTable and rightTable. This join will append a copy of each record from rightTable to every record in leftTable where the record from leftTable covers that from rightTable
The resulting dataset is then grouped to eliminate any duplicate entries in the fields from leftTable.
The grouped dataset is then ordered into descending order, with surviving NULL values being placed after non-NULL values.
Extension
You can use SELECT DISTINCT leftTable.* on the first line if you are happy with selecting all fields from leftTable - I've just gotten in the habit of listing the fields. Either will work just fine in this case. leftTable.* may prove more wieldy if you are dealing with a large number of fields. I'm not sure if there is a difference in execution time bewteen the two methods.
I have not been able to find a way to say where all fields equal in a WHERE clause, either by saying leftTable.* = rightTable.* or something equivalent. Our situation is further complicated by the fact that we are not testing for equivalence, but for covering. Whilst I'd love it if there is a way to test for covering en masse, I'm afraid that you will just have to do a lot of copying, pasting and carefully changing letters so that the test used for each field in my Answer is applied to each of your fields.
Also, I have not been able to find a way to GROUP BY all fields, either in the order that they occur in the table or in any order, short of specifying every field to be grouped on. This too would be nice to know, but for now I think you will have to specify each field from rightTable. Seek out the glories and beware the dangers of copy, paste and edit!
If you do not care about if a row is ordered first or last when the value it is being ordered on is NULL, then you can speed up the statement slightly by removing the ISNULL() conditions from the ORDER BY clause.
If you do not care about ordering at all you can further speed up the statement by removing the ORDER BY clause entirely. Depending on the quirks of your language, you will want to replace it with either nothing or with ORDER BY NULL. Some languages, such as MySQL, automatically sort by the fields specified in a GROUP BY clause unless an ORDER BY clause is specified. ORDER BY NULL is effectively a way of telling it not to do any sorting.
If we are only deduplicating covered records for each user (i.e. each user's records have no bearing on the records of other users), then the following statement should be used...
SELECT DISTINCT leftTable.userid,
leftTable.a,
leftTable.b,
leftTable.c,
leftTable.d
FROM tblTable AS leftTable
JOIN tblTable AS rightTable ON ( leftTable.userid = rightTable.userid AND
( leftTable.a = rightTable.a OR
rightTable.a IS NULL ) AND
( leftTable.b = rightTable.b OR
rightTable.b IS NULL ) AND
( leftTable.c = rightTable.c OR
rightTable.c IS NULL ) AND
( leftTable.d = rightTable.d OR
rightTable.d IS NULL ) )
GROUP BY rightTable.userid,
rightTable.a,
rightTable.b,
rightTable.c,
rightTable.d
ORDER BY leftTable.userid,
ISNULL( leftTable.a ),
leftTable.a DESC,
ISNULL( leftTable.b ),
leftTable.b DESC,
ISNULL( leftTable.c ),
leftTable.c DESC,
ISNULL( leftTable.d ),
leftTable.d DESC;
By eliminating in a dataset that large the need to join other user's records to that of each user, you are removing alot of processing overhead, more than is created by now needing to choose another field for output and by testing another pair of fields when joining and by adding another layer of grouping and by having to ORDER BY another field.
I'm afraid that I can not think of any other way to make this statement more efficient. If anyone does know of a way, then I would like to hear about it.
If you have any questions or comments, then please feel free to post a Comment accordingly.
Appendix
This code was tested in MySQL using a dataset created using the following script...
CREATE TABLE tblTable
(
a INT,
b INT,
c INT,
d INT
);
INSERT INTO tblTable ( a,
b,
c,
d )
VALUES ( 1, 2, 3, 4 ),
( 1, 1, 1, NULL ),
( 1, 2, 3, 4 ),
( 1, NULL, NULL, NULL ),
( 1, 2, NULL, NULL ),
( 1, NULL, NULL, NULL ),
( NULL, NULL, NULL, NULL );

Related

More than one row returned by a subquery used as an expression when UPDATE on multiple rows

I'm trying to update rows in a single table by splitting them into two "sets" of rows.
The top part of the set should have a status set to X and the bottom one should have a status set to status Y.
I've tried putting together a query that looks like this
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
((SELECT id from x_status), 'X'),
((SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (folks.ids);
When I run this query I get the following error:
pq: more than one row returned by a subquery used as an expression
This makes sense, folks.ids is expected to return a list of IDs, hence the IN clause in the UPDATE statement, but I suspect the problem is I can not return the list in the values statement in the FROM clause as it turns into something like this:
(1, 2, 3, 4, 5, 5)
(6, 7, 8, 9, 1)
Is there a way how this UPDATE can be done using a CTE query at all? I could split this into two separate UPDATE queries, but CTE query would be better and in theory faster.
I think I understand now... if I get your problem, you want to set the status to 'X' for the oldest five records and 'Y' for everything else?
In that case I think the row_number() analytic would work -- and it should do it in a single pass, two scans, and eliminating one order by. Let me know if something like this does what you seek.
with ranked as (
select
id, row_number() over (order by date_registered desc) as rn
from people
)
update people p
set
status = case when r.rn <= 5 then 'X' else 'Y' end
from ranked r
where
p.id = r.id
Any time you do an update from another data set, it's helpful to have a where clause that defines the relationship between the two datasets (the non-ANSI join syntax). This makes it iron-clad what you are updating.
Also I believe this code is pretty readable so it will be easier to build on if you need to make tweaks.
Let me know if I missed the boat.
So after more tinkering, I've come up with a solution.
The problem with why the previous query fails is we are not grouping the IDs in the subqueries into arrays so the result expands into a huge list as I suspected.
The solution is grouping the IDs in the subqueries into ARRAY -- that way they get returned as a single result (tuple) in ids value.
This is the query that does the job. Note that we must unnest the IDs in the WHERE clause:
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
(ARRAY(SELECT id from x_status), 'X'),
(ARRAY(SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (SELECT * from unnest(folks.ids));

SQL group by selecting top rows with possible nulls

The example table:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
2
b
2022-01-01 13:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I need to get top 1 rows (with the minimal create_time) grouped by group_id with these conditions:
create_time can be null - it should be treated as a minimal value
group_id can be null - all rows with nullable group_id should be returned (if it's not possible, we can use coalesce(group_id, id) or sth like that assuming that ids are unique and never collide with group ids)
it should be possible to apply pagination on the query (so join can be a problem)
the query should be universal as much as possible (so no vendor-specific things). Again, if it's not possible, it should work in MySQL 5&8, PostgreSQL 9+ and H2
The expected output for the example:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I've already read similar questions on SO but 90% of answers are with specific keywords (numerous answers with PARTITION BY like https://stackoverflow.com/a/6841644/5572007) and others don't honor null values in the group condition columns and probably pagination (like https://stackoverflow.com/a/14346780/5572007).
You can combine two queries with UNION ALL. E.g.:
select id, name, create_time, group_id
from mytable
where group_id is not null
and not exists
(
select null
from mytable older
where older.group_id = mytable.group_id
and older.create_time < mytable.create_time
)
union all
select id, name, create_time, group_id
from mytable
where group_id is null
order by id;
This is standard SQL and very basic at that. It should work in about every RDBMS.
As to pagination: This is usually costly, as you run the same query again and again in order to always pick the "next" part of the result, instead of running the query only once. The best approach is usually to use the primary key to get to the next part so an index on the key can be used. In above query we'd ideally add where id > :last_biggest_id to the queries and limit the result, which would be fetch next <n> rows only in standard SQL. Everytime we run the query, we use the last read ID as :last_biggest_id, so we read on from there.
Variables, however, are dealt with differently in the various DBMS; most commonly they are preceded by either a colon, a dollar sign or an at sign. And the standard fetch clause, too, is supported by only some DBMS, while others have a LIMIT or TOP clause instead.
If these little differences make it impossible to apply them, then you must find a workaround. For the variable this can be a one-row-table holding the last read maximum ID. For the fetch clause this can mean you simply fetch as many rows as you need and stop there. Of course this isn't ideal, as the DBMS doesn't know then that you only need the next n rows and cannot optimize the execution plan accordingly.
And then there is the option not to do the pagination in the DBMS, but read the complete result into your app and handle pagination there (which then becomes a mere display thing and allocates a lot of memory of course).
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
Not sure how you imagine "pagination" should work. Here's one way:
and (
select count(distinct coalesce(t2.group_id, t2.id)) from T t2
where coalesce(t2.group_id, t2.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5 /* for example */
order by coalesce(t1.group_id, t1.id)
I'm assuming there's an implicit cast from 0 to a date value with a resulting value lower than all those in your database. Not sure if that's reliable. (Try '19000101' instead?) Otherwise the rest should be universal. You could probably also parameterize that in the same way as the page range.
You've also got a potential a complication with potential collisions between the group_id and id spaces. Yours don't appear to have that problem though having mixed data types creates its own issues.
This all gets more difficult when you want to order by other columns like name:
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
) and (
select count(*) from (
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
) t3
where t3.name < t1.name or t3.name = t1.name
and coalesce(t3.group_id, t3.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5
order by t1.name;
That does handle ties but also makes the simplifying assumption that name can't be null which would add yet another small twist. At least you can see that it's possible without CTEs and window functions but expect these to also be a lot less efficient to run.
https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=9697fd274e73f4fa7c1a3a48d2c78691
I would guess
SELECT id, name, MAX(create_time), group_id
FROM tb GROUP BY group_id
UNION ALL
SELECT id, name, create_time, group_id
FROM tb WHERE group_id IS NULL
ORDER BY name
I should point out that 'name' is a reserved word.

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

SQL percentage of the total

Hi how can I get the percentage of each record over the total?
Lets imagine I have one table with the following
ID code Points
1 101 2
2 201 3
3 233 4
4 123 1
The percentage for ID 1 is 20% for 2 is 30% and so one
how do I get it?
There's a couple approaches to getting that result.
You essentially need the "total" points from the whole table (or whatever subset), and get that repeated on each row. Getting the percentage is a simple matter of arithmetic, the expression you use for that depends on the datatypes, and how you want that formatted.
Here's one way (out a couple possible ways) to get the specified result:
SELECT t.id
, t.code
, t.points
-- , s.tot_points
, ROUND(t.points * 100.0 / s.tot_points,1) AS percentage
FROM onetable t
CROSS
JOIN ( SELECT SUM(r.points) AS tot_points
FROM onetable r
) s
ORDER BY t.id
The view query s is run first, that gives a single row. The join operation matches that row with every row from t. And that gives us the values we need to calculate a percentage.
Another way to get this result, without using a join operation, is to use a subquery in the SELECT list to return the total.
Note that the join approach can be extended to get percentage for each "group" of records.
id type points %type
-- ---- ------ -----
1 sold 11 22%
2 sold 4 8%
3 sold 25 50%
4 bought 1 50%
5 bought 1 50%
6 sold 10 20%
To get that result, we can use the same query, but a a view query for s that returns total GROUP BY r.type, and then the join operation isn't a CROSS join, but a match based on type:
SELECT t.id
, t.type
, t.points
-- , s.tot_points_by_type
, ROUND(t.points * 100.0 / s.tot_points_by_type,1) AS `%type`
FROM onetable t
JOIN ( SELECT r.type
, SUM(r.points) AS tot_points
FROM onetable r
GROUP BY r.type
) s
ON s.type = t.type
ORDER BY t.id
To do that same result with the subquery, that's going to be a correlated subquery, and that subquery is likely to get executed for every row in t.
This is why it's more natural for me to use a join operation, rather than a subquery in the SELECT list... even when a subquery works the same. (The patterns we use for more complex queries, like assigning aliases to tables, qualifying all column references, and formatting the SQL... those patterns just work their way back into simple queries. The rationale for these patterns is kind of lost in simple queries.)
try like this
select id,code,points,(points * 100)/(select sum(points) from tabel1) from table1
To add to a good list of responses, this should be fast performance-wise, and rather easy to understand:
DECLARE #T TABLE (ID INT, code VARCHAR(256), Points INT)
INSERT INTO #T VALUES (1,'101',2), (2,'201',3),(3,'233',4), (4,'123',1)
;WITH CTE AS
(SELECT * FROM #T)
SELECT C.*, CAST(ROUND((C.Points/B.TOTAL)*100, 2) AS DEC(32,2)) [%_of_TOTAL]
FROM CTE C
JOIN (SELECT CAST(SUM(Points) AS DEC(32,2)) TOTAL FROM CTE) B ON 1=1
Just replace the table variable with your actual table inside the CTE.

How can I select adjacent rows to an arbitrary row (in sql or postgresql)?

I want to select some rows based on certain criteria, and then take one entry from that set and the 5 rows before it and after it.
Now, I can do this numerically if there is a primary key on the table, (e.g. primary keys that are numerically 5 less than the target row's key and 5 more than the target row's key).
So select the row with the primary key of 7 and the nearby rows:
select primary_key from table where primary_key > (7-5) order by primary_key limit 11;
2
3
4
5
6
-=7=-
8
9
10
11
12
But if I select only certain rows to begin with, I lose that numeric method of using primary keys (and that was assuming the keys didn't have any gaps in their order anyway), and need another way to get the closest rows before and after a certain targeted row.
The primary key output of such a select might look more random and thus less succeptable to mathematical locating (since some results would be filtered, out, e.g. with a where active=1):
select primary_key from table where primary_key > (34-5)
order by primary_key where active=1 limit 11;
30
-=34=-
80
83
100
113
125
126
127
128
129
Note how due to the gaps in the primary keys caused by the example where condition (for example becaseu there are many inactive items), I'm no longer getting the closest 5 above and 5 below, instead I'm getting the closest 1 below and the closest 9 above, instead.
There's a lot of ways to do it if you run two queries with a programming language, but here's one way to do it in one SQL query:
(SELECT * FROM table WHERE id >= 34 AND active = 1 ORDER BY id ASC LIMIT 6)
UNION
(SELECT * FROM table WHERE id < 34 AND active = 1 ORDER BY id DESC LIMIT 5)
ORDER BY id ASC
This would return the 5 rows above, the target row, and 5 rows below.
Here's another way to do it with analytic functions lead and lag. It would be nice if we could use analytic functions in the WHERE clause. So instead you need to use subqueries or CTE's. Here's an example that will work with the pagila sample database.
WITH base AS (
SELECT lag(customer_id, 5) OVER (ORDER BY customer_id) lag,
lead(customer_id, 5) OVER (ORDER BY customer_id) lead,
c.*
FROM customer c
WHERE c.active = 1
AND c.last_name LIKE 'B%'
)
SELECT base.* FROM base
JOIN (
-- Select the center row, coalesce so it still works if there aren't
-- 5 rows in front or behind
SELECT COALESCE(lag, 0) AS lag, COALESCE(lead, 99999) AS lead
FROM base WHERE customer_id = 280
) sub ON base.customer_id BETWEEN sub.lag AND sub.lead
The problem with sgriffinusa's solution is that you don't know which row_number your center row will end up being. He assumed it will be row 30.
For similar query I use analytic functions without CTE. Something like:
select ...,
LEAD(gm.id) OVER (ORDER BY Cit DESC) as leadId,
LEAD(gm.id, 2) OVER (ORDER BY Cit DESC) as leadId2,
LAG(gm.id) OVER (ORDER BY Cit DESC) as lagId,
LAG(gm.id, 2) OVER (ORDER BY Cit DESC) as lagId2
...
where id = 25912
or leadId = 25912 or leadId2 = 25912
or lagId = 25912 or lagId2 = 25912
such query works more faster for me than CTE with join (answer from Scott Bailey). But of course less elegant
You could do this utilizing row_number() (available as of 8.4). This may not be the correct syntax (not familiar with postgresql), but hopefully the idea will be illustrated:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY primary_key) AS r, *
FROM table
WHERE active=1) t
WHERE 25 < r and r < 35
This will generate a first column having sequential numbers. You can use this to identify the single row and the rows above and below it.
If you wanted to do it in a 'relationally pure' way, you could write a query that sorted and numbered the rows. Like:
select (
select count(*) from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
Then use that as a common table expression. Write a select which filters it down to the rows you're interested in, then join it back onto itself using a criterion that the index of the right-hand copy of the table is no more than k larger or smaller than the index of the row on the left. Project over just the rows on the right. Like:
with numbered_emps as (
select (
select count(*)
from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
)
select b.*
from numbered_emps a, numbered_emps b
where a.name like '% Smith' -- this is your main selection criterion
and ((b.idx - a.idx) between -5 and 5) -- this is your adjacency fuzzy-join criterion
What could be simpler!
I'd imagine the row-number based solutions will be faster, though.