How can I select adjacent rows to an arbitrary row (in sql or postgresql)? - sql

I want to select some rows based on certain criteria, and then take one entry from that set and the 5 rows before it and after it.
Now, I can do this numerically if there is a primary key on the table, (e.g. primary keys that are numerically 5 less than the target row's key and 5 more than the target row's key).
So select the row with the primary key of 7 and the nearby rows:
select primary_key from table where primary_key > (7-5) order by primary_key limit 11;
2
3
4
5
6
-=7=-
8
9
10
11
12
But if I select only certain rows to begin with, I lose that numeric method of using primary keys (and that was assuming the keys didn't have any gaps in their order anyway), and need another way to get the closest rows before and after a certain targeted row.
The primary key output of such a select might look more random and thus less succeptable to mathematical locating (since some results would be filtered, out, e.g. with a where active=1):
select primary_key from table where primary_key > (34-5)
order by primary_key where active=1 limit 11;
30
-=34=-
80
83
100
113
125
126
127
128
129
Note how due to the gaps in the primary keys caused by the example where condition (for example becaseu there are many inactive items), I'm no longer getting the closest 5 above and 5 below, instead I'm getting the closest 1 below and the closest 9 above, instead.

There's a lot of ways to do it if you run two queries with a programming language, but here's one way to do it in one SQL query:
(SELECT * FROM table WHERE id >= 34 AND active = 1 ORDER BY id ASC LIMIT 6)
UNION
(SELECT * FROM table WHERE id < 34 AND active = 1 ORDER BY id DESC LIMIT 5)
ORDER BY id ASC
This would return the 5 rows above, the target row, and 5 rows below.

Here's another way to do it with analytic functions lead and lag. It would be nice if we could use analytic functions in the WHERE clause. So instead you need to use subqueries or CTE's. Here's an example that will work with the pagila sample database.
WITH base AS (
SELECT lag(customer_id, 5) OVER (ORDER BY customer_id) lag,
lead(customer_id, 5) OVER (ORDER BY customer_id) lead,
c.*
FROM customer c
WHERE c.active = 1
AND c.last_name LIKE 'B%'
)
SELECT base.* FROM base
JOIN (
-- Select the center row, coalesce so it still works if there aren't
-- 5 rows in front or behind
SELECT COALESCE(lag, 0) AS lag, COALESCE(lead, 99999) AS lead
FROM base WHERE customer_id = 280
) sub ON base.customer_id BETWEEN sub.lag AND sub.lead
The problem with sgriffinusa's solution is that you don't know which row_number your center row will end up being. He assumed it will be row 30.

For similar query I use analytic functions without CTE. Something like:
select ...,
LEAD(gm.id) OVER (ORDER BY Cit DESC) as leadId,
LEAD(gm.id, 2) OVER (ORDER BY Cit DESC) as leadId2,
LAG(gm.id) OVER (ORDER BY Cit DESC) as lagId,
LAG(gm.id, 2) OVER (ORDER BY Cit DESC) as lagId2
...
where id = 25912
or leadId = 25912 or leadId2 = 25912
or lagId = 25912 or lagId2 = 25912
such query works more faster for me than CTE with join (answer from Scott Bailey). But of course less elegant

You could do this utilizing row_number() (available as of 8.4). This may not be the correct syntax (not familiar with postgresql), but hopefully the idea will be illustrated:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY primary_key) AS r, *
FROM table
WHERE active=1) t
WHERE 25 < r and r < 35
This will generate a first column having sequential numbers. You can use this to identify the single row and the rows above and below it.

If you wanted to do it in a 'relationally pure' way, you could write a query that sorted and numbered the rows. Like:
select (
select count(*) from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
Then use that as a common table expression. Write a select which filters it down to the rows you're interested in, then join it back onto itself using a criterion that the index of the right-hand copy of the table is no more than k larger or smaller than the index of the row on the left. Project over just the rows on the right. Like:
with numbered_emps as (
select (
select count(*)
from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
)
select b.*
from numbered_emps a, numbered_emps b
where a.name like '% Smith' -- this is your main selection criterion
and ((b.idx - a.idx) between -5 and 5) -- this is your adjacency fuzzy-join criterion
What could be simpler!
I'd imagine the row-number based solutions will be faster, though.

Related

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Returning the lowest integer not in a list in SQL

Supposed you have a table T(A) with only positive integers allowed, like:
1,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18
In the above example, the result is 10. We always can use ORDER BY and DISTINCT to sort and remove duplicates. However, to find the lowest integer not in the list, I came up with the following SQL query:
select list.x + 1
from (select x from (select distinct a as x from T order by a)) as list, T
where list.x + 1 not in T limit 1;
My idea is start a counter and 1, check if that counter is in list: if it is, return it, otherwise increment and look again. However, I have to start that counter as 1, and then increment. That query works most of the cases, by there are some corner cases like in 1. How can I accomplish that in SQL or should I go about a completely different direction to solve this problem?
Because SQL works on sets, the intermediate SELECT DISTINCT a AS x FROM t ORDER BY a is redundant.
The basic technique of looking for a gap in a column of integers is to find where the current entry plus 1 does not exist. This requires a self-join of some sort.
Your query is not far off, but I think it can be simplified to:
SELECT MIN(a) + 1
FROM t
WHERE a + 1 NOT IN (SELECT a FROM t)
The NOT IN acts as a sort of self-join. This won't produce anything from an empty table, but should be OK otherwise.
SQL Fiddle
select min(y.a) as a
from
t x
right join
(
select a + 1 as a from t
union
select 1
) y on y.a = x.a
where x.a is null
It will work even in an empty table
SELECT min(t.a) - 1
FROM t
LEFT JOIN t t1 ON t1.a = t.a - 1
WHERE t1.a IS NULL
AND t.a > 1; -- exclude 0
This finds the smallest number greater than 1, where the next-smaller number is not in the same table. That missing number is returned.
This works even for a missing 1. There are multiple answers checking in the opposite direction. All of them would fail with a missing 1.
SQL Fiddle.
You can do the following, although you may also want to define a range - in which case you might need a couple of UNIONs
SELECT x.id+1
FROM my_table x
LEFT
JOIN my_table y
ON x.id+1 = y.id
WHERE y.id IS NULL
ORDER
BY x.id LIMIT 1;
You can always create a table with all of the numbers from 1 to X and then join that table with the table you are comparing. Then just find the TOP value in your SELECT statement that isn't present in the table you are comparing
SELECT TOP 1 table_with_all_numbers.number, table_with_missing_numbers.number
FROM table_with_all_numbers
LEFT JOIN table_with_missing_numbers
ON table_with_missing_numbers.number = table_with_all_numbers.number
WHERE table_with_missing_numbers.number IS NULL
ORDER BY table_with_all_numbers.number ASC;
In SQLite 3.8.3 or later, you can use a recursive common table expression to create a counter.
Here, we stop counting when we find a value not in the table:
WITH RECURSIVE counter(c) AS (
SELECT 1
UNION ALL
SELECT c + 1 FROM counter WHERE c IN t)
SELECT max(c) FROM counter;
(This works for an empty table or a missing 1.)
This query ranks (starting from rank 1) each distinct number in ascending order and selects the lowest rank that's less than its number. If no rank is lower than its number (i.e. there are no gaps in the table) the query returns the max number + 1.
select coalesce(min(number),1) from (
select min(cnt) number
from (
select
number,
(select count(*) from (select distinct number from numbers) b where b.number <= a.number) as cnt
from (select distinct number from numbers) a
) t1 where number > cnt
union
select max(number) + 1 number from numbers
) t1
http://sqlfiddle.com/#!7/720cc/3
Just another method, using EXCEPT this time:
SELECT a + 1 AS missing FROM T
EXCEPT
SELECT a FROM T
ORDER BY missing
LIMIT 1;

filtering rows by checking a condition for group in one statement only

I have the following statement:
SELECT
(CONVERT(VARCHAR(10), f1, 120)) AS ff1,
CONVERT(VARCHAR(10), f2, 103) AS ff2,
...,
Bonus,
Malus,
ClientID,
FROM
my_table
WHERE
<my_conditions>
ORDER BY
f1 ASC
This select returns several rows for each ClientID. I have to filter out all the rows with the Clients that don't have any row with non-empty Bonus or Malus.
How can I do it by changing this select by one statement only and without duplicating all this select?
I could store the result in a #temp_table, then group the data and use the result of the grouping to filter the temp table. - BUT I should do it by one statement only.
I could perform this select twice - one time grouping it and then I can filter the rows based on grouping result. BUT I don't want to select it twice.
May be CTE (Common Table Expressions) could be useful here to perform the select one time only and to be able to use the result for grouping and then for selecting the desired result based on the grouping result.
Any more elegant solution for this problem?
Thank you in advance!
Just to clarify what the SQL should do I add an example:
ClientID Bonus Malus
1 1
1
1 1
2
2
3 4
3 5
3 1
So in this case I don't want the ClientID=2 rows to appear (they are not interesting). The result should be:
ClientID Bonus Malus
1 1
1
1 1
3 4
3 5
3 1
SELECT Bonus,
Malus,
ClientID
FROM my_table
WHERE ClientID not in
(
select ClientID
from my_table
group by ClientID
having count(Bonus) = 0 and count(Malus) = 0
)
A CTE will work fine, but in effect its contents will be executed twice because they are being cloned into all the places where the CTE is being used. This can be a net performance win or loss compared to using a temp table. If the query is very expensive it might come out as a loss. If it is cheap or if many rows are being returned the temp table will lose the comparison.
Which solution is better? Look at the execution plans and measure the performance.
The CTE is the easier, more maintainable are less redundant alternative.
You haven't specified what are data types of Bonus and Malus columns. So if they're integer (or can be converted to integer), then the query below should be helpful. It calculates sum of both columns for each ClientID. These sums are the same for each detail line of the same client so we can use them in WHERE condition. Statement SUM() OVER() is called "windowed function" and can't be used in WHERE clause so I had to wrap your select-list with a parent one just because of syntax.
SELECT *
FROM (
SELECT
CONVERT(VARCHAR(10), f1, 120) AS ff1,
CONVERT(VARCHAR(10), f2, 103) AS ff2,
...,
Bonus,
Malus,
ClientID,
SUM(Bonus) OVER (PARTITION BY ClientID) AS ClientBonusTotal,
SUM(Malus) OVER (PARTITION BY ClientID) AS ClientMalusTotal
FROM
my_table
WHERE
<my_conditions>
) a
WHERE ISNULL(a.ClientBonusTotal, 0) <> 0 OR ISNULL(a.ClientMalusTotal, 0) <> 0
ORDER BY f1 ASC

How do I get the top 10 results of a query?

I have a postgresql query like this:
with r as (
select
1 as reason_type_id,
rarreason as reason_id,
count(*) over() count_all
from
workorderlines
where
rarreason != 0
and finalinsdate >= '2012-12-01'
)
select
r.reason_id,
rt.desc,
count(r.reason_id) as num,
round((count(r.reason_id)::float / (select count(*) as total from r) * 100.0)::numeric, 2) as pct
from r
left outer join
rtreasons as rt
on
r.reason_id = rt.rtreason
and r.reason_type_id = rt.rtreasontype
group by
r.reason_id,
rt.desc
order by r.reason_id asc
This returns a table of results with 4 columns: the reason id, the description associated with that reason id, the number of entries having that reason id, and the percent of the total that number represents.
This table looks like this:
What I would like to do is only display the top 10 results based off the total number of entries having a reason id. However, whatever is leftover, I would like to compile into another row with a description called "Other". How would I do this?
with r2 as (
...everything before the select list...
dense_rank() over(order by pct) cause_rank
...the rest of your query...
)
select * from r2 where cause_rank < 11
union
select
NULL as reason_id,
'Other' as desc,
sum(r2.num) over() as num,
sum(r2.pct) over() as pct,
11 as cause_rank
from r2
where cause_rank >= 11
As said above Limit and for the skipping and getting the rest use offset... Try This Site
Not sure about Postgre but SELECT TOP 10... should do the trick if you sort correctly
However about the second part: You might use a Right Join for this. Join the TOP 10 Result with the whole table data and use only the records not appearing on the left side. If you calculate the sum of those you should get your "Sum of the rest" result.
I assume that vw_my_top_10 is the view showing you the top 10 records. vw_all_records shows all records (including the top 10).
Like this:
SELECT SUM(a_field)
FROM vw_my_top_10
RIGHT JOIN vw_all_records
ON (vw_my_top_10.Key = vw_all_records.Key)
WHERE vw_my_top_10.Key IS NULL

Selecting a single (random) row for an SQL join

I've got an sql query that selects data from several tables, but I only want to match a single(randomly selected) row from another table.
Easier to show some code, I guess ;)
Table K is (k_id, selected)
Table C is (c_id, image)
Table S is (c_id, date)
Table M is (c_id, k_id, score)
All ID-columns are primary keys, with appropriate FK constraints.
What I want, in english, is for eack row in K that has selected = 1 to get a random row from C where there exists a row in M with (K_id, C_id), where the score is higher than a given value, and where c.image is not null and there is a row in s with c_id
Something like:
select k.k_id, c.c_id, m.score
from k,c,m,s
where k.selected = 1
and m.score > some_value
and m.k_id = k.k_id
and m.c_id = c.c_id
and c.image is not null
and s.c_id = c.c_id;
The only problem is this returns all the rows in C that match the criteria - I only want one...
I can see how to do it using PL/SQL to select all relevent rows into a collection and then select a random one, but I'm stuck as to how to select a random one.
you can use the 'order by dbms_random.random' instruction with your query.
i.e.:
SELECT column FROM
(
SELECT column FROM table
ORDER BY dbms_random.value
)
WHERE rownum = 1
References:
http://awads.net/wp/2005/08/09/order-by-no-order/
http://www.petefreitag.com/item/466.cfm
with analytics:
SELECT k_id, c_id, score
FROM (SELECT k.k_id, c.c_id, m.score,
row_number() over(PARTITION BY k.k_id ORDER BY NULL) rk
FROM k, c, m, s
WHERE k.selected = 1
AND m.score > some_value
AND m.k_id = k.k_id
AND m.c_id = c.c_id
AND c.image IS NOT NULL
AND s.c_id = c.c_id)
WHERE rk = 1
This will select one row that satisfies your criteria per k_id. This will likely select the same set of rows if you run the query several times. If you want more randomness (each run produces a different set of rows), you would replace ORDER BY NULL by ORDER BY dbms_random.value
I'm not too familiar with oracle SQL, but try using LIMIT random(), if there is such a function available.