Find row number in a sort based on row id, then find its neighbours - sql

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?

Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.

Related

(Hive) SQL retrieving data from a column that has 1 to N relationship in another column

How can I retrieve rows where BID comes up multiple times in AID
You can see the sample below, AID and BID columns are under the PrimaryID, and BIDs are under AID. I want to come up with an output that only takes records where BIDs had 1 to many relationship with records on AIDs column. Example output below.
I provided a small sample of data, I am trying to retrieve 20+ columns and joining 4 tables. I have unqiue PrimaryIDs and under those I have multiple unique AIDs, however under these AIDs I can have multiple non-unqiue BIDs that can repeatedly come up under different AIDs.
Hive supports window functions. A window function can associate every row in a group with an attribute of the group. Count() being one of the supported functions. In your case you can use that a and select rows for which that count > 1
The partition by clause you specify which columns define the group, tge same way that you would in the more familiar group by clause.
Something like this:
select * from
(
Select *,
count(*) over (partition by primaryID,AID) counts
from mytable
) x
Where counts>1

How does one get the total rows for a partition in postgresql

I'm using a windows function to help me pagination through a list of records in the database.
For example
I have a list of dogs and they all have a breed associated with them.
I want to show 10 dogs from each breed to my users.
So that would be
select * from dogs
join (
SELECT id, row_number() OVER (PARTITION BY breed) as row_number FROM dogs
) rn on dogs.id = rn.id
where (row_number between 1 and 10)
That will give me ~ten dogs from each breed..
What I need though is a count. Is there a way to get the count of the partitions. I want to know how many Staffies I have waiting for adoption.
I do notice that there's a percentage and all the docs I find seem to indicate theres something called total rows. But I don't see it.
Just run the window aggregate function count() over the same partition (without adding ORDER BY!) to get the total count for the partition:
SELECT *
FROM (
SELECT *
, row_number() OVER (PARTITION BY breed ORDER BY id) AS rn
, count() OVER (PARTITION BY breed) AS breed_count -- !
FROM dogs
) sub
WHERE rn < 11;
Also removed the unnecessary join and simplified.
See:
Run a query with a LIMIT/OFFSET and also get the total number of rows
And I added ORDER BY to the frame definition of row_number() to get a deterministic result. Without, Postgres is free to return any 10 arbitrary rows. Any write to the table (or VACUUM, etc.) can and will change the result without ORDER BY.
Aside, pagination with LIMIT / OFFSET does not scale well. Consider:
Optimize query with OFFSET on large table

How to select 1 row per id?

I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?
If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.
The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.

SQL to find best row in group based on multiple columns?

Let's say I have an Oracle table with measurements in different categories:
CREATE TABLE measurements (
category CHAR(8),
value NUMBER,
error NUMBER,
created DATE
)
Now I want to find the "best" row in each category, where "best" is defined like this:
It has the lowest errror.
If there are multiple measurements with the same error, the one that was created most recently is the considered to be the best.
This is a variation of the greatest N per group problem, but including two columns instead of one. How can I express this in SQL?
Use ROW_NUMBER:
WITH cte AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY category ORDER BY error, created DESC) rn
FROM measurements m
)
SELECT category, value, error, created
FROM cte
WHERE rn = 1;
For a brief explanation, the PARTITION BY clause instructs the DB to generate a separate row number for each group of records in the same category. The ORDER BY clause places those records with the smallest error first. Should two or more records in the same category be tied with the lowest error, then the next sorting level would place the record with the most recent creation date first.

MySQL/Ms SQL latest records with multiple id's

I'm no sql-expert, but came across this problem:
I have to retrieve data from Microsoft SQL 2008 server. It holds different measurement data from different probes, that don't have any recording intervals. Meaning that some probe can transfer data in the database once every week, another once every second. Probes are identified by id's (not unique), and the point is to retrieve only the last record from each id (probe). Table looks like this (last 5, order by SampleDateTime desc):
TagID SampleDateTime SampleValue QualityID
13 634720670797944946 112 192
23 634720670797944946 38.1 192
17 634720670797944946 107.5 192
14 634720670748012090 110.6 192
19 634720670748012090 99.7 192
I CAN'T modify the server or even the settings, am only authorized to do queries. And I'd need to retrieve the requested data on even intervals (say once every minute or so). There are over 100 probes (with different id's) of which about 40 need to be read. So I am guessing that if this could be done in a single query it could be way more efficient than to get each row in a separate query.
Using MySQL and a similar table got the desired result this way (suggestions for a better way highly appreciated!):
SELECT TagID,SampleDateTime,SampleValue FROM
(
SELECT TagID,SampleDateTime,SampleValue FROM measurements
WHERE TagID IN(101,102,103) ORDER BY SampleDateTime DESC
)
AS table1 GROUP BY TagID;
Thought that would do the trick (didn't manage with MAX() or DISTINCT or no matter what I tried), as it did, with the correct data even. But naturally it doesn't work in Ms SQL because of 'GROUP BY'.
Column 'table1.SampleValue' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I'm extremely stuck with this and so any insight would be more than welcome.
I am slightly confused as you have tagged MySQL and SQL-Server. For SQL-Server, I would use the ROW_NUMBER function to assist:
SELECT m.TagID, m.SampleDateTime, m.SampleValue, m.QualityID
FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY TagID ORDER BY SampleDateTime DESC) [RowNumber]
FROM Measurements
) m
WHERE Rownumber = 1
The ROW_NUMBER function does exactly what it says on the tin, gives each row a number based on criteria you provide. So in the example above PARTITION BY TagID tells ROW_NUMBER to start again at 1 each time a new TagID is encountered. ORDER BY SampleDateTime DESC tells ROW_NUMBER to start numbering the each TagID at the latest entry and work upwards to the earliest entry.
The reason your query failed is because MySQL allows implicit group by, meaning that because you have only specified GROUP BY TagID any fields that are in the select list and not contained within an aggregate function will get the values of a "random" row assigned to them (the latest row in your case because you specified ORDER BY SampleDateTime DESC in the subquery.
Just in case it is required the following should work in most DBMS and is a better way of producing a similar query to the one you have been running in MySQL:
SELECT m.TagID, m.SampleDateTime, m.SampleValue, m.QualityID
FROM Measurements m
INNER JOIN
( SELECT TagID, MAX(SampleDateTime) AS SampleDateTime
FROM Measurements
GROUP BY TagID
) MaxTag
ON MaxTag.TagID = m.TagID
AND MaxTag.SampleDateTime = m.SampleDateTime