Global row numbers in chunked query - sql

I would like to include a column row_number in my result set with the row number sequence, where 1 is the newest item, without gaps. This works:
SELECT id, row_number() over (ORDER BY id desc) AS row_number, title
FROM mytable
WHERE group_id = 10;
Now I would like to query for the same data in chunks of 1000 each to be easier on memory:
SELECT id, row_number() over (ORDER BY id desc) AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 0 AND id < 1000
ORDER BY id ASC;
Here the row_number restarts from 1 for every chunk, but I would like it to be as if it were part of the global query, as in the first case. Is there an easy way to accomplish this?

Assuming:
id is defined as PRIMARY KEY - which means UNIQUE and NOT NULL. Else you may have to deal with NULL values and / or duplicates (ties).
You have no concurrent write access on the table - or you don't care what happens after you have taken your snapshot.
A MATERIALIZED VIEW, like you demonstrate in your answer, is a good choice.
CREATE MATERIALIZED VIEW mv_temp AS
SELECT row_number() OVER (ORDER BY id DESC) AS rn, id, title
FROM mytable
WHERE group_id = 10;
But index and subsequent queries must be on the row number rn to get
data in chunks of 1000
CREATE INDEX ON mv_temp (rn);
SELECT * FROM mv_temp WHERE rn BETWEEN 1000 AND 2000;
Your implementation would require a guaranteed gap-less id column - which would void the need for an added row number to begin with ...
When done:
DROP MATERIALIZED VIEW mv_temp;
The index dies with the table (materialized view in this case) automatically.
Related, with more details:
Optimize query with OFFSET on large table

You want to have a query for the first 1000 rows, then one for the next 1000, and so on?
Usually you just write one query (the one you already use), have your app fetch 1000 records, do something with them, then fetch the next 1000 and so on. No need for separate queries, hence.
However, it would be rather easy to write such partial queries:
select *
from
(
SELECT id, row_number() over (ORDER BY id desc) AS rn, title
FROM mytable
WHERE group_id = 10
) numbered
where rn between 1 and 1000; -- <- simply change the row number range here
-- e.g. where rn between 1001 and 2000 for the second chunk

You need a pagination. Try this
SELECT id, row_number() over (ORDER BY id desc)+0 AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 0 AND id < 1000
ORDER BY id ASC;
Next time, when you change the start value of id in the WHERE clause change it in row_number() as well like below
SELECT id, row_number() over (ORDER BY id desc)+1000 AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 1000 AND id < 2000
ORDER BY id ASC;
or Better you can use OFFSET and LIMIT approach for pagination
https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf

In the end I ended up doing it this way:
First I create a temporary materialized view:
CREATE MATERIALIZED VIEW vw_temp AS SELECT id, row_number() over (ORDER BY id desc) AS rn, title
FROM mytable
WHERE group_id = 10;
Then I define the index:
CREATE INDEX idx_temp ON vw_temp USING btree(id);
Now I can perform all operations very quickly, and with numbered rows:
SELECT * FROM vw_temp WHERE id BETWEEN 1000 AND 2000;
After doing the operations, cleanup:
DROP INDEX idx_temp;
DROP MATERIALIZED VIEW vw_temp;
Even though Thorsten Kettner's answer seems the cleanest one, it was not practical for me due to being too slow. Thanks for contributing everyone. For those interesed in the practical use case, I use this for feeding data to the Sphinx indexer.

Related

ORDER BY before WHERE - alternatives?

I'm using a PostgreSQL database (12.8).
Given events sorted by start_time, I would like to retrieve the next 10 events after the event with id = 10.
My first idea was something like this:
SELECT *
FROM events
ORDER BY start_time
WHERE id > 10
LIMIT 10;
However, this does not work because the WHERE-clause is always applied before the ORDER BY-clause. In other words, first all events are selected which have an id > 10, and only the remaining events are then ordered by start_time.
Next, I came up with a CTE expression. If the record with ID 10 has row_number 3:
WITH ordered_events AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY start_time DESC) AS row_number
FROM events
)
SELECT *
FROM ordered_events
WHERE row_number > 3
LIMIT 10;
This works as desired, if I would know the row_number beforehand. However, I only know the ID, i.e. id = 10.
a) Ideally, I would do something like this (however, I don't know if there is any way to write such an expression):
WITH ordered_events AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY start_time DESC) AS row_number
FROM events
)
SELECT *
FROM ordered_events
WHERE row_number > (row_number of row where the ID is 10) # <------
LIMIT 10;
b) The only alternative I came up with is to first retrieve the row_number for all records and then make a second query.
First query:
SELECT id, ROW_NUMBER() OVER (ORDER BY start_time DESC) AS row_number
FROM events;
Now I know that the row_number for the record with ID 10 is 3. Thus, I have enough information to make the query with the CTE expression as described above. However, the performance is bad because I need to retrieve all records from events in the first query.
I would really appreciate it, if you could help me to figure out a (performant) way of doing this.
You only want rows with a start time less than ID 10's start time. Use a WHERE clause for this.
select *
from events where start_time < (select start_time from events where id = 10)
order by start_time desc
limit 10;
This query may benefit from an index on start_time. (I take it for granted that there already is an index on ID to find the row with ID 10 quickly.)

How to get nth record in a sql server table without changing the order?(sql server)

for example i have data like this(sql server)
id name
4 anu
3 lohi
1 pras
2 chand
i want 2nd record in a table (means 3 lohi)
if i use row_number() function its changes the order and i get (2 chand)
i want 2nd record from table data
can anyonr please give me the query fro above scenario
There is no such thing as the nth row in a table. And for a simple reason: SQL tables represent unordered sets (technically multi-sets because they allow duplicates).
You can do what you want use offset/fetch:
select t.*
from t
order by id desc
offset 1 fetch first 1 row only;
This assumes that the descending ordering on id is what you want, based on your example data.
You can also do this using row_number():
select t.*
from (select t.*,
row_number() over (order by id desc) as seqnum
from t
) t
where seqnum = 2;
I should note that that SQL Server allows you to assign row_number() without having an effective sort using something like this:
select t.*
from (select t.*,
row_number() over (order by (select NULL)) as seqnum
from t
) t
where seqnum = 2;
However, this returns an arbitrary row. There is no guarantee it returns the same row each time it runs, nor that the row is "second" in any meaningful use of the term.

Subquery select select only one col

I have this query who create pagination system, I want to SELECT only A* , I dont want to show row_number value even if I need it .
SELECT *
FROM (SELECT A.*, rownum row_number
FROM (select * from dual
) A
WHERE rownum <= 10)
WHERE row_number >= 1
The result :
D ROW_NUMBER
- ----------
X 1
Whats I want
D
-
X
Thanks for help
If your table has a primary key, you may perform the pagination filter only on this key and in the second step select the data based on the PK.
This will allow you to use SELECT *
select * from tab
where id in (
SELECT id
FROM (SELECT id, rownum row_number
FROM (select id from tab
) A
WHERE rownum <= 10)
WHERE row_number >= 1)
You'll pay a little performance penalty, as each selected row must be additionaly accessed by the primary key index (but this will be not visible for 10 rows or so).
An other point with pagination is that you typically need to present the data in some order and not randonly as in your example.
In that case the innermost subquery will be
select id from tab order by <some column>
Here you can profit, as you need to sort only the PK and the sort key and not the whole row (but again it will be not visible for 10 rows).

Is there an efficiency problem with my query to select middle rows?

I am trying to select rows 3 - 5 of:
SELECT *
FROM Finance_User
ORDER BY email DESC
I originally had just:
SELECT
ROW_NUMBER() OVER (ORDER BY email DESC) AS RowNum, *
FROM
Finance_User
WHERE
RowNum BETWEEN 3 AND 5
But this did not work as RowNum was an invalid column.
Instead I did the below:
WITH OrderedUsers AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY email DESC) AS RowNum, *
FROM
Finance_User
)
SELECT *
FROM OrderedUsers
WHERE RowNum BETWEEN 3 AND 5
This works perfectly fine. However, I am concerned that there might be performance issues with this as it seems to be selecting from the table twice?
ROW_NUMBER() with a CTE (or a subquery) won't scan the table twice. However using the window function might incur additional processing for the RDBMS.
You could achieve the same results with ORDER BY ... OFFSET ... FETCH ..., available starting SQL-Server 2012, that are provided specifically for the purpose of paging a resultset:
SELECT *
FROM Finance_User
ORDER BY email DESC
OFFSET 2 ROWS FETCH NEXT 3 ROWS ONLY
From the documentation:
We recommend that you use the OFFSET and FETCH clauses instead of the TOP clause to implement a query paging solution and limit the number of rows sent to a client application.
Your query is fine.
WITH AS clause is a Common Table Expression which means that this query may be reused later and should be cached if needed and possible. So there should not be any problem with "selecting from the table twice".
The same result you can get with this query:
SELECT * from
(SELECT ROW_NUMBER() OVER (ORDER BY email DESC) AS RowNum, * FROM Finance_User)
where RowNum between 3 and 5
And finally, you can always check execution plan and make sure of it as well.

Select random row for each group in a postgres table

I have a table that is roughly:
id | category | link | caption | image
My goal is to fetch a random row from each distinct category in the table, for all the categories in the table. The plan is to then assign each row to a variable for its respective category.
Right now I'm using multiple SELECT statements resembling:
SELECT link, caption, image FROM table
WHERE category='whatever'
ORDER BY RANDOM() LIMIT 1`
But this seems inelegant and creates more trips to the DB, which is expensive.
I'm pretty sure there's a way to do this with window functions in Postgres, but I have no experience with them and I'm not entirely sure how to use one to get what I want.
Try something like:
SELECT DISTINCT ON (category) *
FROM table
ORDER BY category, random();
Or with window functions:
SELECT *
FROM (
SELECT *, row_number() OVER (PARTITION BY category ORDER BY random()) as rn
FROM table ) sub
WHERE rn = 1;