I need to implement pagination which is semi-resilient to data changing between paginations. The standard pagination relies on SQL's LIMIT and OFFSET, however offset has potential to become inaccurate as new data points are created or their ranking shifts in the sort.
One idea is to hold onto the last data point requested from the API and get the following elements. I don't really know SQL (we're using postgres), but this is my (certainly flawed) attempt at doing something like that. I am trying to store the position of the last element as 'rownum' and then use it in the following query.
WITH rownum AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY rank ASC, id) AS rownum
WHERE id = #{after_id}
FROM items )
SELECT * FROM items
OFFSET rownum
ORDER BY rank ASC, id
LIMIT #{pagination_limit}
I can see some issues with this, like if the last item changes significantly in rank. If anyone can think of another way to do this, that would be great. But I would like to confine it to a single DB query if possible since this is the applications most frequently hit API.
Your whole syntax doesn't quite work. OFFSET comes after ORDER BY. FROM comes before WHERE etc.
This simpler query would do what I think your code is supposed to do:
SELECT *
FROM items
WHERE (rank, id) > (
SELECT (rank, id)
FROM items
WHERE id = #{after_id}
)
ORDER BY rank, id
LIMIT #{pagination_limit};
Comparing the composite type (rank, id) guarantees identical sort order.
Make sure you have two indexes:
A multicolumn index on (rank, id).
Another one on just (id) - you probably have a pk constraint on the column doing that already. (A multicolumn index with leading id would do the job as well.)
More about indexes:
Is a composite index also good for queries on the first field?
If rank is not volatile it would be more efficient to parameterize it additionally instead of retrieving it dynamically - but the volatility of rank seems to be the point of your deliberations ...
I now think the best way to solve this problem is by storing the datetime of the original query and filtering out results after that moment on subsequent queries, thus ensuring the offset is mostly correct. Maybe a persistent database could be used to ensure that the data is at the same state it was when the original query was made.
Related
I have a table which has ShipCountry, ShipCity and Freight column in SQL database. I tried to retrieve data from that table by using the below query.
Select ShipCountry from CountryDetails Group by ShipCountry
If i run this query i am getting results in Ascending order. Instead of this i need data in database order. How to achieve this through SQL query?
Note: If i run the below query, it will return the data in Database order. I am getting sorted data when i added group by clause in my query.
Select ShipCountry from CountryDetails
The use of group by for ordering is improper .. (group by is for aggregation function as min, max or count)
if you need a specific order use order by instead
Select ShipCountry from CountryDetails Order by ShipCountry
otherwise if want not order use simply
Select ShipCountry from CountryDetails
Remember that the values store in db have not a proper order ..and are selected in the sequence used for retrive the data.
Each time you need an order you must esplicitally use order by
for avoid "redundant values" .. use distinct and not group by eg:
Select distinct ShipCountry from CountryDetails
As already has been stated, what you describe might lead to unexpected results fro your end users.
Let's assume you have a table without any indexes or keys (A so-called heap). A heap pretty much can be compared to a phone book (yeah, I've been around for a while) consisting of hundreds of pages, on which information is randomly ordered. A heap is exactly that; A lot of randomly ordered data. Whenever you query from such a table, the query analyzer will do its very best to figure out what the fastest way to deliver the data is.
Such decisions from the query analyzer are guided by statistics; a collection of metrics about the data and the distribution thereof. SQL Server uses these statistics to figure out the cardinality (the uniqueness of values), and thus pick the fastest way to return data.
When you simply issue a SELECT * FROM myTable on a heap, those statistics will determine the order in which your data is returned. However, this also means that over time, the statistics will change, as more data flows into the table. This has the effect that the sort order of your data today is not necessarily the sort order in which the data is returned tomorrow, or even five minutes from now.
If that is fine with your end users, then a SELECT * FROM myTable is the right solution for you. But, if you absolutely need to have the data returned in a certain order, you should always implement an ORDER BY clause.
if you want to have the same database order in most cases if you have sorted by primary key it will be the same without ordering as you say:
here the id is the primary key, and if you can not use the primary key add an identity column and use it:
id name
1 elly
2 ahmad
3 joseph
4 omar
While trying to implement pagination from server side in postgres, i came across a point that while using limit and offset keywords you have to provide an ORDER BY clause on a unique column probably the primary key.
In my case i am using the UUID generation for Pkeys so I can't rely on a sequential order of increasing keys. ORDER BY pkey DESC - might not result in newer rows on top always.
So i resorted to using Created Date column - timestamp column which should be unique.
But my question comes what if the UI client wants to sort by some other column? in the event that it might not always be a unique column i resort to ORDER BY user_column, created_dt DESC so as to maintain predictable results for postgres pagination.
is this the right approach? i am not sure if i am going the right way. please advise.
I talked about this exact problem on an old blog post (in the context of using an ORM):
One last note about using sorting and paging in conjunction. A query
that implements paging can have odd results if the ORDER BY clause
does not include a field that represents an empirical sequence in the
data; sort order is not guaranteed beyond what is explicitly specified
in the ORDER BY clause in most (maybe all) database engines. An
example: if you have 100 orders that all occurred on the exact same
date, and you ask for the first page of this data sorted by this date,
then ask for the second page of data sorted the same way, it is
entirely possible that you will get some of the data duplicated across
both pages. So depending on the query and the distribution of data
that is “sortable,” it can be a good practice to always include a
unique field (like a primary key) as the final field in a sort clause
if you are implementing paging.
http://psandler.wordpress.com/2009/11/20/dynamic-search-objects-part-5sorting/
The strategy of using a column that uniquely identifies a record as pkey or insertion_date may not be possible in some cases.
I have an application where the user sets up his own grid query then it can simply put any column from multiple tables and perhaps none is a unique identifier.
In a case that can be useful you use rownum. You simply select the rownum and use his sort in over function. It would be something like:
select col1, col2, col3, row_number() over(order by col3) from tableX order by col3
It's important that over(order by *) match with order by *. Thus your paging will have consistent results every time.
I want get n-th to m-th records in a table, what's best choice in 2 below solutions:
Solution 1:
SELECT * FROM Table WHERE ID >= n AND ID <= m
Solution 2:
SELECT * FROM
(SELECT *,
ROW_NUMBER() OVER (ORDER BY ID) AS row
FROM Table
)a
WHERE row >= n AND row <= m
As other already pointed out, the queries return different results and are comparing apples to oranges.
But the underlying question remains: which is faster: keyset driven paging or rownumber driven paging?
Keyset Paging
Keyset driven paging relies on remembering the top and bottom keys of the last displayed page, and requesting the next or previous set of rows, based on the top/last keyset:
Next page:
select top (<pagesize>) ...
from <table>
where key > #last_key_on_current_page
order by key;
Previous page:
select top (<pagesize>)
from <table>
where key < #first_key_on_current_page
order by key desc;
This approach has two main advantages over the ROW_NUMBER approach, or over the equivalent LIMIT approach of MySQL:
is correct: unlike the row number based approach it correctly handles new entries and deleted entries. Last row of Page 4 does not show up as first row of Page 5 just because row 23 on Page 2 was deleted in the meantime. Nor do rows mysteriously vanish between pages. These anomalies are common with the row_number based approach, but the key set based solution does a much better job at avoiding them.
is fast: all operations can be solved with a fast row positioning followed by a range scan in the desired direction
However, this approach is difficult to implement, hard to understand by the average programmer and not supported by the tools.
Row Number Driven
This is the common approach introduced with Linq queries:
select ...
from (
select ..., row_number() over (...) as rn
from table)
where rn between #firstRow and #lastRow;
(or a similar query using TOP)
This approach is easy to implement and is supported by tools (specifically by Linq .Limit and .Take operators). But this approach is guaranteed to scan the index in order to count the rows. This approach works usually very fast for page 1 and gradually slows down as the an one goes to higher and higher page numbers.
As a bonus, with this solution is very easy to change the sort order (simply change the OVER clause).
Overall, given the ease of the ROW_NUMBER() based solutions, the support they have from Linq, the simplicity to use arbitrary orders for moderate data sets the ROW_NUMBER based solutions are adequate. For large and very large data sets, the ROW_NUMBER() can occur serious performance issues.
One other thing to consider is that often times there is a definite pattern of access. Often the first few pages are hot and pages after 10 are basically never viewed (eg. most recent posts). In this case, the penalty that occurs with ROW_NUMBER() for visiting bottom pages (display pages for which a large number of rows have to be counted to get the starting result row) may be well ignored.
And finally, the keyset pagination is great for dictionary navigation, which ROW_NUMBER() cannot accommodate easily. Dictionary navigation is where instead of using page number, users can navigate to certain anchors, like alphabet letters. Typical example being a contact Rolodex like sidebar, you click on M and you navigate to the first customer name that starts with M.
The 2nd answer is your best choice. It takes into account the fact that you could have holes in your ID column. I'd rewrite it as a CTE though instead of a subquery...
;WITH MyCTE AS
(SELECT *,
ROW_NUMBER() OVER (ORDER BY ID) AS row
FROM Table)
SELECT *
FROM MyCTE
WHERE row >= #start
AND row <= #end
They are different queries.
Assuming ID is a surrogate key, it may have gaps. ROW_NUMBER will be contiguous.
If you can guarantee you have no gaps in the data, then the 1st one because I'd hope it's indexed,. The 2nd one is more "correct" though.
MySQL
Suppose you want to retrieve just a single record by some id, but you want to know what its position would have been if you'd encountered it in a large ordered set.
Case in point is a photo gallery. You land on a single photo, but the system must know what its offset is in the entire gallery.
I suppose I could use custom indexing fields to keep track of positions, but there must be a more graceful way in SQL alone.
So, first you create a virtual table with the position # ordered by whatever your ORDER BY is, then you select the highest one from that set. That's the position in the greater result set. You can run into problems if you don't order by a unique value/set of values...
If you create an index on (photo_gallery_id, date_created_on) it may do an index scan (depending on the distribution of photos), which ought to be faster than a table scan (provided your gallery_id isn't 90% of the photos or whatnot).
SELECT #row := 0;
SELECT MAX( position )
FROM ( SELECT #row := #row + 1 AS position
FROM photos
WHERE photo_gallery_id = 43
AND date_created_on <= 'the-date-time-your-photo-was'
ORDER BY date_created_on ) positions;
Not really. I think Oracle gives you a "ROWID" or something like that, but most don't give you one. A custom ordering, like a column in your database that tells you want position the entry in the gallery is good because you can never be sure that SQL will put things in the table in the order you think they should be in.
As you are not specific about what database you're using, in SQL Server 2005 you could use
SELECT
ROW_NUMBER() OVER (ORDER BY PhotoID)
, PhotoID
FROM dbo.Photos
You don't say what DBMS you are using, and the "solution" will vary accordingly. In Oracle you could do this (but I would urge you not to!):
select photo, offset
from
( select photo
, row_number() over (partition by gallery_id, order by photo_seq) as offset
from photos
)
where id = 123
That query will select all photos (full table scan) and then pick out the one you asked for - not a performant query!
I would suggest if you really need this information it should be stored.
Assuming the position is determined solely by the id, would it not be as simple as counting all records with a smaller id value?:
select
po.[id]
...
((select count(pi.[id]) from photos pi where pi.[id] < po.[id]) + 1) as index
...
from photos po
...
I'm not sure what the performance implications of such a query would be, but I would think returning a lot of records could be a problem.
You must understand the difference between a "application key" and a "technical key".
The technical key exists for the sole purpose to make an item unique. It's usually in INTEGER or BIGINT, generated (identity, whatever). This key is used to locate objects in the database, quickly figure out of an object has already been persisted (IDs must be > 0, so an object with the default ID == 0 is not in the DB, yet), etc.
The application key is something which you need to make sense of an object within the context of your application. In this case, it's the ordering of the photos in the gallery. This has no meaning whatsoever for the database.
Think ordered list: This is the default in most languages. You have a set of items, accessed by an index. For a database, this index is an application key since sets in the database are unordered (or rather the database doesn't guarantee any ordering unless you specify ORDER BY). For the very same reason, paging through results from a query is such a pain: Databases really don't like the idea of "position".
So what you must do is add an index row (i.e. an INTEGER which says at which position in the gallery your image is; not a database index for quicker access, even though you should create an index on this column ...) and maintain that. For every insertion, you must UPDATE index = index + 1 where index >= insertion_point, etc.
Yes, it sucks. The only solution I know of: Use an ORM framework which solves this for you.
There's no need for an extra table, why not just count the records instead?
You know the order in which they are displayed (which can vary), but you know it.
You also know the ID of the current record; let's say it's ordered on date:
The offset of the record, is the total number of records counted with a date < that date.
SELECT COUNT(1) FROM ... WHERE date < "the-date"
This gives you the number you can use as the offset for the other queries...
EDIT: I'm still waiting for more answers. Thanks!
In SQL 2000 days, I used to use temp table method where you create a temp table with new identity column and primary key then select where identity column between A and B.
When SQL 2005 came along I found out about Row_Number() and I've been using it ever since...
But now, I found a serious performance issue with Row_Number().
It performs very well when you are working with not-so-gigantic result sets and sorting over an identity column. However, it performs very poorly when you are working with large result sets like over 10,000 records and sorting it over non-identity column. Row_Number() performs poorly even if you sort by an identity column if the result set is over 250,000 records. For me, it came to a point where it throws an error, "command timeout!"
What do you use to do paginate a large result set on SQL 2005?
Is temp table method still better in this case? I'm not sure if this method using temp table with SET ROWCOUNT will perform better... But some say there is an issue of giving wrong row number if you have multi-column primary key.
In my case, I need to be able to sort the result set by a date type column... for my production web app.
Let me know what you use for high-performing pagination in SQL 2005. And I'd also like to know a smart way of creating indexes. I'm suspecting choosing right primary keys and/or indexes (clustered/non-clustered) will play a big role here.
Thanks in advance.
P.S. Does anyone know what stackoverflow uses?
EDIT: Mine looks something like...
SELECT postID, postTitle, postDate
FROM
(SELECT postID, postTitle, postDate,
ROW_NUMBER() OVER(ORDER BY postDate DESC, postID DESC) as RowNum
FROM MyTable
) as DerivedMyTable
WHERE RowNum BETWEEN #startRowIndex AND (#startRowIndex + #maximumRows) - 1
postID: Int, Identity (auto-increment), Primary key
postDate: DateTime
EDIT: Is everyone using Row_Number()?
The row_number() technique should be quick. I have seen good results for 100,000 rows.
Are you using row_number() similiar to the following:
SELECT column_list
FROM
(SELECT column_list
ROW_NUMBER() OVER(ORDER BY OrderByColumnName) as RowNum
FROM MyTable m
) as DerivedTableName
WHERE RowNum BETWEEN #startRowIndex AND (#startRowIndex + #maximumRows) - 1
...and do you have a covering index for the column_list and/or an index on the 'OrderByColumnName' column?
Well, for your sample query ROW_COUNT should be pretty fast with thousands of rows, provided you have an index on your PostDate field. If you don't, the server needs to perform a complete clustered index scan on your PK, practically load every page, fetch your PostDate field, sort by it, determine the rows to extract for the result set and again fetch those rows. It's kind of creating a temp index over and over again (you might see an table/index spool in the plain).
No wonder you get timeouts.
My suggestion: set an index on PostDate DESC, this is what ROW_NUMBER will go over - (ORDER BY PostDate DESC, ...)
As for the article you are referring to - I've done pretty much paging and stuff with SQL Server 2000 in the past without ROW_COUNT and the approach used in the article is the most efficient one. It does not work in all circumstances (you need unique or almost unique values). An overview of some other methods is here.
.