Should SQL ranking functionality be considered as "use with caution" - sql

This question originates from a discussion on whether to use SQL ranking functionality or not in a particular case.
Any common RDBMS includes some ranking functionality, i.e. its query language has elements like TOP n ... ORDER BY key, ROW_NUMBER() OVER (ORDER BY key), or ORDER BY key LIMIT n (overview).
They do a great job in increasing performance if you want to present only a small chunk out of a huge number of records. But they also introduce a major pitfall: If key is not unique results are non-deterministic. Consider the following example:
users
user_id name
1 John
2 Paul
3 George
4 Ringo
logins
login_id user_id login_date
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
A query is supposed to return the person who logged in last:
SELECT TOP 1 users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
Just as expected George is returned and everything looks fine. But then a new record is inserted into logins table:
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
5 4 2009-08-20
What does the query above return now? Ringo? George? You can't tell. As far as I remember e.g. MySQL 4.1 returns the first record physically created that matches the criteria, i.e. the result would be George. But this may vary from version to version and from DBMS to DBMS. What should have been returned? One might say Ringo since he apparently logged in last but this is pure interpretation. In my opinion both should have been returned, because you can't decide unambiguously from the data available.
So this query matches the requirements:
SELECT users.*
FROM
logins JOIN
users ON
logins.user_id = users.user_id AND
logins.login_date = (
SELECT max(logins.login_date)
FROM
logins JOIN
users ON logins.user_id = users.user_id)
As an alternative some DBMSs provide special functions (e.g. Microsoft SQL Server 2005 introduces TOP n WITH TIES ... ORDER BY key (suggested by gbn), RANK, and DENSE_RANK for this very purpose).
If you search SO for e.g. ROW_NUMBER you'll find numerous solutions which suggest using ranking functionality and miss to point out the possible problems.
Question: What advice should be given if a solution that includes ranking functionality is proposed?

rank and row_number are fantastic functions that should be used more liberally, IMO. Folks just don't know about them.
That being said, you need to make sure what you're ranking by is unique. Have a backup plan for duplicates (esp. dates). The data you get back is only as good as the data you put in.
I think the pitfalls here are the exact same in the query:
select top 2 * from tblA order by date desc
You need to be aware of what you're ordering on and ensure that there is some way to always have a winner. If not, you get a (potentially) random two rows with the max date.
Also, for the record, SQL Server does not store rows in the physical order that they are inserted. It stores records on 8k pages and orders those pages in the most efficient way it can according to the clustered index on the table. Therefore, there is absolutely no guarantee of order in SQL Server.

Use the WITH TIES clause in your example above
SELECT TOP 1 WITH TIES users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
Use DENSE_RANK as you mentioned
Not put myself in this position
Example: Store time too (datetime) and accept the very low risk of a very rare duplicate in the same 3.33 millisecond instant (SQL 2008 is different)

Every database engine uses some kind of a row identifier so that it can distinguish between two rows.
These identifiers are:
Row pointer in MyISAM
Primary key in InnoDB table with a PRIMARY KEY defined
Uniquifier in InnoDB table without a PRIMARY KEY defined
RID in SQL Server's heap table
Primary key in SQL Server's table clustered on PRIMARY/UNIQUE KEY
Index key + uniquifier in SQL Server's table clustered on a non-unique key
ROWID / UROWID in Oracle
CTID in PostgreSQL.
You don't have an immediate access to the following ones:
Row pointer in MyISAM
Uniquifier in InnoDB table without a PRIMARY KEY defined
RID in SQL Server's heap table
Index key + uniquifier in SQL Server's table clustered on a non-unique key
Besides, you don't have control over the following ones:
ROWID / UROWID in Oracle
CTID in PostgreSQL.
(they can change on updates or restoring from backups)
If two rows are identical in these tables, that means they should be identical from the application's point of view.
They return exactly same results and can be treated as an ultimate uniquifier.
This just means you should always include some kind of a uniquifier you have full control over to the ordering clause to keep your ordering consistent.
If your table has a primary or unique key (even composite), include it into the ordering condition:
SELECT *
FROM mytable
ORDER BY
ordering_column, pk
Otherwise, include all columns into the ordering condition:
SELECT *
FROM mytable
ORDER BY
ordering_column, column1, ..., columnN
The later condition will always return any of the otherwise indistinguishable rows, but since they're indistinguishable anyway, it will look consistent from your applications's point of view.
That, by the way, is another good reason for always having a PRIMARY KEY in your tables.
But do not rely on ROWID / CTID to order rows.
It can easily change on UPDATE so your result order will not be stable anymore.

ROW_NUMBER is a fantastic tool indeed. If misused it can provide non-deterministic results, but so will the other SQL functions. You can have ORDER BY return non-deterministic results as well.
Just know what you are doing.

This is the summary:
Use your head first. Should be obvious, but it is always a good point to start. Do you expect n rows exactly or do you expect a possibly varying number of rows that fulfill a constraint? Reconsider your design. If you're expecting n rows exactly, your model might be designed poorly if it's impossible to identify a row unambiguously. If you expect a possibly varying number of rows, you might need to adjust your UI in order to present your query results.
Add columns to key that make it unique (e.g. PK). You at least gain back control on the returned result. There is almost always a way to do this as Quassnoi pointed out.
Consider using possibly more suitable functions like RANK, DENSE_RANK and TOP n WITH TIES. They are available in Microsoft SQL Server by 2005 version and in PosgreSQL from 8.4 onwards. If these functions are not available, consider using nested queries with aggregation instead of ranking functions.

Related

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.
Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.
I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.
Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

Unique sort order for postgres pagination

While trying to implement pagination from server side in postgres, i came across a point that while using limit and offset keywords you have to provide an ORDER BY clause on a unique column probably the primary key.
In my case i am using the UUID generation for Pkeys so I can't rely on a sequential order of increasing keys. ORDER BY pkey DESC - might not result in newer rows on top always.
So i resorted to using Created Date column - timestamp column which should be unique.
But my question comes what if the UI client wants to sort by some other column? in the event that it might not always be a unique column i resort to ORDER BY user_column, created_dt DESC so as to maintain predictable results for postgres pagination.
is this the right approach? i am not sure if i am going the right way. please advise.
I talked about this exact problem on an old blog post (in the context of using an ORM):
One last note about using sorting and paging in conjunction. A query
that implements paging can have odd results if the ORDER BY clause
does not include a field that represents an empirical sequence in the
data; sort order is not guaranteed beyond what is explicitly specified
in the ORDER BY clause in most (maybe all) database engines. An
example: if you have 100 orders that all occurred on the exact same
date, and you ask for the first page of this data sorted by this date,
then ask for the second page of data sorted the same way, it is
entirely possible that you will get some of the data duplicated across
both pages. So depending on the query and the distribution of data
that is “sortable,” it can be a good practice to always include a
unique field (like a primary key) as the final field in a sort clause
if you are implementing paging.
http://psandler.wordpress.com/2009/11/20/dynamic-search-objects-part-5sorting/
The strategy of using a column that uniquely identifies a record as pkey or insertion_date may not be possible in some cases.
I have an application where the user sets up his own grid query then it can simply put any column from multiple tables and perhaps none is a unique identifier.
In a case that can be useful you use rownum. You simply select the rownum and use his sort in over function. It would be something like:
select col1, col2, col3, row_number() over(order by col3) from tableX order by col3
It's important that over(order by *) match with order by *. Thus your paging will have consistent results every time.

Very slow SQL query

I'm having a problem with a slow query. Consider the table tblVotes - and it has two columns - VoterGuid, CandidateGuid. It holds votes cast by voters to any number of candidates.
There are over 3 million rows in this table - with about 13,000 distinct voters casting votes to about 2.7 million distinct candidates. The total number of rows in the table is currently 6.5 million.
What my query is trying to achieve is getting - in the quickest and most cache-efficient way possible (we are using SQL Express) - the top 1000 candidates based on the number of votes they have received.
The code is:
SELECT CandidateGuid, COUNT(*) CountOfVotes
FROM dbo.tblVotes
GROUP BY CandidateGuid
HAVING COUNT(*) > 1
ORDER BY CountOfVotes DESC
... but this takes a scarily long time to run on SQL express when there is a very full table.
Can anybody suggest a good way to speed this up and get it running in quick time? CandidateGuid is indexed individually - and there is a composite primary key on CandidateGuid+VoterGuid.
If you have only two columns in a table, a "normal" index on those two fields won't help you much, because it is in fact a copy of your entire table, only ordered. First check in execution plan, if your index is being used at all.
Then consider changing your index to clustered index.
Try using a top n, instead of a having clause - like so:
SELECT TOP 1000 CandidateGuid, COUNT(*) CountOfVotes
FROM dbo.tblVotes
GROUP BY CandidateGuid
ORDER BY CountOfVotes DESC
I don't know if SQL Server is able to use the composite index to speed this query, but if it is able to do so you would need to express the query as SELECT CandidateGUID, COUNT(VoterGUID) FROM . . . in order to get the optimization. This is "safe" because you know VoterGUID is never NULL, since it's part of a PRIMARY KEY.
If your composite primary key is specified as (CandidateGUID, VoterGUID) you will not get any added benefit of a separate index on just CandidateGUID -- the existing index can be used to optimize any query that the singleton index would assist in.

In SQL Server, is TOP deterministic by default when used on a table with a clustered index?

So I was trying to explain to some people why this query is a bad idea:
SELECT z.ReportDate, z.Zipcode, SUM(z.Sales) AS Sales,
COALESCE(
(SELECT TOP (1) GroupName
FROM dbo.zipGroups
WHERE (Zipcode = z.Zipcode)), 'Unknown') AS GroupName,
COALESCE(
(SELECT TOP (1) GroupCode
FROM dbo.zipGroups
WHERE (Zipcode = z.Zipcode)), 0) AS GroupNumber
FROM dbo.Report_ByZipcode AS z
GROUP BY z.ReportDate, z.Zipcode
and suggesting a better way to write it, when my boss ended the discussion with, "Well, it's been returning the right data for the last year and we haven't had any problems with it, so it's fine."
At which point I thought to myself, how in the world is that even possible?
After some digging, I discovered these facts:
This query is supposed to group sales by Zipcode and date, and link those to the largest Group (by population size) that a Zipcode is assigned to by way of the zipGroups table.
Each Zipcode can be assigned to 0 to many Groups, and if a Zipcode is assigned to 0 Groups, it's simply not in the zipGroups table.
A Group is a geographical area, and the GroupNumbers are ranked by largest to smallest by population (for example, the group covering the NY-NJ-CT tri-state area is GroupNumber 1, and North Platte, Nebraska is GroupNumber 209).
The zipGroups table has not changed in at least 2 years.
The zipGroups table has a clustered index with Zipcode, GroupNumber (ascending) as the keys.
The combination of Zipcode, GroupNumber is unique in zipGroups.
So my question has 2 parts.
A) Even though there are no ORDER BY clauses in those SELECT TOP queries, are they actually deterministic because the clustered index is basically providing it a default ORDER BY?
B1) If that is true, is the query, however precariously, actually doing what it's supposed to do?
B2) If that is not true, can you help me prove it?
Note: I've already re-written this to use joins, so I don't need the SQL to fix it, I need to get it into production so I stop worrying about it breaking.
SQL Server makes no guarantees about the ordering of records in the absence of ORDER BY. It might yield the correct results 999,999 times and then fail on the millionth try. Don't do it.
Always use an order by with a TOP statement. The order is not guaranteed to be in the order of the clustered index as demonstrate in this blog post (complete with a query that disproves it):
Without ORDER BY, there is no default sort order.
Even if it did go by the clustered index, I wouldn't write queries that depend on undocumented behavior of the DB engine and it is better to be explicit for readability.
If you're relying on a clustered index rather than the collation, then getting the right order is coincidental, not deterministic.
In the real world, indexes can be changed from one kind to another, for good reasons, bad reasons, or no reason at all. And, in the real world, you don't necessarily get to choose which index SQL Server will use in executing a query. (Or whether it will use an index at all.)
Technically, the collation can also be changed for good reasons, bad reasons, or no reason at all. But everybody knows changing the collation will change the sort order--that's its job, after all--so it's not a surprise. (Ever heard of "the principle of least surprise"?)
The link by JohnFx is good, although long and hard to follow. Here's a small snippet on it's own that will show the data returning in non-clustered index order.
CREATE TABLE t1 (x INT NOT NULL PRIMARY KEY CLUSTERED, z INT NOT NULL UNIQUE);
INSERT INTO t1 (x,z) VALUES (1,4);
INSERT INTO t1 (x,z) VALUES (3,3);
INSERT INTO t1 (x,z) VALUES (2,2);
INSERT INTO t1 (x,z) VALUES (4,1);
SELECT x, z FROM t1;
Output (you should get)
x z
----------- -----------
4 1
2 2
3 3
1 4
The execution plan shows it using the Unique (or other) index instead of the clustered index.
Even if the clustered index is chosen, it may not sort correctly if the data is being merged from parallelism, if the TOP N count is high enough.
Having said that, since you are only using TOP(1) and if the table has only one index available, it can be considered deterministic since it will only use that index and pick the first entry in the index pages.
A) Even though there are no ORDER BY clauses in those SELECT TOP queries, are they actually deterministic because the clustered index is basically providing it a default ORDER BY?
B1) If that is true, is the query, however precariously, actually doing what it's supposed to do?
When top is specified without ordering, the ordering is a side effect of the method of access chosen by the query optimizer. Since the query optimizer would use the clustered index to resolve this query, you get a quite nice side effect.
I wouldn't use the word deterministic, as the query optimizer might not be deterministic. However in the case where the optimizer choses the clustered index, yes - the query does what it is supposed to do.
ORDER should still be specified, so as to lock the correctness into the query. One should separate correctness ("What do you want") and implementation ("How do you get it") into query and optimizer plan, respectively.
B2) If that is not true, can you help me prove it?
Assuming there are more columns in the ZipGroups table, a Nonclustered index containing the only two relevant columns could be added that would be preferred over the clustered index. If the nonclustered index had a different ordering (Zipcode asc, GroupNumber desc), then the query would break.

Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index

How can I make an ORDER BY clause with a small LIMIT (ie 20 rows at a time) return quickly, when I can't use an index to satisfy the ordering of rows?
Let's say I would like to retrieve a certain number of titles from a table 'node' (simplified below). I'm using MySQL by the way.
node_ID INT(11) NOT NULL auto_increment,
node_title VARCHAR(127) NOT NULL,
node_lastupdated INT(11) NOT NULL,
node_created INT(11) NOT NULL
But I need to limit the rows returned to only those a particular user has access to. Many users have access large numbers of nodes. I have this information pre-calculated in a big lookup table (an attempt to make things easier) where the primary key covers both columns and the presence of a row means that usergroup has access to that node:
viewpermission_nodeID INT(11) NOT NULL,
viewpermission_usergroupID INT(11) NOT NULL
My query therefore contains something like
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
... and I also use a GROUP BY or a DISTINCT so that a node is only returned once even if two of the user's 'usergroups' both have access to that node.
My problem is that there seems to be no way for an ORDER BY clause which sorts results by created or last updated date to use an index, because the rows being returned depend on values in the other viewpermission table.
Therefore MySQL would need to find all rows which match the criteria, then sort them all itself. If there are one million rows for a particular user, and we want to view, say, the latest 100 or rows 100-200 when ordered by last update, the DB would need to figure out which one million rows the user can see, sort this whole result set itself, before it can return those 100 rows, right?
Is there any creative way to get around this? I've been thinking along the lines of:
Somehow add dates into the viewpermission lookup table so that I can build an index containing the dates as well as the permissions. It's a possibility I guess.
Edit: Simplified question
Perhaps I can simplify the question by rewriting it like this:
Is there any way to rewrite this query or create an index for the following such that an index can be used to do the ordering (not just to select the rows)?
SELECT nodeid
FROM lookup
WHERE
usergroup IN (2, 3)
GROUP BY
nodeid
An index on (usergroup) allows the WHERE part to be satisfied by an index, but the GROUP BY forces a temporary table and filesort on those rows. An index on (nodeid) does nothing for me, because the WHERE clause needs an index with usergroup as its first column. An index on (usergroup, nodeid) forces a temporary table and filesort because the GROUP BY is not the first column of the index that can vary.
Any solutions?
Can I answer my own question?
I believe I have found that the only way to do what I describe is for my lookup table to have rows for every possible combination of usergroups a person may want to be a member of.
To pick a simplified example, instead of doing this:
SELECT id FROM ids WHERE groups IN(1,2) ORDER BY id
If you need to use the index both to select rows and to order them, you have to abstract that IN(1,2) so that it is constant rather than a range, ie:
SELECT id FROM ids WHERE grouplist='1,2' ORDER BY id
Of course instead of using the string '1,2' you could have a foreign key there, etc. The point being that you'd have to have a row not just for each group but for each combination of multiple groups.
So, there is my answer.
Anyway, for my application, I feel that maintaining a lookup for all possible combinations of usergroups for each node is not worth it. For my purposes, I predict that most nodes are visible to most users, so I feel that it is acceptable to simply to make the GROUP BY use the index, as the filtering doesn't need it so badly.
In other words, the approach I'll take for my original query may be something like:
SELECT
<fields>
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
FORCE INDEX(node_created_and_node_ID)
GROUP BY
node_created, node_ID
GROUP BY can use an index if it starts at the left most column of the index and it is in the first non-const non-system table to be processed. The join then deals with the entire list (which is already ordered), and only those not visible to the current user (which will be a small proportion) are removed by the INNER JOIN.
Copy the value you are going to order by into to viewpermission table and add it to your index.
You could use a trigger to maintain that value from the other table.
select * from
(
select *
FROM node
INNER JOIN viewpermission
ON viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
) a
order by a.node_lastupdated desc
The inner query gives you the filtered subset, which I understand is substantially smaller than the whole set. Only the smaller has to be sorted.
MySQL has problems when you use GROUP BY and ORDER BY in the same query. That causes a filesort, and that's probably the biggest penalty for performance.
You can eliminate the need for a DISTINCT (or GROUP BY) by using a non-correlated subquery instead of a JOIN.
SELECT * FROM node
WHERE node_id IN (
SELECT viewpermission_nodeID
FROM viewpermission
WHERE viewpermissiong_usergroupID IN ( <...usergroups...> )
)
ORDER BY node_lastupdated DESC
LIMIT 100;
There's no need to sort or do a DISTINCT on the subquery, since IN (1, 1, 2, 3) is the same as IN (1, 3, 2).
Note that MySQL can use only one index per table in a given query, so it'll try to make the best choice between an index on node_id and an index on node_lastupdated. It can't use both, and even if you made a compound index it wouldn't help in this case.
Remember to analyze different solutions with EXPLAIN.