How should I handle "ranked x out of y" data in PostgreSQL? - sql

I have a table that I would like to be able to present "ranked X out of Y" data for. In particular, I'd like to be able to present that data for an individual row in a relatively efficient way (i.e. without selecting every row in the table). The ranking itself is quite simple, it's a straight ORDER BY on a single column in the table.
Postgres seems to present some unique challenges in this regard; AFAICT it doesn't have a RANK or ROW_NUMBER or equivalent function (at least in 8.3, which I'm stuck on for the moment). The canonical answer in the mailing list archives seems to be to create a temporary sequence and select from it:
test=> create temporary sequence tmp_seq;
CREATE SEQUENCE
test=*> select nextval('tmp_seq') as row_number, col1, col2 from foo;
It seems like this solution still won't help when I want to select just a single row from the table (and I want to select it by PK, not by rank).
I could denormalize and store the rank in a separate column, which makes presenting the data trivial, but just relocates my problem. UPDATE doesn't support ORDER BY, so I'm not sure how I'd construct an UPDATE query to set the ranks (short of selecting every row and running a separate UPDATE for each row, which seems like way too much DB activity to trigger every time the ranks need updating).
Am I missing something obvious? What's the Right Way to do this?
EDIT: Apparently I wasn't clear enough. I'm aware of OFFSET/LIMIT, but I don't see how it helps solve this problem. I'm not trying to select the Xth-ranked item, I'm trying to select an arbitrary item (by its PK, say), and then be able to display to the user something like "ranked 43rd out of 312."

If you want the rank, do something like
SELECT id,num,rank FROM (
SELECT id,num,rank() OVER (ORDER BY num) FROM foo
) AS bar WHERE id=4
Or if you actually want the row number, use
SELECT id,num,row_number FROM (
SELECT id,num,row_number() OVER (ORDER BY num) FROM foo
) AS bar WHERE id=4
They'll differ when you have equal values somewhere. There is also dense_rank() if you need that.
This requires PostgreSQL 8.4, of course.

Isn't it just this:
SELECT *
FROM mytable
ORDER BY
col1
OFFSET X LIMIT 1
Or I am missing something?
Update:
If you want to show the rank, use this:
SELECT mi.*, values[1] AS rank, values[2] AS total
FROM (
SELECT (
SELECT ARRAY[SUM(((mi.col1, mi.ctid) < (mo.col1, mo.ctid))::INTEGER), COUNT(*)]
FROM mytable mi
) AS values
FROM mytable mo
WHERE mo.id = #myid
) q

ROW_NUMBER functionality in PostgreSQL is implemented via LIMIT n OFFSET skip.
Find an overview here.
On the pitfalls of ranking see this SO question.
EDIT: Since you are asking for ROW_NUMBER() instead of simple ranking: row_number() is introduced to PostgreSQL in version 8.4. So you might consider to update. Otherwise this workaround might be helpful.

Previous replies tackle the question "select all rows and get their rank" which is not what you want...
you have a row
you want to know its rank
Just do :
SELECT count(*) FROM table WHERE score > $1
Where $1 is the score of the row you just selected (I suppose you'd like to display it so you might select it...).
Or do :
SELECT a., (SELECT count() FROM table b WHERE score > b.score) AS rank FROM table AS a WHERE pk = ...
However, if you select a row which is ranked last, yes you will need to count all the rows which are ranked before it, so you'll need to scan the whole table, and it will be very slow.
Solution :
SELECT count(*) FROM (SELECT 1 FROM table WHERE score > $1 LIMIT 30)
You'll get precise ranking for the 30 best scores, and it will be fast.
Who cares about the losers ?
OK, If you really do care about the losers, you'll need to make a histogram :
Suppose score can go from 0 to 100, and you have 1000000 losers with score < 80 and 10 winners with score > 80.
You make a histogram of how many rows have a score of X, it's a simple small table with 100 rows. Add a trigger to your main table to update the histogram.
Now if you want to rank a loser which has score X, his rank is sum( histo ) where histo_score > X.
Since your score probably isn't between 0 and 100, but (say) between 0 and 1000000000, you'll need to fudge it a bit, enlarge your histogram bins, for instance. so you only need 100 bins max, or use some log-histogram distribution function.
By the way postgres does this when you ANALYZE the table, so if you set statistics_target to 100 or 1000 on score, ANALYZE, and then run :
EXPLAIN SELECT * FROM table WHERE score > $1
you'll get a nice rowcount estimate.
Who needs exact answers ?

Related

SQL/HANA query assigning incremental number per X unique values without using stored procedure

Currently I'm struggling trying to write a select query to achieve the following result:
I have documents, each document has an address.
Now I want a third column that will assign a unique number for every X unique adresses found in the result list. In the example I have used X = 3.
There are a total of 7 unique adresses. Which means we need 3 unique numbers.
1 = (adres A,B,C) , 2 = (adres D,E,F), 3= (adres G).
PS. I have already worked out this logic in a stored procedure, but because of technical limitations that I can't go into detail about this has to be done using a SELECT query if possible. If this is not possible we will have to find another workaround.
I was hoping you could point me in the right direction for which HANA SQL syntax to use to achieve this.. I've been looking into ROW_NUMBER and DENSE_RANK but without success.
In order for your question to make sense, you need an ordering column. This answer assumes that document orders the rows.
You want to assign the group based on the minimum document of each address (and then do arithmetic). You can get this with two levels of window functions:
select t.*,
ceil(dense_rank() over (order by min_document) / 3.0) as incremental_number
from (select t.*,
min(document) over (partition by address) as min_document
from t
) t;

Sqlite select from two tables and ordery by limit

SELECT y.M_id,SUM(t.alarm_sum),y.yarn1,AVG(t.yarn1_in),SUM(t.yarn1_alarm)
FROM '111' t, yarn y
WHERE t.date=date('now') AND y.M_id='111'
ORDER BY t.row_id DESC
LIMIT 5
I want to get the data from 2 table: '111' and yarn, but i only want the last 5 rows of data from '111'.
I ran it but the output is not limited by 5 rows from the '111' table.
What is needed to change?
You can use a subquery e.g.
select ...
from (select * from '111' order by row_id desc limit 5) t,
yarn y
[...]
However it's not clear to me that this makes any sense, and it's not clear what you're trying to achieve either.
What you're doing here is a cartesian join: functionally the database is going to select all rows of t, all rows of y, create every combination of both (so it's going to create the combination of t's first row with all of y's, t's second row with all of y, ...), then it will filter based on your condition, and finally order and limit.
I'm surprised you're getting anything other than the very last row from t though.
(also calling a column row_id seems like a bad idea in sqlite as it'd be easy to confuse with the database's own rowid).

Joining a series in postgres with a select query

I'm looking for a way to join these two queries (or run these two together):
SELECT s
FROM generate_series(1, 50) s;
With this query:
SELECT id FROM foo ORDER BY RANDOM() LIMIT 50;
In a way where I get 50 rows like this:
series, ids_from_foo
1, 53
2, 34
3, 23
I've been at it for a couple days now and I can't figure it out. Any help would be great.
Use row_number()
select row_number() over() as rn, a
from (
select a
from foo
order by random()
limit 50
) s
order by rn;
Picking the top n rows from a randomly sorted table is a simple, but slow way to pick 50 rows randomly. All rows have to be sorted that way.
Doesn't matter much for small to medium tables and one-time, ad-hoc use. For repeated use on a big table, there are much more efficient ways.
If the ratio of gaps / island in the primary key is low, use this:
SELECT row_number() OVER() AS rn, *
FROM (
SELECT *
FROM (
SELECT trunc(random() * 999999)::int AS foo_id
FROM generate_series(1, 55) g
GROUP BY 1 -- fold duplicates
) sub1
JOIN foo USING (foo_id)
LIMIT 50
) sub2;
With an index on foo_id, this blazingly fast, no matter how big the table. (A primary key serves just fine.) Compare performance with EXPLAIN ANALYZE.
How?
999999 is an estimated row count of the table, rounded up. You can get it cheaply from:
SELECT reltuples FROM pg_class WHERE oid = 'foo'::regclass;
Round up to easily include possible new entries since the last ANALYZE. You can also use the expression itself in a generic query dynamically, it's cheap. Details:
Fast way to discover the row count of a table in PostgreSQL
55 is your desired number of rows (50) in the result, multiplied by a low factor to easily make up for the gap ratio in your table and (unlikely but possible) duplicate random numbers.
If your primary key does not start near 1 (does not have to be 1 exactly, gaps are covered), add the minimum pk value to the calculation:
min_pkey + trunc(random() * 999999)::int
Detailed explanation here:
Best way to select random rows PostgreSQL

Remove duplicate rows - Impossible to find a decisive answer

You'd immediately think I went straight to here to ask my question but I googled an awful lot to not find a decisive answer.
Facts: I have a table with 3.3 million rows, 20 columns.
The first row is the primary key thus unique.
I have to remove all the rows where column 2 till column 11 is duplicate. In fact a basic question but so much different approaches whereas everyone seeks the same solution in the end, removing the duplicates.
I was personally thinking about GROUP BY HAVING COUNT(*) > 1
Is that the way to go or what do you suggest?
Thanks a lot in advance!
L
As a generic answer:
WITH cte AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY <groupbyfield> ORDER BY <tiebreaker>) as rn
FROM Table)
DELETE FROM cte
WHERE rn > 1;
I find this more powerful and flexible than the GROUP BY ... HAVING. In fact, GROUP BY ... HAVING only gives you the duplicates, you're still left with the 'trivial' task of choosing a 'keeper' amongst the duplicates.
ROW_NUMBER OVER (...) gives more control over how to distinguish among duplicates (the tiebreaker) and allows for behavior like 'keep first 3 of the duplicates', not only 'keep just 1', which is a behavior really hard to do with GROUP BY ... HAVING.
The other part of your question is how to approach this for 3.3M rows. Well, 3.3M is not really that big, but I would still recommend doing this in batches. Delete TOP 10000 at a time, otherwise you'll push a huge transaction into the log and might overwhelm your log drives.
And final question is whether this will perform acceptably. It depends on your schema. IF the ROW_NUMBER() has to scan the entire table and spool to count, and you have to repeat this in batches for N times, then it won't perform. An appropriate index will help. But I can't say anything more, not knowing the exact schema involved (structure of clustered index/heap, all non-clustered indexes etc).
Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.
If you have SQL Server 2005 or newer, then the easiest way would be to use a CTE (Common Table Expression).
You need to know what criteria you want to "partition" your data by - e.g. create partitions of data that is considered identical/duplicate - and then you need to order those partitions by something - e.g. a sequence ID, a date/time or something.
You didn't provide much details about your tables - so let me just give you a sample:
;WITH Duplicates AS
(
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS RowN
FROM
dbo.Orders
)
DELETE FROM dbo.Orders
WHERE RowN > 1
The CTE ( WITH ... AS :... ) gives you an "inline view" for the next SQL statement - it's not persisted or anything - it just lives for that next statement and then it's gone.
Basically, I'm "grouping" (partitioning) my Orders by CustomerID, and ordering by OrderDate. So for each CustomerID, I get a new "group" of data, which gets a row number starting with 1. The ORDER BY OrderDate DESC gives the newest order for each customer the RowN = 1 value - this is the one order I keep.
All other orders for each customer are deleted based on the CTE (the WITH..... expression).
You'll need to adapt this for your own situation, obviously - but the CTE with the PARTITION BY and ROW_NUMBER() are a very reliable and easy technique to get rid of duplicates.
If you don't want to deal with a new table delete then just use DELETE TOP(1). Use a subquery to get all the ids of rows that are duplicates and then use the delete top to delete where there is multiple rows. You might have to run more than once if there are more than one duplicate but you get the point.
DELETE TOP(1) FROM Table
WHERE ID IN (SELECT ID FROM Table GROUP BY Field HAVING COUNT(*) > 1)
You get the idea hopefully. This is just some pseudo code to help demonstrate.

Efficient SQL to count an occurrence in the latest X rows

For example I have:
create table a (i int);
Assume there are 10k rows.
I want to count 0's in the last 20 rows.
Something like:
select count(*) from (select i from a limit 20) where i = 0;
Is that possible to make it more efficient? Like a single SQL statement or something?
PS. DB is SQLite3 if that matters at all...
UPDATE
PPS. No need to group by anything in this instance, assume the table that is literally 1 column (and presumably the internal DB row_ID or something). I'm just curious if this is possible to do without the nested selects?
You'll need to order by something in order to determine the last 20 rows. When you say last, do you mean by date, by ID, ...?
Something like this should work:
select count(*)
from (
select i
from a
order by j desc
limit 20
) where i = 0;
If you do not remove rows from the table, you may try the following hacky query:
SELECT COUNT(*) as cnt
FROM A
WHERE
ROWID > (SELECT MAX(ROWID)-20 FROM A)
AND i=0;
It operates with ROWIDs only. As the documentation says: Rows are stored in rowid order.
You need to remember to order by when you use limit, otherwise the result is indeterminate. To get the latest rows added, you need to include a column with the insertion date, then you can use that. Without this column you cannot guarantee that you will get the latest rows.
To make it efficient you should ensure that there is an index on the column you order by, possibly even a clustered index.
I'm afraid that you need a nested select to be able to count and restrict to last X rows at a time, because something like this
SELECT count(*) FROM a GROUP BY i HAVING i = 0
will count 0's, but in ALL table records, because a LIMIT in this query will basically have no effect.
However, you can optimize making COUNT(i) as it is faster to COUNT only one field than 2 or more (in this case your table will have 2 fields, i and rowid, that is automatically created by SQLite in PKless tables)