Select random row for each group in a postgres table - sql

I have a table that is roughly:
id | category | link | caption | image
My goal is to fetch a random row from each distinct category in the table, for all the categories in the table. The plan is to then assign each row to a variable for its respective category.
Right now I'm using multiple SELECT statements resembling:
SELECT link, caption, image FROM table
WHERE category='whatever'
ORDER BY RANDOM() LIMIT 1`
But this seems inelegant and creates more trips to the DB, which is expensive.
I'm pretty sure there's a way to do this with window functions in Postgres, but I have no experience with them and I'm not entirely sure how to use one to get what I want.

Try something like:
SELECT DISTINCT ON (category) *
FROM table
ORDER BY category, random();
Or with window functions:
SELECT *
FROM (
SELECT *, row_number() OVER (PARTITION BY category ORDER BY random()) as rn
FROM table ) sub
WHERE rn = 1;

Related

SQL query looping for each value in a list

New to SQL here - I am trying to get 1 row from a table matching to a particular criteria
Typically this would look like
SELECT TOP 1 *
FROM myTable
WHERE id = 'abc'
The output may look like
value id
--------------
1 abc
The table has many entries for an 'id', and I am trying to get one entry per 'id'. Now I have list of 'id's. How would I execute something like
SELECT TOP 1 *
FROM myTable
FOR EACH id
WHERE id IN ('abc', 'edf', 'fgh')
Expecting result like
value id
--------------
1 abc
10 edf
12 fgh
I do not know if it is some sort union or concat operation, but would like to learn. I am working on Azure SQL Server
The table has many entries for an 'id', and I am trying to get one entry per 'id'. Now I have list of 'id's.
A typical method is row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from mytable t
) t
where seqnum = 1;
Note: you can filter on particular ids, if you want. It is unclear if that is really required for your question.
If you happen to be using SQL Server (as select top suggests), you can use the more concise, but somewhat less performant:
select top (1) with ties t.*
from mytable t
order by row_number() over (order by id order by (select null));

Snowflake SQL code to show only second record for items with duplicate ID

I'm trying to get my head around SQL and am using Snowflake as a testbed to do this. I have a table with products which have multiple reviews against them. I am trying to structure a query to only show products with 2 or more reviews and then only show the second review. As I say, this is merely me trying to better understand SQL so selecting the second review is a random ask. The table is made up of 4 columns. 1 is Product ID, 2 is Product Name, 3 is Review and 4 is Date Review was posted.
Thanks in advance for any help.
You use row_number() for this type of query:
select t.*
from (select t.*,
row_number() over (partition by product_id order by date_review asc) as seqnum
from t
) t
where seqnum = 2;
You can use a windowing function like ROW_NUMBER() to make numbered groupings, eg:
WITH Review_Sequence (
SELECT r.*,
ROW_NUMBER() OVER (PARTITION BY Product_ID ORDER BY Review_Date) Review_No
FROM Reviews r
)
SELECT * FROM Review_Sequence WHERE Review_No = 2

Global row numbers in chunked query

I would like to include a column row_number in my result set with the row number sequence, where 1 is the newest item, without gaps. This works:
SELECT id, row_number() over (ORDER BY id desc) AS row_number, title
FROM mytable
WHERE group_id = 10;
Now I would like to query for the same data in chunks of 1000 each to be easier on memory:
SELECT id, row_number() over (ORDER BY id desc) AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 0 AND id < 1000
ORDER BY id ASC;
Here the row_number restarts from 1 for every chunk, but I would like it to be as if it were part of the global query, as in the first case. Is there an easy way to accomplish this?
Assuming:
id is defined as PRIMARY KEY - which means UNIQUE and NOT NULL. Else you may have to deal with NULL values and / or duplicates (ties).
You have no concurrent write access on the table - or you don't care what happens after you have taken your snapshot.
A MATERIALIZED VIEW, like you demonstrate in your answer, is a good choice.
CREATE MATERIALIZED VIEW mv_temp AS
SELECT row_number() OVER (ORDER BY id DESC) AS rn, id, title
FROM mytable
WHERE group_id = 10;
But index and subsequent queries must be on the row number rn to get
data in chunks of 1000
CREATE INDEX ON mv_temp (rn);
SELECT * FROM mv_temp WHERE rn BETWEEN 1000 AND 2000;
Your implementation would require a guaranteed gap-less id column - which would void the need for an added row number to begin with ...
When done:
DROP MATERIALIZED VIEW mv_temp;
The index dies with the table (materialized view in this case) automatically.
Related, with more details:
Optimize query with OFFSET on large table
You want to have a query for the first 1000 rows, then one for the next 1000, and so on?
Usually you just write one query (the one you already use), have your app fetch 1000 records, do something with them, then fetch the next 1000 and so on. No need for separate queries, hence.
However, it would be rather easy to write such partial queries:
select *
from
(
SELECT id, row_number() over (ORDER BY id desc) AS rn, title
FROM mytable
WHERE group_id = 10
) numbered
where rn between 1 and 1000; -- <- simply change the row number range here
-- e.g. where rn between 1001 and 2000 for the second chunk
You need a pagination. Try this
SELECT id, row_number() over (ORDER BY id desc)+0 AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 0 AND id < 1000
ORDER BY id ASC;
Next time, when you change the start value of id in the WHERE clause change it in row_number() as well like below
SELECT id, row_number() over (ORDER BY id desc)+1000 AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 1000 AND id < 2000
ORDER BY id ASC;
or Better you can use OFFSET and LIMIT approach for pagination
https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
In the end I ended up doing it this way:
First I create a temporary materialized view:
CREATE MATERIALIZED VIEW vw_temp AS SELECT id, row_number() over (ORDER BY id desc) AS rn, title
FROM mytable
WHERE group_id = 10;
Then I define the index:
CREATE INDEX idx_temp ON vw_temp USING btree(id);
Now I can perform all operations very quickly, and with numbered rows:
SELECT * FROM vw_temp WHERE id BETWEEN 1000 AND 2000;
After doing the operations, cleanup:
DROP INDEX idx_temp;
DROP MATERIALIZED VIEW vw_temp;
Even though Thorsten Kettner's answer seems the cleanest one, it was not practical for me due to being too slow. Thanks for contributing everyone. For those interesed in the practical use case, I use this for feeding data to the Sphinx indexer.

Ambiguous column name using row_number() without alias

I'm trying to implement pagination in a query that is built using information from a view, and I need to use the row_number() function over a column when I don't know which table it is from.
SELECT * FROM (
SELECT class.ID as ID, user.ID as USERID, row_number() over (ORDER BY
ID desc) as row_number FROM class, user
) out_q WHERE row_number > #startrow ORDER BY row_number
The problem is that I only have the result column name (ID or USERID) that came from a previous query. If I execute this query, it will raise the error 'Ambiguous column name "ID"'. Is there a way to specify that I'm referencing the column ID that is being selected and not from a different table?
Is it possible to specify an alias to the query result itself?
I have already tried the following,
SELECT TOP 30 * FROM (
SELECT *, row_number() over (ORDER BY ID desc) as row_number FROM(
SELECT class.ID as ID, user.ID as USERID FROM class, user
) in_q
) out_q WHERE row_number > #startrow ORDER BY row_number
It works, but the SGBD gets confused on which query plan it has to use, because of the small row goal present in the outer query and the big set of results returned by the inner query, when #startrow is a small number, the query executes in less than one second, when it is a big number the query takes minutes to execute.
Your problem is the id in the row_number itself. If you want a stable sort, then include both ids:
SELECT *
FROM (SELECT class.ID as ID, user.ID as USERID,
row_number() over (ORDER BY class.ID desc, user.id) as row_number
FROM class CROSS JOIN user
) out_q
WHERE row_number > #startrow
ORDER BY row_number;
I assume the cartesian product is intentional. Sometimes, this indicates an error in the query. In general, I would advise you to avoid using commas in the from clause. If you do want a cartesian product, then be explicit by using CROSS JOIN.
You could try using the option you already tried, then use the OPTIMIZE FOR hint.
OPTION ( OPTIMIZE FOR (#startrow = 100000) );
See a description of the hint in MSDN docs here: https://msdn.microsoft.com/en-us/library/ms181714.aspx.

How do I use ROW_NUMBER()?

I want to use the ROW_NUMBER() to get...
To get the max(ROW_NUMBER()) --> Or i guess this would also be the count of all rows
I tried doing:
SELECT max(ROW_NUMBER() OVER(ORDER BY UserId)) FROM Users
but it didn't seem to work...
To get ROW_NUMBER() using a given piece of information, ie. if I have a name and I want to know what row the name came from.
I assume it would be something similar to what I tried for #1
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
but this didn't work either...
Any Ideas?
For the first question, why not just use?
SELECT COUNT(*) FROM myTable
to get the count.
And for the second question, the primary key of the row is what should be used to identify a particular row. Don't try and use the row number for that.
If you returned Row_Number() in your main query,
SELECT ROW_NUMBER() OVER (Order by Id) AS RowNumber, Field1, Field2, Field3
FROM User
Then when you want to go 5 rows back then you can take the current row number and use the following query to determine the row with currentrow -5
SELECT us.Id
FROM (SELECT ROW_NUMBER() OVER (ORDER BY id) AS Row, Id
FROM User ) us
WHERE Row = CurrentRow - 5
Though I agree with others that you could use count() to get the total number of rows, here is how you can use the row_count():
To get the total no of rows:
with temp as (
select row_number() over (order by id) as rownum
from table_name
)
select max(rownum) from temp
To get the row numbers where name is Matt:
with temp as (
select name, row_number() over (order by id) as rownum
from table_name
)
select rownum from temp where name like 'Matt'
You can further use min(rownum) or max(rownum) to get the first or last row for Matt respectively.
These were very simple implementations of row_number(). You can use it for more complex grouping. Check out my response on Advanced grouping without using a sub query
If you need to return the table's total row count, you can use an alternative way to the SELECT COUNT(*) statement.
Because SELECT COUNT(*) makes a full table scan to return the row count, it can take very long time for a large table. You can use the sysindexes system table instead in this case. There is a ROWS column that contains the total row count for each table in your database. You can use the following select statement:
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2
This will drastically reduce the time your query takes.
You can use this for get first record where has clause
SELECT TOP(1) * , ROW_NUMBER() OVER(ORDER BY UserId) AS rownum
FROM Users
WHERE UserName = 'Joe'
ORDER BY rownum ASC
ROW_NUMBER() returns a unique number for each row starting with 1. You can easily use this by simply writing:
ROW_NUMBER() OVER (ORDER BY 'Column_Name' DESC) as ROW_NUMBER
May not be related to the question here. But I found it could be useful when using ROW_NUMBER -
SELECT *,
ROW_NUMBER() OVER (ORDER BY (SELECT 100)) AS Any_ID
FROM #Any_Table
select
Ml.Hid,
ml.blockid,
row_number() over (partition by ml.blockid order by Ml.Hid desc) as rownumber,
H.HNAME
from MIT_LeadBechmarkHamletwise ML
join [MT.HAMLE] h on ML.Hid=h.HID
SELECT num, UserName FROM
(SELECT UserName, ROW_NUMBER() OVER(ORDER BY UserId) AS num
From Users) AS numbered
WHERE UserName='Joe'
You can use Row_Number for limit query result.
Example:
SELECT * FROM (
select row_number() OVER (order by createtime desc) AS ROWINDEX,*
from TABLENAME ) TB
WHERE TB.ROWINDEX between 0 and 10
--
With above query, I will get PAGE 1 of results from TABLENAME.
If you absolutely want to use ROW_NUMBER for this (instead of count(*)) you can always use:
SELECT TOP 1 ROW_NUMBER() OVER (ORDER BY Id)
FROM USERS
ORDER BY ROW_NUMBER() OVER (ORDER BY Id) DESC
Need to create virtual table by using WITH table AS, which is mention in given Query.
By using this virtual table, you can perform CRUD operation w.r.t row_number.
QUERY:
WITH table AS
-
(SELECT row_number() OVER(ORDER BY UserId) rn, * FROM Users)
-
SELECT * FROM table WHERE UserName='Joe'
-
You can use INSERT, UPDATE or DELETE in last sentence by in spite of SELECT.
SQL Row_Number() function is to sort and assign an order number to data rows in related record set. So it is used to number rows, for example to identify the top 10 rows which have the highest order amount or identify the order of each customer which is the highest amount, etc.
If you want to sort the dataset and number each row by seperating them into categories we use Row_Number() with Partition By clause. For example, sorting orders of each customer within itself where the dataset contains all orders, etc.
SELECT
SalesOrderNumber,
CustomerId,
SubTotal,
ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY SubTotal DESC) rn
FROM Sales.SalesOrderHeader
But as I understand you want to calculate the number of rows of grouped by a column. To visualize the requirement, if you want to see the count of all orders of the related customer as a seperate column besides order info, you can use COUNT() aggregation function with Partition By clause
For example,
SELECT
SalesOrderNumber,
CustomerId,
COUNT(*) OVER (PARTITION BY CustomerId) CustomerOrderCount
FROM Sales.SalesOrderHeader
This query:
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
will return all rows where the UserName is 'Joe' UNLESS you have no UserName='Joe'
They will be listed in order of UserID and the row_number field will start with 1 and increment however many rows contain UserName='Joe'
If it does not work for you then your WHERE command has an issue OR there is no UserID in the table. Check spelling for both fields UserID and UserName.