How does one get the total rows for a partition in postgresql - sql

I'm using a windows function to help me pagination through a list of records in the database.
For example
I have a list of dogs and they all have a breed associated with them.
I want to show 10 dogs from each breed to my users.
So that would be
select * from dogs
join (
SELECT id, row_number() OVER (PARTITION BY breed) as row_number FROM dogs
) rn on dogs.id = rn.id
where (row_number between 1 and 10)
That will give me ~ten dogs from each breed..
What I need though is a count. Is there a way to get the count of the partitions. I want to know how many Staffies I have waiting for adoption.
I do notice that there's a percentage and all the docs I find seem to indicate theres something called total rows. But I don't see it.

Just run the window aggregate function count() over the same partition (without adding ORDER BY!) to get the total count for the partition:
SELECT *
FROM (
SELECT *
, row_number() OVER (PARTITION BY breed ORDER BY id) AS rn
, count() OVER (PARTITION BY breed) AS breed_count -- !
FROM dogs
) sub
WHERE rn < 11;
Also removed the unnecessary join and simplified.
See:
Run a query with a LIMIT/OFFSET and also get the total number of rows
And I added ORDER BY to the frame definition of row_number() to get a deterministic result. Without, Postgres is free to return any 10 arbitrary rows. Any write to the table (or VACUUM, etc.) can and will change the result without ORDER BY.
Aside, pagination with LIMIT / OFFSET does not scale well. Consider:
Optimize query with OFFSET on large table

Related

How Can I Show The Top Rows Those cover 50% Of the Total Visitor in SQL

Problem: I want to fetch a set of data from a table related to search keyword metrics. I want to fetch only the keywords that cover 50% of the total unique visitors. The overall code is given below -
SELECT se_keyword
,COUNT(DISTINCT visitor_id) AS Distinct_Visitors
FROM search_table
WHERE DATE >= '20210207'
GROUP BY se_keyword
ORDER BY Distinct_Visitors DESC
This will show all the keywords with unique visitors against the search keyword. But I want to show only the top keywords based on unique visitor that will cover 50% of total unique visitor.
This is a tricky problem. One method is the following:
Reduce the data to one row per user and keyword (not necessary if there are no duplicates).
Calculate a running total of the number of duplicates using count(distinct) as a window function.
Filter for the conditions you want.
Here is what the logic looks like:
select distinct ku.keyword, ku.running_num_users
from (select ku.*,
count(distinct userid) over (order by num_users desc) as running_num_users,
count(distinct userid) as num_users_overall
from (select keyword, userid,
count(*) over (partition by keyword) as num_users
from t
group by keyword, userid
) ku
) ku
where running_num_users <= 0.5 * num_users_overall;
Note that not all databases support count(distinct) as a window function. There are simple workarounds, however.

Rank order ST_DWithin results by the number of radii a result appears in

I have a table of existing customers and another table of potential customers. I want to return a list of potential customers rank ordered by the number of radii of existing purchasers that they appear in.
There are many rows in the potential customers table per each existing customer, and the radius around a given existing customer could encompass multiple potential customers. I want to return a list of potential customers ordered by the count of the existing customer radii that they fall within.
SELECT pur.contact_id AS purchaser, count(pot.*) AS nearby_potential_customers
FROM purchasers_geocoded pur, potential_customers_geocoded pot
WHERE ST_DWithin(pur.geom,pot.geom,1000)
GROUP BY purchaser;
Does anyone have advice on how to proceed?
EDIT:
With some help, I wrote this query, which seems to do the job, but I'm verifying now.
WITH prequalified_leads_table AS (
SELECT *
FROM nearby_potential_customers
WHERE market_val > 80000
AND market_val < 120000
)
, proximate_to_existing AS (
SELECT pot.prop_id AS prequalified_leads
FROM purchasers_geocoded pur, prequalified_leads_table pot
WHERE ST_DWithin(pot.geom,pur.geom,100)
)
SELECT prequalified_leads, count(prequalified_leads)
FROM proximate_to_existing
GROUP BY prequalified_leads
ORDER BY count(*) DESC;
I want to return a list of potential customers ordered by the count of the existing customer radii that they fall within.
Your query tried the opposite of your statement, counting potential customers around existing ones.
Inverting that, and after adding some tweaks:
SELECT pot.contact_id AS potential_customer
, rank() OVER (ORDER BY pur.nearby_customers DESC
, pot.contact_id) AS rnk
, pur.nearby_customers
FROM potential_customers_geocoded pot
LEFT JOIN LATERAL (
SELECT count(*) AS nearby_customers
FROM purchasers_geocoded pur
WHERE ST_DWithin(pur.geom, pot.geom, 1000)
) pur ON true
ORDER BY 2;
I suggest a subquery with LEFT JOIN LATERAL ... ON true to get counts. Should make use of the spatial index that you undoubtedly have:
CREATE INDEX ON purchasers_geocoded USING gist (geom);
Thereby retaining rows with 0 nearby customers in the result - your original join style would exclude those. Related:
What is the difference between LATERAL and a subquery in PostgreSQL?
Then ORDER BY the resulting nearby_customers in the outer query (not: nearby_potential_customers).
It's not clear whether you want to add an actual rank. Use the window function rank() if so. I made the rank deterministic while being at it, breaking ties with an additional ORDER BY expression: pot.contact_id. Else, peers are returned in arbitrary order which can change for every execution.
ORDER BY 2 is short syntax for "order by the 2nd out column". See:
Select first row in each GROUP BY group?
Related:
How do I query all rows within a 5-mile radius of my coordinates?

How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).
I would like to get name of the most frequent hotel in table Orders.
SELECT hotel.name from Orders
JOIN Trip
on Orders.id_trip = Trip.id_hotel
JOIN hotel
on trip.id_hotel = hotel.id_hotel
FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
FROM (SELECT hotel.name, count(*) cnt
FROM Orders
GROUP BY hotel.name))
WHERE rnk = 1;
The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name. It's called the MODE of the distribution. And Oracle has the STATS_MODE() function for it. https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm
For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees. (30 is the department "name" or number, it is NOT the number of employees in that department!)
In your case:
select stats_mode(h.name) from (the rest of your query)
Note: if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic). If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above). This is a documented flaw in Oracle's understanding and implementation of the statistical concept.
Use FIRST for a single result:
SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC)
FROM (
SELECT hotel.name, COUNT(*) cnt
FROM orders
JOIN trip USING (id_trip)
JOIN hotel USING (id_hotel)
GROUP BY hotel.name
) t
Here is one method:
select name
from (select h.name,
row_number() over (order by count(*) desc) as seqnum -- use `rank()` if you want duplicates
from orders o join
trip t
on o.id_trip = t.id_trip join -- this seems like the right join condition
hotels h
on t.id_hotel = h.id_hotel
) oth
where seqnum = 1;
** Getting the most recent statistical mode out of a data sample **
I know it's more than a year, but here's my answer. I came across this question hoping to find a simpler solution than what I know, but alas, nope.
I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.
In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)
In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)
In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.
SELECT VALUE
FROM
(SELECT VALUE ,
COUNT( * ) CNT,
MAX( R ) R
FROM
( SELECT ID, ROWNUM R FROM FOO
)
GROUP BY ID
ORDER BY CNT DESC,
R DESC
)
WHERE
(
ROWNUM < 2
);
That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work.)
Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case).
Then skim off the top row.
Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.
If this doesn't work for your specific case, you'll have to create your own custom aggregator.
Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.

How to efficiently get a range of ranked users (for a leaderboard) using Postgresql

I have read many posts on this topic, such as
mysql-get-rank-from-leaderboards.
However, none of the solutions are efficient at scale for getting a range of ranks from the database.
The problem is simple. Suppose we have a Postgres table with an "id" column and another INTEGER column whose values are not unique, but we have an index for this column.
e.g. table could be:
CREATE TABLE my_game_users (id serial PRIMARY KEY, rating INTEGER NOT NULL);
The goal
Define a rank for users ordering users on the "rating" column descending
Be able to query for a list of ~50 users ordered by this new "rank", centered at any particular user
For example, we might return users with ranks { 15, 16, ..., 64, 65 } where the center user has rank #40
Performance must scale, e.g. be under 80 ms for 100,000 users.
Attempt #1: row_number() window function
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank
FROM my_game_users)
SELECT *
FROM my_ranks
WHERE rank >= 4000 AND rank <= 4050
ORDER BY rank ASC;
This "works", but the queries average 550ms with 100,000 users on a fast laptop without any other real work being done.
I tried adding indexes, and re-phrasing this query to not use the "WITH" syntax, and nothing worked to speed it up.
Attempt #2 - count the number of rows with a greater rating value
I tried a query like this:
SELECT t1.*,
(SELECT COUNT(*)
FROM my_game_users t2
WHERE (t1.rating, -t1.id) <= (t2.rating, -t2.id)
) AS rank
FROM my_game_users t1
WHERE id = 2000;
This is decent, this query takes about 120ms with 100,000 users having random ratings. However, this only returns the rank for user with a particular id (2000).
I can't see any efficient way to extend this query to get a range of ranks. Any attempt at extending this makes a very slow query.
I only know the ID of the "center" user, since the users have to be ordered by rank before we know which ones are in the range!
Attempt #3: in-memory ordered Tree
I ended up using a Java TreeSet to store the ranks. I can update the TreeSet whenever a new user is inserted into the database, or a user's rating changes.
This is super fast, around 25 ms with 100,000 users.
However, it has a serious drawback that it's only updated on the Webapp node that serviced the request. I'm using Heroku and will deploy multiple nodes for my app. So, I needed to add a scheduled task for the server to re-build this ranking tree every hour, to make sure the nodes don't get too out-of-sync!
If anyone knows of an efficient way to do this in Postgres with full solution, then I am all ears!
You can get the same results by using order by rating desc and offset and limit to get users between a certain rank.
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank FROM my_game_users)
SELECT * FROM my_ranks WHERE rank >= 4000 AND rank <= 4050 ORDER BY rank ASC;
The query above is the same as
select * , rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 4000
If you want to select users around rank #40 you could select ranks #15-#65
select *, rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 15
Thanks, #FuzzyTree !
Your solution doesn't quite give me everything I need, but it nudged me in the right direction. Here's the full solution I'm going with for now.
The only limitation with your solution is that there's no way to get a unique rank for a particular user. All users with the same rating would have the same rank (or at least it is undefined by SQL standard). If I knew the OFFSET ahead of time, then your rank would be good enough, but I have to get the rank of a particular user first.
My solution is to do the following query to get a range of ranks:
SELECT * FROM my_game_users ORDER BY rating DESC, id ASC LIMIT ? OFFSET ?
This is basically uniquely defining the ranks by rating, then by who joined the Game first (lower id).
To make this efficient I'm creating an index on (rating DESC, id)
Then, I'm getting a particular user's rank to plug in to this query with:
SELECT COUNT(*) FROM my_game_users WHERE rating > ? OR (rating = ? AND id < ?)
I actually made this more efficient with:
SELECT (SELECT COUNT(*) FROM my_game_users WHERE rating > ?) + (SELECT COUNT(*) FROM my_game_users WHERE rating = ? AND id < ?) + 1
Now, even with these queries it takes about 78ms average and median time to get the ranks around a user. If anyone has a good idea how to speed these up I'm all ears!
For example, getting a range of ranks takes about 60ms, and explaining it yields:
EXPLAIN SELECT * FROM word_users ORDER BY rating DESC, id ASC LIMIT 50 OFFSET 50000;
"Limit (cost=6350.28..6356.63 rows=50 width=665)"
" -> Index Scan using idx_rating_desc_and_id on word_users (cost=0.29..12704.83 rows=100036 width=665)"
So, it's using the rating and id index, yet it still has this highly variable cost from 0.29...12704.83. Any ideas how to improve??
If you order it in desc order you have it in the right order. Use the rownumber() function.
Select Row number in postgres
Also you would use an in memory cache to store stuff in memory. Something like redis. Its a separate application that can serve multiple instances, even remotely.

Find row number in a sort based on row id, then find its neighbours

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?
Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.