Dont understand how queries to retrieve top n records from each group work - sql

I had an issue where I was trying to get the top 'n' records from each group (day) or records in my database. After a bunch of digging I found some great answers and they did in fact solve my problem.
How to select the first N rows of each group?
Get top n records for each group of grouped results
However, my noob-ness is preventing me from understanding exactly WHY these "counting" solutions work. If someone with better SQL knowledge can explain, that would be really great.
EDIT: here's more details
Let's say I had a table described below with this sample data. (To make things simpler, I have a column that kept track of the time of the next upcoming midnight, in order to group 'per day' better).
id | vote_time | time_of_midnight | name | votes_yay | votes_nay
------------------------------------------------------------------------
1 | a | b | Person p | 24 | 36
1 | a | b | Person q | 20 | 10
1 | a | b | Person r | 42 | 22
1 | c | d | Person p | 8 | 10
1 | c | d | Person s | 120 | 63
There can be tens or hundreds of "People" per day (b, d, ...)
id is some other column I needed in order to group by (you can think of it as an election id if that helps)
I'm trying to calculate the top 5 names that had the highest number of votes per day, in descending order. I was able to use the referenced articles to create a query that would give me the following results (on Oracle):
SELECT name, time_of_midnight, votes_yay, votes_nay, (votes_yay+votes_nay) AS total_votes
FROM results a
WHERE id=1 AND (
SELECT COUNT(*)
FROM results b
WHERE b.id=a.id AND b.time_of_midnight=a.time_of_midnight AND (a.votes_yay+a.votes_nay) >= (b.votes_yay+b.votes_nay)) <= 5
ORDER BY time_of_midnight DESC, total_votes DESC;
name | time_of_midnight | votes_yay | votes_nay | total_votes
------------------------------------------------------------------------
Person s | d | 120 | 63 | 183
Person p | d | 8 | 10 | 18
Person r | b | 42 | 22 | 64
Person p | b | 24 | 36 | 60
Person q | b | 20 | 10 | 30
So I'm not really sure
Why this counting method works?
[stupid]: Why don't I need to also include name in the inner query to make sure it doesn't join the data incorrectly?

Let's begin with the fact that your query is actually calculating top 5 names that had the lowest number of votes. To get the top 5 with the highest number, you'll need to change this condition:
(a.votes_yay+a.votes_nay) >= (b.votes_yay+b.votes_nay)
into this:
(a.votes_yay+a.votes_nay) <= (b.votes_yay+b.votes_nay)
or, perhaps, this (which is the same):
(b.votes_yay+b.votes_nay) >= (a.votes_yay+a.votes_nay)
(The latter form would seem to me preferable, but merely because it would be uniform with the other two comparisons which have a b column on the left-hand side and an a column on the right-hand side. That is perfectly irrelevant to the correctness of the logic.)
Logically, what's happening is this. For every row in results, the server will be looking for rows in the same table that match id and time_of_midnight of the given row and have the same or higher number of total votes than that in the given row. It will then count the found rows and check if the result is not greater than 5, i.e. if no more than 5 rows in the same (id, time_of_midnight) group have the same or higher number of votes as in the given row.
For example, if the given row happens to be one with the most votes in its group, the subquery will find only that same row (assuming there are no ties) and so the count will be 1. That is fewer than 5 – therefore, the given row will qualify for output.
If the given row will be the second most voted item in a group, the subquery will find the same row and the top-voted item (again, assuming no ties), which will give the count of 2. Again, that matches the count <= 5 condition, and so the row will be returned in the output.
In general, if a row is ranked as # N in its group according to the total number of votes, it means there are N rows in that group where the vote number is the same or higher than the number in the given row (we are still assuming there are no ties). So, when you are counting votes in this way, you are effectively calculating the given row's ranking.
Now, if there are ties, you may get fewer results per group using this method. In fact, if a group had 6 or more rows tied at the maximum number of rows, you would get no rows for that group in the output, because the subquery would never return a count value less than 6.
That is because effectively all the top-voted items would be ranked as 6 (or whatever their number would be) rather than as 1. To rank them as 1 instead, you could try the following modification of the same query:
SELECT name, time_of_midnight, votes_yay, votes_nay, (votes_yay+votes_nay) AS total_votes
FROM results a
WHERE id=1 AND (
SELECT COUNT(*) + 1
FROM results b
WHERE b.id=a.id AND b.time_of_midnight=a.time_of_midnight
AND (b.votes_yay+b.votes_nay) > (a.votes_yay+a.votes_nay)) <= 5
ORDER BY time_of_midnight DESC, total_votes DESC;
Now the subquery will be looking only for rows with the higher number of votes than in the given row. The resulting count will be increased by 1 and that will be the given row's ranking (and the value to compare against 5).
So, if the counts were e.g. 10, 10, 8, 7 etc., the rankings would be calculated as 1, 1, 3, 4 etc. rather than as 2, 2, 3, 4 etc., as with the original version.
That, of course, means that the output might now have more than 5 rows per group. For instance, if votes were distributed as 10, 9, 8, 8, 8, 8, 6 etc., you would get 10, 9 and all the 8s (because the rankings would be 1, 2, 3, 3, 3, 3, 7...). To return exactly 5 names per group (assuming there are at least 5 of them), you'd probably need to consider a different method altogether.

Related

Count unique entries for ts_stat count in full text search

I'm struggling with using ts_stat to get the number of unique occurrences of tags in a table and sort them by the highest count.
What I need though is to only count each entry one time so that only unique entries are counted. I tried group by and distinct but nothing is working for me.
e.g. table
user_id | tags | post_date
===================================
2 | dog cat | 1580049400
2 | dog | 1580039400
3 | dog | 1580038400
3 | dog dog cat | 1580058400
4 | dog horse | 1580028400
Here is the current query
SELECT word, ndoc, nentry
FROM ts_stat($$SELECT to_tsvector('simple', tags) FROM tags WHERE post_date > 1580018400$$)
ORDER BY ndoc DESC
LIMIT 10;
Right now this will produce
word | ndoc | nentry
====================
dog | 5 | 6
cat | 2 | 2
horse| 1 | 1
The result I would be looking for is unique counts so no 1 user can count more than once even if they have > 1 entries after a certain date as noted in the post_date condition (Which might be irrelevant). Like below.
word | total_count_per_user
===========================
dog | 3 (because there are 3 unique users with this term)
cat | 2 (because there are 2 unique users with this term)
horse| 1 (because there are 1 unique users with this term)
UPDATE: I changed the column name to reflect output. The point is no matter how many times a user enters a word. It only needs the unique count per user. e.g. if a user in that scenario creates 100 entries with dog in the text it will only count dog 1 time for that user not 100 counts of dog.
You can use COUNT on DISTINCT value if I get your point correct. The sample query is as below-
SELECT tags,COUNT(DISTINCT user_id)
FROM your_table
GROUP BY tags
I guess this one was tough. Just in case someone happens to have a similar requirement I was able to get this to work. Seems odd to have to get total with ts_stat then filter it again using distinct, cross join etc so that no matter how many times it finds a word each user only counts once per word. I'm not sure how efficient it will be on a large data set but it yields the expected results.
UPDATE: This is works without using a CTE. Also cross join is the key to filtering on user id.
SELECT DISTINCT (t.word) as tag, count(DISTINCT h.user_id) as posts
FROM ts_stat($$SELECT hashtagsearch FROM tagstable WHERE post_date > 1580018400$$) t
CROSS JOIN tagstable h WHERE hashtagsearch ## to_tsquery('simple',t.word)
GROUP BY t.word HAVING count(DISTINCT h.user_id) > 1 ORDER BY posts DESC LIMIT 10'
This answer helped quite a bit. https://stackoverflow.com/a/42704207/330987

Find highest (max) date query, and then find highest value from results of previous query

Here is a table called packages:
id packages_sent date sent_order
1 | 10 | 2017-02-11 | 1
2 | 25 | 2017-03-15 | 1
3 | 5 | 2017-04-08 | 1
4 | 20 | 2017-05-21 | 1
5 | 25 | 2017-05-21 | 2
6 | 5 | 2017-06-19 | 1
This table shows the number of packages sent on a given date; if there were multiple packages sent on the same date (as is the case with rows 4 and 5), then the sent_order keeps track of the order in which they were sent.
I am trying to make a query that will return sum(packages_sent) given the following conditions: first, return the row with the max(date) (given some date provided), and second, if there are multiple rows with the same max(date), return the row with the max(send_order) (the highest send_order value).
Here is the query I have so far:
SELECT sum(packages_sent)
FROM packages
WHERE date IN
(SELECT max(date)
FROM packages
WHERE date <= '2017-05-29');
This query correctly finds the max date, which is 2017-05-21, but then for the sum it returns 45 because it is adding rows 4 and 5 together.
I want the query to return the max(date), and if there are multiple rows with the same max(date), then return the row with the max(sent_order). Using the example above with the date 2017-05-29, it should only return 25.
I don't see where a sum() comes into play. You seem to only want the last row:
select p.*
from packages p
order by date desc, sendorder desc
fetch first 1 row only;
If you data is truly ordered ascending as you show it then it's easier to use the surrogate key ID field.
SELECT packages_sent
FROM packages
WHERE ID =
(SELECT max(ID)
FROM packages
WHERE date <= '2017-05-29');
Since the ID is always increasing with date and sent order finding the max of it also finds the max of the other two in one step.

SQL: the most effective way to get row number of one element

I have a table of persons:
id | Name | Age
1 | Alex | 18
2 | Peter| 30
3 | Zack | 25
4 | Bim | 30
5 | Ken | 20
And I have the following interval of rows: WHERE ID>1 AND ID<5. I know that in this interval there is a person whose id=3. What is the most efficient (the fastest) way to get its row number in this interval (in my example rownumber=2)? I mean I don't need any other data. I need only one thing - to know row position of person with id=3 in interval WHERE ID>1 AND ID<5.
If it's possible I would like to get not vendor specific solution but a general sql solution. If it's not possible then I need solution for postgresql and h2.
The row number would be the number of rows between the first row in the interval and the row you're looking for. For interval ID>1 AND ID<5 and target row ID=3, this is:
select count(*)
from YourTable
where id between 2 and 3
For interval ID>314 AND ID<1592 and target row ID=1000, you'd use:
where id between 315 and 1000
To be sure that there is an element with ID=3, use:
select count(*)
from YourTable
where id between 2 and
(
select id
from YourTable
where id = 3
)
This will return 0 if the row doesn't exist.

Sort by data from multiple columns

For customer reviews on my products, I have them stored in SQL something like the below:
durability | cost | appearance
----------------------------------
5 | 3 | 4
2 | 4 | 2
1 | 5 | 5
Each value is an out of five score in the three categories.
When I want to print this information on page, I'd like to order them in descending order by the average score of an individual review.
SELECT *
FROM reviews
ORDER BY (durability+cost+appearance)/3 DESC
Obviously this doesn't work, but is there a way to get my result? I don't want to include an average column in SQL because outside of this one small application, it serves zero purpose.
Use ORDER BY instead of SORT BY:
SELECT *
FROM reviews
ORDER BY (durability+cost+appearance)/3 DESC
EDIT:
To see the order by value, try adding one more column in the select clause:
SELECT *,(durability+cost+appearance)/3 as OrderValue
FROM reviews
ORDER BY (durability+cost+appearance)/3 DESC
Sample output:
DURABILITY COST APPEARANCE ORDERVALUE
5 3 4 4
1 5 5 3
2 4 2 2

Determine on what page a record is

How can I determine on what page a certain record is?
Let's say i display 5 records per page using a query like this:
SELECT * FROM posts ORDER BY date DESC LIMIT 0,5
SELECT * FROM posts ORDER BY date DESC LIMIT 5,5
SELECT * FROM posts ORDER BY date DESC LIMIT 10,5
Sample data:
id | name | date
-----------------------------------------------------
1 | a | 2013-11-07 08:19 page 1
2 | b | 2013-12-02 12:32
3 | c | 2013-12-14 14:11
4 | d | 2013-12-21 09:26
5 | e | 2013-12-22 18:52 _________
6 | f | 2014-01-04 11:20 page 2
7 | g | 2014-01-07 21:09
8 | h | 2014-01-08 13:39
9 | i | 2014-01-08 16:41
10 | j | 2014-01-09 07:45 _________
11 | k | 2014-01-14 22:05 page 3
12 | l | 2014-01-21 17:21
Someone may edit a record, let's say with id = 7, or insert a new record (id = 13). How can determine on which page is that record? The reason is that I want to display the page that contains the record that has just been edited or added.
ok I guess I could just display the same page if the record is edited. But the problem is when a record gets added. The list can be ordered by name and the new record could be placed anywhere :(
Is there some way I could do a query like SELECT offset WHERE id = 13 ORDER BY date LIMIT 5 that returns 10 ?
For the sake of this example, let's assume that entry 7 has just been added (and that there could be duplicate names) - the first thing you need to do is find how many entries come before that one (based on name), thus:
SELECT COUNT(*)
FROM Posts
WHERE name <= 'g'
AND id < 7
Here, id is being used as a "tiebreaker" column, to ensure a stable sort. It's also assuming that we know the value of id, too - given that non-key data can be duplicate, you need that sort of functionality.
In any case, this gives us the number of rows preceding this one (6). With some integer division arithmetic (based on the LIMIT), we can now get the relevant information:
(int) ((6 - 1) / 5) = 1
... this is for a 0-indexed page, though (ie, entries 1 - 5 appear on page "0"); however, in this case it works in our favor. Note that we have to subtract 1 from the initial count because the first is 1, not 0 - otherwise, entry 5 would appear on the second page, instead of the first.
We now have the page index, but we need to turn it into the entry index. Some simple multiplication does that for us:
(1 * 5) + 1 = 6
(ignore that this is identical to the count - it's coincidence in this case).
This gives us the index of the first entry on the page, the value for OFFSET.
We can now write the query:
SELECT id, name, date
FROM Posts
ORDER BY name, id
LIMIT 5 OFFSET 6
(keep in mind that we require id to guarantee a stable sort for the data, if we assume that name could be a duplicate!).
This is two trips to the database. Surprisingly, SQLite allows LIMIT/OFFSET values to be the results of SQL subqueries (keep in mind, not all RDBMSs allow them to even be host variables, meaning the could only be changed with dynamic SQL. Although in at least one case, the db had ROW_NUMBER() to make up for that...). I wasn't able to get rid of the repetition of the subqueries, though.
SELECT Posts.id, Posts.name, Posts.date, Pages.pageCount
FROM Posts
CROSS JOIN (SELECT ((COUNT(*) - 1) / 5) + 1 as pageCount
FROM Posts
WHERE name <= 'g'
AND id < 7) Pages
ORDER BY name
LIMIT 5, (SELECT ((COUNT(*) - 1) / 5) * 5 + 1 as entryCount
FROM Posts
WHERE name <= 'g'
AND id < 7);
(and the working SQL Fiddle example).