I'm implementing a view to store leaderboard data of the top 10 users that is computed using an expensive COUNT(*). I'm planning on the view to look something like this:
id SERIAL PRIMARY KEY
user_id TEXT
type TEXT
rank INTEGER
count INTEGER
-- adding an index to user_id
-- adding a two-column unique index to user_id and type
I'm having trouble with seeing how this view should be created to properly account for the rank and type. Essentially, I have a big table (~30 million rows) like this:
+----+---------+---------+----------------------------+
| id | user_id | type | created_at |
+----+---------+---------+----------------------------+
| 1 | 1 | Diamond | 2021-05-11 17:35:18.399517 |
| 2 | 1 | Diamond | 2021-05-12 17:35:17.399517 |
| 3 | 1 | Diamond | 2021-05-12 17:35:18.399517 |
| 4 | 2 | Diamond | 2021-05-13 17:35:18.399517 |
| 5 | 1 | Clay | 2021-05-14 17:35:18.399517 |
| 6 | 1 | Clay | 2021-05-15 17:35:18.399517 |
+----+---------+---------+----------------------------+
With the table above, I'm trying to achieve something like this:
+----+---------+---------+------+-------+
| id | user_id | type | rank | count |
+----+---------+---------+------+-------+
| 1 | 1 | Diamond | 1 | 3 |
| 2 | 2 | Diamond | 2 | 1 |
| 3 | 1 | Clay | 1 | 2 |
| 4 | 1 | Weekly | 1 | 5 | -- 3 diamonds + 2 clay obtained between Mon-Sun
| 5 | 2 | Weekly | 2 | 1 |
+----+---------+---------+------+-------+
By Weekly I am counting the time from the last Sunday to the upcoming Sunday.
Is this doable using only SQL, or is some kind of script needed? If doable, how would this be done? It's worth mentioning that there are thousands of different types, so not having to manually specify type would be preferred.
If there's anything unclear, please let me know and I'll do my best to clarify. Thanks!
The "weekly" rows are produced in a different way compared to the "user" rows (I called them two different "categories"). To get the result you want you can combine two queries using UNION ALL.
For example:
select 'u' as category, user_id, type,
rank() over(partition by type order by count(*) desc) as rk,
count(*) as cnt
from scores
group by user_id, type
union all
select 'w', user_id, 'Weekly',
rank() over(order by count(*) desc),
count(*) as cnt
from scores
group by user_id
order by category, type desc, rk
Result:
category user_id type rk cnt
--------- -------- -------- --- ---
u 1 Diamond 1 3
u 2 Diamond 2 1
u 1 Clay 1 2
w 1 Weekly 1 5
w 2 Weekly 2 1
See running example at DB Fiddle.
Note: For the sake of simplicity I left the filtering by timestamp out of the query. If you really needed to include only the rows of the last 7 days (or other period of time), it would be a matter of adding a WHERE clause in both subqueries.
I think this is what you were talking about, right?
WITH scores_plus_weekly AS ((
SELECT id, user_id, 'Weekly' AS type, created_at
FROM scores
WHERE created_at BETWEEN '2021-05-10' AND '2021-05-17'
)
UNION (
SELECT * FROM scores
))
SELECT
row_number() OVER (ORDER BY CASE "type" WHEN 'Diamond' THEN 0 WHEN 'Clay' THEN 1 ELSE 2 END, count(*) DESC) as "id",
user_id,
"type",
row_number() OVER (PARTITION BY count(*) DESC) as "rank",
count(*)
FROM scores_plus_weekly
GROUP BY user_id, "type"
ORDER BY "id";
I'm sure this is not the only way, but I thought the result wasn't too complex. This query first combines the original database with all scores from this week. For the sake of consistency I picked a date range that matches your entire example set. It then groups by user_id and type to get the counts for each combination. The row_numbers will give you the overall rank and the rank per type. A big part of this query consists of sorting by type, so if you're joining another table that contains the order or priority of the types, the CASE can probably be simplified.
Then, lastly, this entire query can be caught in a view using the CREATE VIEW score_ranks AS , followed by your query.
Related
I have a table that has a number column and an attribute column like this:
1.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 1 | b |
| 1 | a |
| 2 | a |
| 2 | b |
| 2 | b |
+------------
I want to make the number unique, and the attribute to be whichever attribute occured most often for that number, like this (This is the end-product im interrested in) :
2.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 2 | b |
+------------
I have been working on this for a while and managed to write myself a query that looks up how many times an attribute occurs for a given number like this:
3.
+-----+-----+-----+
| num | att |count|
------------------+
| 1 | a | 1 |
| 1 | b | 2 |
| 2 | a | 1 |
| 2 | b | 2 |
+-----------------+
But I can't think of a way to only select those rows from the above table where the count is the highest (for each number of course).
So basically what I am asking is given table 3, how do I select only the rows with the highest count for each number (Of course an answer describing providing a way to get from table 1 to table 2 directly also works as an answer :) )
You can use aggregation and window functions:
select num, att
from (
select num, att, row_number() over(partition by num order by count(*) desc, att) rn
from mytable
group by num, att
) t
where rn = 1
For each num, this brings the most frequent att; if there are ties, the smaller att is retained.
Oracle has an aggregation function that does this, stats_mode().:
select num, stats_mode(att)
from t
group by num;
In statistics, the most common value is called the mode -- hence the name of the function.
Here is a db<>fiddle.
You can use group by and count as below
select id, col, count(col) as count
from
df_b_sql
group by id, col
I do NOT want to get top 1 from each group! Pay attention to the explanation which I have provided at the last portion of my question!
I have the following rows:
| Code | Type | SubType | Date |
|:----:|:----:|:-------:|:----------:|
| 100 | 10 | 1 | 17.12.2019 |
| 100 | 10 | 2 | 18.12.2019 |
| 100 | 10 | 2 | 19.12.2019 |
| 100 | 10 | 1 | 20.12.2019 |
What I need is to make groups of rows based on Code, Type and SubType columns. But not only should I keep the Date column, but I have to remove duplicate rows (based on Code, Type and SubType columns) from those groups which are in the middle as follows:
| Code | Type | SubType | Date |
|:----:|:----:|:-------:|:----------:|
| 100 | 10 | 1 | 17.12.2019 |
| 100 | 10 | 2 | 18.12.2019 |
| 100 | 10 | 1 | 20.12.2019 |
Let me to explain more about the scenario which leads to this situation, and thus I need to clean my data before displaying to the end user. I have a historical table which has 4 columns (Code, Type, SubType and Date). Each row of this table shows a change which have been occurred on the values of the fields of that row at a specific date. For instance, in the above example, there have been 4 changes on the row at 4 different dates. At first, the row has been generated with Code = 100, Type = 10 and SubType = 1 at 17.12.2019. Then SubType has been changed to 2 at 18.12.2019. Next day, at 19.12.2019, SubType has been changed again to 2 (which is a duplicate in my case). Finally, SubType has been changed again to 1 at 20.12.2019. In fact, I don't need to show the 3rd change as it is a duplicate in my case.
I tried using Row_Number()Over(Partition by Code, Type and SubType Order By Date), but I was not successful.
You want to keep the dates where something changes. My recommendation is lag on the date:
select t.*
from (select t.*,
lag(date) over (partition by code, type, subtype order by date) as prev_cts_date,
lag(date) over (order by date) as prev_date
from t
) t
where prev_cts_date is null or prev_cts_date <> prev_date;
One alternative is a lag() on each of the columns and then check each value for a change. Not only is that cumbersome, but the logic gets much worse if NULL values are involved.
Here is logic is just asking: "Is the previous date for the CTS combination the same as the previous date?" If so, discard the record.
This looks to me like a gaps-and-island problem. Here is one approach using row_number():
select code, type, SubType, Date
from (
select
t.*,
row_number() over(partition by code, type, rn1 - rn2 order by date) rn
from (
select
t.*,
row_number() over(partition by code, type order by date) rn1,
row_number() over(partition by code, type, SubType order by date) rn2
from mytable t
) t
) t
where rn = 1
This defines group by taking the difference of row numbers over partitions of code, type against partitions of code, type, subtype. Then, we select the first record per group, using row_number() again.
Demo on DB Fiddle:
code | type | SubType | Date
---: | ---: | ------: | :---------
100 | 10 | 1 | 17.12.2019
100 | 10 | 2 | 18.12.2019
100 | 10 | 1 | 20.12.2019
Example Table user:
ID | USER_ID | SCORE |
1 | 555 | 50 |
2 | 555 | 10 |
3 | 555 | 20 |
4 | 123 | 5 |
5 | 123 | 5 |
6 | 999 | 30 |
The result set should be like
ID | USER_ID | SCORE | COUNT |
1 | 555 | 80 | 3 |
2 | 123 | 10 | 2 |
3 | 999 | 30 | 1 |
Is it possible to generate a sql that can return the table above, so far I can only count the rows where certain user_id appear, but don't know how to sum and show for every user ?
You've included a column called "ID" in both the source data and desired results, but I'm going to assume that these ID values are not related and simply represent the row or line number - otherwise the question doesn't make sense.
In which case, you can simply use:
SELECT
USER_ID,
SUM(SCORE) AS SCORE,
COUNT(USER_ID) AS COUNT
FROM
<Table>
GROUP BY
USER_ID
If you really want to generate the ID column as well, then how you do this depends on the database platform being used. For example on Oracle you could use the ROWNUM pseudocolumn, on SQL Server you will need to use ROW_NUMBER() function (which also works for Oracle).
SELECT ID
,sum(SCORE)
,count(USER_ID)
FROM Table
GROUP BY
ID
I think COUNT is the number of scores per user_id, if so, then your sql request should be :
SELECT
ID,
USER_ID,
SUM(SCORE)AS SCORE,
COUNT(SCORE)AS COUNT
FROM
TABLE
GROUP BY
USER_ID
I'm not looking for the answer as much as what to search for as I think this is possible. I have a query where the result can be as such:
| ID | CODE | RANK |
I want to base rank off of the code so my I get these results
| 1 | A | 1 |
| 1 | B | 1 |
| 2 | A | 1 |
| 2 | C | 1 |
| 3 | B | 2 |
| 3 | C | 2 |
| 4 | C | 3 |
Basically, based on the group of IDs, if any of the CODEs = a certain value I want to adjust the rank so then I can order by rank first and then other columns. Never sure how to phrase things in SQL.
I tried
CASE WHEN CODE = 'A' THEN 1 WHEN CODE = 'B' THEN 2 ELSE 3 END rank
ORDER BY rank DESC
But I want to keep the ids together, I don't want them broken apart, I was thinking of doing all ranks the same based on the highest if I can't solve it another way?
Thoughts of a SQL function to look at?
You could use the MIN() OVER() analytic function to get the minimum rank value per group, and just order by that;
WITH cte AS (
SELECT id, code,
MIN(CASE WHEN code='A' THEN 1 WHEN code='B' THEN 2 ELSE 3 END)
OVER (PARTITION BY id) rank
FROM mytable
)
SELECT * FROM cte
ORDER BY rank, id, code
An SQLfiddle to test with.
I want to create a window function that will count how many times the value of the field in the current row appears in the part of the ordered partition coming before the current row. To make this more concrete, suppose we have a table like so:
| id| fruit | date |
+---+--------+------+
| 1 | apple | 1 |
| 1 | cherry | 2 |
| 1 | apple | 3 |
| 1 | cherry | 4 |
| 2 | orange | 1 |
| 2 | grape | 2 |
| 2 | grape | 3 |
And we want to create a table like so (omitting the date column for clarity):
| id| fruit | prior |
+---+--------+-------+
| 1 | apple | 0 |
| 1 | cherry | 0 |
| 1 | apple | 1 |
| 1 | cherry | 1 |
| 2 | orange | 0 |
| 2 | grape | 0 |
| 2 | grape | 1 |
Note that for id = 1, moving along the ordered partition, the first entry 'apple' doesn't match anything (since the implied set is empty), the next fruit, 'cherry' also doesn't match. Then we get to 'apple' again, which is a match and so on. I'm imagining the SQL looks something like this:
SELECT
id, fruit,
<some kind of INTERSECT?> OVER (PARTITION BY id ORDER by date) AS prior
FROM fruit_table;
But I cannot find anything that looks right. FWIW, I'm using PostgreSQL 8.4.
You could solve that without a window function rather elegantly with a self-left join and a count():
SELECT t.id, t.fruit, t.day, count(t0.*) AS prior
FROM tbl t
LEFT JOIN tbl t0 ON (t0.id, t0.fruit) = (t.id, t.fruit) AND t0.day < t.day
GROUP BY t.id, t.day, t.fruit
ORDER BY t.id, t.day
I renamed the date column day because date is a reserved word in every SQL standard and in PostgreSQL.
I corrected a mistake in your sample data. They way you had it, it did not check out. Might confuse people.
If your point is to do it with a window function, this one should work:
SELECT id, fruit, day
,count(*) OVER (PARTITION BY id, fruit ORDER BY day) - 1 AS prior
FROM tbl
ORDER BY id, day
This works, because, I quote the manual:
If frame_end is omitted it defaults to CURRENT ROW.
You effectively count how many rows had the same (id, fruit) on prior days - including the current row. That's what the - 1 is for.