Get Percentile for a user

Get Percentile for a user - sql

I have a table such as this:
Id, ReportId, UserId
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 1 3
9 2 3
10 1 4
My table has thousands of records, above is just an example of the table structure simplified for purpose of understanding the problem.
I'm trying to figure out what at what percentile a user sits based on how many reports he has read.
I've been looking into PERCENTILE_CONT and PERCENTILE_DISC functions, but I fail to understand them properly. https://learn.microsoft.com/en-us/sql/t-sql/functions/percentile-cont-transact-sql
What confuses me most is that what it appears to me is that these functions are trying to find the 50th percentile, not percentile for a specific record.
Maybe I'm just not understanding this correctly. Is there a better way?
EDIT:
To clarify. I want to know at what percentile a specific user (in this case user with id 1) sits based on how many reports they have read. If they read the most reports they would be at a higher percentile, what is that percentile? Lets say there are 100 users exactly, then the person with most reports read would be 1st percentile.

Update #2
One of these should do it:
select
a.UserId,
a.reports_read,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY a.reports_read) OVER (partition by UserId) AS percentile_d,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY a.reports_read) OVER (partition by UserId) AS percentile_c,
PERCENT_RANK() OVER(ORDER BY a.reports_read ) percent_rank,
CUME_DIST() OVER(ORDER BY a.reports_read ) AS cumulative_distance
from
(select UserId, count(distinct(ReportId)) as reports_read
from #tmp
group by UserId
) a
It gives the following results:
UserId reports_read percentile_d percentile_c percent_rank cumulative_distance
4 1 1 1 0 0.25
3 2 2 2 0.33333 0.5
2 3 3 3 0.66667 0.75
1 6 6 6 1 1
I hope this helps.

Related

How to get the values for every group of the top 3 types

I've got this table ratings:
id
user_id
type
value
0
0
Rest
4
1
0
Bar
3
2
0
Cine
2
3
0
Cafe
1
4
1
Rest
4
5
1
Bar
3
6
1
Cine
2
7
1
Cafe
5
8
2
Rest
4
9
2
Bar
3
10
3
Cine
2
11
3
Cafe
5
I want to have a table with a row for every pair (user_id, type) for the top 3 rated types through all users (ranked by sum(value) across the whole table).
Desired result:
user_id
type
value
0
Rest
4
0
Cafe
1
0
Bar
3
1
Rest
4
1
Cafe
5
1
Bar
3
2
Rest
4
3
Cafe
5
2
Bar
3
I was able to do this with two queries, one to get the top 3 and then another to get the rows where the type matches the top 3 types.
Does someone know how to fit this into a single query?

Get rows per user for the 3 highest ranking types, where types are ranked by the total sum of their value across the whole table.
So it's not exactly about the top 3 types per user, but about the top 3 types overall. Not all users will have rows for the top 3 types, even if there would be 3 or more types for the user.
Strategy:
Aggregate to get summed values per type (type_rnk).
Take only the top 3. (Break ties ...)
Join back to main table, eliminating any other types.
Order result by user_id, type_rnk DESC
SELECT r.user_id, r.type, r.value
FROM ratings r
JOIN (
SELECT type, sum(value) AS type_rnk
FROM ratings
GROUP BY 1
ORDER BY type_rnk DESC, type -- tiebreaker
LIMIT 3 -- strictly the top 3
) v USING (type)
ORDER BY user_id, type_rnk DESC;
db<>fiddle here
Since multiple types can have the same ranking, I added type to the sort order to break ties alphabetically by their name (as you did not specify otherwise).
Turns out, we don't need window functions - the ones with OVER and, optionally, PARTITION for this. (Since you asked in a comment).

I think you just want row_number(). Based on your results, you seem to want three rows per type, with the highest value:
select t.*
from (select t.*,
row_number() over (partition by type order by value desc) as seqnum
from t
) t
where seqnum <= 3;
Your description suggests that you might just want this per user, which is a slight tweak:
select t.*
from (select t.*,
row_number() over (partition by user order by value desc) as seqnum
from t
) t
where seqnum <= 3;

Resetting a Count in SQL

I have data that looks like this:
ID num_of_days
1 0
2 0
2 8
2 9
2 10
2 15
3 10
3 20
I want to add another column that increments in value only if the num_of_days column is divisible by 5 or the ID number increases so my end result would look like this:
ID num_of_days row_num
1 0 1
2 0 2
2 8 2
2 9 2
2 10 3
2 15 4
3 10 5
3 20 6
Any suggestions?
Edit #1:
num_of_days represents the number of days since the customer last saw a doctor between 1 visit and the next.
A customer can see a doctor 1 time or they can see a doctor multiple times.
If it's the first time visiting, the num_of_days = 0.

SQL tables represent unordered sets. Based on your question, I'll assume that the combination of id/num_of_days provides the ordering.
You can use a cumulative sum . . . with lag():
select t.*,
sum(case when prev_id = id and num_of_days % 5 <> 0
then 0 else 1
end) over (order by id, num_of_days)
from (select t.*,
lag(id) over (order by id, num_of_days) as prev_id
from t
) t;
Here is a db<>fiddle.
If you have a different ordering column, then just use that in the order by clauses.

Derby DB last x row average

I have the following table structure.
ITEM TOTAL
----------- -----------------
ID | TITLE ID |ITEMID|VALUE
1 A 1 2 6
2 B 2 1 4
3 C 3 3 3
4 D 4 3 8
5 E 5 1 2
6 F 6 5 4
7 4 5
8 2 8
9 2 7
10 1 3
11 2 2
12 3 6
I am using Apache Derby DB. I need to perform the average calculation in SQL. I need to show the list of item IDs and their average total of the last 3 records.
That is, for ITEM.ID 1, I will go to TOTAL table and select the last 3 records of the rows which are associated with the ITEMID 1. And take average of them. In Derby database, I am able to do this for a given item ID but I cannot make it without giving a specific ID. Let me show you what I've done it.
SELECT ITEM.ID, AVG(VALUE) FROM ITEM, TOTAL WHERE TOTAL.ITEMID = ITEM.ID GROUP BY ITEM.ID
This SQL gives the average of all items in a list. But this calculates for all values of the total tables. I need last 3 records only. So I changed the SQL to this:
SELECT AVG(VALUE) FROM (SELECT ROW_NUMBER() OVER() AS ROWNUM, TOTAL.* FROM TOTAL WHERE ITEMID = 1) AS TR WHERE ROWNUM > (SELECT COUNT(ID) FROM TOTAL WHERE ITEMID = 1) - 3
This works if I supply the item ID 1 or 2 etc. But I cannot do this for all items without giving an item ID.
I tried to do the same thing in ORACLE using partition and it worked. But derby does not support partitioning. There is WINDOW but I could not make use of it.
Oracle one
SELECT ITEMID, AVG(VALUE) FROM(SELECT ITEMID, VALUE, COUNT(*) OVER (PARTITION BY ITEMID) QTY, ROW_NUMBER() OVER (PARTITION BY ITEMID ORDER BY ID) IDX FROM TOTAL ORDER BY ITEMID, ID) WHERE IDX > QTY -3 GROUP BY ITEMID ORDER BY ITEMID
I need to use derby DB for its portability.
The desired output is this
RESULT
-----------------
ITEMID | AVERAGE
1 (9/3)
2 (17/3)
3 (17/3)
4 (5/1)
5 (4/1)
6 NULL

As you have noticed, Derby's support for the SQL 2003 "OLAP Operations" support is incomplete.
There was some initial work (see https://wiki.apache.org/db-derby/OLAPOperations), but that work was only partially completed.
I don't believe anyone is currently working on adding more functionality to Derby in this area.
So yes, Derby has a row_number function, but no, Derby does not (currently) have partition by.

SQL query to take top elements of ordered list on Apache Hive

I have the table below in an SQL database.
user rating
1 10
1 7
1 6
1 2
2 8
2 3
2 2
2 2
I would like to keep only the best two ratings by user to get:
user rating
1 10
1 7
2 8
2 3
What would be the SQL query to do that? I am not sure how to do it.

It will work
;with cte as
(select user,rating, row_number() over (partition by user order by rating desc) maxval
from yourtable)
select user,rating
from cte
where maxval in (1,2)

How to find the SQL medians for a grouping

I am working with SQL Server 2008
If I have a Table as such:
Code Value
-----------------------
4 240
4 299
4 210
2 NULL
2 3
6 30
6 80
6 10
4 240
2 30
How can I find the median AND group by the Code column please?
To get a resultset like this:
Code Median
-----------------------
4 240
2 16.5
6 30
I really like this solution for median, but unfortunately it doesn't include Group By:
https://stackoverflow.com/a/2026609/106227

The solution using rank works nicely when you have an odd number of members in each group, i.e. the median exists within the sample, where you have an even number of members the rank method will fall down, e.g.
1
2
3
4
The median here is 2.5 (i.e. half the group is smaller, and half the group is larger) but the rank method will return 3. To get around this you essentially need to take the top value from the bottom half of the group, and the bottom value of the top half of the group, and take an average of the two values.
WITH CTE AS
( SELECT Code,
Value,
[half1] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value),
[half2] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value DESC)
FROM T
WHERE Value IS NOT NULL
)
SELECT Code,
(MAX(CASE WHEN Half1 = 1 THEN Value END) +
MIN(CASE WHEN Half2 = 1 THEN Value END)) / 2.0
FROM CTE
GROUP BY Code;
Example on SQL Fiddle
In SQL Server 2012 you can use PERCENTILE_CONT
SELECT DISTINCT
Code,
Median = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Value) OVER(PARTITION BY Code)
FROM T;
Example on SQL Fiddle

SQL Server does not have a function to calculate medians, but you could use the ROW_NUMBER function like this:
WITH RankedTable AS (
SELECT Code, Value,
ROW_NUMBER() OVER (PARTITION BY Code ORDER BY VALUE) AS Rnk,
COUNT(*) OVER (PARTITION BY Code) AS Cnt
FROM MyTable
)
SELECT Code, Value
FROM RankedTable
WHERE Rnk = Cnt / 2 + 1
To elaborate a bit on this solution, consider the output of the RankedTable CTE:
Code Value Rnk Cnt
---------------------------
4 240 2 3 -- Median
4 299 3 3
4 210 1 3
2 NULL 1 2
2 3 2 2 -- Median
6 30 2 3 -- Median
6 80 3 3
6 10 1 3
Now from this result set, if you only return those rows where Rnk equals Cnt / 2 + 1 (integer division), you get only the rows with the median value for each group.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get Percentile for a user - sql

Related

How to get the values for every group of the top 3 types

Resetting a Count in SQL

Derby DB last x row average

SQL query to take top elements of ordered list on Apache Hive

How to find the SQL medians for a grouping

Categories

Resources