Number of rows per "percentile" - sql

I would like a Postgres query returning the number of rows per percentile.
Input:
id
name
price
1
apple
12
2
banana
6
3
orange
18
4
pineapple
26
4
lemon
30
Desired output:
percentile_3_1
percentile_3_2
percentile_3_3
1
2
2
percentile_3_1 = number of fruits in the 1st 3-precentile (i.e. with a price < 10)

Postgres has the window function ntile() and a number of very useful ordered-set aggregate functions for percentiles. But you seem to have the wrong term.
number of fruits in the 1st 3-precentile (i.e. with a price < 10)
That's not a "percentile". That's the count of rows with a price below a third of the maximum.
Assuming price is defined numeric NOT NULL CHECK (price > 0), here is a generalized query to get row counts for any given number of partitions:
WITH bounds AS (
SELECT *
FROM (
SELECT bound AS lo, lead(bound) OVER (ORDER BY bound) AS hi
FROM (
SELECT generate_series(0, x, x/3) AS bound -- number of partitions here!
FROM (SELECT max(price) AS x FROM tbl) x
) sub1
) sub2
WHERE hi IS NOT NULL
)
SELECT b.hi, count(t.price)
FROM bounds b
LEFT JOIN tbl t ON t.price > b.lo AND t.price <= b.hi
GROUP BY 1
ORDER BY 1;
Result:
hi | count
--------------------+------
10.0000000000000000 | 1
20.0000000000000000 | 2
30.0000000000000000 | 2
Notably, each partition includes the upper bound, as this makes more sense while deriving partitions from the maximum value. So your quote would read:
i.e. with a price <= 10
db<>fiddle here

Related

Aggregate Sliding Window for hive

I have a hive table which is in sorted order based on a numeric value say count.
fruit count
------ -------
apple 10
orange 8
banana 5
melon 3
pears 1
The total count is 27. I need it divided into three segments. So first 1/3 of count i.e. 1 to 9 is one, 10 to 18 is second and 19 to 27 is third.
I guess I need to do some sought of sliding window.
fruit count zone
------ ------- --------
apple 10 one
orange 8 two
banana 5 three
melon 3 three
pears 1 three
Any idea how to approach this
In SQL way:
select *,
(
sum(count) over (partition by 1 order by count desc) /*<---this line for return running totals*/
/(sum(count) over (partition by 1) /3) /*<-- divided total count into 3 group. In your case this is 9 for each zone value.*/
) /*<--using running totals divided by zone value*/
+ /*<-- 11 / 9 = 1 ... 2 You must plus 1 with quotient to let 11 in the right zone.Thus,I use this + operator */
(
case when
(
sum(count) over (partition by 1 order by count desc)
%(sum(count) over (partition by 1) /3) /*<--calculate remainder */
) >1 then 1 else 0 end /*<--if remainder>1 then the zone must +1*/
) as zone
from yourtable

Running "distinct on" across all unique thresholds in a postgres table

I have a Postgres 11 table called sample_a that looks like this:
time | cat | val
------+-----+-----
1 | 1 | 5
1 | 2 | 4
2 | 1 | 6
3 | 1 | 9
4 | 3 | 2
I would like to create a query that for each unique timestep, gets the most recent values across each category at or before that timestep, and aggregates these values by taking the sum of these values and dividing by the count of these values.
I believe I have the query to do this for a given timestep. For example, for time 3 I can run the following query:
select sum(val)::numeric / count(val) as result from (
select distinct on (cat) * from sample_a where time <= 3 order by cat, time desc
) x;
and get 6.5. (This is because at time 3, the latest from category 1 is 9 and the latest from category 2 is 4. The count of the values are 2, and they sum up to 13, and 13 / 2 is 6.5.)
However, I would ideally like to run a query that will give me all the results for each unique time in the table. The output of this new query would look as follows:
time | result
------+----------
1 | 4.5
2 | 5
3 | 6.5
4 | 5
This new query ideally would avoid adding another subselect clause if possible; an efficient query would be preferred. I could get these prior results by running the prior query inside my application for each timestep, but this doesn't seem efficient for a large sample_a.
What would this new query look like?
See if performance is acceptable this way. Syntax might need minor tweaks:
select t.time, avg(mr.val) as result
from (select distinct time from sample_a) t,
lateral (
select distinct on (cat) val
from sample_a a
where a.time <= t.time
order by a.cat, a.time desc
) mr
group by t.time
I think you just want cumulative functions:
select time,
sum(sum(val)) over (order by time) / sum(sum(num_val)) over (order by time) as result
from (select time, sum(val) as sum_val, count(*) as num_val
from sample_a a
group by time
) a;
Note if val is an integer, you might need to convert to a numeric to get fractional values.
This can be expressed without a subquery as well:
select time,
sum(sum(val)) over (order by time) / sum(count(*)) over (order by time) as result
from sample_a
group by time

SQL - Overall average Points

I have a table like this:
[challenge_log]
User_id | challenge | Try | Points
==============================================
1 1 1 5
1 1 2 8
1 1 3 10
1 2 1 5
1 2 2 8
2 1 1 5
2 2 1 8
2 2 2 10
I want the overall average points. To do so, i believe i need 3 steps:
Step 1 - Get the MAX value (of points) of each user in each challenge:
User_id | challenge | Points
===================================
1 1 10
1 2 8
2 1 5
2 2 10
Step 2 - SUM all the MAX values of one user
User_id | Points
===================
1 18
2 15
Step 3 - The average
AVG = SUM (Points from step 2) / number of users = 16.5
Can you help me find a query for this?
You can get the overall average by dividing the total number of points by the number of distinct users. However, you need the maximum per challenge, so the sum is a bit more complicated. One way is with a subquery:
select sum(Points) / count(distinct userid)
from (select userid, challenge, max(Points) as Points
from challenge_log
group by userid, challenge
) cl;
You can also do this with one level of aggregation, by finding the maximum in the where clause:
select sum(Points) / count(distinct userid)
from challenge_log cl
where not exists (select 1
from challenge_log cl2
where cl2.userid = cl.userid and
cl2.challenge = cl.challenge and
cl2.points > cl.points
);
Try these on for size.
Overall Mean
select avg( Points ) as mean_score
from challenge_log
Per-Challenge Mean
select challenge ,
avg( Points ) as mean_score
from challenge_log
group by challenge
If you want to compute the mean of each users highest score per challenge, you're not exactly raising the level of complexity very much:
Overall Mean
select avg( high_score )
from ( select user_id ,
challenge ,
max( Points ) as high_score
from challenge_log
) t
Per-Challenge Mean
select challenge ,
avg( high_score )
from ( select user_id ,
challenge ,
max( Points ) as high_score
from challenge_log
) t
group by challenge
After step 1 do
SELECT USER_ID, AVG(POINTS)
FROM STEP1
GROUP BY USER_ID
You can combine step 1 and 2 into a single query/subquery as follows:
Select BestShot.[User_ID], AVG(cast (BestShot.MostPoints as money))
from (select tLog.Challenge, tLog.[User_ID], MostPoints = max(tLog.points)
from dbo.tmp_Challenge_Log tLog
Group by tLog.User_ID, tLog.Challenge
) BestShot
Group by BestShot.User_ID
The subquery determines the most points for each user/challenge combo, and the outer query takes these max values and uses the AVG function to return the average value of them. The last Group By tells SQL to average all the values across each User_ID.

How to find the SQL medians for a grouping

I am working with SQL Server 2008
If I have a Table as such:
Code Value
-----------------------
4 240
4 299
4 210
2 NULL
2 3
6 30
6 80
6 10
4 240
2 30
How can I find the median AND group by the Code column please?
To get a resultset like this:
Code Median
-----------------------
4 240
2 16.5
6 30
I really like this solution for median, but unfortunately it doesn't include Group By:
https://stackoverflow.com/a/2026609/106227
The solution using rank works nicely when you have an odd number of members in each group, i.e. the median exists within the sample, where you have an even number of members the rank method will fall down, e.g.
1
2
3
4
The median here is 2.5 (i.e. half the group is smaller, and half the group is larger) but the rank method will return 3. To get around this you essentially need to take the top value from the bottom half of the group, and the bottom value of the top half of the group, and take an average of the two values.
WITH CTE AS
( SELECT Code,
Value,
[half1] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value),
[half2] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value DESC)
FROM T
WHERE Value IS NOT NULL
)
SELECT Code,
(MAX(CASE WHEN Half1 = 1 THEN Value END) +
MIN(CASE WHEN Half2 = 1 THEN Value END)) / 2.0
FROM CTE
GROUP BY Code;
Example on SQL Fiddle
In SQL Server 2012 you can use PERCENTILE_CONT
SELECT DISTINCT
Code,
Median = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Value) OVER(PARTITION BY Code)
FROM T;
Example on SQL Fiddle
SQL Server does not have a function to calculate medians, but you could use the ROW_NUMBER function like this:
WITH RankedTable AS (
SELECT Code, Value,
ROW_NUMBER() OVER (PARTITION BY Code ORDER BY VALUE) AS Rnk,
COUNT(*) OVER (PARTITION BY Code) AS Cnt
FROM MyTable
)
SELECT Code, Value
FROM RankedTable
WHERE Rnk = Cnt / 2 + 1
To elaborate a bit on this solution, consider the output of the RankedTable CTE:
Code Value Rnk Cnt
---------------------------
4 240 2 3 -- Median
4 299 3 3
4 210 1 3
2 NULL 1 2
2 3 2 2 -- Median
6 30 2 3 -- Median
6 80 3 3
6 10 1 3
Now from this result set, if you only return those rows where Rnk equals Cnt / 2 + 1 (integer division), you get only the rows with the median value for each group.

SQL query to return rows in multiple groups

I have a SQL table with data in the following format:
REF FIRSTMONTH NoMONTHS VALUE
--------------------------------
1 2 1 100
2 4 2 240
3 5 4 200
This shows a quoted value which should be delivered starting on the FIRSTMONTH and split over NoMONTHS
I want to calculate the SUM for each month of the potential deliveries from the quoted values.
As such I need to return the following result from a SQL server query:
MONTH TOTAL
------------
2 100 <- should be all of REF=1
4 120 <- should be half of REF=2
5 170 <- should be half of REF=2 and quarter of REF=3
6 50 <- should be quarter of REF=3
7 50 <- should be quarter of REF=3
8 50 <- should be quarter of REF=3
How can I do this?
You are trying extract data from what should be a many to many relationship.
You need 3 tables. You should be able to write a JOIN or GROUP BY select statement from there. The tables below don't use the same data values as yours, and are merely intended for a structural example.
**Month**
REF Month Value
---------------------
1 2 100
2 3 120
etc.
**MonthGroup**
REF
---
1
2
**MonthsToMonthGroups**
MonthREF MonthGroupREF
------------------
1 1
2 2
2 3
The first part of this query gets a set of numbers between the start and the end of the valid values
The second part takes each month value, and divides it into the monthly amount
Then it is simply a case of grouping each month, and adding up all of the monthly amounts.
select
number as month, sum(amount)
from
(
select number
from master..spt_values
where type='p'
and number between (select min(firstmonth) from yourtable)
and (select max(firstmonth+nomonths-1) from yourtable)
) numbers
inner join
(select
firstmonth,
firstmonth+nomonths-1 as lastmonth,
value / nomonths as amount
from yourtable) monthly
on numbers.number between firstmonth and lastmonth
group by number