Calculate percentiles using SQL for a group/partition - sql

I want to calculate the percentiles for a given partition/group in SQL. For example the input data looks like -
CustID Product ID quantity_purchased
1 111 2
2 111 3
3 111 2
4 111 5
1 222 2
2 222 6
4 222 7
6 222 2
I want to get percentiles on each product ID group. The output should be -
Product ID min 25% 50% 75% max
111 2 2 2.5 3.5 5
222 2 2 4 6.25 7
How to achieve this using SQL?

You can use percentile_cont():
select product_id, min(quantity_purchased), max(quantity_purchased),
percentile_cont(0.25) within group (order by quantity_purchased),
percentile_cont(0.50) within group (order by quantity_purchased),
percentile_cont(0.75) within group (order by quantity_purchased)
from t
group by product_id;

Related

Get average of rows group by value intervals

I have a table as follows:
ID | Value
1 5
1 1000
1 1500
2 1000
2 1800
3 40
3 1000
3 1200
3 2000
3 2500
I want to obtain the average of each ID groupped by a given range r of value. For instance, if in this case r=1000, The expected result would be:
ID | Value
1 5
1 1250
2 1400
3 40
3 1100
3 2250
I have seen that this can be done with time intervals as seen here. My question is, how can I perform this type of group by operation for integer/float types?
You could try this way:
SELECT id, avg(value) as AvgValue
FROM (SELECT id, value, ROUND(value/1000, 0) AS range FROM yourtable) t
GROUP BY id, range

Group by and provide groups only if unique in group

i have the Following dataset :
Amount Document Number
0 200 12345
1 90 2222
2 200 456789
3 90 4444
4 300 4789
5 300 4789
So basically i want to get group numbers for the above data (using ngroup maybe)
Grouping the data on the basis of amount. assign a group number to one group only if the Document numbers in that group has unique numbers.
This is what i would like the outcome to be.
Amount Document Number Group
0 200 12345 1
1 90 2222 2
2 200 456789 1
3 90 4444 2
4 300 4789
5 300 4789
Grouping the data on the basis of amount. assign the rows to one group only if the Document number is a unique number.
I think you want rank():
select t.*, rank() over (order by amount, document_number) as grouping
from t;
In pandas, you could first create a mask where any group by amount has a dup is flagged as False with groupby.transform and duplicated, then use this mask and groupby.ngroup like:
mask_dup = ~(df.duplicated().groupby(df['Amount']).transform(any))
df.loc[mask_dup, 'Group'] = df[mask_dup].groupby('Amount').ngroup()+1
print (df)
Amount Document Number Group
0 200 12345 2.0
1 90 2222 1.0
2 200 456789 2.0
3 90 4444 1.0
4 300 4789 NaN
5 300 4789 NaN
if you have more than the two columns at first you need to specify the subset in duplicated

pandas pivoting start and end

I need help with pivoting my df to get the start and end day.
Id Day Value
111 6 a
111 5 a
111 4 a
111 2 a
111 1 a
222 3 a
222 2 a
222 1 a
333 1 a
The desired result would be:
Id StartDay EndDay
111 4 6
111 1 2 (since 111 skips day 3)
222 1 3
333 1 1
Thanks a bunch!
So, my first thought was just :
df.groupby('Id').Day.agg(['min','max'])
But then I noticed your stipulation "(since 111 skips day 3)", which means we have to make an identifier which tells us if the current row is in the same 'block' as the previous (same Id, contiguous Day). So, we sort:
df.sort_values(['Id','Day'], inplace=True)
Then define the block:
df['block'] = ((df.Day!=(df.shift(1).Day+1).fillna(0).astype(int))).astype(int).cumsum()
(adapted from top answer to this question: Finding consecutive segments in a pandas data frame)
then group by Id and block:
df.groupby(['Id','block']).Day.agg(['min','max'])
Giving:
Id block min max
111 1 1 2
111 2 4 6
222 3 1 3
333 4 1 1

How to rewrite query which gives amount of specific value in row to avoid some values and count further with others?

I have a query which gives me amount of grade 5 for every student in row (if student don't have any other grade on the way):
select distinct on (student, class) scg.*
from (select student, class, grade, count(*) as cnt,
min(gradeDate), max(gradeDate), min_gradeDate, max_gradeDate
from (select t.*,
row_number() over (partition by student, class, grade order by gradeDate) as seqnum_scg,
row_number() over (partition by student, class order by gradeDate) as seqnum_sc
from t
) t
where grade = 5
group by student, class, grade, (seqnum_sc - seqnum_scg)
) scg
order by student, class, cnt desc;
The original problem is explained here:
How to count data with specific values and for specific user/person (in row)?
But now I want to extend this query with one more feature. This counter gives me max value unless some student have grade 4/3/2/1, but now I want it to:
stop counting if student has 4 or 3 grade and start over (with previous max) when student get another 5
What I mean:
Actual query: 5, 5, 5, 4, 3, 5, 5, 2 --> gives me max = 3
New query: 5, 5, 5, 4, 3, 5, 5, 2 --> gives me max = 5, because 4 and 3 stop counter and start it when user gets another 5
stop counting if student gets grade 2 or 1 (and give me max value before getting 2/1 grade) So the same thing which query does now for every grade except 5, but I want it only for 2 and lower (that I can specify in query).
Can someone help me rewrite the second query given by #Gordon Linoff to work like that and tell me what changed?
Edit: examples as requested:
id student grade class gradeDate
1 1 5 1 2017-03-03
2 1 5 1 2017-03-04
3 1 1 1 2017-03-05
4 1 5 1 2017-03-06
5 1 5 1 2017-03-07
6 1 5 1 2017-03-08
7 1 1 1 2017-03-09
8 2 5 2 2017-03-03
9 3 5 3 2017-03-03
10 4 5 4 2017-03-03
11 4 5 4 2017-03-04
12 4 4 4 2017-03-05
13 4 3 4 2017-03-06
14 4 5 4 2017-03-07
15 4 5 4 2017-03-08
16 5 5 5 2017-03-01
17 5 5 5 2017-03-03
18 5 5 5 2017-03-04
19 5 5 5 2017-03-05
20 5 5 5 2017-03-06
21 5 2 5 2017-03-07
22 5 5 5 2017-03-08
23 5 5 5 2017-03-09
Student one : max = 3
Student two : max = 1
Student three : max = 1
Student four : max = 4 (grade 4 and 3 stop counter, but don't reset it)
Student five : max = 5 (because grade 2 reset counter, lack of grade on date
2017-03-02 is not a problem for counter)
One of the methods can be using 2 subqueries and one analytic function
Demo: http://sqlfiddle.com/#!15/74b71/10
SELECT student, max( xxx )
FROM (
SELECT student, grp_nbr, count(CASE WHEN grade = 5 THEN 1 END) As xxx
FROM (
SELECT *,
SUM ( CASE WHEN grade in (1,2)
THEN 1 ELSE 0
END
) OVER (Partition by student Order By gradeDate ) As grp_nbr
FROM table1
) x
GROUP BY student, grp_nbr
) y
GROUP BY student
ORDER BY student
| student | max |
|---------|-----|
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
| 4 | 4 |
| 5 | 5 |

sql server 2008 - calculated and ordered list needs to return only 2 entries per supplier

I have a dataset like below, but longer. I want to ensure I am picking the 'fleet_id' in terms of their 'StarDriver' value overall, but I want to return only two results for each 'supplier_id' and return a max of 20 in total.
(I'm sorry I didnt work out how to copy the below in proper formatting, couldn't find from toolbar above and google results were about copying data; would also be grateful if someone would point out how)
fleet_id supplier_id Ratings Driver Punctuality Car StarDriver
19442 151 10 5 5 5 5
19634 151 11 5 5 5 5
19437 151 12 5 5 5 5
12832 10 14 5 4.92857142857143 5 4.97619047619048
12217 111 10 5 5 4.9 4.96666666666667
21135 158 19 5 4.89473684210526 5 4.96491228070175
19436 151 14 4.85714285714286 5 5 4.95238095238095
12239 111 12 4.91666666666667 5 4.91666666666667 4.94444444444445
10520 92 12 4.91666666666667 5 4.91666666666667 4.94444444444445
19997 151 12 5 5 4.83333333333333 4.94444444444444
To limit to the top 2 for each supplier, use row_number(). This will enumerate the rows and you can choose just two with where seqnum <= 2.
The rest of the query is just selecting 20 rows based on a field:
select top 20 t.*
from (select t.*,
row_number() over (partition by supplier order by StarDriver desc) as seqnum
from table t
) t
where seqnum <= 2
order by StarDriver;