how to get median for every record? - sql

There's no median function in sql server, so I'm using this wonderful suggestion:
https://stackoverflow.com/a/2026609/117700
this computes the median over an entire dataset, but I need the median per record.
My dataset is:
+-----------+-------------+
| client_id | TimesTested |
+-----------+-------------+
| 214220 | 1 |
| 215425 | 1 |
| 212839 | 4 |
| 215249 | 1 |
| 210498 | 3 |
| 110655 | 1 |
| 110655 | 1 |
| 110655 | 12 |
| 215425 | 4 |
| 100196 | 1 |
| 110032 | 1 |
| 110032 | 1 |
| 101944 | 3 |
| 101232 | 2 |
| 101232 | 1 |
+-----------+-------------+
here's the query I am using:
select client_id,
(
SELECT
(
(SELECT MAX(TimesTested ) FROM
(SELECT TOP 50 PERCENT t.TimesTested
FROM counted3 t
where t.timestested>1
and CLIENT_ID=t.CLIENT_ID
ORDER BY t.TimesTested ) AS BottomHalf)
+
(SELECT MIN(TimesTested ) FROM
(SELECT TOP 50 PERCENT t.TimesTested
FROM counted3 t
where t.timestested>1
and CLIENT_ID=t.CLIENT_ID
ORDER BY t.TimesTested DESC) AS TopHalf)
) / 2 AS Median
) TotalAvgTestFreq
from counted3
group by client_id
but it is giving my funny data:
+-----------+------------------+
| client_id | median???????????|
+-----------+------------------+
| 100007 | 84 |
| 100008 | 84 |
| 100011 | 84 |
| 100014 | 84 |
| 100026 | 84 |
| 100027 | 84 |
| 100028 | 84 |
| 100029 | 84 |
| 100042 | 84 |
| 100043 | 84 |
| 100071 | 84 |
| 100072 | 84 |
| 100074 | 84 |
+-----------+------------------+
i can i get the median for every client_id ?
I am currently trying to use this awesome query from Aaron's site:
select c3.client_id,(
SELECT AVG(1.0 * TimesTested ) median
FROM
(
SELECT o.TimesTested ,
rn = ROW_NUMBER() OVER (ORDER BY o.TimesTested ), c.c
FROM counted3 AS o
CROSS JOIN (SELECT c = COUNT(*) FROM counted3) AS c
where count>1
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2)
) a
from counted3 c3
group by c3.client_id
unfortunately, as Richardthekiwi points out:
it's for a single median whereas this question is about a median
per-partition
i would like to know how i can join it on counted3 to get the median per partition?>

Note: If testFreq is an int or bigint type, you need to CAST it before taking an average, otherwise you'll get integer division, e.g. (2+5)/2 => 3 if 2 and 5 are the median records - e.g. AVG(Cast(testfreq as float)).
select client_id, avg(testfreq) median_testfreq
from
(
select client_id,
testfreq,
rn=row_number() over (partition by CLIENT_ID
order by testfreq),
c=count(testfreq) over (partition by CLIENT_ID)
from tbk
where timestested>1
) g
where rn in (round(c/2,0),c/2+1)
group by client_id;
The median is found either as the central record in an ODD number of rows, or the average of the two central records in an EVEN number of rows. This is handled by the condition rn in (round(c/2,0),c/2+1) which picks either the one or two records required.

try this:
select client_id,
(
SELECT
(
(SELECT MAX(testfreq) FROM
(SELECT TOP 50 PERCENT t.testfreq
FROM counted3 t
where t.timestested>1
and c3.CLIENT_ID=t.CLIENT_ID
ORDER BY t.testfreq) AS BottomHalf)
+
(SELECT MIN(testfreq) FROM
(SELECT TOP 50 PERCENT t.testfreq
FROM counted3 t
where t.timestested>1
and c3.CLIENT_ID=t.CLIENT_ID
ORDER BY t.testfreq DESC) AS TopHalf)
) / 2 AS Median
) TotalAvgTestFreq
from counted3 c3
group by client_id
I added the c3 alias to the outer CLIENT_ID references and the outer table.

Related

Get some values from the table by selecting

I have a table:
| id | Number |Address
| -----| ------------|-----------
| 1 | 0 | NULL
| 1 | 1 | NULL
| 1 | 2 | 50
| 1 | 3 | NULL
| 2 | 0 | 10
| 3 | 1 | 30
| 3 | 2 | 20
| 3 | 3 | 20
| 4 | 0 | 75
| 4 | 1 | 22
| 4 | 2 | 30
| 5 | 0 | NULL
I need to get: the NUMBER of the last ADDRESS change for each ID.
I wrote this select:
select dh.id, dh.number from table dh where dh =
(select max(min(t.history)) from table t where t.id = dh.id group by t.address)
But this select not correctly handling the case when the address first changed, and then changed to the previous value. For example id=1: group by return:
| Number |
| -------- |
| NULL |
| 50 |
I have been thinking about this select for several days, and I will be happy to receive any help.
You can do this using row_number() -- twice:
select t.id, min(number)
from (select t.*,
row_number() over (partition by id order by number desc) as seqnum1,
row_number() over (partition by id, address order by number desc) as seqnum2
from t
) t
where seqnum1 = seqnum2
group by id;
What this does is enumerate the rows by number in descending order:
Once per id.
Once per id and address.
These values are the same only when the value is 1, which is the most recent address in the data. Then aggregation pulls back the earliest row in this group.
I answered my question myself, if anyone needs it, my solution:
select * from table dh1 where dh1.number = (
select max(x.number)
from (
select
dh2.id, dh2.number, dh2.address, lag(dh2.address) over(order by dh2.number asc) as prev
from table dh2 where dh1.id=dh2.id
) x
where NVL(x.address, 0) <> NVL(x.prev, 0)
);

Compare dates and data column

I have tables like this:
TABLE 1 - PERSON:
m_id | name |
-------------
22 | jo |
-------------
77 | john |
--------------
TABLE 2 - AMT_DATA
m_id | amt | activity |
-------------------------
22 | 100 | - |
-------------------------
77 | 300 | n |
-------------------------
TABLE 3 - STATUS_DATA:
m_id | status | s_date |
22 | - | 01.01.2000 |
22 | n | 01.01.2001 |
22 | - | 01.01.2002 |
77 | - | 01.01.2001 |
77 | n | 01.01.2002 |
How can i write a query or procedure that will return me all m_ids which biggest status_data.s_date for that m_id also have status_data.status = '-'?
I need to get result like this:
person.m_id | person.name | amt_data.amt | status | s_date
------------------------------------------------------------------
22 | jo | 100 | - | 01.01.2002
I don't see what amt really has to do with the question. You can just join that in.
One method is:
select p.*, status_date, status
from person p join
(select m_id, max(s_date) as status_date,
max(status) keep (dense_rank first order by s_date desc) as status
from status_data
group by m_id
) s
using (m_id)
where status = '-';
The keep syntax is Oracle's (rather verbose) way of implementing a "first" aggregation function.
You can use the analytical function as follows:
Select * from
(Select p.m_id,
P.name,
A.amt,
S.status,
S.s_date,
Row_number() over (partition by p.m_id order by s.s_date desc) as rn
From person p
join amt_data a on p.m_id = a.m_id
Join status_data s on p.m_id = s.m_id
Where s.status = '-')
Where rn = 1;

How to roll up based on a few criteria in SQL

I have a data table like this:
QuestionID UserName UserWeightingForQuestion AnswerGivenForQuestion Metric
1 A 1.50 1 ToBeCalculated
1 B 1.00 2 ToBeCalculated
1 C 1.80 3 ToBeCalculated
1 D 1.20 1 ToBeCalculated
1 E 1.40 2 ToBeCalculated
2 A 1.20 2 ToBeCalculated
2 B 1.20 2 ToBeCalculated
2 C 1.10 4 ToBeCalculated
2 D 1.20 5 ToBeCalculated
...
For each question group, I'd like to fill each cell under Metric column with a calculated value defined as shown below:
Metric_For_User_A_For_QuestionID_X = SUM(Weights_With_The_Answer_Similar_To_What_Is_Given_By_User_A_In_QuestionID_Group = X) / DISTINCT(All_WEeights_In_One_QuestionID_Group = X)
Specifically speaking,
Metric_For_User_A_For_QuestionID_1 = SUM(1.50+1.20)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_B_For_QuestionID_1 = SUM(1.00+1.40)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_C_For_QuestionID_1 = SUM(1.80)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_D_For_QuestionID_1 = SUM(1.50+1.20)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_E_For_QuestionID_1 = SUM(1.00+1.40)/(1.50+1.00+1.80+1.20+1.40)
For QuestionID group = 2, I'd like to repeat the process as above. For example,
Metric_For_User_A_For_QuestionID_2 = SUM(1.20+1.20)/(1.20+1.10)
I'm fairly new to SQL and I believe the OVER or some sort of aggregation function can be utilized to achieve this(?) If this kind of calculation is possible in SQL, could someone with SQL expertise suggest me a way to achieve what I'm trying to calculate.
The raw table has ~70m rows, and I am using SQL Server. Thank you very much in advance for your suggestions and answers!
You can use the SUM window function to do this.
select t.*,
sum(UserWeightingForQuestion) over(partition by questionID,AnswerGivenForQuestion)
/sum(UserWeightingForQuestion) over(partition by questionID) as metric
from tablename t
sum(UserWeightingForQuestion) over(partition by questionID) gets the sum of all UserWeightingForQuestion per questionID
sum(UserWeightingForQuestion) over(partition by questionID,AnswerGivenForQuestion) sums up the similar UserWeightingForQuestion per questionID
Edit: To sum up the distinct weights for each questionID in the denominator, use
select t.*,
sum(UserWeightingForQuestion) over(partition by questionID,AnswerGivenForQuestion)
/(select sum(distinct UserWeightingForQuestion) from tablename where t.questionID=questionID) as metric
from tablename t
declare #quest table(QuestionID int
, UserName varchar(20)
, UserWeightingForQuestion decimal(10,2)
, AnswerGivenForQuestion int);
insert into #quest values
(1,'A',1.50,1),(1,'B',1.00,2),(1,'C',1.80,3),(1,'D',1.20,1),
(1,'E',1.40,2),(2,'A',1.20,2),(2,'B',1.20,2),(2,'C',1.10,4),(2,'D',1.20,5);
Baicaly you made two partitions, one by QuestionID and AnswerGivenForQuestion, and another by QuestionID.
WITH CALC AS
(
SELECT Q2.QuestionID, Q2.UserName,
SUM(UserWeightingForQuestion) OVER (PARTITION BY QuestionID, AnswerGivenForQuestion) AS Weight,
(SELECT SUM(DISTINCT Q1.UserWeightingForQuestion)
FROM #quest Q1
WHERE Q1.QuestionID = Q2.QuestionID) AS AllWeights
FROM #quest Q2
)
SELECT QuestionID, UserName, Weight, AllWeights,
CAST(Weight / AllWeights AS DECIMAL(18,2)) as Metric
FROM CALC
ORDER BY QuestionID, UserName;
+------------+----------+--------+------------+--------+
| QuestionID | UserName | Weight | AllWeights | Metric |
+------------+----------+--------+------------+--------+
| 1 | A | 2,70 | 6,90 | 0,39 |
| 1 | B | 2,40 | 6,90 | 0,35 |
| 1 | C | 1,80 | 6,90 | 0,26 |
| 1 | D | 2,70 | 6,90 | 0,39 |
| 1 | E | 2,40 | 6,90 | 0,35 |
+------------+----------+--------+------------+--------+
| 2 | A | 2,40 | 2,30 | 1,04 |
| 2 | B | 2,40 | 2,30 | 1,04 |
| 2 | C | 1,10 | 2,30 | 0,48 |
| 2 | D | 1,20 | 2,30 | 0,52 |
+------------+----------+--------+------------+--------+

Update table with ordered values

i need to update a table ordering by price and reassigning the ordered price.
The price and values are grouped by idcategory. Here is an example:
| ID | idcategory | price | value |
| 1 | 1 | 10 | 3 |
| 2 | 1 | 12 | 30 |
| 3 | 1 | 43 | 9 |
| 4 | 1 | 32 | 23 |
| 5 | 2 | 38 | 13 |
| 6 | 2 | 8 | 26 |
| 7 | 2 | 3 | 34 |
| 8 | 2 | 10 | 12 |
. .. .. .. .. .. .. ... ... .. .. .. ..
I need to reorder the table grouping by idcategory reassigning the ordered value to the ordered price like this:
| ID | idcategory | price | value |
| 1 | 1 | 10 | 3 |
| 2 | 1 | 12 | 9 |
| 3 | 1 | 32 | 23 |
| 4 | 1 | 43 | 30 |
| 5 | 2 | 3 | 12 |
| 6 | 2 | 8 | 13 |
| 7 | 2 | 10 | 26 |
| 8 | 2 | 38 | 34 |
.. .. .. .. .. .. .. .. .. ... ..
database is a postgres 9.2.
any idea will be appreciated.
Thanks you and Happy new Year !!!
this is the updated working solution based on GarethD suggestion:
WITH OrderedValues AS
( SELECT Value,
Price,
idcategory,
ROW_NUMBER() OVER(PARTITION BY idcategory ORDER BY Value) AS ValueNum,
ROW_NUMBER() OVER(PARTITION BY idcategory ORDER BY Price) AS PriceNum
FROM T
), OrderedIDs AS
( SELECT ID,
idcategory,
ROW_NUMBER() OVER(PARTITION BY idcategory ORDER BY ID) AS RowNum
FROM T
), NewValues AS
( SELECT i.ID,
v.Value,
p.Price
FROM OrderedIDs i
INNER JOIN OrderedValues v
ON i.RowNum = v.ValueNum
AND i.idcategory = v.idcategory
INNER JOIN OrderedValues p
ON i.RowNum = p.PriceNum
AND i.idcategory = p.idcategory
)
UPDATE T
SET Price = v.Price,
Value = v.Value
FROM NewValues v
WHERE v.ID = T.ID;
SELECT *
FROM T;
You first need to rank your both your IDs (OrderedIDs), and your Price/Value combination (OrderedValues). Then you can matched the corresponding ranks (NewValues), and update your table accordingly:
WITH OrderedValues AS
( SELECT Value,
Price,
idcategory,
ROW_NUMBER() OVER(PARTITION BY idcategory ORDER BY Value, Price) AS RowNum
FROM T
), OrderedIDs AS
( SELECT ID,
idcategory,
ROW_NUMBER() OVER(PARTITION BY idcategory ORDER BY ID) AS RowNum
FROM T
), NewValues AS
( SELECT i.ID,
v.Value,
v.Price
FROM OrderedIDs i
INNER JOIN OrderedValues v
ON i.RowNum = v.RowNum
AND i.idcategory = v.idcategory
)
UPDATE T
SET Price = v.Price,
Value = v.Value
FROM NewValues v
WHERE v.ID = T.ID;
Example on SQL Fiddle

logic for handling tie with aggregate

I have this data set:
| ID | TYPE | PERCENT |
------|------|---------|
| 123 | A | 0.5 |
| 123 | B | 0.5 |
| 456 | A | 0.7 |
| 456 | B | 0.3 |
| 789 | A | 1 |
I would like the following result:
| ID | TYPE | PERCENT |
------|------|---------|
| 123 | A | 0.5 |
| 456 | A | 0.7 |
| 789 | A | 1 |
That is, getting the MAX(percent) for each id and the corresponding type.
I'm currently using
SELECT ...
FROM
(SELECT [id], MAX([percent]) AS [p]
FROM [highest]
GROUP BY [id]) a
LEFT JOIN [highest] b
ON b.[id] = a.[id]
AND b.[percent] = a.[p]
And getting
| ID | P | TYPE | PERCENT |
--- --|-----|------|---------|
| 123 | 0.5 | A | 0.5 |
| 123 | 0.5 | B | 0.5 |
| 456 | 0.7 | A | 0.7 |
| 789 | 1 | A | 1 |
Try this query:
SELECT src.[id], src.[type], src.[percent]
FROM (
SELECT [id], [type], [percent],
ROW_NUMBER() OVER(PARTITION BY h.[id] ORDER BY [percent] DESC, h.[type] ASC) AS RowNum
FROM [highest] h
) src
WHERE src.RowNum = 1
Another way to skin the cat:
SELECT d.ID, m.type, m.[percent]
FROM highest AS d
CROSS APPLY (
SELECT TOP 1 type, [percent]
FROM highest
WHERE ID = d.ID
ORDER BY [percent] DESC, type ASC
) AS m
GROUP BY d.ID, m.type, m.[percent]
;
That is, for every distinct ID, a row with the maximum (TOP 1 ... ORDER BY [percent] DESC) percent is fetched. When several types have the maximum for the same ID, the one that sorts before the others (type ASC) is selected.
A slightly less verbose equivalent (using DISTINCT instead of GROUP BY):
SELECT DISTINCT d.ID, m.type, m.[percent]
FROM highest AS d
CROSS APPLY (
SELECT TOP 1 type, [percent]
FROM highest
WHERE ID = d.ID
ORDER BY [percent] DESC, type ASC
) AS m
;
With proper indexing, shouldn't be much worse than #Bogdan Sahlean's suggestion.