logic for handling tie with aggregate - sql

I have this data set:
| ID | TYPE | PERCENT |
------|------|---------|
| 123 | A | 0.5 |
| 123 | B | 0.5 |
| 456 | A | 0.7 |
| 456 | B | 0.3 |
| 789 | A | 1 |
I would like the following result:
| ID | TYPE | PERCENT |
------|------|---------|
| 123 | A | 0.5 |
| 456 | A | 0.7 |
| 789 | A | 1 |
That is, getting the MAX(percent) for each id and the corresponding type.
I'm currently using
SELECT ...
FROM
(SELECT [id], MAX([percent]) AS [p]
FROM [highest]
GROUP BY [id]) a
LEFT JOIN [highest] b
ON b.[id] = a.[id]
AND b.[percent] = a.[p]
And getting
| ID | P | TYPE | PERCENT |
--- --|-----|------|---------|
| 123 | 0.5 | A | 0.5 |
| 123 | 0.5 | B | 0.5 |
| 456 | 0.7 | A | 0.7 |
| 789 | 1 | A | 1 |

Try this query:
SELECT src.[id], src.[type], src.[percent]
FROM (
SELECT [id], [type], [percent],
ROW_NUMBER() OVER(PARTITION BY h.[id] ORDER BY [percent] DESC, h.[type] ASC) AS RowNum
FROM [highest] h
) src
WHERE src.RowNum = 1

Another way to skin the cat:
SELECT d.ID, m.type, m.[percent]
FROM highest AS d
CROSS APPLY (
SELECT TOP 1 type, [percent]
FROM highest
WHERE ID = d.ID
ORDER BY [percent] DESC, type ASC
) AS m
GROUP BY d.ID, m.type, m.[percent]
;
That is, for every distinct ID, a row with the maximum (TOP 1 ... ORDER BY [percent] DESC) percent is fetched. When several types have the maximum for the same ID, the one that sorts before the others (type ASC) is selected.
A slightly less verbose equivalent (using DISTINCT instead of GROUP BY):
SELECT DISTINCT d.ID, m.type, m.[percent]
FROM highest AS d
CROSS APPLY (
SELECT TOP 1 type, [percent]
FROM highest
WHERE ID = d.ID
ORDER BY [percent] DESC, type ASC
) AS m
;
With proper indexing, shouldn't be much worse than #Bogdan Sahlean's suggestion.

Related

Compare dates and data column

I have tables like this:
TABLE 1 - PERSON:
m_id | name |
-------------
22 | jo |
-------------
77 | john |
--------------
TABLE 2 - AMT_DATA
m_id | amt | activity |
-------------------------
22 | 100 | - |
-------------------------
77 | 300 | n |
-------------------------
TABLE 3 - STATUS_DATA:
m_id | status | s_date |
22 | - | 01.01.2000 |
22 | n | 01.01.2001 |
22 | - | 01.01.2002 |
77 | - | 01.01.2001 |
77 | n | 01.01.2002 |
How can i write a query or procedure that will return me all m_ids which biggest status_data.s_date for that m_id also have status_data.status = '-'?
I need to get result like this:
person.m_id | person.name | amt_data.amt | status | s_date
------------------------------------------------------------------
22 | jo | 100 | - | 01.01.2002
I don't see what amt really has to do with the question. You can just join that in.
One method is:
select p.*, status_date, status
from person p join
(select m_id, max(s_date) as status_date,
max(status) keep (dense_rank first order by s_date desc) as status
from status_data
group by m_id
) s
using (m_id)
where status = '-';
The keep syntax is Oracle's (rather verbose) way of implementing a "first" aggregation function.
You can use the analytical function as follows:
Select * from
(Select p.m_id,
P.name,
A.amt,
S.status,
S.s_date,
Row_number() over (partition by p.m_id order by s.s_date desc) as rn
From person p
join amt_data a on p.m_id = a.m_id
Join status_data s on p.m_id = s.m_id
Where s.status = '-')
Where rn = 1;

Each rows to column values

I'm trying to create a view that shows first table's columns plus second table's first 3 records sorted by date in 1 row.
I tried to select specific rows using offset from sub table and join to main table, but when joining query result is ordered by date, without
WHERE tblMain_id = ..
clause in joining SQL it returns wrong record.
Here is sqlfiddle example: sqlfiddle demo
tblMain
| id | fname | lname | salary |
+----+-------+-------+--------+
| 1 | John | Doe | 1000 |
| 2 | Bob | Ross | 5000 |
| 3 | Carl | Sagan | 2000 |
| 4 | Daryl | Dixon | 3000 |
tblSub
| id | email | emaildate | tblmain_id |
+----+-----------------+------------+------------+
| 1 | John#Doe1.com | 2019-01-01 | 1 |
| 2 | John#Doe2.com | 2019-01-02 | 1 |
| 3 | John#Doe3.com | 2019-01-03 | 1 |
| 4 | Bob#Ross1.com | 2019-02-01 | 2 |
| 5 | Bob#Ross2.com | 2018-12-01 | 2 |
| 6 | Carl#Sagan.com | 2019-10-01 | 3 |
| 7 | Daryl#Dixon.com | 2019-11-01 | 4 |
View I am trying to achieve:
| id | fname | lname | salary | email_1 | emaildate_1 | email_2 | emaildate_2 | email_3 | emaildate_3 |
+----+-------+-------+--------+---------------+-------------+---------------+-------------+---------------+-------------+
| 1 | John | Doe | 1000 | John#Doe1.com | 2019-01-01 | John#Doe2.com | 2019-01-02 | John#Doe3.com | 2019-01-03 |
View I have created
| id | fname | lname | salary | email_1 | emaildate_1 | email_2 | emaildate_2 | email_3 | emaildate_3 |
+----+-------+-------+--------+---------+-------------+---------------+-------------+---------------+-------------+
| 1 | John | Doe | 1000 | (null) | (null) | John#Doe1.com | 2019-01-01 | John#Doe2.com | 2019-01-02 |
You can use conditional aggregation:
select m.id, m.fname, m.lname, m.salary,
max(s.email) filter (where seqnum = 1) as email_1,
max(s.emailDate) filter (where seqnum = 1) as emailDate_1,
max(s.email) filter (where seqnum = 2) as email_2,
max(s.emailDate) filter (where seqnum = 3) as emailDate_2,
max(s.email) filter (where seqnum = 3) as email_3,
max(s.emailDate) filter (where seqnum = 3) as emailDate_3
from tblMain m left join
(select s.*,
row_number() over (partition by tblMain_id order by emailDate desc) as seqnum
from tblsub s
) s
on s.tblMain_id = m.id
where m.id = 1
group by m.id, m.fname, m.lname, m.salary;
Here is a SQL Fiddle.
Here is a solution that should get you what you expect.
This works by first ranking records within each table and joining them together. Then, the outer query uses aggregation to generate the expected output.
This solution will work even if the first record in the main table does not have id 1. Also filtering takes occurs within the JOINs, so this should be quite efficient.
SELECT
m.id,
m.fname,
m.lname,
m.salary,
MAX(CASE WHEN s.rn = 1 THEN s.email END) email_1,
MAX(CASE WHEN s.rn = 1 THEN s.emaildate END) email_date1,
MAX(CASE WHEN s.rn = 2 THEN s.email END) email_2,
MAX(CASE WHEN s.rn = 2 THEN s.emaildate END) email_date2,
MAX(CASE WHEN s.rn = 3 THEN s.email END) email_3,
MAX(CASE WHEN s.rn = 3 THEN s.emaildate END) email_date3
FROM
(
SELECT m.*, ROW_NUMBER() OVER(ORDER BY id) rn
FROM tblMain
) m
INNER JOIN (
SELECT
email,
emaildate,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY emaildate) rn
FROM tblSub
) s
ON m.id = s.tblmain_id
AND m.rn = 1
AND s.rn <= 3
GROUP BY
m.id,
m.fname,
m.lname,
m.salary

Values Disappear when Filtering Correlated Subquery

This question is related to the recent answer I provided here.
Setup
Using MS Access 2007.
Assume I have a table called mytable consisting of three fields:
id Long Integer AutoNumber (PK)
type Text
num Long Integer
With the following sample data:
+----+------+-----+
| id | type | num |
+----+------+-----+
| 1 | A | 10 |
| 2 | A | 20 |
| 3 | A | 30 |
| 4 | B | 40 |
| 5 | B | 50 |
| 6 | B | 60 |
| 7 | C | 70 |
| 8 | C | 80 |
| 9 | C | 90 |
| 10 | D | 100 |
+----+------+-----+
Similar to the linked answer, say I wish to output the three fields, with a running total for each type value, with the value of the running total limited to a maximum of 100, I might use a correlated subquery such as the following:
select q.* from
(
select t.id, t.type, t.num,
(
select sum(u.num)
from mytable u where u.type = t.type and u.id <= t.id
) as rt
from mytable t
) q
where q.rt < 100
This produces the expected result:
+----+------+-----+----+
| id | type | num | rt |
+----+------+-----+----+
| 1 | A | 10 | 10 |
| 2 | A | 20 | 30 |
| 3 | A | 30 | 60 |
| 4 | B | 40 | 40 |
| 5 | B | 50 | 90 |
| 7 | C | 70 | 70 |
+----+------+-----+----+
Observation
Now assume that I wish to filter the result to show only those values for type like "[AB]".
If I use either of the following queries:
select q.* from
(
select t.id, t.type, t.num,
(
select sum(u.num)
from mytable u where u.type = t.type and u.id <= t.id
) as rt
from mytable t
where t.type like "[AB]"
) q
where q.rt < 100
select q.* from
(
select t.id, t.type, t.num,
(
select sum(u.num)
from mytable u where u.type = t.type and u.id <= t.id
) as rt
from mytable t
) q
where q.rt < 100 and q.type like "[AB]"
The results are filtered as expected, but the values in the rt (running total) column disappear:
+----+------+-----+----+
| id | type | num | rt |
+----+------+-----+----+
| 1 | A | 10 | |
| 2 | A | 20 | |
| 3 | A | 30 | |
| 4 | B | 40 | |
| 5 | B | 50 | |
+----+------+-----+----+
Question
Why would the filter cause the values returned by the correlated subquery to disappear?
Thank you for your time reading my question and in advance for any advice you can offer.
Moving type criteria to the aggregate subquery works.
One less tier works but the aggregate subquery has to repeat in WHERE clause:
SELECT mytable.*, (select sum(u.num)
from mytable u where u.type = MyTable.type and u.id <= MyTable.id
) AS rt
FROM mytable
WHERE ((((select sum(u.num)
from mytable u where u.type = MyTable.type and u.id <= MyTable.id
))<100) AND ((mytable.[type]) Like "[AB]"));
An INNER JOIN version:
select MyTable.*, q.* from MyTable INNER JOIN
(
select t.id, t.type, t.num,
(
select sum(u.num)
from mytable u where u.type = t.type and u.id <= t.id
) as rt
from mytable t
) q
ON q.id=MyTable.ID
where q.rt < 100 AND MyTable.Type LIKE "[AB]";

How to roll up based on a few criteria in SQL

I have a data table like this:
QuestionID UserName UserWeightingForQuestion AnswerGivenForQuestion Metric
1 A 1.50 1 ToBeCalculated
1 B 1.00 2 ToBeCalculated
1 C 1.80 3 ToBeCalculated
1 D 1.20 1 ToBeCalculated
1 E 1.40 2 ToBeCalculated
2 A 1.20 2 ToBeCalculated
2 B 1.20 2 ToBeCalculated
2 C 1.10 4 ToBeCalculated
2 D 1.20 5 ToBeCalculated
...
For each question group, I'd like to fill each cell under Metric column with a calculated value defined as shown below:
Metric_For_User_A_For_QuestionID_X = SUM(Weights_With_The_Answer_Similar_To_What_Is_Given_By_User_A_In_QuestionID_Group = X) / DISTINCT(All_WEeights_In_One_QuestionID_Group = X)
Specifically speaking,
Metric_For_User_A_For_QuestionID_1 = SUM(1.50+1.20)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_B_For_QuestionID_1 = SUM(1.00+1.40)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_C_For_QuestionID_1 = SUM(1.80)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_D_For_QuestionID_1 = SUM(1.50+1.20)/(1.50+1.00+1.80+1.20+1.40)
Metric_For_User_E_For_QuestionID_1 = SUM(1.00+1.40)/(1.50+1.00+1.80+1.20+1.40)
For QuestionID group = 2, I'd like to repeat the process as above. For example,
Metric_For_User_A_For_QuestionID_2 = SUM(1.20+1.20)/(1.20+1.10)
I'm fairly new to SQL and I believe the OVER or some sort of aggregation function can be utilized to achieve this(?) If this kind of calculation is possible in SQL, could someone with SQL expertise suggest me a way to achieve what I'm trying to calculate.
The raw table has ~70m rows, and I am using SQL Server. Thank you very much in advance for your suggestions and answers!
You can use the SUM window function to do this.
select t.*,
sum(UserWeightingForQuestion) over(partition by questionID,AnswerGivenForQuestion)
/sum(UserWeightingForQuestion) over(partition by questionID) as metric
from tablename t
sum(UserWeightingForQuestion) over(partition by questionID) gets the sum of all UserWeightingForQuestion per questionID
sum(UserWeightingForQuestion) over(partition by questionID,AnswerGivenForQuestion) sums up the similar UserWeightingForQuestion per questionID
Edit: To sum up the distinct weights for each questionID in the denominator, use
select t.*,
sum(UserWeightingForQuestion) over(partition by questionID,AnswerGivenForQuestion)
/(select sum(distinct UserWeightingForQuestion) from tablename where t.questionID=questionID) as metric
from tablename t
declare #quest table(QuestionID int
, UserName varchar(20)
, UserWeightingForQuestion decimal(10,2)
, AnswerGivenForQuestion int);
insert into #quest values
(1,'A',1.50,1),(1,'B',1.00,2),(1,'C',1.80,3),(1,'D',1.20,1),
(1,'E',1.40,2),(2,'A',1.20,2),(2,'B',1.20,2),(2,'C',1.10,4),(2,'D',1.20,5);
Baicaly you made two partitions, one by QuestionID and AnswerGivenForQuestion, and another by QuestionID.
WITH CALC AS
(
SELECT Q2.QuestionID, Q2.UserName,
SUM(UserWeightingForQuestion) OVER (PARTITION BY QuestionID, AnswerGivenForQuestion) AS Weight,
(SELECT SUM(DISTINCT Q1.UserWeightingForQuestion)
FROM #quest Q1
WHERE Q1.QuestionID = Q2.QuestionID) AS AllWeights
FROM #quest Q2
)
SELECT QuestionID, UserName, Weight, AllWeights,
CAST(Weight / AllWeights AS DECIMAL(18,2)) as Metric
FROM CALC
ORDER BY QuestionID, UserName;
+------------+----------+--------+------------+--------+
| QuestionID | UserName | Weight | AllWeights | Metric |
+------------+----------+--------+------------+--------+
| 1 | A | 2,70 | 6,90 | 0,39 |
| 1 | B | 2,40 | 6,90 | 0,35 |
| 1 | C | 1,80 | 6,90 | 0,26 |
| 1 | D | 2,70 | 6,90 | 0,39 |
| 1 | E | 2,40 | 6,90 | 0,35 |
+------------+----------+--------+------------+--------+
| 2 | A | 2,40 | 2,30 | 1,04 |
| 2 | B | 2,40 | 2,30 | 1,04 |
| 2 | C | 1,10 | 2,30 | 0,48 |
| 2 | D | 1,20 | 2,30 | 0,52 |
+------------+----------+--------+------------+--------+

how to get median for every record?

There's no median function in sql server, so I'm using this wonderful suggestion:
https://stackoverflow.com/a/2026609/117700
this computes the median over an entire dataset, but I need the median per record.
My dataset is:
+-----------+-------------+
| client_id | TimesTested |
+-----------+-------------+
| 214220 | 1 |
| 215425 | 1 |
| 212839 | 4 |
| 215249 | 1 |
| 210498 | 3 |
| 110655 | 1 |
| 110655 | 1 |
| 110655 | 12 |
| 215425 | 4 |
| 100196 | 1 |
| 110032 | 1 |
| 110032 | 1 |
| 101944 | 3 |
| 101232 | 2 |
| 101232 | 1 |
+-----------+-------------+
here's the query I am using:
select client_id,
(
SELECT
(
(SELECT MAX(TimesTested ) FROM
(SELECT TOP 50 PERCENT t.TimesTested
FROM counted3 t
where t.timestested>1
and CLIENT_ID=t.CLIENT_ID
ORDER BY t.TimesTested ) AS BottomHalf)
+
(SELECT MIN(TimesTested ) FROM
(SELECT TOP 50 PERCENT t.TimesTested
FROM counted3 t
where t.timestested>1
and CLIENT_ID=t.CLIENT_ID
ORDER BY t.TimesTested DESC) AS TopHalf)
) / 2 AS Median
) TotalAvgTestFreq
from counted3
group by client_id
but it is giving my funny data:
+-----------+------------------+
| client_id | median???????????|
+-----------+------------------+
| 100007 | 84 |
| 100008 | 84 |
| 100011 | 84 |
| 100014 | 84 |
| 100026 | 84 |
| 100027 | 84 |
| 100028 | 84 |
| 100029 | 84 |
| 100042 | 84 |
| 100043 | 84 |
| 100071 | 84 |
| 100072 | 84 |
| 100074 | 84 |
+-----------+------------------+
i can i get the median for every client_id ?
I am currently trying to use this awesome query from Aaron's site:
select c3.client_id,(
SELECT AVG(1.0 * TimesTested ) median
FROM
(
SELECT o.TimesTested ,
rn = ROW_NUMBER() OVER (ORDER BY o.TimesTested ), c.c
FROM counted3 AS o
CROSS JOIN (SELECT c = COUNT(*) FROM counted3) AS c
where count>1
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2)
) a
from counted3 c3
group by c3.client_id
unfortunately, as Richardthekiwi points out:
it's for a single median whereas this question is about a median
per-partition
i would like to know how i can join it on counted3 to get the median per partition?>
Note: If testFreq is an int or bigint type, you need to CAST it before taking an average, otherwise you'll get integer division, e.g. (2+5)/2 => 3 if 2 and 5 are the median records - e.g. AVG(Cast(testfreq as float)).
select client_id, avg(testfreq) median_testfreq
from
(
select client_id,
testfreq,
rn=row_number() over (partition by CLIENT_ID
order by testfreq),
c=count(testfreq) over (partition by CLIENT_ID)
from tbk
where timestested>1
) g
where rn in (round(c/2,0),c/2+1)
group by client_id;
The median is found either as the central record in an ODD number of rows, or the average of the two central records in an EVEN number of rows. This is handled by the condition rn in (round(c/2,0),c/2+1) which picks either the one or two records required.
try this:
select client_id,
(
SELECT
(
(SELECT MAX(testfreq) FROM
(SELECT TOP 50 PERCENT t.testfreq
FROM counted3 t
where t.timestested>1
and c3.CLIENT_ID=t.CLIENT_ID
ORDER BY t.testfreq) AS BottomHalf)
+
(SELECT MIN(testfreq) FROM
(SELECT TOP 50 PERCENT t.testfreq
FROM counted3 t
where t.timestested>1
and c3.CLIENT_ID=t.CLIENT_ID
ORDER BY t.testfreq DESC) AS TopHalf)
) / 2 AS Median
) TotalAvgTestFreq
from counted3 c3
group by client_id
I added the c3 alias to the outer CLIENT_ID references and the outer table.