Another approach to percentiles? - sql

I have a dataset which essentially consists of a list of job batches, the number of jobs contained in each batch, and the duration of each job batch. Here is a sample dataset:
CREATE TABLE test_data
(
batch_id NUMBER,
job_count NUMBER,
duration NUMBER
);
INSERT INTO test_data VALUES (1, 37, 9);
INSERT INTO test_data VALUES (2, 47, 4);
INSERT INTO test_data VALUES (3, 66, 6);
INSERT INTO test_data VALUES (4, 46, 6);
INSERT INTO test_data VALUES (5, 54, 1);
INSERT INTO test_data VALUES (6, 35, 1);
INSERT INTO test_data VALUES (7, 55, 9);
INSERT INTO test_data VALUES (8, 82, 7);
INSERT INTO test_data VALUES (9, 12, 9);
INSERT INTO test_data VALUES (10, 52, 4);
INSERT INTO test_data VALUES (11, 3, 9);
INSERT INTO test_data VALUES (12, 90, 2);
Now, I want to calculate some percentiles for the duration field. Typically, this is done with something like the following:
SELECT
PERCENTILE_DISC( 0.75 )
WITHIN GROUP (ORDER BY duration ASC)
AS third_quartile
FROM
test_data;
(Which gives the result of 9)
My problem here is that we don't want to get the percentiles based on batches, I want to get them based on individual jobs. I can figure this out by hand quite easily by generating a running total of the job_count:
SELECT
batch_id,
job_count,
SUM(
job_count
)
OVER (
ORDER BY duration
ROWS UNBOUNDED PRECEDING
)
AS total_jobs,
duration
FROM
test_data
ORDER BY
duration ASC;
BATCH_ID JOB_COUNT TOTAL_JOBS DURATION
6 35 35 1
5 54 89 1
12 90 179 2
2 47 226 4
10 52 278 4
3 66 344 6
4 46 390 6
8 82 472 7
9 12 484 9
1 37 521 9
11 3 524 9
7 55 579 9
Since I have 579 jobs, then the 75th percentile would be job 434. Looking at the above result set, that corresponds with a duration of 7, different from what the standard function does.
Essentially, I want to consider each job in a batch as a separate observation, and determine percentiles based on those, instead on the batches.
Is there a relatively simple way to accomplish this?

I would think of this as "weighted" percentiles. I don't know if there is a built-in analytic function for this in Oracle, but it is easy enough to calculate. And you are on the way there.
The additional idea is to calculate the total number of jobs, and then use arithmetic to select the value you want. For the 75th percentile, the value is the smallest duration such that the cumulative number of jobs is greater than 0.75 times the total number of jobs.
Here is the example in SQL:
select pcs.percentile, min(case when cumjobs >= totjobs * percentile then duration end)
from (SELECT batch_id, job_count,
SUM(job_count) OVER (ORDER BY duration) as cumjobs,
sum(job_count) over () as totjobs,
duration
FROM test_data
) t cross join
(select 0.25 as percentile from dual union all
select 0.5 from dual union all
select 0.75 from dual
) pcs
group by pcs.percentile;
This example gives you the percentile values (and as an added bonus, for three different percentiles) with each value on its own row. If you want the values on each row, you need to join back to your original table.

OK. I think I have your answer. Idea is mine. Implementation is borrowed from this Ask Tom article
SELECT PERCENTILE_DISC( 0.75 )
WITHIN GROUP (ORDER BY duration ASC)
AS third_quartile
FROM(
with data as
(select level l
from dual, (select max(job_count) max_jobs from test_data)
connect by level <= max_jobs
)
select *
from test_data, data
where l <= job_count
--ORDER BY duration, batch_id
) inner
;
Here is SQL Fiddle.

Related

SQL check if group is continuous when ordered and return broken groups

I tried to find a way to list rows, that are breaking continuous groups of records. I say groups, because we could use GROUP BY to list values of groups (but that is not applied, we need particular rows).
Sample data:
CREATE TABLE Test (ID INT, NNO INT, DIDX INT, SIDX INT);
-- Valid sample rows
INSERT INTO Test (ID, NNO, DIDX, SIDX) VALUES
( 1, 107 , 7898, 0 ),
( 2, 102 , 7883, 0 ),
( 3, 53 , 7877, 0 ),
( 4, 62 , 7877, 42 ),
( 5, 101 , 7870, 81 ),
( 6, 103 , 7918, 42 ),
( 7, 110 , 7920, 42 ),
( 8, 100 , 7919, 0 ),
( 9, 24 , 7921, 0 ),
(10, 85 , 7904, 0 ),
(11, 85 , 7905, 0 ),
(12, 85 , 7906, 0 ),
(13, 85 , 7907, 0 ),
(14, 85 , 7908, 0 ),
(15, 85 , 7911, 0 ),
(16, 112 , 7876, 0 ),
(17, 5 , 7891, 42 ),
(18, 80 , 7912, 42 ),
(19, 66 , 7912, 91 ),
(20, 22 , 7912, 81 ),
(21, 60 , 7911, 42 ),
(22, 60 , 7912, 0 ),
(23, 78 , 7891, 81 );
-- Disecting row
INSERT INTO Test (ID, NNO, DIDX, SIDX) VALUES
(24, 666 , 7906, 120);
EDIT: I probaly mislead some answers a bit by providing an example too much simplified. It then appeared that perhaps the groups could be only broken by a single row. So please add another row into the example data set:
-- Disecting row -2-
INSERT INTO Test (ID, NNO, DIDX, SIDX) VALUES
(25, 444 , 7906, 160);
Now if ordered the rows in this particular order:
SELECT ID, NNO, DIDX, SIDX
FROM Test
ORDER BY DIDX, SIDX;
...the last inserted row will break (disect) group of records, which have NNO=85:
ID NNO DIDX SIDX
----------- ----------- ----------- -----------
...
10 85 7904 0
11 85 7905 0
12 85 7906 0
24 666 7906 120 <<<<<<<<<<<<<<<<<<<
25 444 7906 160 <<<<<<<<<<<< after EDIT <<<<<<<
13 85 7907 0
14 85 7908 0
15 85 7911 0
...
The result should say 85, which is the broken group, or NULL if we would use healthy data without row ID=24.
Another way to look at the problem is, that for each group (even if it contains 1 row), there may not be records of another group which start or end lies between start and end of the queried group. In the provided example, queried group (85) starts with DIDX=7904 and SIDX=0 and ends with DIDX=7911 and SIDX=0 and nothing can fall into the range - which is the case of record ID=24.
I so far tried some approaches like using ROW_NUMBER() OVER (ORDER BY ...), using WITH with MIN and MAX to go through each group and check, if there are rows that fall within (failed so far to construct it) and GROUP BY with MIN and MAX with aim to cross check it with table rows. No attempt is really worth publishing, so far.
Could anyone advice, how I could check continuity of such defined groups?
WITH CTE AS (
SELECT
ID
,NNO
,DIDX
,SIDX
,LAG(NNO) OVER (ORDER BY DIDX, SIDX) as previousNNO
,LEAD(NNO) OVER (ORDER BY DIDX, SIDX) as nextNNO
FROM Test
)
SELECT
previousNNO as BrokenGroup
FROM CTE
WHERE previousNNO=nextNNO
AND NNO<>previousNNO
I used a CTE and WINDOW functions to also keep track of the previous and next group (NNO) for each row. A broken group will be one that has a different current group while the previous and next are the same. From your example with ID 24.
ID NNO DIDX SIDX
----------- ----------- ----------- -----------
...
12 85 < previous group 7906 0
24 666 < current group 7906 120 <<<<<<<<<<<<<<<<<<<
13 85 < next group 7907 0
...
I'm assuming that any consecutive range of DIDX should only have one NNO. As such there will be no two valid groups that abut each other.
This should help identify the offenders:
with data as (
select NNO, DIDX, dense_rank() over (order by DIDX) as rn
from Test
)
select min(DIDX) as range_start, max(DIDX) as range_end
from data
group by DIDX - rn
having count(distinct NNO) > 1;
Getting the actual rows:
with data as (
select ID, NNO, DIDX, dense_rank() over (order by DIDX) as rn
from Test
), groups as (
select DIDX - rn as grp, min(DIDX) as range_start, max(DIDX) as range_end
from data
group by DIDX - rn
having count(distinct NNO) > 1
), data2 as (
select *, lead(NNO) over (partition by grp order by DIDX) as next_NNO
from Test t inner join groups g
on t.DIDX between g.range_start and g.range_end
)
select * from data2 where NNO <> next_NNO;
If you're looking for a test to run prior to inserting a row:
with data as (
select NNO, DIDX, row_number() over (order by DIDX) as rn
from Test
)
select case when min(DIDX) is not null then 'Fail' else 'Pass' end as InsertTest
from data
group by DIDX - rn
having #proposed_DIDX between min(DIDX) and max(DIDX)
and #proposed_NNO <> min(NNO);
OK, so inspired by given answers (which I voted up for helping), I came with this code that seems to provide the desired result. I'm not sure though, if it's the cleanest and shortest possible way.
;WITH CTE AS (
SELECT NNO,
DMIN,
DMAX,
SMIN,
SMAX,
LEAD(DMIN) OVER (ORDER BY DMIN, SMIN) as nextDMIN,
LAG(DMAX) OVER (ORDER BY DMIN, SMIN) as prevDMAX,
LAG(SMAX) OVER (ORDER BY DMIN, SMIN) as prevSMAX,
LEAD(SMIN) OVER (ORDER BY DMIN, SMIN) as nextSMIN,
CNT
FROM (
SELECT NNO,
MIN(DIDX) as DMIN,
MAX(DIDX) as DMAX,
MIN(SIDX) as SMIN,
MAX(SIDX) as SMAX,
COUNT(NNO) as CNT
FROM Test
GROUP BY NNO
) as SRC
)
SELECT *
FROM CTE
WHERE ((prevDMAX > DMIN OR (prevDMAX = DMIN AND prevSMAX > SMIN)) OR
(nextDMIN < DMAX OR (nextDMIN = DMAX AND nextSMIN < SMAX)))
AND CNT > 1
Perhaps I should give a little explanation. The code finds MIN and MAX border values for each parameter SMIN and DMIN and then find those values for previous and next rows. We also COUNT number of rows in a group.
The conditions in brackets basicaly say, that no record of other group can be in DMIN and DMAX range, and if it's on the borders of the range, then it has to be outside SMIN or SMAX. Finally, a broken group must have more then 1 row; otherwise the query would return not only offendee group, but also first offender.
I should say, that this code has a little flaw and that is a case when there would be an offender with more than one row intact in between ofendee group rows. I should be able to tackle this in post processing, where I have to shuffle rows to achieve intact groups.

T-SQL Select to compute a result row on preceeding group/condition

How to achieve this result using a T-SQL select query.
Given this sample table :
create table sample (a int, b int)
insert into sample values (999, 10)
insert into sample values (16, 11)
insert into sample values (10, 12)
insert into sample values (25, 13)
insert into sample values (999, 20)
insert into sample values (14, 12)
insert into sample values (90, 45)
insert into sample values (18, 34)
I'm trying to achieve this output:
a b result
----------- ----------- -----------
999 10 10
16 11 10
10 12 10
25 13 10
999 20 20
14 12 20
90 45 20
18 34 20
The rule is fairly simple: if column 'a' has the special value of 999 the result for that row and following rows (unless the value of 'a' is again 999) will be the value of column 'b'. Assume the first record will have 999 on column 'a'.
Any hint how to implement, if possible, the select query without using a stored procedure or function?
Thank you.
António
You can do what you want if you add a column to specify the ordering:
create table sample (
id int identity(1, 1),
a int,
b int
);
Then you can do what you want by finding the "999" version that is most recent and copying that value. Here is a method using window functions:
select a, b, max(case when a = 999 then b end) over (partition by id_999) as result
from (select s.*,
max(case when a = 999 then id end) over (order by id) as id_999
from sample s
) s;
You need to have an id column
select cn.id, cn.a
, (select top (1) b from sample where sample.id <= cn.id and a = 999 order by id desc)
from sample as cn
order by id

Select rows matching the pattern: greater_than, less_than, greater_than

Got a database with entries indicating units earned by staff. Am trying to find a query that can select for me entries where the units_earned by the staff follow this pattern: >30 then <30 and then >30
In this SQL Fiddle, I would expect the query to return:
For John, Rows:
2, 4, 6
9, 10, 11
For Jane, Rows:
3, 5, 8
12, 13, 14
Here is the relevant SQL:
CREATE TABLE staff_units(
id integer,
staff_number integer,
first_name varchar(50),
month_name varchar(3),
units_earned integer,
PRIMARY KEY(id)
);
INSERT INTO staff_units VALUES (1, 101, 'john', 'jan', 32);
INSERT INTO staff_units VALUES (2, 101, 'john', 'jan', 33);
INSERT INTO staff_units VALUES (3, 102, 'jane', 'jan', 39);
INSERT INTO staff_units VALUES (4, 101, 'john', 'feb', 28);
INSERT INTO staff_units VALUES (5, 102, 'jane', 'feb', 28);
INSERT INTO staff_units VALUES (6, 101, 'john', 'mar', 39);
INSERT INTO staff_units VALUES (7, 101, 'john', 'mar', 34);
INSERT INTO staff_units VALUES (8, 102, 'jane', 'mar', 40);
INSERT INTO staff_units VALUES (9, 101, 'john', 'mar', 36);
INSERT INTO staff_units VALUES (10, 101, 'john', 'apr', 18);
INSERT INTO staff_units VALUES (11, 101, 'john', 'may', 32);
INSERT INTO staff_units VALUES (12, 102, 'jane', 'jun', 31);
INSERT INTO staff_units VALUES (13, 102, 'jane', 'jun', 28);
INSERT INTO staff_units VALUES (14, 102, 'jane', 'jun', 32);
Using window function lead you can refer to the next two leading records of the current record and then compare the three against your desired pattern.
with staff_units_with_leading as (
select id, staff_number, first_name, units_earned,
lead(units_earned) over w units_earned_off1, -- units_earned from record with offset 1
lead(units_earned, 2) over w units_earned_off2, -- units_earned from record with offset 2
lead(id) over w id_off1, -- id from record with offset 1
lead(id, 2) over w id_off2 -- id from record with offset 2
from staff_units
window w as (partition by first_name order by id)
)
, ids_wanted as (
select unnest(array[id, id_off1, id_off2]) id --
from staff_units_with_leading
where
id_off1 is not null -- Discard records with no two leading records
and id_off2 is not null -- Discard records with no two leading records
and units_earned > 30 -- Match desired pattern
and units_earned_off1 < 30 -- Match desired pattern
and units_earned_off2 > 30 -- Match desired pattern
)
select * from staff_units
where id in (select id from ids_wanted)
order by staff_number, id;
To generate trigrams just get rid of the unnest
with staff_units_with_leading as (
select id, staff_number, first_name, units_earned,
lead(units_earned) over w units_earned_off1, -- units_earned from record with offset 1
lead(units_earned, 2) over w units_earned_off2, -- units_earned from record with offset 2
lead(id) over w id_off1, -- id from record with offset 1
lead(id, 2) over w id_off2 -- id from record with offset 2
from staff_units
window w as (partition by first_name order by id)
)
select staff_number, array[id, id_off1, id_off2] id, array[units_earned , units_earned_off1 , units_earned_off2 ] units_earned --
from staff_units_with_leading
where
id_off1 is not null -- Discard records with no two leading records
and id_off2 is not null -- Discard records with no two leading records
and units_earned > 30 -- Match desired pattern
and units_earned_off1 < 30 -- Match desired pattern
and units_earned_off2 > 30 -- Match desired pattern
I took cachique's answer (with excellent idea to use lead() ) and reformatted and extended it to generate 3-grams as you originally wanted:
with staff_units_with_leading as (
select
id, staff_number, first_name, units_earned,
lead(units_earned) over w units_earned_off1, -- units_earned from record with offset 1
lead(units_earned, 2) over w units_earned_off2, -- units_earned from record with offset 2
lead(id) over w id_off1, -- id from record with offset 1
lead(id, 2) over w id_off2 -- id from record with offset 2
from staff_units
window w as (partition by staff_number order by id)
), ids_wanted as (
select
id_off1, -- keep this to group 3-grams later
unnest(array[id, id_off1, id_off2]) id
from staff_units_with_leading
where
id_off1 is not null -- Discard records with no two leading records
and id_off2 is not null -- Discard records with no two leading records
and units_earned > 30 -- Match desired pattern
and units_earned_off1 < 30 -- Match desired pattern
and units_earned_off2 > 30 -- Match desired pattern
), res as (
select su.*, iw.id_off1
from staff_units su
join ids_wanted iw on su.id = iw.id
order by su.staff_number, su.id
)
select
staff_number,
array_agg(units_earned order by id) as values,
array_agg(id order by id) as ids
from res
group by staff_number, id_off1
order by 1
;
The result will be:
staff_number | values | ids
--------------+------------+------------
101 | {33,28,39} | {2,4,6}
101 | {36,18,32} | {9,10,11}
102 | {39,28,40} | {3,5,8}
102 | {31,28,32} | {12,13,14}
(4 rows)
The problem you're trying to solve is a bit complicated. It is probably easier to solve it if you'll use pl/pgsql and play with integer arrays inside pl/pgsql function, or probably with JSON/JSONB.
But it also can be solved in plain SQL, however such SQL is pretty advanced.
with rows_numbered as (
select
*, row_number() over (partition by staff_number order by id) as row_num
from staff_units
order by staff_number
), sequences (staff_number, seq) as (
select
staff_number,
json_agg(json_build_object('row_num', row_num, 'id', id, 'units_earned', units_earned) order by id)
from rows_numbered
group by 1
)
select
s1.staff_number,
(s1.chunk->>'id')::int as id1,
(s2.chunk->>'id')::int as id2,
(s3.chunk->>'id')::int as id3
from (select staff_number, json_array_elements(seq) as chunk from sequences) as s1
, lateral (
select *
from (select staff_number, json_array_elements(seq) as chunk from sequences) _
where
(s1.chunk->>'row_num')::int + 1 = (_.chunk->>'row_num')::int
and (_.chunk->>'units_earned')::int < 30
and s1.staff_number = _.staff_number
) as s2
, lateral (
select *
from (select staff_number, json_array_elements(seq) as chunk from sequences) _
where
(s2.chunk->>'row_num')::int + 1 = (_.chunk->>'row_num')::int
and (_.chunk->>'units_earned')::int > 30
and s2.staff_number = _.staff_number
) as s3
where (s1.chunk->>'units_earned')::int > 30
order by 1, 2;
I used several advanced SQL features:
CTE
JSON
LATERAL
window functions.

How to compare values with lookup table

I am having one table "Mark" which contains marks of different subjects. If marks fit into one particular range then I should pick up respective rank and insert into marks table itself in column 'rank_sub_1'. could you please help me how can I look up in the table and insert in the column. Below is my table structure.
**Marks**
Subject1_Marks Subject2_Marks
------------------------------
71 22
10 40
**LookupTable**
Rank range1 range2
----------------------
9 10 20
8 21 30
7 31 40
6 41 50
5 51 60
4 61 70
3 71 80
2 81 90
1 91 100
Now I want to check marks of each subject with lookup table which contains the ranges and ranks for different marks obtained.
**Marks**
Subject1_Marks Subject2_Marks Rank_Sub_1 Rank_Sub_2
------------------------------------------------------
71 22
10 40
If marks fit into one particular range then I should pick up respective rank and insert into marks table itself in column 'rank_sub_1'. could you please help me how can I look up in the table and insert in the column.
(Considering there is no overlapping in range values)
Take two instances of lookuptable and join first with subject1_marks and second with subject2_marks. Here i haven't used LEFT JOINS as i am assuming your subject marks will fall under 1 range for sure. If you are not sure about that, please use left joins and handle null values as per your requirement for columns RANK_SUB_1 and RANK_SUB_2
WITH LOOKUPTABLE_TMP AS (SELECT * FROM LOOKUPTABLE)
SELECT M.*, L1.RANK AS RANK_SUB_1, L2.RANK AS RANK_SUB_2
FROM MARKS M , LOOKUPTABLE_TMP L1, LOOKUPTABLE_TMP L2
WHERE M.SUBJECT1_MARKS BETWEEN L1.RANGE1 AND L1.RANGE2
AND M.SUBJECT2_MARKS BETWEEN L2.RANGE1 AND L2.RANGE2
Then MERGE the data into table MARKS.
Solution:
MERGE INTO MARKS MS
USING
(
SELECT M.SUBJECT1_MARKS, M.SUBJECT2_MARKS, L1.RNK AS RANK_SUB_1, L2.RNK AS RANK_SUB_2
FROM MARKS M , LOOKUPTABLE L1, LOOKUPTABLE L2
WHERE M.SUBJECT1_MARKS BETWEEN L1.RANGE1 AND L1.RANGE2
AND M.SUBJECT2_MARKS BETWEEN L2.RANGE1 AND L2.RANGE2
GROUP BY M.SUBJECT1_MARKS, M.SUBJECT2_MARKS, L1.RNK, L2.RNK
) SUB
ON (MS.SUBJECT1_MARKS=SUB.SUBJECT1_MARKS AND MS.SUBJECT2_MARKS =SUB.SUBJECT2_MARKS)
WHEN MATCHED THEN UPDATE
SET MS.RANK_SUB_1=SUB.RANK_SUB_1, MS.RANK_SUB_2=SUB.RANK_SUB_2;
Tested on below schema and data as per your question's details.
CREATE TABLE MARKS (SUBJECT1_MARKS NUMBER, SUBJECT2_MARKS NUMBER , RANK_SUB_1 NUMBER, RANK_SUB_2 NUMBER)
INSERT INTO MARKS (SUBJECT1_MARKS , SUBJECT2_MARKS ) VALUES (71, 22);
INSERT INTO MARKS (SUBJECT1_MARKS , SUBJECT2_MARKS ) VALUES (10, 40);
CREATE TABLE LOOKUPTABLE (RNK NUMBER, RANGE1 NUMBER , RANGE2 NUMBER)
INSERT INTO LOOKUPTABLE VALUES (9, 10, 20);
INSERT INTO LOOKUPTABLE VALUES (8, 21, 30);
INSERT INTO LOOKUPTABLE VALUES (7, 31, 40);
INSERT INTO LOOKUPTABLE VALUES (6, 41, 50);
INSERT INTO LOOKUPTABLE VALUES (5, 51, 60);
INSERT INTO LOOKUPTABLE VALUES (4, 61, 70);
INSERT INTO LOOKUPTABLE VALUES (3, 71, 80);
INSERT INTO LOOKUPTABLE VALUES (2, 81, 90);
INSERT INTO LOOKUPTABLE VALUES (1, 91, 100);
Thanks!!
I think thisupdatestatement should do what you want:
UPDATE Marks m
SET Rank_Sub_1 = (SELECT l.Rank
FROM LookupTable l
WHERE m.Subject1_Marks BETWEEN l.range1 AND l.range2)
WHERE EXISTS (
SELECT 1
FROM LookupTable l
WHERE m.Subject1_Marks BETWEEN l.range1 AND l.range2
);
Sample SQL Fiddle
If you want to update the value forRank_Sub_2at the same time you can do this:
UPDATE Marks m
SET Rank_Sub_1 = (SELECT l.Rank
FROM LookupTable l
WHERE m.Subject1_Marks BETWEEN l.range1 AND l.range2)
,Rank_Sub_2 = (SELECT l.Rank
FROM LookupTable l
WHERE m.Subject2_Marks BETWEEN l.range1 AND l.range2)
Sample SQL Fiddle
Consider the design below which eliminates the possibility of overlaps or gaps. Although I usually use this technique with dates, any data that defines an unbroken sequence will work the same way. The idea is that you only define where the range starts. It is understood that the range stops at the last value possible less than the next higher range. However, notice I added a tenth rank, in case values less than 10 are possible. Any values greater than 100 will, of course, show as rank 1.
with
Lookup( Rank, Cutoff )as(
select 1, 91 union all
select 2, 81 union all
select 3, 71 union all
select 4, 61 union all
select 5, 51 union all
select 6, 41 union all
select 7, 31 union all
select 8, 21 union all
select 9, 10 union all
select 10, 0
),
Marks( Mark1, Mark2 )as(
select 71, 22 union all
select 10, 40 union all
select 21, 101
)
select Mark1, l1.Rank as Rank1, Mark2, l2.Rank as Rank2
from Marks m
join Lookup l1
on l1.Cutoff =(
select Max( Cutoff )
from Lookup
where Cutoff <= m.Mark1 )
join Lookup l2
on l2.Cutoff =(
select Max( Cutoff )
from Lookup
where Cutoff <= m.Mark2 );
The output:
Mark1 Rank1 Mark2 Rank2
----------- ----------- ----------- -----------
71 3 22 8
10 9 40 7
21 8 101 1

SQL Query .. a little help with AVG and MEDIAN using DISTINCT and SUM

I have a query to get the total duration of phone usage for various users...
But I need to be able to work out distinct averages for their usage.. the problem being certain users share phones and I can only grab phone info, so the call duration is repeated and this would skew the data..
So I need an average and a distinct (on the pin.Number field)... it would also be useful to do a Median if that is possible..??
This is the current query...
SELECT TOP 40 SUM(Duration) AS TotalDuration, c.Caller, oin.Name, oin.Email, pin.Number, oin.PRN
FROM Calls as c
INNER JOIN Phones as pin On c.caller = pin.id
INNER JOIN officers as oin On pin.id = oin.fk_phones
WHERE Duration <> 0 AND Placed BETWEEN '01/07/2011 00:00:00' AND '20/08/2011 23:59:59'
GROUP BY c.Caller, oin.Name, pin.Number, oin.Email, oin.PRN
ORDER BY TotalDuration DESC
Many thanks for any pointers
Here's an example of the current data I am after (but I have added the averages below which is what I am after), as you can see some users share the same phone but the number of seconds is shared between them so don't want that to influence the average (I don't want 11113 seconds repeated), so there needs to be a distinct on each phone number..
Here's a solution that implements the following idea:
Get totals per phone (SUM(Duration)).
Rank the resulting set by the total duration values (ROW_NUMBEROVER (ORDER BY SUM(Duration))).
Include one more column for the total number of rows (COUNT(*)OVER ()).
From the resulting set, get the average (AVG(TotalDuration)).
Get the median as the average between two values whose rankings are
1) N div 2 + 1,
2) N div 2 + N mod 2,
where N is the number of items, div is the integer division operator, and mod is the modulo operator.
My testing table:
DECLARE #Calls TABLE (Caller int, Duration int);
INSERT INTO #Calls (Caller, Duration)
SELECT 3, 123 UNION ALL
SELECT 1, 23 UNION ALL
SELECT 2, 15 UNION ALL
SELECT 1, 943 UNION ALL
SELECT 3, 326 UNION ALL
SELECT 3, 74 UNION ALL
SELECT 9, 49 UNION ALL
SELECT 5, 66 UNION ALL
SELECT 4, 56 UNION ALL
SELECT 4, 208 UNION ALL
SELECT 4, 112 UNION ALL
SELECT 5, 521 UNION ALL
SELECT 6, 197 UNION ALL
SELECT 8, 23 UNION ALL
SELECT 7, 22 UNION ALL
SELECT 1, 24 UNION ALL
SELECT 0, 45;
The query:
WITH totals AS (
SELECT
Caller,
TotalDuration = SUM(Duration),
rn = ROW_NUMBER() OVER (ORDER BY SUM(Duration)),
N = COUNT(*) OVER ()
FROM #Calls
GROUP BY Caller
)
SELECT
Average = AVG(TotalDuration),
Median = AVG(CASE WHEN rn IN (N / 2 + 1, N / 2 + N % 2) THEN TotalDuration END)
FROM totals
The output:
Average Median
----------- -----------
282 123
Note: In Transact-SQL, / stands for integer division if both operands are integer. The modulo operator in T-SQL is %.
I hope you can use this, I did it with temporary tables
declare #calls table (number char(4), duration int)
declare #officers table(number char(4), name varchar(10))
insert #calls values (3321,1)
insert #calls values (3321,1)
insert #calls values (3321,1)
insert #calls values (3321,42309)
insert #calls values (1235,34555)
insert #calls values (2979,31133)
insert #calls values (2324,24442)
insert #calls values (2345,11113)
insert #calls values (3422,9922)
insert #calls values (3214,8333)
insert #officers values(3321, 'Peter')
insert #officers values(1235, 'Stewie')
insert #officers values(2979, 'Lois')
insert #officers values(2324, 'Brian')
insert #officers values(2345, 'Chris')
insert #officers values(2345, 'Peter')
insert #officers values(3422, 'Frank')
insert #officers values(3214, 'John')
insert #officers values(3214, 'Mark')
Sql to get median and average
;with a as
(
select sum(duration) total_duration, number from #calls group by number
)
select avg(a.total_duration) avg_duration, c.total_duration median_duration from a
cross join (
select top 1 total_duration from (
select top 50 percent total_duration from a order by total_duration desc) b order by
total_duration) c
group by c.total_duration
Try here: https://data.stackexchange.com/stackoverflow/q/108612/
Sql To get the Total durations
select o.name, c.total_duration, c.number from #officers o join
(select sum(duration) total_duration, number from #calls group by number) c
on o.number = c.number
order by total_duration desc
Try here: https://data.stackexchange.com/stackoverflow/q/108611/