SQL count distinct per group divided by count distinct of total - sql

I have:
id
value
1
123
1
124
1
125
2
126
2
127
2
127
3
128
3
128
3
128
I want an aggregation like:
id
distinct_count
total_distinct
percentage
1
3
6
0.5
2
2
6
0.33
3
1
6
0.167
I tried applying a window over clause like this:
SELECT id,
COUNT(DISTINCT value) AS distinct_count,
COUNT(DISTINCT value) OVER () AS total_distinct,
COUNT(DISTINCT value) / COUNT(DISTINCT value) OVER () AS percentage
FROM have
GROUP BY id
but it seems it is not implemented yet.
is there a way to achieve this without a join?

You can do this:
SELECT id,
COUNT(DISTINCT value) AS distinct_count,
(SELECT COUNT(DISTINCT value) FROM have) AS total_distinct,
(0.0+COUNT(DISTINCT value)) / (SELECT COUNT(DISTINCT value) FROM have) AS percentage
FROM have
GROUP BY id
or do:
WITH cte AS (SELECT COUNT(DISTINCT value) AS value FROM have)
SELECT
id,
COUNT(DISTINCT value) AS distinct_count,
cte.value AS total_distinct,
(0.0+COUNT(DISTINCT value)) / cte.value AS percentage
FROM have
CROSS APPLY cte
GROUP By cte.value,id;

An alternative method is to enumerate the values and use conditional aggregation:
SELECT id,
SUM(CASE WHEN seqnum_iv = 1 THEN 1 ELSE 0 END) as distinct_count,
SUM(CASE WHEN seqnum_v = 1 THEN 1 ELSE 0 END) as total_distinct_count,
(SUM(CASE WHEN seqnum_iv = 1 THEN 1.0 ELSE 0 END) /
SUM(CASE WHEN seqnum_v = 1 THEN 1.0 ELSE 0 END)
) as ratio
FROM (SELECT h.*,
ROW_NUMBER() OVER (PARTITION BY id, value ORDER BY value) as seqnum_iv,
ROW_NUMBER() OVER (PARTITION BY value ORDER BY value) as seqnum_v
FROM have h
) h
GROUP BY id;
This may be faster than an approach using subqueries.

Related

Find subsequent occurrence of a value in a table

I have a table which looks like shown below
ID SubmittedValue ApprovedValue
1 25.9 0
1 29 29
1 25.9 25.9
1 50 0
1 45 0
1 10 0
1 10 10
Expected result
ID SubsequentlyApproved(CNT) Total_Amt_sub_aprvd
1 2 35.9
We get the above result because 25.9+10 since it is repeated in the subsequent rows.
How to perform VLOOKUP like functionality for this scenario. I tried the subquery but it didn't work.
SELECT a.id,
SUM(CASE WHEN a.ApprovedValue=0 THEN 1 ELSE 0 END) AS SUB_COUNT
FROM myTable a
join (select id, sum( case when SubmittedValue=ApprovedValue then 1 end) as check_value from myTable) b
on b.id=a.id and SUB_COUNT=check_value
but this is not giving me the expected result.
You seem to want to count rows where the values are the same and the first value appears more than once. If so, you can use window functions and aggregation:
select id, count(*), sum(ApprovedValue)
from (select t.*, count(*) over (partition by id, SubmittedValue) as cnt
from t
) t
where cnt > 1 and SubmittedValue = ApprovedValue
group by id
Without window functions using a semi-join
select id, count(*), sum(submittedvalue)
from test t1
where submittedvalue=approvedvalue
and exists (select 1
from test t2
where t1.id=t2.id and t1.submittedvalue=t2.submittedvalue
group by id, submittedvalue
having count(*)>1)
group by id;

count rows which have max value less than specified parameter

I want to find in my table, max value which is less than specified in parameter and get count of rows that have the same value as max value. For example in my table I have values: (4,1,3,1,4,4,10), and it is list of parameters in string "2,9,10,4". I have to split string to separate parameters. Base on this sample values I want to get something like that:
param | max value | count
2 | 1 | 2
9 | 4 | 3
10 | 4 | 3
4 | 3 | 1
And it is my sample query:
select
[param]
, max([val]) [max_value_by_param]
, max(count) [count]
from(
select
n.value as [param]
,a.val
, count(*) as [count]
from (--mock of table
select 1 as val union all
select 3 as val union all
select 4 as val union all
select 1 as val union all
select 3 as val union all
select 4 as val union all
select 4 as val union all
select 10 as val
) a
join (select [value] from string_split('2,9,10,4', ',')) n--list of params
on a.val < n.[value]
group by n.value, a.val
) tmp
group by [param]
Is it possible to do it better/easier ?
Here is a way to express this using apply:
select s.value as param, a.val, a.cnt
from string_split('2,9,10,4', ',') s outer apply
(select top (1) a.val, count(*) as cnt
from a
group by a.val
having a.val < s.value
order by a.val desc
) a;
Here is a db<>fiddle.
But the fastest method is probably going to be:
with av as (
select a.val, count(*) as cnt
from a
group by a.val
union all
select s.value, null as cnt
from string_split('2,9,10,4', ',') s
)
select val, a_val, a_cnt
from (select av.*,
max(case when cnt is not null then val end) over (order by val, (case when cnt is null then 1 else 2 end)) as a_val,
max(case when cnt is not null then cnt end) over (order by val, (case when cnt is null then 1 else 2 end)) as a_cnt
from av
) av
where cnt is null;
This only aggregates the data once and should return all parameters, even those with no preceding values in a.

Assign column value based on the percentage of rows

In DB2 is there a way to assign a column value based on the first x%, then y% and remaining z% of rows?
I've tried using row_number() function but no luck!
Example below
Assuming that the below example count(id) is already arranged in descending order
Input:
ID count(id)
5 10
3 8
1 5
4 3
2 1
Output:
First 30% rows of the above input should be assigned code H, last 30% of the rows will have code L and remaining will have code M. If 30% of rows evaluates to decimal then round up-to 0 decimal place.
ID code
5 H
3 H
1 M
4 L
2 L
You can use window functions:
select t.id,
(case ntile(3) over (order by count(id) desc)
when 1 then 'H'
when 2 then 'M'
when 3 then 'L'
end) as grp
from t
group by t.id;
This puts them into equal sized groups.
For 30-40-30% split with your conditions, you have to be more careful:
select t.id,
(case when (seqnum - 1.0) < 0.3 * cnt then 'H'
when (seqnum + 1.0) > 0.7 * cnt then 'L'
else 'M'
end) as grp
from (select t.id,
count(*) as cnt,
count(*) over () as num_ids,
row_number() over (order by count(*) desc) as seqnum
from t
group by t.id
) t
Try this:
with t(ID, count_id) as (values
(5, 10)
, (3, 8)
, (1, 5)
, (4, 3)
, (2, 1)
)
select t.*
, case
when pst <=30 then 'H'
when pst <=70 then 'M'
else 'L'
end as code
from
(
select t.*
, rownumber() over (order by count_id desc) as rn
, 100*rownumber() over (order by count_id desc)/nullif(count(1) over(), 0) as pst
from t
) t;
The result is:
ID COUNT_ID RN PST CODE
-- -------- -- --- ----
5 10 1 20 H
3 8 2 40 M
1 5 3 60 M
4 3 4 80 L
2 1 5 100 L

Aggregate within a group of unchanged values

I have sample data:
RowId TypeId Value
1 1 34
2 1 53
3 1 34
4 2 43
5 2 65
6 16 54
7 16 34
8 1 45
9 6 43
10 6 34
11 16 64
12 16 63
I want to count row for each type (The Value does not matter to me), but only for... neighbor TypeId
TypeId Count
1 3
2 2
16 2
1 1
6 2
16 2
How to achieve this result?
This should give you COUNT of rows within a group of unchanged values:
SELECT TypeId, grp, COUNT(*) FROM (
SELECT RowId, TypeId , Value, gap, SUM(gap) over (ORDER BY RowId ) grp
FROM (SELECT RowId, TypeId , Value,
CASE WHEN TypeId = lag(TypeId) over (ORDER BY RowId )
THEN 0
ELSE 1
END gap
FROM dummy
) t
) tt
GROUP BY TypeId, grp;
If you prefer WITH over endless sub-query inclusions:
WITH dummy_with_groups AS (
SELECT RowId, TypeId , Value, SUM(gap) OVER (ORDER BY RowId) grp
FROM (SELECT RowId, TypeId , Value,
CASE WHEN TypeId = lag(TypeId) OVER (ORDER BY RowId)
THEN 0 ELSE 1 END gap
FROM dummy) t
)
SELECT TypeId, COUNT(*) as Result
FROM dummy_with_groups
GROUP BY TypeId, grp;
http://www.sqlfiddle.com/#!6/f16e9/34
Check this fiddle demo. I have renamed your columns a little.
WITH myCTE AS
(SELECT row_id,
type_id,
ROW_NUMBER () OVER (PARTITION BY type_id ORDER BY row_id)
AS cnt,
CASE LEAD (type_id) OVER (ORDER BY row_id)
WHEN type_id THEN 0
ELSE 1
END
AS show
FROM dummy),
innerQuery AS
(SELECT row_id, type_id, cnt
FROM myCTE
WHERE show = 1)
SELECT iq1.type_id, iq1.cnt - ISNULL (iq2.cnt, 0) CNT
FROM innerQuery iq1
LEFT OUTER JOIN innerQuery iq2
ON iq1.type_id = iq2.type_id
AND EXISTS
(SELECT 1
FROM innerQuery iq3
WHERE iq3.type_id = iq1.type_id
AND iq3.row_id < iq1.row_id
HAVING MAX (iq3.row_id) = iq2.row_id)
The output is exactly as expected.

How to get average of the 'middle' values in a group?

I have a table that has values and group ids (simplified example). I need to get the average for each group of the middle 3 values. So, if there are 1, 2, or 3 values it's just the average. But if there are 4 values, it would exclude the highest, 5 values the highest and lowest, etc. I was thinking some sort of window function, but I'm not sure if it's possible.
http://www.sqlfiddle.com/#!11/af5e0/1
For this data:
TEST_ID TEST_VALUE GROUP_ID
1 5 1
2 10 1
3 15 1
4 25 2
5 35 2
6 5 2
7 15 2
8 25 3
9 45 3
10 55 3
11 15 3
12 5 3
13 25 3
14 45 4
I'd like
GROUP_ID AVG
1 10
2 15
3 21.6
4 45
Another option using analytic functions;
SELECT group_id,
avg( test_value )
FROM (
select t.*,
row_number() over (partition by group_id order by test_value ) rn,
count(*) over (partition by group_id ) cnt
from test t
) alias
where
cnt <= 3
or
rn between floor( cnt / 2 )-1 and ceil( cnt/ 2 ) +1
group by group_id
;
Demo --> http://www.sqlfiddle.com/#!11/af5e0/59
I'm not familiar with the Postgres syntax on windowed functions, but I was able to solve your problem in SQL Server with this SQL Fiddle. Maybe you'll be able to easily migrate this into Postgres-compatible code. Hope it helps!
A quick primer on how I worked it.
Order the test scores for each group
Get a count of items in each group
Use that as a subquery and select only the middle 3 items (that's the where clause in the outer query)
Get the average for each group
--
select
group_id,
avg(test_value)
from (
select
t.group_id,
convert(decimal,t.test_value) as test_value,
row_number() over (
partition by t.group_id
order by t.test_value
) as ord,
g.gc
from
test t
inner join (
select group_id, count(*) as gc
from test
group by group_id
) g
on t.group_id = g.group_id
) a
where
ord >= case when gc <= 3 then 1 when gc % 2 = 1 then gc / 2 else (gc - 1) / 2 end
and ord <= case when gc <= 3 then 3 when gc % 2 = 1 then (gc / 2) + 2 else ((gc - 1) / 2) + 2 end
group by
group_id
with cte as (
select
*,
row_number() over(partition by group_id order by test_value) as rn,
count(*) over(partition by group_id) as cnt
from test
)
select
group_id, avg(test_value)
from cte
where
cnt <= 3 or
(rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id
order by group_id
sql fiddle demo
in the cte, we need to get count of elements over each group_id by window function + calculate row_number inside each group_id. Then, if this count > 3 then we need to get middle of the group by dividing count by 2 and then get +1 and -1 element. If count <= 3, then we should just take all elements.
This works:
SELECT A.group_id, avg(A.test_value) AS avg_mid3 FROM
(SELECT group_id,
test_value,
row_number() OVER (PARTITION BY group_id ORDER BY test_value) AS position
FROM test) A
JOIN
(SELECT group_id,
CASE
WHEN count(*) < 4 THEN 1
WHEN count(*) % 2 = 0 THEN (count(*)/2 - 1)
ELSE (count(*) / 2)
END AS position_start,
CASE
WHEN count(*) < 4 THEN count(*)
WHEN count(*) % 2 = 0 THEN (count(*)/2 + 1)
ELSE (count(*) / 2 + 2)
END AS position_end
FROM test GROUP BY group_id) B
ON A.group_id=B.group_id
AND A.position >= B.position_start
AND A.position <= B.position_end
GROUP BY A.group_id
Fiddle link: http://www.sqlfiddle.com/#!11/af5e0/56
If you need to calculate the average values ​​for groups then you can do this:
SELECT CASE WHEN NUMBER_FIRST_GROUP <> 0
THEN SUM_FIRST_GROUP / NUMBER_FIRST_GROUP
ELSE NULL
END AS AVG_FIRST_GROUP,
CASE WHEN NUMBER_SECOND_GROUP <> 0
THEN SUM_SECOND_GROUP / NUMBER_SECOND_GROUP
ELSE NULL
END AS AVG_SECOND_GROUP,
CASE WHEN NUMBER_THIRD_GROUP <> 0
THEN SUM_THIRD_GROUP / NUMBER_THIRD_GROUP
ELSE NULL
END AS AVG_THIRD_GROUP,
CASE WHEN NUMBER_FOURTH_GROUP <> 0
THEN SUM_FOURTH_GROUP / NUMBER_FOURTH_GROUP
ELSE NULL
END AS AVG_FOURTH_GROUP
FROM (
SELECT
SUM(CASE WHEN GROUP_ID = 1 THEN 1 ELSE 0 END) AS NUMBER_FIRST_GROUP,
SUM(CASE WHEN GROUP_ID = 1 THEN TEST_VALUE ELSE 0 END) AS SUM_FIRST_GROUP,
SUM(CASE WHEN GROUP_ID = 2 THEN 1 ELSE 0 END) AS NUMBER_SECOND_GROUP,
SUM(CASE WHEN GROUP_ID = 2 THEN TEST_VALUE ELSE 0 END) AS SUM_SECOND_GROUP,
SUM(CASE WHEN GROUP_ID = 3 THEN 1 ELSE 0 END) AS NUMBER_THIRD_GROUP,
SUM(CASE WHEN GROUP_ID = 3 THEN TEST_VALUE ELSE 0 END) AS SUM_THIRD_GROUP,
SUM(CASE WHEN GROUP_ID = 4 THEN 1 ELSE 0 END) AS NUMBER_FOURTH_GROUP,
SUM(CASE WHEN GROUP_ID = 4 THEN TEST_VALUE ELSE 0 END) AS SUM_FOURTH_GROUP
FROM TEST
) AS FOO