Count cummulative distinct - sql

I have the below table. Is it possible to do a cummulative distinct count? For example, if A1 has 3 distinct values, then the count for it will be 3. Afterwards, check for A1 and A2. If A1 and A2 together have 5 distinct values, 5. Repeat until A1 + A2 ... + An and count the distinct values.
A
V
A1
V1
A1
V2
A1
V2
A2
V1
A2
V2
A2
V3
My expected output would be:
A
C
A1
2
A2
3

This answers the original version of the question.
You can aggregate twice . . . once to keep the first occurrence of v and the second to aggregate again:
select a, count(*) as new_cs
from (select v, min(a) as a
from t
group by v
) v
group by a;
Note: The above only shows as that have new values. If you want all a, then window functions are a better approach:
select a, sum(case when seqnum = 1 then 1 else 0 end) as c
from (select t.*, row_number() over (partition by v order by a) as seqnum
from t
) t
group by a
order by a;
Here is a db<>fiddle.

You can use ROW_NUMBER() window function to find the 1st occurrence of each V and then COUNT() window function to count only these 1st occurrences:
SELECT DISTINCT A,
COUNT(CASE WHEN rn = 1 THEN 1 END) OVER (ORDER BY A) C
FROM (
SELECT A, ROW_NUMBER() OVER (PARTITION BY V ORDER BY A) rn
FROM tablename
) t
ORDER BY A
See the demo.

You can use a partitioned outer join to ensure that all V values are counted for all A values and then use the FIRST_VALUE analytic function to find whether a value exists in the current or preceding A values for the V:
SELECT a,
COUNT( DISTINCT fv ) AS c
FROM (
SELECT t.a,
FIRST_VALUE(t.v) IGNORE NULLS OVER (
PARTITION BY v.v
ORDER BY t.a
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS fv
FROM ( SELECT DISTINCT v FROM table_name ) v
LEFT OUTER JOIN table_name t
PARTITION BY ( t.a )
ON ( t.v = v.v )
)
GROUP BY a
ORDER BY a
Which, for the sample data:
CREATE TABLE table_name ( A, V ) AS
SELECT 'A1', 'V1' FROM DUAL UNION ALL
SELECT 'A1', 'V2' FROM DUAL UNION ALL
SELECT 'A1', 'V3' FROM DUAL UNION ALL
SELECT 'A2', 'V1' FROM DUAL UNION ALL
SELECT 'A2', 'V3' FROM DUAL UNION ALL
SELECT 'A2', 'V4' FROM DUAL UNION ALL
SELECT 'A3', 'V2' FROM DUAL UNION ALL
SELECT 'A3', 'V3' FROM DUAL UNION ALL
SELECT 'A4', 'V1' FROM DUAL UNION ALL
SELECT 'A4', 'V5' FROM DUAL;
Outputs:
A
C
A1
3
A2
4
A3
4
A4
5
db<>fiddle here

Related

postgresql - count distinct combination of three columns- order doesn't matter

I'm trying to count distinct combinations of three columns, order of the columns doesn't matter
sample :
a a a
a a b
a b a
b b a
b a b
the result I'm getting :
a a a 1
a a b 1
a b a 1
b b a 1
b a b 1
desired result
aaa 1
aab 2
bba 2
You can use an ordered array
select v[1], v[2], v[3], count(*) n
from tbl t
cross join lateral (
select array_agg(col order by col) v
from (
values (c1),(c2),(c3)
) t(col)
) s
group by v[1], v[2], v[3];
db<>fiddle
Maybe you can use checksums for getting the required result eg if it is really just combinations 'a' and 'b' that you are dealing with, you could convert the letters to integers (by calling the ASCII() function) and add these up so that you get a checksum.
TABLE
create table t (c1, c2, c3 ) as
select 'a', 'a', 'a' union all
select 'a', 'a', 'b' union all
select 'a', 'b', 'a' union all
select 'b', 'b', 'a' union all
select 'b', 'a', 'b' ;
Checksums
select c1, c2, c3, ascii( c1 ) + ascii( c2 ) + ascii( c3 ) as checksum
from t ;
-- output
c1 c2 c3 checksum
a a a 291
a a b 292
a b a 292
b b a 293
b a b 293
If this works for you, then you can use window functions eg
select c1, c2, c3, rc_ as rowcount
from (
select c1, c2, c3
, count(*) over ( partition by ascii( c1 ) + ascii( c2 ) + ascii( c3 ) order by 1 ) rc_
, row_number() over ( partition by ascii( c1 ) + ascii( c2 ) + ascii( c3 ) order by 1 ) rn_
from t
) sq
where rc_ = rn_ ;
-- output
c1 c2 c3 rowcount
a a a 1
a b a 2
b a b 2
See dbfiddle.
If you are dealing with strings that cannot easily converted to integers, you could create a mapping between the strings and integers, and implement the map_ as a view (so that it is easy to use in subsequent queries) eg
MAP
-- {1} find all distinct elements
-- {2} map each element to an integer
create view map_
as
select val_, rank() over ( order by val_ ) weight_
from (
select distinct val_
from (
select distinct c1 val_ from t union all
select distinct c2 from t union all
select distinct c3 from t
) all_elements
) unique_elements ;
Once you have this map, you can use its values for creating checksums (maybe also in a view) ...
Checksums
create view t_checksums_
as
select c1, c2, c3, c1weight + c2weight + c3weight as checksum
from (
select
c1, ( select weight_ from map_ where c1 = map_.val_ ) c1weight
, c2, ( select weight_ from map_ where c2 = map_.val_ ) c2weight
, c3, ( select weight_ from map_ where c3 = map_.val_ ) c3weight
from t
) valandweight ;
... and then, you can use the same query as before, for obtaining the final result - see dbfiddle.

Group by with having and merge back to original table

I have a table like this
A_Count
B_Count
A
B
C
1
0
A
NULL
C1
0
1
NULL
B
C1
1
1
A
B
C2
1
1
A
B
C2
and I want to have a result table (only need to show column A and B) like:
A_Count
B_Count
A
B
C
1
1
A
B
C1
1
1
A
B
C2
1
1
A
B
C2
So my goal is to merge two row having the following condiction:
both rows belong to same group C and only merge when one row has A being null and one row has B being null.
so its like:
group by C
having sum(A_COUNT) =1 AND sum(B_COUNT) =1
but the problem is, I want to keep those rows that are not merged (ROW 3 & 4) , can someone tell me how to do that? many thanks!
You can use conditional analytical function and group by as follows:
Select max(a) as a, max(b) as b, c from
(Select a, b, c,
case when nulla = 1 and nullb = 1 and (a is null or b is null)
then 0
else row_number() over (partition by c order by 1)
end as rn
from (Select a, b, c,
count(case when a is null then 1 end) over(partition by c) as nulla,
count(case when b is null then 1 end) over (partition by c) as nullb
From your_table t
)
)
Group by c, rn
DB<>Fiddle Thanks to MT0. Used the sample data from MT0's fiddle.
If you were using Oracle 12 then you could use MATCH_RECOGNIZE:
SELECT a_count, b_count, a, b, c
FROM (
SELECT t.*,
NVL2(
A,
ROW_NUMBER() OVER ( PARTITION BY C ORDER BY NVL2( B, 1, 0 ) DESC, ROWNUM ),
ROW_NUMBER() OVER ( PARTITION BY C ORDER BY NVL2( A, 1, 0 ) DESC, ROWNUM )
) AS rn
FROM table_name t
)
MATCH_RECOGNIZE (
PARTITION BY C
ORDER BY rn, A NULLS LAST
MEASURES
FIRST( a_count ) AS a_count,
LAST( b_count ) AS b_count,
FIRST( a ) AS a,
LAST( b ) AS b
PATTERN ( a b? )
DEFINE
a AS a.a IS NOT NULL,
b AS a.b IS NULL AND b.a IS NULL AND b.b IS NOT NULL
)
Before that Oracle version, you can get a similar effect using analytic functions to determine which rows to aggregate:
SELECT SUM( a_count ) AS a_count,
SUM( b_count ) AS b_count,
MAX( a ) AS a,
MAX( b ) AS b,
c
FROM (
SELECT t.*,
NVL2(
A,
ROW_NUMBER() OVER ( PARTITION BY C ORDER BY NVL2( B, 1, 0 ) DESC, ROWNUM ),
ROW_NUMBER() OVER ( PARTITION BY C ORDER BY NVL2( A, 1, 0 ) DESC, ROWNUM )
) AS rn
FROM table_name t
)
GROUP BY c, rn
Which, for the sample data (in an unordered state, with additional rows to demonstrate grouping additional pairs of rows):
CREATE TABLE table_name ( A_Count, B_Count, A, B, C ) AS
SELECT 1, 0, 'A', NULL, 'C1' FROM DUAL UNION ALL
SELECT 0, 1, NULL, 'B', 'C1' FROM DUAL UNION ALL
SELECT 1, 1, 'A', 'B', 'C2' FROM DUAL UNION ALL
SELECT 0, 1, NULL, 'B', 'C2' FROM DUAL UNION ALL -- Added row
SELECT 1, 0, 'A', NULL, 'C2' FROM DUAL UNION ALL -- Added row
SELECT 1, 0, 'A', NULL, 'C2' FROM DUAL UNION ALL -- Added row
SELECT 1, 1, 'A', 'B', 'C2' FROM DUAL UNION ALL
SELECT 0, 1, NULL, 'B', 'C2' FROM DUAL -- Added row
Both output:
A_COUNT | B_COUNT | A | B | C
------: | ------: | :- | :- | :-
1 | 1 | A | B | C1
1 | 1 | A | B | C2
1 | 1 | A | B | C2
1 | 1 | A | B | C2
1 | 1 | A | B | C2
db<>fiddle here
You can do this with join:
select (t1.a_count + coalesce(t2.a_count, 0)) as a_count,
(t1.b_count + coalesce(t2.b_count, 0)) as b_count,
coalesce(t1.a, t2.a) as a,
coalesce(t1.b, t2.b) as b,
t1.c
from t t1 left join
t t2
on t1.c = t2.c and
t1.a is not null and t2.a is null and
t1.b is null and t2.b is not null
where t1.a is not null;
As you've described the problem, aggregation doesn't seem necessary.
Here is a db<>fiddle with your original data.

Get Distinct values without null

I have a table like this;
--Table_Name--
A | B | C
-----------------
A1 NULL NULL
A1 NULL NULL
A2 NULL NULL
NULL B1 NULL
NULL B2 NULL
NULL B3 NULL
NULL NULL C1
I want to get like this ;
--Table_Name--
A | B | C
-----------------
A1 B1 C1
A2 B2 NULL
NULL B3 NULL
How should I do that ?
Here's one option:
sample data is from line #1 - 9
the following CTEs (lines #11 - 13) fetch ranked distinct not null values from each column
the final query (line #15 onward) returns desired result by outer joining previous CTEs on ranked value
SQL> with test (a, b, c) as
2 (select 'A1', null, null from dual union all
3 select 'A1', null, null from dual union all
4 select 'A2', null, null from dual union all
5 select null, 'B1', null from dual union all
6 select null, 'B2', null from dual union all
7 select null, 'B3', null from dual union all
8 select null, null, 'C1' from dual
9 ),
10 --
11 ta as (select distinct a, dense_rank() over (order by a) rn from test where a is not null),
12 tb as (select distinct b, dense_rank() over (order by b) rn from test where b is not null),
13 tc as (select distinct c, dense_rank() over (order by c) rn from test where c is not null)
14 --
15 select ta.a, tb.b, tc.c
16 from ta full outer join tb on ta.rn = tb.rn
17 full outer join tc on ta.rn = tc.rn
18 order by a, b, c
19 /
A B C
-- -- --
A1 B1 C1
A2 B2
B3
SQL>
If you have only one value per column, then I think a simpler solution is to enumerate the values and aggregate:
select max(a) as a, max(b) as b, max(c) as c
from (select t.*,
dense_rank() over (partition by (case when a is null then 1 else 2 end),
(case when b is null then 1 else 2 end),
(case when c is null then 1 else 2 end)
order by a, b, c
) as seqnum
from t
) t
group by seqnum;
This only "aggregates" once and only uses one window function, so I think it should have better performance than handling each column individually.
Another approach is to use lateral joins which are available in Oracle 12C -- but this assumes that the types are compatible:
select max(case when which = 'a' then val end) as a,
max(case when which = 'b' then val end) as b,
max(case when which = 'c' then val end) as c
from (select which, val,
dense_rank() over (partition by which order by val) as seqnum
from t cross join lateral
(select 'a' as which, a as val from dual union all
select 'b', b from dual union all
select 'c', c from dual
) x
where val is not null
) t
group by seqnum;
The performance may be comparable, because the subquery removes so many rows.

Generate random percentage using oracle sql

I have the below table account_ownership:
account holder
A1 H1
A1 H2
A2 H3
A2 H4
A2 H5
A3 H6
A3 H7
A3 H8
A3 H9
Below is what i need.
Output:
account holder ownership
A1 H1 50
A1 H2 50
A2 H3 50
A2 H4 25
A2 H5 25
A3 H6 60
A3 H7 30
A3 H8 5
A3 H9 5
Note: The ownership can have any value between 1 and 99 (no fractional part) but it should add up to 100 within the account.
WITH tbl1 (acct, hldr)
AS (SELECT 'A1', 'H1' FROM DUAL
UNION ALL
SELECT 'A1', 'H2' FROM DUAL
UNION ALL
SELECT 'A2', 'H3' FROM DUAL
UNION ALL
SELECT 'A2', 'H4' FROM DUAL
UNION ALL
SELECT 'A2', 'H5' FROM DUAL
)
select acct, hldr,
case when rn=1 then 100-sum(percent) over(partition by acct)+percent
else percent end AS percent
from (
select acct, hldr,
ceil(dbms_random.value(1,100/count(1) over(partition by acct))) percent,
row_number() over(partition by acct order by NULL) rn
from tbl1
)
Use SQL as below.
WITH account_ownership (account, holder)
AS (SELECT 'A1', 'H1' FROM DUAL
UNION ALL
SELECT 'A1', 'H2' FROM DUAL
UNION ALL
SELECT 'A2', 'H3' FROM DUAL
UNION ALL
SELECT 'A2', 'H4' FROM DUAL
UNION ALL
SELECT 'A2', 'H5' FROM DUAL
UNION ALL
SELECT 'A3', 'H6' FROM DUAL),
table1 as(
select a.account,holder,
sum(1)over(partition by a.account order by holder) as myorder,
b.mycount,
dbms_random.value(1,100) myvalue
from account_ownership a
left join (select account,count(*) mycount
from account_ownership
group by account) b
on a.account = b.account),
table2 as(
select a.account,holder,myorder,mycount,trunc(myvalue/sumvalue*100) true_value
from table1 a
left join ( select account,sum(myvalue) sumvalue from table1
group by account) b
on a.account = b.account)
select account,holder,case when myorder != mycount then true_value
else 100 - sum_value + true_value end as ownership
from (select account,holder,myorder,mycount,true_value,
sum(true_value) over (partition by account order by myorder) as sum_value
from table2)
Try this. The logic is explained inlined.
--Dataset
WITH tbl1 (acct, hldr)
AS (SELECT 'A1', 'H1' FROM DUAL
UNION ALL
SELECT 'A1', 'H2' FROM DUAL
UNION ALL
SELECT 'A2', 'H3' FROM DUAL
UNION ALL
SELECT 'A2', 'H4' FROM DUAL
UNION ALL
SELECT 'A2', 'H5' FROM DUAL),
--dataset end
-- Counting records for each account
tbl2 (acct, cnt)
AS ( SELECT acct, COUNT (1)
FROM tbl1
GROUP BY acct),
-- Counting end
tbl3 (acct,
hldr,
prcnt,
rnk)
AS (SELECT tbl1.acct,
tbl1.hldr,
ROUND (100 / tbl2.cnt) prcnt,
DENSE_RANK ()
OVER (PARTITION BY tbl1.acct ORDER BY tbl1.acct, tbl1.hldr)
rnk
FROM tbl1 INNER JOIN tbl2 ON tbl1.acct = tbl2.acct)
SELECT tbl3.acct,
tbl3.hldr,
CASE
WHEN (rnk = 1) AND MOD (tbl2.cnt, 2) = 1
THEN (prcnt + 1)
ELSE prcnt
END
FROM tbl3
INNER JOIN tbl2
ON tbl3.acct = tbl2.acct
ORDER BY ACCT;

Average sum ignoring max or min

I want to use AVG to get the average of some values, but ignoring the max and min values only if they are 1.5 bellow or above the second max and min values. I will put some examples:
Example 1:
SELECT *
FROM (
SELECT 100.5 v FROM DUAL UNION
SELECT 101.5 v FROM DUAL UNION
SELECT 103.1 v FROM DUAL ) D
I need this result, ignoring the 103.1 value:
100.5
101.5
Example 2:
SELECT *
FROM (
SELECT 100.5 v FROM DUAL UNION
SELECT 101.5 v FROM DUAL UNION
SELECT 103.1 v FROM DUAL UNION
SELECT 106.2 v FROM DUAL) D
I need this result, ignoring only the 106.2 value:
100.5
101.1
103.1
Example 3:
SELECT *
FROM (
SELECT 100.0 v FROM DUAL UNION
SELECT 102.0 v FROM DUAL UNION
SELECT 103.0 v FROM DUAL UNION
SELECT 105.0 v FROM DUAL UNION
SELECT 107.0 v FROM DUAL) D
I need this result, ignoring 100.0 and 107.0 values:
102.0
103.0
105.0
When there is only two values it doesnt matter.
With the right result, I can AVG(value) correctly.
You need a combination of analytic functions (lead/lag) and conditional aggregation. Here's what I came up with. Note that I allow for multiple groups, the "adjusted" average must be computed for each group separately (a common task in statistics when you must throw out the outliers in each group, when they exist):
with
inputs ( id, val ) as (
select 101, 33 from dual union all
select 102, 23 from dual union all
select 102, 22.8 from dual union all
select 103, 30 from dual union all
select 103, 40 from dual union all
select 104, 90 from dual union all
select 104, 92 from dual union all
select 104, 92 from dual union all
select 104, 91.5 from dual union all
select 104, 91.7 from dual
)
-- End of simulated inputs (for testing only, not part of the solution).
-- SQL query begins BELOW THIS LINE. Use your actual table and column names.
select id,
avg ( case when cnt >= 3
and ( lag_val is null and lead_val - val >= 1.5
or
lead_val is null and val - lag_val >= 1.5
)
then null
else val
end
) as adjusted_avg_val
from (
select id, val, count(val) over (partition by id) as cnt,
lag ( val ) over ( partition by id order by val ) as lag_val,
lead ( val ) over ( partition by id order by val ) as lead_val
from inputs
)
group by id
;
Output:
ID ADJUSTED_AVG_VAL
--- ----------------
101 33
102 22.9
103 35
104 91.8
Try using the following combination of row_number lead and lag
with cte as (
SELECT 100.5 v FROM DUAL UNION ALL
SELECT 101.5 v FROM DUAL UNION ALL
SELECT 103.1 v FROM DUAL UNION ALL
SELECT 106.2 v FROM DUAL)
-- end of sample data
select avg(v)
from
(
select row_number() over (order by v desc) arn,
row_number() over (order by v) drn,
lag(v) over (order by v) av,
lead(v) over (order by v) dv,
v
from cte
) t
where (arn != 1 and drn != 1) or -- if they are no maximum nor minumum
(drn = 1 and v + 1.5 > dv) or -- if they are minimum
(arn = 1 and v - 1.5 < av) or -- if they are maximum
(av is null and arn < 3) or -- if there are just two ore one value
(dv is null and drn < 3) -- if there are just two ore one value
In SQL you simply just need to express the result, so ...
WITH D as(
SELECT 100.0 v FROM DUAL UNION
SELECT 102.0 FROM DUAL UNION
SELECT 103.0 FROM DUAL UNION
SELECT 105.0 FROM DUAL UNION
SELECT 107.0 FROM DUAL)
SELECT avg(v)
FROM D
where (v < (select max(v) from D )
and ((select max(v) from D )
-(select max(v) from D where v !=
(select max(v) from D ) ) > 1.5))
or
(v > (select min(v) from D )
and ((select min(v) from D )
+(select min(v) from D where v !=
(select min(v) from D ) ) > 1.5))
... should do the trick!
But thinking ahead... the below version may also be useful ;)
WITH D as(
SELECT 1 PK, 100.0 v FROM DUAL UNION
SELECT 1,102.0 FROM DUAL UNION
SELECT 1,103.0 FROM DUAL UNION
SELECT 1,105.0 FROM DUAL UNION
SELECT 1,107.0 FROM DUAL)
SELECT PK,avg(v)
FROM D
where (v < (select max(v) from D group by PK)
and ((select max(v) from D group by PK)
-(select max(v) from D where v !=
(select max(v) from D group by PK) group by PK) > 1.5))
or
(v > (select min(v) from D group by PK)
and ((select min(v) from D group by PK)
+(select min(v) from D where v !=
(select min(v) from D group by PK) group by PK) > 1.5))
GROUP BY PK
In real life though you would consider the execution plans of the above in a large dataset too (homework).
For any further clarifications, I am at your disposal via comments.
Sincerely,
Ted