selecting the highest count for a categorical variable when grouping - sql

I have the following table:
custID Cat
1 A
1 B
1 B
1 B
1 C
2 A
2 A
2 C
3 B
3 C
4 A
4 C
4 C
4 C
What I need is the most efficient way to aggregate by CustID in such a manner that I obtain the most frequent category (cat), the second most frequent and the third. The output of the above should be
most freq 2nd most freq 3rd most freq
1 B A C
2 A C Null
3 B C Null
4 C A Null
When there is a tie in the count I do not really care what is first and what is second. For example for customer 1 2nd most freq and 3rd most freq could be swapped because each of them occur 1 time only.
Any sql would be fine, preferable hive sql.
Thank you

Try to use group by twice and dense_rank() to sort accorting to the cat count. Actually I'm not 100% sure , but I guess it should work in hive as well.
select custId,
max(case when t.rn = 1 then cat end) as [most freq],
max(case when t.rn = 2 then cat end) as [2nd most freq],
max(case when t.rn = 3 then cat end) as [3th most freq]
from
(
select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn
from your_table
group by custId, cat
) t
group by custId
demo
According to the comments I add slightly modified solution that conforms with Hive SQL
select custId,
max(case when t.rn = 1 then cat else null end) as most_freq,
max(case when t.rn = 2 then cat else null end) as 2nd_most_freq,
max(case when t.rn = 3 then cat else null end) as 3th_most_freq
from
(
select custId, cat, dense_rank() over (partition by custId order by ct desc) rn
from (
select custId, cat, count(*) ct
from your_table
group by custId, cat
) your_table_with_counts
) t
group by custId
Hive SQL demo

SELECT journal, count(*) as frequency
FROM ${hiveconf:TNHIVE}
WHERE journal IS NOT NULL
GROUP BY journal
ORDER BY frequency DESC
LIMIT 5;

Related

Compare the same id with 2 values in string in one table

I have a table like this:
id
status
grade
123
Overall
A
123
Current
B
234
Overall
B
234
Current
D
345
Overall
C
345
Current
A
May I know how can I display how many ids is fitting with the condition:
The grade is sorted like this A > B > C > D > F,
and the Overall grade must be greater than or equal to the Current grade
Is it need to use CASE() to switch the grade to a number first?
e.g. A = 4, B = 3, C = 2, D = 1, F = 0
In the table, there should be 345 is not match the condition. How can I display the tables below:
qty_pass_the_condition
qty_fail_the_condition
total_ids
2
1
3
and\
fail_id
345
Thanks.
As grade is sequential you can do order by desc to make the number. for the first result you can do something like below
select
sum(case when GradeRankO >= GradeRankC then 1 else 0 end) AS
qty_pass_the_condition,
sum(case when GradeRankO < GradeRankC then 1 else 0 end) AS
qty_fail_the_condition,
count(*) AS total_ids
from
(
select * from (
select Id,Status,
Rank() over (partition by Id order by grade desc) GradeRankO
from YourTbale
) as a where Status='Overall'
) as b
inner join
(
select * from (
select Id,Status,
Rank() over (partition by Id order by grade desc) GradeRankC
from YourTbale
) as a where Status='Current'
) as c on b.Id=c.Id
For second one you can do below
select
b.Id fail_id
from
(
select * from (
select Id,Status,
Rank() over (partition by Id order by grade desc) GradeRankO
from Grade
) as a where Status='Overall'
) as b
inner join
(
select * from (
select Id,Status,
Rank() over (partition by Id order by grade desc) GradeRankC
from Grade
) as a where Status='Current'
) as c on b.Id=c.Id
where GradeRankO < GradeRankC
You can use pretty simple conditional aggregation for this, there is no need for window functions.
A Pass is when the row of Overall has grade which is less than or equal to Current, with "less than" being in A-Z order.
Then aggregate again over the whole table, and qty_pass_the_condition is simply the number of non-nulls in Pass. And qty_fail_the_condition is the inverse of that.
SELECT
qty_pass_the_condition = COUNT(t.Pass),
qty_fail_the_condition = COUNT(*) - COUNT(t.Pass),
total_ids = COUNT(*)
FROM (
SELECT
t.id,
Pass = CASE WHEN MIN(CASE WHEN t.status = 'Overall' THEN t.grade END) <=
MIN(CASE WHEN t.status = 'Current' THEN t.grade END)
THEN 1 END
FROM YourTable t
GROUP BY
t.id
) t;
To query the actual failed IDs, simply use a HAVING clause:
SELECT
t.id
FROM YourTable t
GROUP BY
t.id
HAVING MIN(CASE WHEN t.status = 'Overall' THEN t.grade END) >
MIN(CASE WHEN t.status = 'Current' THEN t.grade END);
db<>fiddle

How to find the highest and second highest entry in SQL in a single query using `GROUP BY`?

Let this be the table that is provided.
PID
TID
Type
Freq
1
1
A
3
1
1
A
2
1
1
A
1
1
1
B
3
1
2
A
4
1
2
B
5
I want to write a query to get an output like this.
PID
TID
Type
Max_Freq_1
Max_Freq_2
1
1
A
3
2
1
1
B
3
NULL
1
2
A
4
NULL
1
2
B
5
NULL
That is, given a combination of PID, TID, Type, what is the highest and second-highest frequency? If there aren't a sufficient number of entries in the table, then put second highest as NULL
If your database can use the window functions, then the top 2 Freq can be calculated via the DENSE_RANK function.
SELECT PID, TID, Type
, MAX(CASE WHEN Rnk = 1 THEN Freq END) AS Max_Freq_1
, MAX(CASE WHEN Rnk = 2 THEN Freq END) AS Max_Freq_2
FROM
(
SELECT PID, TID, Type, Freq
, DENSE_RANK() OVER (PARTITION BY PID, TID, Type ORDER BY Freq DESC) AS Rnk
FROM YourTable t
) q
GROUP BY PID, TID, Type
ORDER BY PID, TID, Type
pid
tid
type
max_freq_1
max_freq_2
1
1
A
3
2
1
1
B
3
null
1
2
A
4
null
1
2
B
5
null
If ROW_NUMBER isn't available, then try this.
SELECT PID, TID, Type
, MAX(CASE WHEN Rnk = 1 THEN Freq END) AS Max_Freq_1
, MAX(CASE WHEN Rnk = 2 THEN Freq END) AS Max_Freq_2
FROM
(
SELECT t1.PID, t1.TID, t1.Type, t1.Freq
, COUNT(DISTINCT t2.Freq) AS Rnk
FROM YourTable t1
LEFT JOIN YourTable t2
ON t2.PID = t1.PID
AND t2.TID = t1.TID
AND t2.Type = t1.Type
AND t2.Freq >= t1.Freq
GROUP BY t1.PID, t1.TID, t1.Type, t1.Freq
) q
GROUP BY PID, TID, Type
ORDER BY PID, TID, Type
Demo on db<>fiddle here
This is what I came up with on PostgreSQL. Using the window function like row_number is the easiest way to get the result you want.
with t as (
select *, row_number() over (partition by pid, tid, "type" order by freq desc) as r
from test_so
) select pid, tid, "type", max(case when r = 1 then freq end) as "highest", max(case when r = 2 then freq end) as "second_highest"
from t
group by pid, tid, "type"

How to sample from different values in a column but only return records that are unique from another column?

I am struggling with a sampling issue using Teradata
Below is the format of the data
ID Group Rank
1 dog 1
1 cat 1
1 lion 1
1 elephant 2
2 dog 1
2 cat 1
2 lion 1
2 elephant 1
3 dog 1
3 cat 2
3 lion 1
3 elephant 1
4 dog 2
4 cat 1
4 lion 1
4 elephant 1
...
I would ideally like to return a sample number for each entry in Group but with only unique values from ID.
Below is the current query I produced but this returns duplicates for ID
SELECT ID, Group FROM Table
WHERE rank = 1
SAMPLE
WHEN group = 'dog' then 10
WHEN group = 'cat' then 10
WHEN group = 'elephant' then 5
WHEN group = 'lion' then 5
END
with cte as
(
SELECT ID, Group,
random(1,10000) as rnd -- RANDOM can't be directly used in OLAP-functions
FROM Table
WHERE rank = 1
)
SELECT ID, Group
FROM cte
QUALIFY
ROW_NUMBER() -- get one random row per ID
OVER (PARTITION BY ID
ORDER BY rnd) = 1
SAMPLE
WHEN group = 'dog' then 10
WHEN group = 'cat' then 10
WHEN group = 'elephant' then 5
WHEN group = 'lion' then 5
END
Assuming you have enough records, choose a random row for each id and then choose the appropriate numbers from that:
select t.*
from (select t.*,
row_number() over (partition by group order by seqnum) as sequm_g
from (select t.*,
row_number() over (partition by id order by random(1, 1000000))
from t
) t
where seqnum = 1
) t
where (group in ('dog', 'cat') and seqnum_g <= 10) or
(group in ('elephant', 'lion') and seqnum_g <= 5) ;
This doesn't guarantee that the groups will be big enough in the result set. But if you have enough data relative to the size of the groups, then it should work.

Pivot data in SQL (repeated levels)

I have a question regarding pivoting data in SQL.
Input data:
TABLE NAME temp
id cat value
1 A 22
1 B 33
1 C 44
1 C 55
My ideal output would be:
id A B C
1 22 33 44
1 22 33 55
Can someone provide some hints on this?
Thanks!
select * from
(
select
id,cat,value
from tablename
)
as tablo
pivot
(
sum(value)
for cat in ([A],[B],[C])
) as p
order by id
use case when, assuming you did a mistake in output format in 2nd rows
select id, max( case when cat='A' then value end) as A,
max(case when cat='B' then value end) as B,
max(case when cat='C' then value end)as C from table
group by id
You need row_number() function with conditional aggregation :
select id, max(case when cat = 'a' then value end) a,
max(case when cat = 'b' then value end) b,
max(case when cat = 'c' then value end) c
from (select t.*, row_number() over (partition by id, cat order by value) as seq
from table t
) t
group by id, seq;
However, it doesn't produce your actual output (it leaves null value where the cat has only one value compare to other cats) but it will give the idea of how to do that.
Use CASE WHEN and MAX aggregation:
select id, max(case when cat='A' then value end) as A,max(case when cat='B' then value end) as B,
max(case when cat='C' then value end) as C from temp
group by id

SQL Select Distinct row values as Column headers maintain individual row values

I have a table that looks like this:
QuestionNum AnswerChoice
1 a
1 a
2 b
2 b
2 a
3 c
3 d
3 c
4 a
4 b
I would like to select the distinct values from the QuestionNum column as column headers and still list each answer choice underneath, so it should look like this:
1 2 3 4
a b c a
a b d b
a c
I started looking at Pivot tables, but the QuestionNum is going to be unknown. Also, I couldnt figure out a way to select multiple rows from the original.
You can do this with conditional aggregation. The challenge is that you need a key, and row_number() provides the key:
select max(case when QuestionNum = 1 then AnswerChoice end) as q_1,
max(case when QuestionNum = 2 then AnswerChoice end) as q_2,
max(case when QuestionNum = 3 then AnswerChoice end) as q_3,
max(case when QuestionNum = 4 then AnswerChoice end) as q_4
from (select t.*,
row_number() over (partition by QuestionNum order by examInstanceID) as seqnum
from table t
) t
group by seqnum;