Automating Repeated Unions - sql

I'm running a query like this:
SELECT id FROM table
WHERE table.type IN (1, 2, 3)
LIMIT 15
This returns a random sampling. I might have 7 items from class_1 and 3 items from class_2. I would like to return exactly 5 items from each class, and the following code works:
SELECT id FROM (
SELECT id, type FROM table WHERE type = 1 LIMIT 5
UNION
SELECT id, type FROM table WHERE type = 2 LIMIT 5
UNION ...
ORDER BY type ASC)
This gets unwieldy if I want a random sampling from ten classes, instead of only three. What is the best way to do this?
(I'm using Presto/Hive, so any tips for those engines would be appreciated).

Use a function like row_number to do this. This makes the selection independent of the number of types.
SELECT id,type
FROM (SELECT id, type, row_number() over(partition by type order by id) as rnum --adjust the partition by and order by columns as needed
FROM table
) T
WHERE rnum <= 5

I would strongly suggest adding ORDER BY. Anyway, you can do something like:
with
x as (
select
id,
type,
row_number() over(partition by type order by id) as rn
from table
)
select * from x where rn <= 5

Related

Select 20 results per every column value

I prepared query that select date from table. In table I got: rank, name, citycode as columns. When I am doing something like that:
select name, citycode
from tab20
where rank <= 20
I got resault of first 20 rows that gets rank <= 20. And Everything would be ok, but I have to show results of first 20 rows per every citystate. Is it possible to create in one query ? I was tryin union etc but it doesn't work well.
Thanks
You would use the row_number() function. Based on the rank that would be:
select t.*
from (select t.*,
row_number() over (partition by citycode order by rank) as seqnum
from tab20 t
) t
where seqnum <= 20;

Take recommendation from 3 different tables

I have 3 different recommendation model that gives me the output in three different tables.
Recommendation 1 : In a ideal situation, I want to take top 2 recommendation per user from this table ordered by ProductRecommendation ascending.
Recommendation 2 : In a ideal situation, I want to take top 3 recommendation per user from this table based on top score.
Recommendation 3 : In a ideal situation, take remaining recommendation from this table to add up to 5 recommendation per user
In the end, I want to see a final output which is a merge of all the recommendation into one which would look like this.
I want to take top 5 recommendation across 3 different tables. FYI, not all the user id can appear in all the tables. Ideally, I want to take TOP 2 from recommendation 1, TOP 3 from recommendation 2. Recommendation 3 is just there so that if there are not enough recommendation from the first two table then recommendation 3 will compensate so at the end I will get 5 results per userID. I don't need to refer to recommendation 3 if I can get 5 recommendation (2 from recommendation 1 and 3 from recommendation 2). when the recommendation 1 has < 2 recommendations per user then I want to get the remaining of the recommendation from recommendation 2. For example, when there is 1 recommendation in Recommendtiaon1 then get 4 recommendation from Recommendation2. Alternatively, if there are 0 recommendation in Recommendation1 then get 5 recommendation from Recommendation2. If Recommednation1 and Recommendation2 doesn't add up to 5 that's when I need to refer to recommendation3. I need to do this in big query SQL. Can you please help?
Thanks for your help.
Consider below approach
with output1 as (
select *, null as Score, row_number() over win pos
from Recommendation1
where true
qualify row_number() over win <= 2
window win as (partition by UserID order by ProductRecommendation)
), output2 as (
select *, 2 + row_number() over win pos
from Recommendation2
where not (UserID, ProductRecommendation) in (select as struct UserID, ProductRecommendation from output1)
qualify row_number() over win <= 5
window win as (partition by UserID order by Score desc)
), output3 as (
select *, 7 + row_number() over win pos
from Recommendation3
where not (UserID, ProductRecommendation) in (select as struct UserID, ProductRecommendation from output1)
and not (UserID, ProductRecommendation) in (select as struct UserID, ProductRecommendation from output2)
qualify row_number() over win <= 5
window win as (partition by UserID order by Score desc)
)
select * except(pos) from (
select * from output1 union all
select * from output2 union all
select * from output3
)
where true
qualify row_number() over win <=5
window win as (partition by UserID order by pos)
# order by UserID, pos
if applied to sample data in your question - the output is
Your description is a bit unclear. The following takes 2 rows from the first table for each user, 3 from the second, and additional rows from the third. The outer query then ensures that there are 5 rows (if available) for each user:
select r.*
from ((select userid, recommendation, 1 as which
from recommendation1
where 1=1
qualify row_number() over (partition by userid order by recommendation) <= 2
) union all
(select userid, recommendation, 2 as which
from recommendation2
where 1=1
qualify row_number() over (partition by userid order by score desc) <= 3
) union all
(select userid, recommendation, 3 as which
from recommendation3
)
) r
where 1=1
qualify row_number() over (partition by userid order by which) <= 5;

Finding top count of a value in a table using SQL

I'm looking for a way to find the top count value of a column by SQL.
If for example this is my data
id type
----------
1 A
1 B
1 A
2 C
2 D
2 D
I would like the result to be:
1 A
2 D
I'm looking for a way to do it without groping by the column I count (type in the example)
Thanks
Statistically, this is called the "mode". You can calculate it using window functions:
select id, type, cnt
from (select id, type, count(*) as cnt,
row_number() over (partition by id order by count(*) desc) as seqnum
from t
group by id, type
) t
where seqnum = 1;
If there are ties, then an arbitrary value is chosen from among the ties.
You are looking for the statistic mode (the most often ocurring value):
select id, stats_mode(type)
from mytable
group by id
order by id;
Not all DBMS support this however. Check your docs, wheher this function or a similar one is available in your DBMS.
Just GROUP BY id, type and keep the rows with the maximum counter:
select id, type
from tablename
group by id, type
having count(*) = (
select count(*) from tablename group by id, type order by count(*) desc limit 1
)
See the demo
Or
select id, type
from tablename
group by id, type
having count(*) = (
select max(t.counter) from (select count(*) counter from tablename group by id, type) t
)
See the demo

SQL Ranking N records by one criteria and N records by another and repeat

In my table I have 4 columns Id, Type InitialRanking & FinalRanking. Based on certain criteria I’ve managed to apply InitialRanking to the records (1-20). I now need to apply FinalRanking by identifying the top 7 of Type 1 followed by the
top 3 of Type 2. Then I need to repeat the above until all records have a FinalRanking. My goal would be to achieve the output in the final column of the attached image.
The 7 & 3 will vary over time but for the purposes of this example let’s say they are fixed.
you can try like this
SELECT * FROM(
( SELECT ID,DISTINCT TYPE,
CASE WHEN TYPE=1 THEN
( SELECT TOP 7 INITIALRANK, FINALRANK
from table where type=1)
ELSE
( SELECT TOP 3 INITIALRANK, FINALRANK
from table where type=2)
END CASE
FROM TABLE WHERE TYPE IN (1,2)
)
UNION
( SELECT ID,TYPE,
INITIALRANK, FINALRANK
from table where type not in (1,2))
)
)
A simple (or simplistic) approach to your Final Rank would be the following:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(10-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(10-3)+7
end FinalRank
This can be generalized for more than 2 groups for example with three groups of size 7, 3 and 2, the pattern size is 7+3+2=12 the general form is PartitionedRowNum+(Ceil(PartitionedRowNum/GroupSize)-1)*(PaternSize-GroupSize)+Offset where the offset is the sum of the preceding group sizes:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(12-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(12-3)+7
when 3 then (ceil((row_number() over (partition by type order by initrank))/2)-1)*(12-2)+7+3
end FinalRank

Finding row with max values in two groups

I use SQL Server 2012,
I have a following table:
id, name, surname, timestamp, type
type has two possible values: 1 and 2.
Now, I would like to find two rows - for each group (1 and 2) row with maximal value in particular type.
The problem is that I would like to find both name and surname.
I can do it with SELECT TOP 1 - WHERE ORDER BY - UNION approach, but I would like to find antother, better idea.
Can you help me ?
This sounds like you want the most recent for each row, for each type. If that's the case, here is a way with row_number()
with cte as(
select
id
,name
,surname
,timestamp
,type
RN = row_number() over (partition by id,type order by timestamp desc))
select *
from cte
where RN = 1