Finding top n-th occurrences in group, Hive - hive

I have a table where each record have columns: title and category.
I want to find 2 titles with most occurrences in their category. Some titles are listed in both categories. How can this be achieved in Hive?
Here is a table creation query:
create table book(category String, title String) row format delimited fields terminated by '\t' stored as textfile;
And example data:
fiction book1
fiction book2
fiction book3
fiction book4
fiction book5
fiction book6
fiction book7
fiction book8
fiction book8
fiction book8
psychology book1
psychology book2
psychology book2
psychology book2
psychology book2
psychology book7
psychology book7
psychology book7
Expected result:
fiction book8
fiction any other
psychology book2
psychology book7
Currently I've managed to write this query:
SELECT * FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
order by category, sale_count DESC;
That gives count for a title in each category but I can't find the way to return only 2 top records from each category

For only two top records use row_number()
select category, title, sale_count
from
(
SELECT a.*,
row_number() over(partition by category order by sale_count desc) rn
FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
)s where rn <=2
order by category, sale_count DESC;
and if there are more than one row with the same top sales and you need to return all top sales rows for two top counts, use DENSE_RANK instead of row_number, it will assign same rank if there are titles with the same sale_count.

Related

How to apply group by here?

I have a table Movie with columns Movie and Viewer where each movie is viewed by any user any number of times, so the table can have multiple same entries. I want to find the Top N most viewed movies and then the Top K viewers for each of the Top N movies. How can I apply group by or partition by effectively in such scenario ? Or if there is any better approach to this, please share. Thanks!
Movie
User
Avengers
John
Batman
Chris
Batman
Ron
X-Men
Chris
X-Men
Ron
Matrix
John
Batman
Martin
Matrix
Chris
Batman
Chris
X-Men
Ron
So, in this table Batman is the most watched movie is Batman followed by X-Men so I want the result table to look like :
Movie
User
View count
Batman
Chris
2
Batman
Ron
1
Batman
Martin
1
X-Men
Ron
2
X-Men
Chris
1
Matrix
John
1
Matrix
Chris
1
Avengers
John
1
I understand that I can group by movie and then do order by count(*) desc but this doesn't give me the second column which is grouped by viewer and the count for each viewer also.
Consider below approach (assuming Top 3 movies with Top 2 users)
select movie, user, view_count
from (
select distinct *,
count(*) over(partition by movie) movie_views,
count(*) over(partition by movie, user) view_count
from your_table
)
qualify dense_rank() over(order by movie_views desc) <=3
and row_number() over(partition by movie order by view_count desc) <=2
-- order by movie_views desc, view_count desc
if applied to sample data in your question - output is

I'd like some help to write sql code to return a list of customer data items ranked by frequency (high to low)

The table I am querying has several thousand rows and numerous fields - I'd like the code to return the top 10 values for a handful of the fields, namely: Forename, Surname and City - I'd also like to see a count of the values returned.
For example
Ranking
Forename
FName Frequency
Surname
SName Frequency
City
City Frequency
1
Liam
830,091
Smith
2,353,709
New York
2,679,785
2
Mary
708,390
Johnson
1,562,990
Los Angeles
413,359
3
Noah
639,592
Williams
792,306
Chicago
393,511
4
Patricia
568,410
Brown
743,346
Houston
367,496
5
William
557,049
Jones
633,933
Phoenix
336,929
6
Linda
497,138
Miller
503,523
Philadelphia
304,638
7
James
490,665
Davis
503,115
San Antonio
255,142
8
Barbara
418,312
Garcia
468,683
San Diego
238,521
9
Logan
399,947
Rodriguez
461,816
Dallas
232,718
10
Elizabeth
399,737
Wilson
436,843
San Jose
213,483
The returned list should be interpreted thus:
The most frequently occurring forename in the table is Liam - with 830,091 instances,
The 5th most frequently occurring forename is William - with 557,049 instances,
The 8th most frequently occurring city is San Diego - with 238,521 instances
...and so on
(N.b. the table does not show there are 2.7m Liams in New York - just that there are 830,091 Liams in the entire table - and that there are 2,679,785 New York addresses in the entire table)
The following produces what I need - but just for the first field (Forename) - I'd like to be able to do the same for three fields
SELECT Forename, COUNT(Forename) AS FName_Frequency
FROM Customer_Table
GROUP BY Forename
ORDER BY FName_Frequency DESC
limit 10
Thanks in anticipation
I would just put this in separate rows:
select 'forename', forename, count(*) as freq
from customer_table
group by forename
order by freq desc
fetch first 10 rows only
union all
select 'surname', surname, count(*) as freq
from customer_table
group by surname
order by freq desc
fetch first 10 rows only
union all
select 'city', city, count(*) as freq
from customer_table
group by city
order by freq desc
fetch first 10 rows only;
Note that this uses Standard SQL syntax, because you have not tagged with the question with the database you are using. You can also put this in separate columns, using:
select max(case when which = 'forename' then col end),
max(case when which = 'forename' then freq end),
max(case when which = 'surname' then col end),
max(case when which = 'surname' then freq end),
max(case when which = 'city' then col end),
max(case when which = 'city' then freq end)
from ((select 'forename' as which, forename as col, count(*) as freq,
row_number() over (order by count(*) desc) as seqnum
from customer_table
group by forename
) union all
(select 'surname' as which, surname, count(*) as freq
row_number() over (order by count(*) desc) as seqnum
from customer_table
group by surname
) union all
(select 'city', city, count(*) as freq,
row_number() over (order by count(*) desc) as seqnum
from customer_table
group by city
)
) x
group by seqnum;

How to produce detail, not summary, report sorted by count(*)?

Oracle 11g:
I want results to list by highest count, then ch_id. When I use group by to get the count then I loose the granularity of the detail. Is there an analytic function I could use?
SALES
ch_id desc customer
=========================
ANAR Anari BOB
SWIS Swiss JOE
SWIS Swiss AMY
BRUN Brunost SAM
BRUN Brunost ANN
BRUN Brunost ROB
Desired Results
count ch_id customer
===========================================
3 BRUN ANN
3 BRUN ROB
3 BRUN SAM
2 SWIS AMY
2 SWIS JOE
1 ANAR BOB
Use the analytic count(*):
select * from
(
select count(*) over (partition by ch_id) cnt,
ch_id, customer
from sales
)
order by cnt desc
select total, ch_id, customer
from sales s
inner join (select count(*) total, ch_id from sales group by ch_id) b
on b.ch_id = s.chi_id
order by total, ch_id
ok - the other post that happened at the same time, using partition, is the better solution for Oracle. But this one works regardless of DB.

How query can i use to get a best seller?

This is my database sample
id/cart_id/title/quantity
1 1 bbok1 2
2 2 book2 3
3 3 book1 2
So in this case I have book1 with a total of 4 sales and book2 with 3 sales.
However if I use order by quantity it would show book2 then book1.
How do I use the query or php code to get the same item and make it book1 4 sales, book2 3 sales?
SELECT
title,
SUM(quantity) AS total_sales
FROM
books
GROUP BY
title
ORDER BY
total_sales DESC

Need a postgresql query to group and order

I have a customers table as following:
customername, ordername, amount
=============================
bob, book, 20
bob, computer, 40
steve,hat, 15
bill, book, 12
bill, computer, 3
steve, pencil, 10
bill, pen, 2
I want to run a query to get the following result:
customername, ordername, amount
=============================
bob, computer, 40
bob, book, 20
bob, ~total~, 60
steve, hat, 15
steve, pencil, 10
steve, ~total~,25
bill, book, 12
bill, computer, 3
bill, pen, 2
bill, ~total~, 17
I want the amount for each customer to be ordered from max to min and a new ordername as "~total~" (must always be the last row for each customer) with a result as sum of all amount for the same customer.
So, in above example, bob should be the first since the total=60, steve the second (total=25) and bill the third (total=17).
Use:
SELECT x.customername,
x.ordername,
x.amount
FROM (SELECT a.customername,
a.ordername,
a.amount,
y.rk,
1 AS sort
FROM CUSTOMERS a
JOIN (SELECT c.customername,
ROW_NUMBER() OVER (ORDER BY SUM(c.amount) DESC) AS rk
FROM CUSTOMERS c
GROUP BY c.customername) y ON y.customername = a.customername
UNION ALL
SELECT b.customername,
'~total~',
SUM(b.amount),
ROW_NUMBER() OVER (ORDER BY SUM(b.amount) DESC) AS rk,
2 AS sort
FROM CUSTOMERS b
GROUP BY b.customername) x
ORDER BY x.rk, x.customername, x.sort, x.amount DESC
You could look at using GROUP BY ROLLUP, but the ordername value would be NULL so you'd have to post-process it to get that replaced with "~total~"...