How can this query be optimized? - sql

I need to write this query in postgresql 9.3:
List the most popular movie in each country. The most popular movie/movies is the one that has got the highest average rating
across all the users of that country. In case of a tie, return all
movies order alphabetically. (2 columns)
Tables needed:
CREATE TABLE movie (
id integer,
name varchar(200),
year date
);
CREATE TABLE userProfile (
userid varchar(200),
gender char(1),
age integer,
country varchar(200),
registered date
);
CREATE TABLE ratings (
mid integer,
userid varchar(200),
rating integer
);
CREATE INDEX movie_id_idx ON movie (id);
CREATE INDEX userProfile_userid_idx ON userProfile (userid);
CREATE INDEX ratings_userid_idx ON ratings (userid);
CREATE INDEX ratings_mid_idx ON ratings (mid);
CREATE INDEX ratings_userid_mid_idx ON ratings (userid, mid);
Here is mine query:
CREATE TEMP TABLE tops AS SELECT country, name
FROM ratings AS r INNER JOIN userProfile AS u
ON r.userid=u.userid
INNER JOIN movie AS m ON m.id = r.mid LIMIT 0;
~10 min
CREATE TEMP TABLE avg_country AS
SELECT country, r.mid, AVG(rating) AS rate
FROM ratings AS r INNER JOIN userProfile AS u
ON r.userid=u.userid
GROUP BY country, r.mid;
~8 min
DO $$
DECLARE arrow record;
BEGIN
CREATE TABLE movie_names AS SELECT id, name FROM movie;
FOR arrow IN SELECT DISTINCT country FROM userProfile ORDER BY country
LOOP
CREATE TABLE movies AS SELECT mid FROM (SELECT MAX(rate) AS m_rate FROM avg_country
WHERE country=arrow.country) AS max_val CROSS JOIN LATERAL
(SELECT mid FROM avg_country
WHERE country=arrow.country AND rate=max_val.m_rate) AS a;
WITH names AS (DELETE FROM movie_names AS m
WHERE m.id IN (SELECT mid FROM movies) RETURNING name)
INSERT INTO tops
SELECT arrow.country, name FROM names ORDER BY name;
DROP TABLE movies;
END LOOP;
DROP TABLE movie_names;
END$$;
SELECT * FROM tops;
DROP TABLE tops, avg_country;
Thanks a lot in advance)

This is similar to kordirkos answer, but with one fewer subquery:
select country, movie_name, avg_rating
from (select u.country, m.name as movie_name, avg(r.rating) as avg_rating
rank() over (partition by u.country order by avg(r.rating) desc) as seqnum
from userProfile u join
ratings r
on u.userid = r.userid join
movie m
on r.mid = m.id
group by u.country, m.id -- `name` is not needed here because id is unique
) uc
where seqnum = 1;
Alternatively, if you want to get the list on one row per country:
select country, string_agg(movie_name, '; ') as most_popular_movies
from (select u.country, m.name as movie_name, avg(r.rating) as avg_rating
rank() over (partition by u.country order by avg(r.rating) desc) as seqnum
from userProfile u join
ratings r
on u.userid = r.userid join
movie m
on r.mid = m.id
group by u.country, m.id -- `name` is not needed here because id is unique
) uc
where seqnum = 1
group by country;

Use a plain, old-fashioned SQL - it is old but gold.
WITH q AS (
SELECT *,
dense_rank() over (partition by country order by avg_rating desc ) rank
FROM (
select u.country, m.name movie_name, avg( r.rating ) avg_rating
from userProfile u
join ratings r on u.userid = r.userid
join movie m on r.mid = m.id
group by u.country, m.name
) xx )
SELECT country, movie_name
FROM q WHERE rank <= 1

Related

HiveQL query for data marked as table column names

I work in HDP 2.6.5 platformon using Hive (1.2.1000.2.6.5.0-292) on a simple database based on data from:
https://grouplens.org/datasets/movielens/100k/
I have 4 tables named: genre, movies, ratings, users as below:
CREATE TABLE genre(genre string, genre_id int);
CREATE TABLE movies (movie_id INT, title STRING, rel_date DATE, video_rel_date STRING,
imdb_url STRING, unknown INT, action INT, adventure INT, animation INT, childrens INT,
comedy INT, crime INT, documentary INT, drama INT, fantasy INT, noir INT, horror INT,
musical INT, mystery INT, romance INT, sci_fi INT, thriller INT, war INT, western INT)
CLUSTERED BY (movie_id) INTO 12 BUCKETS STORED AS ORC;
CREATE TABLE ratings(user_id int, movie_id int, rating int, rating_time int);
CREATE TABLE users(user_id int, age int, gender char(1), occupation string, zip int);
I would like to write a query returning which genre of movies was watched most often by women and which by men? But the problem for me is the structure of the movies table where the movie genre is located:
1|Toy Story (1995)|1995-01-01||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
The last 19 fields are the genres, a '1' indicates the movie is of that genre, a '0' indicates it is not. Additionally movies can be in several genres at once. The gender is represented in 'users' table as 'M' or 'F' char.
The required tables can be easily joined, but how to return and group the genres which are the columns names?
SELECT m.title, r.rating, u.gender
FROM movies m INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id);
Make an array of genre columns placed in the order corresponding to genre_id, explode array and join by position in array with genre table. Like this(not tested):
select s.title, s.genre, s.gender, s.rating, s.cnt
from
(select s.title, s.gender, s.rating, s.cnt, s.genre,
rank() over (partition by s.gender order by s.cnt desc) as rnk
from
(
select m.title, u.gender, r.rating, g.genre, count(*) over(partition by u.gender) cnt
from
(select m.movie_id, m.title, e.id+1 as genre_id
from movies m
lateral view
posexplode (array(--place columns in a positions corresponding their genre_id
unknown, action, adventure, animation, childrens,
comedy, crime, documentary, drama, fantasy,
noir, horror, musical, mystery, romance,
sci_fi, thriller, war, western
)
)e as id, val
where e.val=1
) m
INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id)
INNER JOIN genre g ON (g.genre_id = m.genre_id)
) s
) s where rnk = 1
Awful data model. You should have a table with one row per movie and genre.
To solve this problem, I would suggest unpivoting to aggregate:
select mg.*
from (select m.movie_id, u.gender, count(*) as cnt,
rank() over (partition by gender order by count(*) desc) as seqnum
from ((select movie_id, 'action' from movies where action = 1) union all
(select movie_id, 'adventure' from movies where adventure = 1) union all
. . .
) m join
ratings r
on r.movie_id = m.movie_id join
users u
on r.user_id = u.user_id
group by m.movie_id, u.gender
) mg
where seqnum = 1;

Issue with getting the rank of a user based on combined columns in a join table

I have a users table and each user has flights in a flights table. Each flight has a departure and an arrival airport relationship within an airports table. What I need to do is count up the unique airports across both departure and arrival columns (flights.departure_airport_id and flights.arrival_airport_id) for each user, and then assign them a rank via dense_rank and then retrieve the rank for a given user id.
Basically, I need to order all users according to how many unique airports they have flown to or from and then get the rank for a certain user.
Here's what I have so far:
SELECT u.rank FROM (
SELECT
users.id,
dense_rank () OVER (ORDER BY count(DISTINCT (flights.departure_airport_id, flights.arrival_airport_id)) DESC) AS rank
FROM users
LEFT JOIN flights ON users.id = flights.user_id
GROUP BY users.id
) AS u WHERE u.id = 'uuid';
This works, but does not actually return the desired result as count(DISTINCT (flights.departure_airport_id, flights.arrival_airport_id)) counts the combined airport ids and not each unique airport id separately. That's how I understand it works, anyway... I'm guessing that I somehow need to use a UNION join on the airport id columns but can't figure out how to do that.
I'm on Postgres 13.0.
I would recommend a lateral join to unpivot, then aggregation and ranking:
select *
from (
select f.user_id,
dense_rank() over(order by count(distinct a.airport_id) desc) rn
from flights f
cross join lateral (values
(f.departure_airport_id), (f.arrival_airport_id)
) a(airport_id)
group by f.user_id
) t
where user_id = 'uuid'
You don't really need the users table for what you want, unless you do want to allow users without any flight (they would all have the same, highest rank). If so:
select *
from (
select u.id,
dense_rank() over(order by count(distinct a.airport_id) desc) rn
from users u
left join flights f on f.user_id = u.id
left join lateral (values
(f.departure_airport_id), (f.arrival_airport_id)
) a(airport_id) on true
group by u.id
) t
where id = 'uuid'
You're counting the distinct pairs of (departure_airport_id, arrival_airpot_id). As you suggested, you could use union to get a single column of airport IDs (regardless of whether they are departure or arrival airports), and then apply a count on them:
SELECT user_id, DENSE_RANK() OVER (ORDER BY cnt DESC) AS user_rank
FROM (SELECT u.id AS user_id, COALESCE(cnt, 0) AS cnt
FROM users u
LEFT JOIN (SELECT user_id, COUNT DISTINCT(airport_id) AS cnt
FROM (SELECT user_id, departure_airport_id AS airport_id
FROM flights
UNION
SELECT user_id, arrival_airport_id AS airport_id
FROM flights) x
GROUP BY u.id) f ON u.id = f.user_id) t

sql: get a single-column table, order by column from another table

I have interconnected tables.
movies (main, parent) : id | title | year
people (child) : people_id | name | birthyear
ratings (child) : movie_id | rating | votes
stars (child) : movie_id | person_id
I need to make a query ang get a sinle column output from tables "movies-people-stars" and order that by column from the table "rating" without joining column "rating" to my output.
My code:
SELECT title from movies
where id in (select movie_id from stars
where person_id in(select id from people where name = "Chadwick Boseman"))LIMIT 5;
It returns all titles of movies where Chadwick Boseman plays. I need to order them by rating. How to do it?
Although this would never be done without a join, since it is homework, you can use a correlated subquery for the table ratings in the ORDER BY clause:
select m.title
from movies m
inner join stars s on s.movie_id = m.id
inner join people p on p.people_id = s.person_id
where p.name = 'Chadwick Boseman'
order by (select r.rating from ratings r where r.movie_id = m.id) desc
limit 5
You could also use your query and add the ORDER BY clause:
select m.title
from movies m
where m.id in (
select movie_id
from stars
where person_id in(
select id
from people
where name = 'Chadwick Boseman'
)
)
order by (select r.rating from ratings r where r.movie_id = m.id) desc
limit 5;
You need to include the column in the select list to order by that column. Order by sorts your output in the order of the column you specify. Also, why can't you use JOINs for your query like below.
SELECT m.title,d.rating
FROM movies m
JOIN stars s ON s.movie_id = m.id
JOIN people p ON p.id = s.person_id
JOIN tbl d ON d.xx = z.yy ----- JOIN the table d here and use it in select . replace z,xx and yy with actual table name and columns.
WHERE p.name = "Chadwick Boseman"
ORDER BY d.rating
LIMIT 5
updated* - It might work but not able to test as I don't have access to actual data and tables.
SELECT m.title
FROM movies m
JOIN stars s ON s.movie_id = m.id
JOIN people p ON p.id = s.person_id
WHERE p.name = 'Chadwick Boseman'
AND m.id in (SELECT top 5 movie_id
FROM ratings r
WHERE r.movie_id = m.id
ORDER BY ratings desc)

Shorten a query

I have to write a query that would calculate number of tickets purchased consisting only of movie genre of that type. At the end, I have to return movie genre and number of tickets bought for that genre. I have written a query but I was wondering if it can be made shorter and more compact?
Following is the database scheme:
movies(movieId, movieGenre, moviePrice)
tickets(ticketId, ticketDate, customerId)
details(ticketId, movieId, numOfTickets)
Here is my query:
select m.genre, count(*)
from(select t.ticketId, m.genre
from(select d.ticketId
from(select m.genre, t.ticketId
from tickets t join details d on t.ticketId =
d.ticketId join movies m on d.movieId = m.movieId
group by m.genre, t.ticketId) d
group by d.ticketId
having count(*) = 1) as t join details d on t.ticketId =
d.ticketId join movies m on d.movieId = m.movieId
group by t.ticketId, m.genre) m
group by m.genre;
This runs on a database so I am only able to post sample output:
comedy 29821
action 27857
rom-com 19663
I see no reason to use the table tickets, because the results do not filter or aggregate by ticketDate or customerID. Thus, a shorter sql is
SELECT m.moviegenre,
Sum(d.numoftickets) as SumNum
FROM details d
LEFT JOIN movies m
ON d.movieid = m.movieid
GROUP BY m.moviegenre
HAVING SumNum > 0
ORDER BY m.moviegenre
added 3/28 am
I am not sure what is meant by Duplicates?? In table = details(ticketId, movieId, numOfTickets) ??
I would expect that ticketId is unique, so what would explain duplicates?
Is the same ticketId being printed twice, repeatedly??
Determine what number of ticketId are duplicates--
SELECT ticketId, count(*) as cnt
FROM details d
GROUP By ticketId
HAVING count(*) > 1
Determine what number of "details" rows are duplicates--
SELECT ticketId, movieId, numOfTickets, count(*) as cnt
FROM details d
GROUP By ticketId, movieId, numOfTickets
HAVING count(*) > 1
Then again, it may be that table = movies(movieId, movieGenre, moviePrice) is the one with duplicates??
Determine what number of movieId are duplicates--
SELECT movieId, count(*) as cnt
FROM movies m
GROUP BY movieId
HAVING count(*) > 1
Remove duplicates from details--
SELECT m.moviegenre,
Sum(d.numoftickets) as SumNum
FROM
(Select Distinct * From details) d
LEFT JOIN movies m
ON d.movieid = m.movieid
GROUP BY m.moviegenre
ORDER BY m.moviegenre

hive select max count by grouping on two fields

I am trying to write a sql query to find Most Popular Artist in each Country. Popular artist is one which has maximum number of rating>=8
Below is table structure,
describe album;
albumid string
album_title string
album_artist string`
describe album_ratings;
userid int
albumid string
rating int
describe cusers;
userid int
state string
country string
Below is one query that I wrote but it is not working.
select album_artist, country, count(rating)
from album, album_ratings, cusers
where album.albumid=album_ratings.albumid
and album_ratings.userid=cusers.userid
and rating>=6
group by country, album_artist
having count(rating) = (
select max(t.cnt)
from (
select count(rating) as cnt
from album, album_ratings, cusers
where album.albumid=album_ratings.albumid
and album_ratings.userid=cusers.userid
and rating>=6
group by country, album_artist
) as t
group by t.country
);
Learn to use proper, explicit JOIN syntax. Never use commas in the FROM clause.
You can do this with window functions:
select *
from (select album_artist, country, count(*) as cnt,
row_number() over (partition by country order by count(*) desc) as seqnum
from album a join
album_ratings ar join
on a.albumid = ar.albumid
cusers u
on ar.userid = u.userid
where rating >= 6
group by country, album_artist
) aru
where seqnum = 1;
If you want ties, use rank() instead of row_number().
You can use window function row_number to find most popular artist in each country (higher rating - more popular):
select *
from (
select c.country,
a.album_artist,
sum(rating) as total_rating,
row_number() over (partition by c.country order by sum(rating) desc) as rn
from cusers c
join album_ratings r on c.userid = r.userid
join album a on r.albumid = a.albumid
where r.rating >= 8
group by c.country,
a.album_artist
) t
where rn = 1;
I assumed sum(rating) instead, because I think rating should be additive.
Also, always use explicit join syntax instead of old comma based join.