How to write sql query using group by - sql

I have this table:
actors(id: int, first_name: string, last_name: string, gender: string)
directors(id: int, first_name: string, last_name: string)
directors genres(director id: int, genre: string, prob:
float)
movies(id: int, name: string, years: int, rank:
float,)
movies directors(director id: int, movie id: int)
movies genres(movie id: int, genre: string)
roles(actor id: int, movie id: int, role: string)
I want to find the year for each genre in which maximum movies for that genre were made.
I am doing the following but I'm stuck, please help!
select m.YEAR, count(m.year) as c, genre
from movies_genres,
movies m
where m.id = movies_genres.movie_id
group by genre, m.year;

you are getting the count of the movies for each genre for each year which is great. Now you just need to select the max of those by placing your query as a derived table.
select genre, year, max(c) mc
from
(select m.YEAR, count(m.year) as c, genre
from movies_genres mg
inner join movies m
on m.id = mg.movie_id
group by genre, m.year)
group by genre, year

If you are using SQL Server or Oracle or DB2 this should work.
SELECT genre, year, number
FROM
(
SELECT genre, year, number,
row_number() over (PARTITION BY genre ORDER BY number DESC) as rank
FROM
(
SELECT mg.genre, m.year, count(*) as number
FROM movies_genres mg
JOIN movies m on m.id = mg.movie_id
GROUP BY mg.genre, m.year
) A
) B
WHERE rank = 1
How it works: From inner to outer, first you get the count for all genres and years. Then you rank each genre's years by count, finally select the items which are largest.

Related

HiveQL query for data marked as table column names

I work in HDP 2.6.5 platformon using Hive (1.2.1000.2.6.5.0-292) on a simple database based on data from:
https://grouplens.org/datasets/movielens/100k/
I have 4 tables named: genre, movies, ratings, users as below:
CREATE TABLE genre(genre string, genre_id int);
CREATE TABLE movies (movie_id INT, title STRING, rel_date DATE, video_rel_date STRING,
imdb_url STRING, unknown INT, action INT, adventure INT, animation INT, childrens INT,
comedy INT, crime INT, documentary INT, drama INT, fantasy INT, noir INT, horror INT,
musical INT, mystery INT, romance INT, sci_fi INT, thriller INT, war INT, western INT)
CLUSTERED BY (movie_id) INTO 12 BUCKETS STORED AS ORC;
CREATE TABLE ratings(user_id int, movie_id int, rating int, rating_time int);
CREATE TABLE users(user_id int, age int, gender char(1), occupation string, zip int);
I would like to write a query returning which genre of movies was watched most often by women and which by men? But the problem for me is the structure of the movies table where the movie genre is located:
1|Toy Story (1995)|1995-01-01||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
The last 19 fields are the genres, a '1' indicates the movie is of that genre, a '0' indicates it is not. Additionally movies can be in several genres at once. The gender is represented in 'users' table as 'M' or 'F' char.
The required tables can be easily joined, but how to return and group the genres which are the columns names?
SELECT m.title, r.rating, u.gender
FROM movies m INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id);
Make an array of genre columns placed in the order corresponding to genre_id, explode array and join by position in array with genre table. Like this(not tested):
select s.title, s.genre, s.gender, s.rating, s.cnt
from
(select s.title, s.gender, s.rating, s.cnt, s.genre,
rank() over (partition by s.gender order by s.cnt desc) as rnk
from
(
select m.title, u.gender, r.rating, g.genre, count(*) over(partition by u.gender) cnt
from
(select m.movie_id, m.title, e.id+1 as genre_id
from movies m
lateral view
posexplode (array(--place columns in a positions corresponding their genre_id
unknown, action, adventure, animation, childrens,
comedy, crime, documentary, drama, fantasy,
noir, horror, musical, mystery, romance,
sci_fi, thriller, war, western
)
)e as id, val
where e.val=1
) m
INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id)
INNER JOIN genre g ON (g.genre_id = m.genre_id)
) s
) s where rnk = 1
Awful data model. You should have a table with one row per movie and genre.
To solve this problem, I would suggest unpivoting to aggregate:
select mg.*
from (select m.movie_id, u.gender, count(*) as cnt,
rank() over (partition by gender order by count(*) desc) as seqnum
from ((select movie_id, 'action' from movies where action = 1) union all
(select movie_id, 'adventure' from movies where adventure = 1) union all
. . .
) m join
ratings r
on r.movie_id = m.movie_id join
users u
on r.user_id = u.user_id
group by m.movie_id, u.gender
) mg
where seqnum = 1;

How I can use SQL to create a new table using WHERE/AND conditions?

I need to create a new table with the title of the movies starred by Johnny Depp AND Helena Bonham Caster.
I have these tables:
Movies: (ID, title and year)
Stars: (Movie_ID, Person_ID)
People: (ID, Name, and Birth year)
CREATE TABLE johnny_helena (title TEXT);
INSERT INTO johnny_helena (title)
SELECT title
FROM movies
WHERE id IN (SELECT movie_id FROM stars WHERE person_id IN (SELECT id FROM people WHERE name = 'Johnny Depp') AND person_id IN (SELECT id FROM people WHERE name = 'Helena Bonham Caster'))
When I execute this code I receive one Table with 0 rows but should be 6.
I'm new em SQL and I believe that I'm making a foolish mistake. So, how I can fix this error?
Thanks.
One option uses joins and aggregation:
INSERT INTO johnny_helena (title)
SELECT m.title
FROM movies m
INNER JOIN stars s ON s.movie_id = m.id
INNER JOIN people p ON p.id = s.person_id
WHERE p.name in ('Johnny Depp', 'Helena Bonham Caster')
GROUP BY m.id, m.title
HAVING COUNT(*) = 2

SQL How to use info from created table

SELECT yr, COUNT(title) AS g FROM movie
JOIN casting ON id = movieid
JOIN actor ON actorid = actor.id
WHERE name = 'John Travolta'
AND g = 1
GROUP BY yr
--> I want g to work, but I don't know how to use info from same query.
You can't use g at that moment because you're trying to use an aggregate result before it's being aggregated. You need to use the HAVING clause.
SELECT yr, COUNT(title) AS g FROM movie
JOIN casting ON id = movieid
JOIN actor ON actorid = actor.id
WHERE name = 'John Travolta'
GROUP BY yr
HAVING COUNT(title) = 1

How can this query be optimized?

I need to write this query in postgresql 9.3:
List the most popular movie in each country. The most popular movie/movies is the one that has got the highest average rating
across all the users of that country. In case of a tie, return all
movies order alphabetically. (2 columns)
Tables needed:
CREATE TABLE movie (
id integer,
name varchar(200),
year date
);
CREATE TABLE userProfile (
userid varchar(200),
gender char(1),
age integer,
country varchar(200),
registered date
);
CREATE TABLE ratings (
mid integer,
userid varchar(200),
rating integer
);
CREATE INDEX movie_id_idx ON movie (id);
CREATE INDEX userProfile_userid_idx ON userProfile (userid);
CREATE INDEX ratings_userid_idx ON ratings (userid);
CREATE INDEX ratings_mid_idx ON ratings (mid);
CREATE INDEX ratings_userid_mid_idx ON ratings (userid, mid);
Here is mine query:
CREATE TEMP TABLE tops AS SELECT country, name
FROM ratings AS r INNER JOIN userProfile AS u
ON r.userid=u.userid
INNER JOIN movie AS m ON m.id = r.mid LIMIT 0;
~10 min
CREATE TEMP TABLE avg_country AS
SELECT country, r.mid, AVG(rating) AS rate
FROM ratings AS r INNER JOIN userProfile AS u
ON r.userid=u.userid
GROUP BY country, r.mid;
~8 min
DO $$
DECLARE arrow record;
BEGIN
CREATE TABLE movie_names AS SELECT id, name FROM movie;
FOR arrow IN SELECT DISTINCT country FROM userProfile ORDER BY country
LOOP
CREATE TABLE movies AS SELECT mid FROM (SELECT MAX(rate) AS m_rate FROM avg_country
WHERE country=arrow.country) AS max_val CROSS JOIN LATERAL
(SELECT mid FROM avg_country
WHERE country=arrow.country AND rate=max_val.m_rate) AS a;
WITH names AS (DELETE FROM movie_names AS m
WHERE m.id IN (SELECT mid FROM movies) RETURNING name)
INSERT INTO tops
SELECT arrow.country, name FROM names ORDER BY name;
DROP TABLE movies;
END LOOP;
DROP TABLE movie_names;
END$$;
SELECT * FROM tops;
DROP TABLE tops, avg_country;
Thanks a lot in advance)
This is similar to kordirkos answer, but with one fewer subquery:
select country, movie_name, avg_rating
from (select u.country, m.name as movie_name, avg(r.rating) as avg_rating
rank() over (partition by u.country order by avg(r.rating) desc) as seqnum
from userProfile u join
ratings r
on u.userid = r.userid join
movie m
on r.mid = m.id
group by u.country, m.id -- `name` is not needed here because id is unique
) uc
where seqnum = 1;
Alternatively, if you want to get the list on one row per country:
select country, string_agg(movie_name, '; ') as most_popular_movies
from (select u.country, m.name as movie_name, avg(r.rating) as avg_rating
rank() over (partition by u.country order by avg(r.rating) desc) as seqnum
from userProfile u join
ratings r
on u.userid = r.userid join
movie m
on r.mid = m.id
group by u.country, m.id -- `name` is not needed here because id is unique
) uc
where seqnum = 1
group by country;
Use a plain, old-fashioned SQL - it is old but gold.
WITH q AS (
SELECT *,
dense_rank() over (partition by country order by avg_rating desc ) rank
FROM (
select u.country, m.name movie_name, avg( r.rating ) avg_rating
from userProfile u
join ratings r on u.userid = r.userid
join movie m on r.mid = m.id
group by u.country, m.name
) xx )
SELECT country, movie_name
FROM q WHERE rank <= 1

How should I join these 3 SQL queries in Oracle?

I have these 3 queries:
SELECT
title, year, MovieGenres(m.mid) genres,
MovieDirectors(m.mid) directors, MovieWriters(m.mid) writers,
synopsis, poster_url
FROM movies m
WHERE m.mid = 1;
SELECT AVG(rating) FROM movie_ratings WHERE mid = 1;
SELECT COUNT(rating) FROM movie_ratings WHERE mid = 1;
And I need to join them into a single query. I was able to do it like this:
SELECT
title, year, MovieGenres(m.mid) genres,
MovieDirectors(m.mid) directors, MovieWriters(m.mid) writers,
synopsis, poster_url, AVG(rating) average, COUNT(rating) count
FROM movies m INNER JOIN movie_ratings mr
ON m.mid = mr.mid
WHERE m.mid = 1
GROUP BY
title, year, MovieGenres(m.mid), MovieDirectors(m.mid),
MovieWriters(m.mid), synopsis, poster_url;
But I don't really like that "huge" GROUP BY, is there a simpler way to do it?
You could do something like this:
SELECT title
,year
,MovieGenres(m.mid) genres
,MovieDirectors(m.mid) directors
,MovieWriters(m.mid) writers
,synopsis
,poster_url
,(select avg(mr.rating)
from movie_ratings mr
where mr.mid = m.mid) as avg_rating
,(select count(rating)
from movie_ratings mr
where mr.mid = m.mid) as num_ratings
FROM movies m
WHERE m.mid = 1;
or even
with grouped as(
select avg(rating) as avg_rating
,count(rating) as num_ratings
from movie_ratings
where mid = 1
)
select title
,year
,MovieGenres(m.mid) genres
,MovieDirectors(m.mid) directors
,MovieWriters(m.mid) writers
,synopsis
,poster_url
,avg_rating
,num_ratings
from movies m cross join grouped
where m.mid = 1;
I guess I don't see the problem with having several GroupBy columns. That's a very common pattern in SQL. Of course, code clarity is often in the eye of the beholder.
Check the explain plans for the two approaches; my guess is you'll get better performance with your original version since it only needs to process the movie_ratings table once. But I haven't checked, and that will be somewhat data and installation dependent.
how about
SELECT
title, year, MovieGenres(m.mid) genres,
MovieDirectors(m.mid) directors, MovieWriters(m.mid) writers,
synopsis, poster_url,
(SELECT AVG(rating) FROM movie_ratings WHERE mid = 1) av,
(SELECT COUNT(rating) FROM movie_ratings WHERE mid = 1) cnt
FROM movies m
WHERE m.mid = 1;
or
SELECT
title, year, MovieGenres(m.mid) genres,
MovieDirectors(m.mid) directors, MovieWriters(m.mid) writers,
synopsis, poster_url,
av.av,
cnt.cnt
FROM movies m,
(SELECT AVG(rating) av FROM movie_ratings WHERE mid = 1) av,
(SELECT COUNT(rating) cnt FROM movie_ratings WHERE mid = 1) cnt
WHERE m.mid = 1;