HiveQL query for data marked as table column names - sql

I work in HDP 2.6.5 platformon using Hive (1.2.1000.2.6.5.0-292) on a simple database based on data from:
https://grouplens.org/datasets/movielens/100k/
I have 4 tables named: genre, movies, ratings, users as below:
CREATE TABLE genre(genre string, genre_id int);
CREATE TABLE movies (movie_id INT, title STRING, rel_date DATE, video_rel_date STRING,
imdb_url STRING, unknown INT, action INT, adventure INT, animation INT, childrens INT,
comedy INT, crime INT, documentary INT, drama INT, fantasy INT, noir INT, horror INT,
musical INT, mystery INT, romance INT, sci_fi INT, thriller INT, war INT, western INT)
CLUSTERED BY (movie_id) INTO 12 BUCKETS STORED AS ORC;
CREATE TABLE ratings(user_id int, movie_id int, rating int, rating_time int);
CREATE TABLE users(user_id int, age int, gender char(1), occupation string, zip int);
I would like to write a query returning which genre of movies was watched most often by women and which by men? But the problem for me is the structure of the movies table where the movie genre is located:
1|Toy Story (1995)|1995-01-01||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
The last 19 fields are the genres, a '1' indicates the movie is of that genre, a '0' indicates it is not. Additionally movies can be in several genres at once. The gender is represented in 'users' table as 'M' or 'F' char.
The required tables can be easily joined, but how to return and group the genres which are the columns names?
SELECT m.title, r.rating, u.gender
FROM movies m INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id);

Make an array of genre columns placed in the order corresponding to genre_id, explode array and join by position in array with genre table. Like this(not tested):
select s.title, s.genre, s.gender, s.rating, s.cnt
from
(select s.title, s.gender, s.rating, s.cnt, s.genre,
rank() over (partition by s.gender order by s.cnt desc) as rnk
from
(
select m.title, u.gender, r.rating, g.genre, count(*) over(partition by u.gender) cnt
from
(select m.movie_id, m.title, e.id+1 as genre_id
from movies m
lateral view
posexplode (array(--place columns in a positions corresponding their genre_id
unknown, action, adventure, animation, childrens,
comedy, crime, documentary, drama, fantasy,
noir, horror, musical, mystery, romance,
sci_fi, thriller, war, western
)
)e as id, val
where e.val=1
) m
INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id)
INNER JOIN genre g ON (g.genre_id = m.genre_id)
) s
) s where rnk = 1

Awful data model. You should have a table with one row per movie and genre.
To solve this problem, I would suggest unpivoting to aggregate:
select mg.*
from (select m.movie_id, u.gender, count(*) as cnt,
rank() over (partition by gender order by count(*) desc) as seqnum
from ((select movie_id, 'action' from movies where action = 1) union all
(select movie_id, 'adventure' from movies where adventure = 1) union all
. . .
) m join
ratings r
on r.movie_id = m.movie_id join
users u
on r.user_id = u.user_id
group by m.movie_id, u.gender
) mg
where seqnum = 1;

Related

How I can use SQL to create a new table using WHERE/AND conditions?

I need to create a new table with the title of the movies starred by Johnny Depp AND Helena Bonham Caster.
I have these tables:
Movies: (ID, title and year)
Stars: (Movie_ID, Person_ID)
People: (ID, Name, and Birth year)
CREATE TABLE johnny_helena (title TEXT);
INSERT INTO johnny_helena (title)
SELECT title
FROM movies
WHERE id IN (SELECT movie_id FROM stars WHERE person_id IN (SELECT id FROM people WHERE name = 'Johnny Depp') AND person_id IN (SELECT id FROM people WHERE name = 'Helena Bonham Caster'))
When I execute this code I receive one Table with 0 rows but should be 6.
I'm new em SQL and I believe that I'm making a foolish mistake. So, how I can fix this error?
Thanks.
One option uses joins and aggregation:
INSERT INTO johnny_helena (title)
SELECT m.title
FROM movies m
INNER JOIN stars s ON s.movie_id = m.id
INNER JOIN people p ON p.id = s.person_id
WHERE p.name in ('Johnny Depp', 'Helena Bonham Caster')
GROUP BY m.id, m.title
HAVING COUNT(*) = 2

How can this query be optimized?

I need to write this query in postgresql 9.3:
List the most popular movie in each country. The most popular movie/movies is the one that has got the highest average rating
across all the users of that country. In case of a tie, return all
movies order alphabetically. (2 columns)
Tables needed:
CREATE TABLE movie (
id integer,
name varchar(200),
year date
);
CREATE TABLE userProfile (
userid varchar(200),
gender char(1),
age integer,
country varchar(200),
registered date
);
CREATE TABLE ratings (
mid integer,
userid varchar(200),
rating integer
);
CREATE INDEX movie_id_idx ON movie (id);
CREATE INDEX userProfile_userid_idx ON userProfile (userid);
CREATE INDEX ratings_userid_idx ON ratings (userid);
CREATE INDEX ratings_mid_idx ON ratings (mid);
CREATE INDEX ratings_userid_mid_idx ON ratings (userid, mid);
Here is mine query:
CREATE TEMP TABLE tops AS SELECT country, name
FROM ratings AS r INNER JOIN userProfile AS u
ON r.userid=u.userid
INNER JOIN movie AS m ON m.id = r.mid LIMIT 0;
~10 min
CREATE TEMP TABLE avg_country AS
SELECT country, r.mid, AVG(rating) AS rate
FROM ratings AS r INNER JOIN userProfile AS u
ON r.userid=u.userid
GROUP BY country, r.mid;
~8 min
DO $$
DECLARE arrow record;
BEGIN
CREATE TABLE movie_names AS SELECT id, name FROM movie;
FOR arrow IN SELECT DISTINCT country FROM userProfile ORDER BY country
LOOP
CREATE TABLE movies AS SELECT mid FROM (SELECT MAX(rate) AS m_rate FROM avg_country
WHERE country=arrow.country) AS max_val CROSS JOIN LATERAL
(SELECT mid FROM avg_country
WHERE country=arrow.country AND rate=max_val.m_rate) AS a;
WITH names AS (DELETE FROM movie_names AS m
WHERE m.id IN (SELECT mid FROM movies) RETURNING name)
INSERT INTO tops
SELECT arrow.country, name FROM names ORDER BY name;
DROP TABLE movies;
END LOOP;
DROP TABLE movie_names;
END$$;
SELECT * FROM tops;
DROP TABLE tops, avg_country;
Thanks a lot in advance)
This is similar to kordirkos answer, but with one fewer subquery:
select country, movie_name, avg_rating
from (select u.country, m.name as movie_name, avg(r.rating) as avg_rating
rank() over (partition by u.country order by avg(r.rating) desc) as seqnum
from userProfile u join
ratings r
on u.userid = r.userid join
movie m
on r.mid = m.id
group by u.country, m.id -- `name` is not needed here because id is unique
) uc
where seqnum = 1;
Alternatively, if you want to get the list on one row per country:
select country, string_agg(movie_name, '; ') as most_popular_movies
from (select u.country, m.name as movie_name, avg(r.rating) as avg_rating
rank() over (partition by u.country order by avg(r.rating) desc) as seqnum
from userProfile u join
ratings r
on u.userid = r.userid join
movie m
on r.mid = m.id
group by u.country, m.id -- `name` is not needed here because id is unique
) uc
where seqnum = 1
group by country;
Use a plain, old-fashioned SQL - it is old but gold.
WITH q AS (
SELECT *,
dense_rank() over (partition by country order by avg_rating desc ) rank
FROM (
select u.country, m.name movie_name, avg( r.rating ) avg_rating
from userProfile u
join ratings r on u.userid = r.userid
join movie m on r.mid = m.id
group by u.country, m.name
) xx )
SELECT country, movie_name
FROM q WHERE rank <= 1

How to write sql query using group by

I have this table:
actors(id: int, first_name: string, last_name: string, gender: string)
directors(id: int, first_name: string, last_name: string)
directors genres(director id: int, genre: string, prob:
float)
movies(id: int, name: string, years: int, rank:
float,)
movies directors(director id: int, movie id: int)
movies genres(movie id: int, genre: string)
roles(actor id: int, movie id: int, role: string)
I want to find the year for each genre in which maximum movies for that genre were made.
I am doing the following but I'm stuck, please help!
select m.YEAR, count(m.year) as c, genre
from movies_genres,
movies m
where m.id = movies_genres.movie_id
group by genre, m.year;
you are getting the count of the movies for each genre for each year which is great. Now you just need to select the max of those by placing your query as a derived table.
select genre, year, max(c) mc
from
(select m.YEAR, count(m.year) as c, genre
from movies_genres mg
inner join movies m
on m.id = mg.movie_id
group by genre, m.year)
group by genre, year
If you are using SQL Server or Oracle or DB2 this should work.
SELECT genre, year, number
FROM
(
SELECT genre, year, number,
row_number() over (PARTITION BY genre ORDER BY number DESC) as rank
FROM
(
SELECT mg.genre, m.year, count(*) as number
FROM movies_genres mg
JOIN movies m on m.id = mg.movie_id
GROUP BY mg.genre, m.year
) A
) B
WHERE rank = 1
How it works: From inner to outer, first you get the count for all genres and years. Then you rank each genre's years by count, finally select the items which are largest.

How to use an aggregate function in SQL WITH JOINS?

The following is a query that I built to find the cheapest price for a particular movie along with the distributor who sells it, how do I manipulate it so that all of the movies are listed with their respective highest movie price along with the distributor who sells it?
/*finds the cheapest price for the movie "American Beauty" and the distributor who sells it*/
SELECT movies.title, movie_distributors.distributor_name, MIN(distributed_movie_list.unit_price) AS lowest_price
FROM distributed_movie_list
INNER JOIN movies ON distributed_movie_list.movie_id = movies.movie_id
INNER JOIN movie_distributors ON movie_distributors.distributor_id = distributed_movie_list.distributor_id
WHERE movies.title = 'American Beauty'
AND distributed_movie_list.unit_price =
(SELECT MIN(unit_price)
FROM distributed_movie_list
INNER JOIN movies ON distributed_movie_list.movie_id = movies.movie_id
WHERE movies.title = 'American Beauty')
GROUP BY movies.title, distributed_movie_list.distributor_id, movie_distributors.distributor_name;
The following tables are used for this query:
create table movies (
movie_id number(5),
title varchar2(30) not null,
category_code char(3) not null,
description varchar2(500),
released_by number(3) not null,
released_on date not null
);
create table movie_distributors(
distributor_id number(3),
distributor_name varchar2(30) not null,
location varchar2(40),
contact varchar2(40)
);
create table distributed_movie_list(
distribution_id number(8),
movie_id number(5) not null,
distributor_id number(3) not null,
distribute_type varchar2(10),
inventory_quantity number(3) default 0,
unit_price number(8,2)
);
The end result I'm looking for is a query that list distributors and highest prices for each movie. Any help would be greatly appreciated ; )
The end result I'm looking for is a query that list distributors and
highest prices for each movie
Then why not just GROUP BY title, distributor_name with MAX like so:
SELECT
m.movie_id,
m.title,
md.distributor_name,
MAX(d.unit_price) AS Highest_price
FROM distributed_movie_list AS d
INNER JOIN movies AS m ON d.movie_id = m.movie_id
INNER JOIN movie_distributors AS md ON md.distributor_id = d.distributor_id
GROUP BY m.movie_id,m.title,
md.distributor_name
HAVING MAX(d.unit_price) = (SELECT MAX(d2.unit_price)
FROM distributed_movie_list d2
WHERE m.movie_id = d2.movie_id)
The following should work on most RDBMSs:
SELECT DISTINCT m.title, md.distributor_name, dml.unit_price AS highest_price
FROM distributed_movie_list dml
INNER JOIN movies m ON dml.movie_id = m.movie_id
INNER JOIN movie_distributors md ON md.distributor_id = dml.distributor_id
INNER JOIN (SELECT movie_id, MAX(unit_price)
FROM distributed_movie_list
GROUP BY movie_id) dmlh
ON dml.unit_price = dmlH.unit_price AND dml.movie_id = dmlh.movie_id
A more efficient solution should be possible for RDBMSs (such as Oracle or SQLServer) that support ranking functions.
If you are trying to get the highest price for each movie -- and associated information -- then use the row_number() function. The following query returns all the information about the highest price for each movie:
select dml.*
from (select dml.*,
ROW_NUMBER() over (partition by movie_id order by unit_price desc) as seqnum
from distributed_movie_list dml
) dml
where seqnum = 1
You can join in other information that you want about movies and distributors.

MIN and COUNT Oracle SQL Query

I have tried to this query: What are the hospitals for each country with the lower number of doctors. (1st column: name of the country; 2nd column: name of the hospital. In case of there is more than hospital with the lower number of doctors it must appear on the result). But the result isn't what I expected and it has a syntax error.
I have these tables:
CREATE TABLE Hospital (
hid INT PRIMARY KEY,
name VARCHAR(127) UNIQUE,
country VARCHAR(127),
area INT
);
CREATE TABLE Doctor (
ic INT PRIMARY KEY,
name VARCHAR(127),
date_of_birth INT,
);
CREATE TABLE Work (
hid INT,
ic INT,
since INT,
FOREIGN KEY (hid) REFERENCES Hospital (hid),
FOREIGN KEY (ic) REFERENCES Doctor (ic),
PRIMARY KEY (hid,ic)
);
I tried with this:
SELECT DISTINCT H.country, H.name, MIN(*)
FROM Hospital H
WHERE H.hid IN (
SELECT COUNT(*)
FROM Work W, Doctor D
WHERE W.hid = H.hid AND W.ic = D.ic
GROUP BY H.country
)
GROUP BY H.country
;
Thanks.
SELECT country, name
FROM
(
SELECT hid, country, name, MIN(doctorCount)
FROM
(
SELECT a.hid, a.country, a.name, COUNT(b.hid) doctorCount
FROM Hospital a
LEFT JOIN Work b
ON a.hid = b.hid
GROUP BY a.hid, a.country, a.name
) x
GROUP BY hid, country, name
) y
Try this:
WITH doctorCount AS
(SELECT H.country country, H.hid hid, COUNT(*) dCount
FROM Work W, Doctor D, Hospital H
WHERE W.ic = D.ic
AND H.hid = W.hid
GROUP BY H.country, H.hid),
minCount AS
(SELECT D.country, MIN (D.dCount) lowCount
FROM doctorCount D
GROUP BY D.country)
SELECT D.country, H.name
FROM doctorCount D, Hospital H, minCount M
WHERE D.hid = H.hid
AND M.country = D.country
AND D.dCount = M.lowCount;