How to use an aggregate function in SQL WITH JOINS? - sql

The following is a query that I built to find the cheapest price for a particular movie along with the distributor who sells it, how do I manipulate it so that all of the movies are listed with their respective highest movie price along with the distributor who sells it?
/*finds the cheapest price for the movie "American Beauty" and the distributor who sells it*/
SELECT movies.title, movie_distributors.distributor_name, MIN(distributed_movie_list.unit_price) AS lowest_price
FROM distributed_movie_list
INNER JOIN movies ON distributed_movie_list.movie_id = movies.movie_id
INNER JOIN movie_distributors ON movie_distributors.distributor_id = distributed_movie_list.distributor_id
WHERE movies.title = 'American Beauty'
AND distributed_movie_list.unit_price =
(SELECT MIN(unit_price)
FROM distributed_movie_list
INNER JOIN movies ON distributed_movie_list.movie_id = movies.movie_id
WHERE movies.title = 'American Beauty')
GROUP BY movies.title, distributed_movie_list.distributor_id, movie_distributors.distributor_name;
The following tables are used for this query:
create table movies (
movie_id number(5),
title varchar2(30) not null,
category_code char(3) not null,
description varchar2(500),
released_by number(3) not null,
released_on date not null
);
create table movie_distributors(
distributor_id number(3),
distributor_name varchar2(30) not null,
location varchar2(40),
contact varchar2(40)
);
create table distributed_movie_list(
distribution_id number(8),
movie_id number(5) not null,
distributor_id number(3) not null,
distribute_type varchar2(10),
inventory_quantity number(3) default 0,
unit_price number(8,2)
);
The end result I'm looking for is a query that list distributors and highest prices for each movie. Any help would be greatly appreciated ; )

The end result I'm looking for is a query that list distributors and
highest prices for each movie
Then why not just GROUP BY title, distributor_name with MAX like so:
SELECT
m.movie_id,
m.title,
md.distributor_name,
MAX(d.unit_price) AS Highest_price
FROM distributed_movie_list AS d
INNER JOIN movies AS m ON d.movie_id = m.movie_id
INNER JOIN movie_distributors AS md ON md.distributor_id = d.distributor_id
GROUP BY m.movie_id,m.title,
md.distributor_name
HAVING MAX(d.unit_price) = (SELECT MAX(d2.unit_price)
FROM distributed_movie_list d2
WHERE m.movie_id = d2.movie_id)

The following should work on most RDBMSs:
SELECT DISTINCT m.title, md.distributor_name, dml.unit_price AS highest_price
FROM distributed_movie_list dml
INNER JOIN movies m ON dml.movie_id = m.movie_id
INNER JOIN movie_distributors md ON md.distributor_id = dml.distributor_id
INNER JOIN (SELECT movie_id, MAX(unit_price)
FROM distributed_movie_list
GROUP BY movie_id) dmlh
ON dml.unit_price = dmlH.unit_price AND dml.movie_id = dmlh.movie_id
A more efficient solution should be possible for RDBMSs (such as Oracle or SQLServer) that support ranking functions.

If you are trying to get the highest price for each movie -- and associated information -- then use the row_number() function. The following query returns all the information about the highest price for each movie:
select dml.*
from (select dml.*,
ROW_NUMBER() over (partition by movie_id order by unit_price desc) as seqnum
from distributed_movie_list dml
) dml
where seqnum = 1
You can join in other information that you want about movies and distributors.

Related

HiveQL query for data marked as table column names

I work in HDP 2.6.5 platformon using Hive (1.2.1000.2.6.5.0-292) on a simple database based on data from:
https://grouplens.org/datasets/movielens/100k/
I have 4 tables named: genre, movies, ratings, users as below:
CREATE TABLE genre(genre string, genre_id int);
CREATE TABLE movies (movie_id INT, title STRING, rel_date DATE, video_rel_date STRING,
imdb_url STRING, unknown INT, action INT, adventure INT, animation INT, childrens INT,
comedy INT, crime INT, documentary INT, drama INT, fantasy INT, noir INT, horror INT,
musical INT, mystery INT, romance INT, sci_fi INT, thriller INT, war INT, western INT)
CLUSTERED BY (movie_id) INTO 12 BUCKETS STORED AS ORC;
CREATE TABLE ratings(user_id int, movie_id int, rating int, rating_time int);
CREATE TABLE users(user_id int, age int, gender char(1), occupation string, zip int);
I would like to write a query returning which genre of movies was watched most often by women and which by men? But the problem for me is the structure of the movies table where the movie genre is located:
1|Toy Story (1995)|1995-01-01||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
The last 19 fields are the genres, a '1' indicates the movie is of that genre, a '0' indicates it is not. Additionally movies can be in several genres at once. The gender is represented in 'users' table as 'M' or 'F' char.
The required tables can be easily joined, but how to return and group the genres which are the columns names?
SELECT m.title, r.rating, u.gender
FROM movies m INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id);
Make an array of genre columns placed in the order corresponding to genre_id, explode array and join by position in array with genre table. Like this(not tested):
select s.title, s.genre, s.gender, s.rating, s.cnt
from
(select s.title, s.gender, s.rating, s.cnt, s.genre,
rank() over (partition by s.gender order by s.cnt desc) as rnk
from
(
select m.title, u.gender, r.rating, g.genre, count(*) over(partition by u.gender) cnt
from
(select m.movie_id, m.title, e.id+1 as genre_id
from movies m
lateral view
posexplode (array(--place columns in a positions corresponding their genre_id
unknown, action, adventure, animation, childrens,
comedy, crime, documentary, drama, fantasy,
noir, horror, musical, mystery, romance,
sci_fi, thriller, war, western
)
)e as id, val
where e.val=1
) m
INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id)
INNER JOIN genre g ON (g.genre_id = m.genre_id)
) s
) s where rnk = 1
Awful data model. You should have a table with one row per movie and genre.
To solve this problem, I would suggest unpivoting to aggregate:
select mg.*
from (select m.movie_id, u.gender, count(*) as cnt,
rank() over (partition by gender order by count(*) desc) as seqnum
from ((select movie_id, 'action' from movies where action = 1) union all
(select movie_id, 'adventure' from movies where adventure = 1) union all
. . .
) m join
ratings r
on r.movie_id = m.movie_id join
users u
on r.user_id = u.user_id
group by m.movie_id, u.gender
) mg
where seqnum = 1;

Write a query to find the film which grossed the highest revenue for the video renting organisation

select title from film where film_id in ("highest revenue")
select film_id from inventory where inventory_id in (
select inventory_id from rental group by inventory_id
order by count(inventory_id) desc ))
limit (highest revenue);
where im wrong?
SELECT title
FROM film
WHERE film_id in (SELECT film_id
FROM inventory
WHERE inventory_id in (
SELECT inventory_id
FROM rental
GROUP BY inventory_id
ORDER BY count(inventory_id) DESC
)) limit 1;
even this code with joins will work
select Title
from film
inner join inventory
using (film_id)
inner join rental
using (inventory_id)
inner join payment
using (rental_id)
group by title
order by sum(amount) desc
limit 1;
SELECT title
FROM film
WHERE film_id IN ("highest revenue")
SELECT film_id
FROM inventory
WHERE inventory_id IN (
SELECT inventory_id
FROM rental
GROUP BY inventory_id
ORDER BY count(inventory_id) DESC
) limit(highest revenue);
Remove the ) after DESC
Write a query to find the film which grossed the highest revenue for the video renting organization.
select TITLE from FILM f
inner join INVENTORY i using (FILM_ID)
inner join RENTAL r using (INVENTORY_ID)
inner join PAYMENT p using (RENTAL_ID)
group by TITLE
order by sum(AMOUNT) desc
limit 1;

What is the most efficient way of selecting data from relational database?

I just started working with databases and
I have this data sample from PostgreSQL tutorial
https://www.postgresqltutorial.com/postgresql-sample-database/
Which diagram looks like this:
I want to find all film categories rented in for example Canada. Is there a way of doing it without using SELECT within SELECT.. statement like this:
SELECT * FROM category WHERE category_id IN (
SELECT category_id FROM film_category WHERE film_id IN (
SELECT film_id FROM film WHERE film_id IN (
SELECT film_id FROM inventory WHERE inventory_id IN (
SELECT inventory_id FROM rental WHERE staff_id IN (
SELECT staff_id FROM staff WHERE store_id IN (
SELECT store_id FROM store WHERE address_id IN (
SELECT address_id FROM address WHERE city_id IN (
SELECT city_id FROM city WHERE country_id IN (
SELECT country_id FROM country WHERE country IN ('Canada')
)
)
)
)
)
)
)
)
)
I'm sure there must be something that i'm missing.
The proper way is to use joins instead of all these nested subqueries:
select distinct c.category_id, c.name
from category c
inner join film_category fc on fc.category_id = c.category_id
inner join inventory i on i.film_id = fc.film_id
inner join rental r on r.inventory_id = i.inventory_id
inner join staff s on s.staff_id = r.staff_id
inner join store sr on sr.store_id = s.store_id
inner join address a on a.address_id = sr.address_id
inner join city ct on ct.city_id = a.city_id
inner join country cr on cr.country_id = ct.country_id
where cr.country = 'Canada'
For your requirement you must join 9 tables (1 less than your code because the table film is not really needed as the column film_id can link the tables film_category and inventory directly).
Notice the aliases for each table which shortens the code and makes it more readable and the ON clauses which are used to link each pair of tables.
Also the keyword DISTINCT is used so you don't get duplicates in the results because all these joins will return many rows for each category.

SQL query optimization on cross production?

I have the following schema
CUSTOMER (CID INTEGER NOT NULL, NAME VARCHAR(30), ADDRESS VARCHAR(50))
PRODUCT (PID INTEGER NOT NULL, NAME VARCHAR(50), PRICE DECIMAL(10,2))
SALE (SID BIGINT NOT NULL, STATUS VARCHAR(10), CID INTEGER, TOTALPRICE DECIMAL(30,2))
PRODUCTSALE (SID BIGINT NOT NULL, PID INTEGER NOT NULL, UNITS INTEGER, SUBTOTAL DECIMAL(30,2))
I am currently have a statement like this:
SELECT
P.NAME, COUNT(DISTINCT C.CID) AS NUM_CUSTOMERS
FROM
CUSTOMER AS C, PRODUCT AS P, PRODUCTSALE AS PS, SALE AS S
WHERE
C.CID = S.CID
AND S.SID = PS.SID
AND PS.PID = P.PID
GROUP BY
P.NAME
ORDER BY
NUM_CUSTOMERS DESC
I think it's creating a four table(P,S,PS,C) cross product? Can I optimize it by using nature join on four of them? What are the other way of optimize this statement?
Start with the biggest table and filter down.
SELECT p.NAME, COUNT(DISTINCT c.CID) AS NUM_CUSTOMERS
FROM ProductSale ps
INNER JOIN Product p ON ps.PID=p.PID
INNER JOIN Sale s ON ps.SID=s.SID
INNER JOIN Customer c ON s.CID=c.CID
GROUP BY p.Name
ORDER BY NUM_CUSTOMERS DESC

Oracle - Find Max and Least records from table

I have the following patients and appointments tables.
Patient
CREATE TABLE Patient
(
patientID number(10),
firstName varchar2(50) NOT NULL,
middleName varchar2(50),
surName varchar2(50) NOT NULL,
p_age number(10) NOT NULL,
p_gender char(1),
p_address varchar2(200),
p_contact_no number(10),
medicalHistory varchar2(500),
allergies varchar2(200),
CONSTRAINT PK_Patient PRIMARY KEY (patientID)
);
Appointment
CREATE TABLE Appointment
(
appID number(10),
patientId number(10),
staffId number(10),
appDateTime TIMESTAMP(3),
CONSTRAINT PK_Appointment PRIMARY KEY (appID),
CONSTRAINT FK_Appointment_Patient FOREIGN KEY (patientId) REFERENCES Patient(patientID) ON DELETE CASCADE,
CONSTRAINT FK_Appointment_Staff FOREIGN KEY (staffId) REFERENCES Staff(staffID) ON DELETE CASCADE
);
I want to get the patient details of patients having most and least appointments.
I have written the query in SQL server before and now I want to change it to oracle. Can anyone help me?
This is what I have so far.
SELECT p.patientId, p.firstName,
Count(a.appId) AS Count,
MAX(Count(a.appId)) OVER () AS MaxMyGroup,
MIN(Count(a.appId)) OVER () AS MinMyGroup
FROM Patient p INNER JOIN Appointment a ON p.patientID = a.patientId
GROUP BY p.patientId, p.firstName
SQL Query
WITH s
AS (SELECT p.patientId, p.firstName,
Count(a.appId) AS [Count],
MAX(Count(a.appId)) OVER () AS [MaxMyGroup],
MIN(Count(a.appId)) OVER () AS [MinMyGroup]
FROM Patient p INNER JOIN Appointment a ON p.patientID = a.patientId
GROUP BY p.patientId, p.firstName)
SELECT patientId AS ID,
firstName AS 'First Name',
V.[Count] AS 'Appointment Count',
Agg AS 'MAX/MIN'
FROM s
CROSS APPLY (VALUES ( 'Most', CASE WHEN [Count] = [MaxMyGroup] THEN [Count] END),
('Least', CASE WHEN [Count] = [MinMyGroup] THEN [Count] END))
V(Agg, [Count])
WHERE V.[Count] IS NOT NULL
You are almost there with your query - you just need to then filter on whether the number of appointments is equal to either the minimum or maximum number of appointments. (You also probably want to use LEFT OUTER JOIN rather than INNER JOIN.)
SELECT patientId,
firstName,
NumAppt,
CASE NumAppt
WHEN MinAppt
THEN 'Least'
ELSE 'Most'
END AS category
FROM (
SELECT p.patientId,
p.firstName,
Count(a.appId) AS NumAppt,
MAX(Count(a.appId)) OVER () AS MaxAppt,
MIN(Count(a.appId)) OVER () AS MinAppt
FROM Patient p
LEFT OUTER JOIN Appointment a
ON ( p.patientID = a.patientId )
GROUP BY p.patientId, p.firstName
)
WHERE NumAppt IN ( MinAppt, MaxAppt );
Check if this helps.
SELECT *
FROM PATIENT P
WHERE EXISTS
(SELECT PATIENTID,
COUNT(*)
FROM APPOINTMENT A
GROUP BY PATIENTID
HAVING P.PATIENTID = A.PATIENTID
AND (COUNT(*) >=
(SELECT MAX(COUNT(*)) FROM APPOINTMENT A2 GROUP BY A2.PATIENTID
)
OR COUNT(*) <=
(SELECT MIN(COUNT(*)) FROM APPOINTMENT A3 GROUP BY A3.PATIENTID
) )
)