SQL query, need help understanding - sql

I' am new to SQL and seem to struggle when I come across harder queries. I cannot seem to understand how to write these SQL queries from start to finish. I get lost on where to start and how to build these up one at time and combine together. Below is a questions I was asked followed by the solution. Would anyone help me walk through a way of looking at this that might help me understand to get the same results?
Write a query to return the shortest movie from each category.
The order of your results doesn't matter.
If there are ties, return just one of them.
Return the following columns: film_id, title, length, category, row_num
WITH movie_ranking AS (
SELECT
F.film_id,
F.title,
F.length,
C.name category,
ROW_NUMBER() OVER(PARTITION BY C.name ORDER BY F.length) row_num
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
)
SELECT
film_id,
title,
length,
category,
row_num
FROM movie_ranking
WHERE row_num = 1
;
SELECT
film_id,
title,
length,
category,
row_num
FROM (
SELECT
F.film_id,
F.title,
F.length,
C.name category,
ROW_NUMBER() OVER(PARTITION BY C.name ORDER BY F.length) row_num
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
) X
WHERE row_num = 1
;

Lets see if this helps. First, can you follow the very SIMPLE query of "get a list of all movies and sort them by category. Within each category, sort with the shortest movies to longest.
select
c.name category,
f.length,
f.title
from
film f
JOIN film_category fc
on f.film_id = fc.film_id
join category c
on fc.category_id = c.category_id
order by
c.name,
f.length
This should be pretty simple. If not, I will go deeper. So how does this help you with partitioning and ordering, and what is this ROW_NUMBER() thing.
First, partitioning. Think of this as taking the result records that come through and pre-breaking them down into different buckets (your film categories). Simple enough, as the above is just ordering the output based on the c.name (category name descriptive text).
Now, the ORDER BY clause in the partitioning. That corresponds to your secondary sort on the film length. So the first part took all the films and put them into different "buckets" Sci-Fi, RomCom, Horror, Fiction, Action/Adventure, etc. Now the ORDER BY is saying, within each bucket, sort the data (hence the order by WITHIN each c.name). At this time, you now have 5+ buckets (each category), each with different number of movies. Some could have 4 or 5 example, others could have 30, others 18, whatever. But at least now, each category bucket has been sorted based on the length.
NOW comes the ROW_NUMBER() (partion context) row_num. ROW_NUMBER() is a Windows Function call that resets back to 0 each time a new bucket (film category) is encountered and so it takes the records now sorted by length and assigns it a sequential # 1, 2, 3, etc for however many are within each category. The final column name from the ROW_NUMBER() is "row_num" but could be assigned any column name.
So we now have 4, 5 or however many buckets, all sorted by the length of the movie IN EACH RESPECTIVE BUCKET. And each bucket as a # 1, 2, 3, etc based on available movies.
Now the query is using "CTE" (common table expression). Basically writing a query with an alias name that will be used in the immediate following actual query. So this "WITH MOVIE_RANKING" query is similar to my query but without a generic outer ORDER BY clause, but instead the partition and order within each partition.
WITH movie_ranking AS (
SELECT
F.film_id,
F.title,
F.length,
C.name category,
ROW_NUMBER() OVER(PARTITION BY C.name ORDER BY F.length) row_num
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
)
SELECT
film_id,
title,
length,
category,
row_num
FROM movie_ranking
WHERE row_num = 1
;
The final select gets the result of the WITH Movie_Ranking alias context, so it can explicitly reference the final column name called "row_num". Since the request is to get the move with the shortest time in each category, all you care about is any record with the "row_num" = 1. Remember, each bucket was sorted, then assigned sequentially, so #1 in each bucket was the shortest length movie.
The second version of the query is just doing an in-line query from a query. Basically an in-line copy of the WITH CTE, just using the now recognized alias to keep readability simplified.

Related

SQL Collect duplicates to one place? PostgreSQL

Sorry I'm new here and I'm also new with SQL and can't really explain my problem in the title...
So I have a TV show database, and there I have a Genre column, but for a TV show there are multiple Genres stored, so when I'm selecting all my TV Shows how can I combine them?
It needs to look like this:
https://i.stack.imgur.com/3EhBj.png
So I have to combine the string together, here is my code so far what I wrote:
SELECT title,
year,
runtime,
MIN(name) as name,
ROUND(rating, 1) as rating,
trailer,
homepage
FROM shows
JOIN show_genres
on shows.id = show_genres.show_id
JOIN genres
on show_genres.genre_id = genres.id
GROUP BY title,
year,
runtime,
rating,
trailer,
homepage
ORDER BY rating DESC
LIMIT 15;
I also have some other stuff here, that's my exerciese tasks! Thanks!
Also here is the relationship model:
https://i.stack.imgur.com/M89ho.png
Basically you need string aggregation - in Postgres, you can use string_agg() for this.
For efficiency, I would recommend moving the aggregation to a correlated subquery or a lateral join rather than aggregating in the outer query, so:
SELECT
s.title,
s.year,
s.runtime,
g.genre_names,
ROUND(s.rating, 1) as rating,
s.trailer,
s.homepage
FROM shows s
LEFT JOIN LATERAL (
SELECT string_agg(g.name, ', ') genre_names
FROM show_genres sg
INNER JOIN genres g ON g.id = sg.genre_id
WHERE sg.show_id = s.id
) g ON 1 = 1
ORDER BY s.rating DESC
LIMIT 15

SQL Query with row_number() not returning expected output

my goal is to write a query that should return the cities which produced the highest avg. sales for each item-category.
This is the expected output:
item_category|city
books |los_angeles
toys |austin
electronics |san_fransisco
My 3 table schemas look like this:
users
user_id|city
sales
user_id|item_id|sales_amt
items
item_id|item_category
These are further notes to consider:
1. sales_amt is the only column that may have Null values. if no users have placed a sale for a particular item-category (no rows in sales with a non-Null sales_amt), then the city name should be Null.
2. only 1 row per each distinct item. It more than 1 city qualify, then pick the first one alphabetically.
The attempt I took looks like this but it does not produce the right output:
select a.item_category,a.city from (
select
i.item_category,
u.city,
row_number() over (partition by i.item_category,u.city order by avg(s.sales_amt) desc)rk
from sales s
join users u on s.user_id=u.user_id
join items i on i.item_id=s.item_id
group by i.item_category,u.city)a
where a.rk=1
My output does not return the Null cased for sales_amt. Also, I get non-unique rows. Therefore, I am very nervous I am not properly incorporating the 2 notes.
I hope someone can help.
my goal is to write a query that should return the cities which produced the highest avg. sales for each item-category.
This can be calculated using aggregation and window functions:
select ic.*
from (select i.item_category, u.city,
row_number() over(partition by u.item_category order by avg(s.sales_amt) desc, u.city) as seqnum
from users u join
sales s
on s.user_id = u.user_id join
items i
on i.item_id = s.item_id
group by i.item_category, u.city
) ic
where seqnum = 1;
Your question explicitly says "average" which is why this uses avg(). However, I suspect that you really want the sum in each city, which would be sum().
Notes:
You want one row so row_number() instead of rank().
You need sales to calculate the average, so join, instead of left join.
You want one row per item_category, so that is used for partitioning.
Aaaand my take on it is a mix of GMB and Gordon's advices; GMB points out that left joins are needed but I think his starting table, partition and choice of rank() is wrong (his query cannot generate null city names as requested, and could generate duplicates tied on same avg), and Gordon picked up on things like ordering by city on a tied avg which GMB did not but missed the "if no sales of any items in category X put null for the city" requirement. Both guys left cancelled orders floating round the system which introduces errors:
select *
from (
select
i.item_category,
u.city,
row_number() over(partition by i.item_category order by avg(s.sales_amt) desc, u.city asc) rn
from items i
left join (select * from sales where sale_amt is not null) s on i.item_id = s.item_id
left join users u on s.user_id = u.user_id
group by i.item_category, u.city
) t
where rn = 1
We start from itemcategory so that categories having no sales get nulls for their sale amount and city.
We also need to consider that any sales that didn't fulfil will have null in their amount and we exclude these with a subquery otherwise they will link through to users giving a false positive - even though the avg will calculate as null for a category that only has cancelled orders, the city will still show when it should not). I could also have done this with a and sales_amt is not null predicate in the join but I think this way is clearer. This should not be done with a predicate in the where clause because that will eliminate the sale-less categories we are trying to preserve
Row number is used on avg but with city name to break any ties. It's a simpler function than rank and cannot generate duplicate values
Finally we pull the rn 1s to get the top averaging cities
I think you want left joins starting from users in the inner query to preserve cities without sales.
As for the ranking: if you want one record per city, then do not put other columns that city in the partition (your current partition gives you one record per city and per category, which is not what you want).
Consider:
select *
from (
select
i.item_category,
u.city,
rank() over(partition by u.city order by avg(s.sales_amt) desc) rk
from users u
left join sales s on s.user_id = u.user_id
left join items i on i.item_id = s.item_id
group by i.item_category, u.city
) t
where rk = 1

Sqlite - get numbered rows

I am retrieving list of persons from a database and each person has some points. What I want to achieve is to get all person information along with person's points and rank. Points are calculated on the go, because they are not stored within the entity and the query looks something like that:
SELECT p.<some person attributes>, s.points, [here I need rank] as rank
FROM Persons p LEFT JOIN <subquery calculating points> s
ON p.id = s.personId
ORDER BY s.points DESC
In my select part I need to get a position in ranking of a person (what is basically order of returned rows, since I order it by points, right?)
Is there any sql/sqlite column or function to return that?
This is exactly what window functions are for. Specifically, dense_rank will also take care of pesky edge-cases where several users have the same number of points:
SELECT p.<some person attributes> s.points,
DENSE_RANK() OVER (ORDER BY points DESC) as "rank"
FROM Persons p
LEFT JOIN <subquery calculating points> s ON p.id = s.personId
ORDER BY s.points DESC
Unfortunately, SQLite is not very good at this. You pretty much need to resort to a correlated subquery:
with s as (
<subquery calculating points>
)
select p.<some person attributes>, s.points,
(select 1 + count(*)
from s s2
where s2.points > s.points
) as rank
from Persons p left join
s
on p.id = s.personId
order by s.points desc;
This specifically implements rank() over (order by points desc). Similar logic can be used for dense_rank() or row_number() if that is what you really need.

Fetching Two Joined Tables in SQL With Group By and Multiple Columns

I've read through every relevant question in here but couldn't figure out or modify the accepted answers to accomplish what I want.
I have two tables:
News: Id, Title, CategoryId
NewsCategory: Id, Title
I want to list all NewsCategories and include 10 rows of News belong to this category in the same Sql query.
I got this query working at the moment:
Select C.Id As CategoryId, C.Title As CategoryTitle, N.Id, N.Title
From NewsCategories C, News N
Where N.CategoryId In (C.Id)
Order By C.Id Desc
But couldn't figure out how to limit the amount of "News" returned without limiting NewsCategories.
Just use OUTER APPLY:
SELECT C.Id AS CategoryId, C.Title AS CategoryTitle, N.Id, N.Title
From NewsCategories C
OUTER APPLY (
SELECT TOP 10 *
FROM News
WHERE CategoryId = C.Id
) AS N
ORDER BY C.Id Desc
Depending on your requirements and the News table schema, you can perform an additional ORDER BY on the OUTER APPLY sub-query to, e.g., get the 10 latest news for each category or the 10 top news in alphabetical order, etc

Getting SQL tuples that contain the largest value of an aggregate attribute after grouping

Here is the schema:
ACTOR (id, name)
PLAY (id, name, year)
CASTS (pid, aid, character)
The question is:
Find the plays with the largest cast (actors distinct) and return the titles and cast size of those plays.
This is SQL query that I have so far:
select mm.id, mm.name, count(distinct a.id) as numOfActors
from actor a
join casts c on c.pid = a.id
join play mm on mm.id = c.aid
group by mm.id, mm.name;
Every tuple returned from that query contains a different play, displaying its id, name, and the number of casts it has. But from there I'm having difficulty trying to fit it as a subquery within an outer query that would allow me to extract only the tuples that have the largest numofActors value (so like if the largest value was 100, then the only tuples that would be returned all have 100 actors).
Yeah this is one of those "homework"-type of problems, but I'm also looking for a conceptual understanding too (essentially, extracting the tuples that contain the largest value of a certain aggregated attribute after grouping has been done). Ordering by descending and selecting the top tuple doesn't work since there may be more than one tuple with the largest value.
Here is the approach in SQL Server:
select acp.*
from (select p.id, p.name, count(distinct a.id) as numOfActors,
max(count(distinct a.id)) over () as maxcnt
from actor a join
casts c
on c.pid = a.id join
play p
on p.id = c.aid
group by p.id, p.name
) acp
where numOfActors = maxnt;
The expression max(count(distinct a.id)) over (partition by partition by p.id) is an example of a window function. It is calculating the maximum value of a field over a group of rows. Because the () are empty (there is no partition by clause), this assigns the same maximum value to a new column in all rows.
What value is that? It is the maximum of the calculated value count(distinct a.id)) over (partition by partition by p.id). You want to find all plays that have this number of actors, so the outer query just selects these.
A subquery is needed because you cannot use window functions in the where clause.
EDIT:
with acp as (
select p.id, p.name, count(distinct a.id) as numOfActors
from actor a join
casts c
on c.pid = a.id join
play p
on p.id = c.aid
group by p.id, p.name
)
select acp.*
from acp join
(select p.id, max(numOfActors) as maxnoa
from acp
group by p.id
) acpm
on acp.id = acpm.id and acp.numOfActors = acpm.maxnoa;