I am very new to the SQL universe, and I came across this prompt that I was able to fulfill, but I have to imagine I'm missing a more direct and intuitive solution. My solution returns the correct response in SQLite within rounding error to over 10 decimal places but technically does not match the reported solution. I appreciate any insight.
Prompt:
Find the difference between the average rating ["stars"] of movies released before 1980 and the average rating of movies released after 1980. (The difference between the average of averages before and after.)
The database includes 3 tables with the following columns (simplified for relevance):
movie| mID*, year
reviewer| rID*, name
rating| rID*, mI*, stars
"mavg" is my own aliased aggregation
select distinct(
(select avg(mavg)
from(
(select *, avg(stars) as mavg
from rating
group by mID) join movie using(mID) )
where year < 1980) -
(select avg(mavg)
from(
(select *, avg(stars) as mavg
from rating
group by mID) join movie using(mID) )
where year >= 1980)
)
from rating
;
Let's look at your subquery:
select *, avg(stars) as mavg
from rating
group by mID
This is an invalid query. With GROUP BY mid you say you want to aggregate your rows to get one result row per mID. But then you don't only select the average rating, but all columns from the table (SELECT *). One of these columns is stars. How can you select the stars column into one row, when there are many rows for an mID? Most DBMS report a syntax error here. SQLite picks one of the stars from any of the mID's rows arbitrarily instead. So, while this is considered valid in SQLite, it isn't in standard SQL, and you should not write such queries.
To the result (the average per movie) you join the movies table. And then you select the average of the movie ratings for the movies in the desired years. This is well done, but you could have put that restriction (join or IN clause or EXISTS clause) right into the subquery in order to only calculate the averages for the movies you want, rather then calculating all averages and then only keep some of the movies and dismiss others. But that's a minor detail.
Then you subtract the new average from the old one. This means you subtract one value from another and end up with exactly the one value you want to show. But instead of merely selecting this value (SELECT (...) - (...)) you are linking the value with the rating table (SELECT (...) - (...) FROM rating) for no apparent reason, thus selecting the desired value as often as there are rows in the rating table. You then notice this and apply DISTINCT to get rid of the rows you just created unnecessarily yourself. DISTINCT is very, very often an indicator for a badly written query. When you think you need DISTINCT, ask yourself what makes this necessary. Where do the duplicate rows come from? Have you created them yourself? The amend this.
The query can be written thus:
select
avg(case when m.year < 1980 then r.movie_rating end) -
avg(case when m.year >= 1980 then r.movie_rating end) as diff
from
(
select mid, avg(stars) as movie_rating
from rating
group by mid
) r
join movie m using (mid);
Using a case expression inside an aggregation function is called conditional aggregation and is often the preferred solution when working with diferent aggregates.
You may use the following single query here:
SELECT AVG(CASE WHEN m.year < 1980 THEN r.stars END) -
AVG(CASE WHEN m.year >= 1980 THEN r.stars END) AS mavg
FROM rating r
INNER JOIN movie m ON m.mID = r.mID;
Related
I got some tables and now I want to determine the current rank of each customer.
I got a log table that holds all the Information when a customer got a "point" then I created a view that counts the "points" for every customer. Now I'm trying to create another view that matches the customers points with the current Rank he has. Furthermore I got a "rank" table that holds the name of the rank and the min points you need to have to reach that rank. My Problem is now that when I do
SELECT r.neededVisits, r.name, av.customerId
FROM Rank r, amountOfVisits av
WHERE av.amount >= r.neededVisits
I get something like this:
[Table Output]
The left column "besuche" holds the value that is needed for that rank i.e. for the rank "gast" you need 0 visits. For the rank "Stammgast" you need 25 visits.
So I get every rank that a customer ever passed. But I just want to get the last rank for each customer
Is there any way I can do this?
Desired Result would be something like this:
[Deisred Result]
The Table that holds the ranks
[Rank Table]
[Rank Table Values]
The table that holds the counted visits for each user
[Amount of visits for each user]
[Amount of visits Table values]
I assume you are looking for something like
SELECT r.neededVisits, r.name, av.customerId
FROM Rank r, amountOfVisits av
WHERE av.amount >= r.neededVisits
AND NOT EXISTS (SELECT * FROM rank r2 WHERE r2.neededVisits < r.neededVisits AND av.amount >= r2.neededVisits)
This uses your current logic, but the final condition removes the in between ranks.
As pointed out in the comments, you should probably try to rewrite with an inner join which would be more like
SELECT r.neededVisits, r.name, av.customerId
FROM Rank r INNER JOIN amountOfVisits av
ON av.amount >= r.neededVisits
WHERE NOT EXISTS (SELECT * FROM rank r2 WHERE r2.neededVisits < r.neededVisits AND av.amount >= r2.neededVisits)
I think the above is pretty readable, but a more modern method (depending on your DBMS) would be to use window functions. Something like
SELECT neededVisits, name, customerId
FROM
(
SELECT r.neededVisits, r.name, av.customerId, RANK() OVER (PARTITION BY av.customerId ORDER BY r.neededVisits DESC) tr
FROM Rank r INNER JOIN amountOfVisits av
ON av.amount >= r.neededVisits
) iq
WHERE tr=1
The inner query here calculates a column "tr" that is ordered DESC based on the matching ranks for that customer. The outer query gets the first one.
a.Movie_Cost, a.Movie_Num,b.Movie_Genre,b.Average
FROM
movie AS a
iNNER JOIN(
SELECT
Movie_Genre,
AVG(Movie_Cost) AS Average
FROM
movie
GROUP BY
Movie_Genre
) AS b
ON
a.Movie_Genre = b.Movie_Genre
This query gives out the result shown in the picture:
What I am trying to do here is simple, find out what each genre's average cost is and then find the percentage difference between the columns Average and Movie_Cost. Each genre's average cost is calculated using the Group BY clause and AVG function, as you can see in the query.
I think the main problem I face is that I am not able to reuse the average column which I generated, in order to calculate the percentage difference.
Btw I am using PHPmyAdmin
Also when I tried implementing it myself it looked something like this:
SELECT
a.Movie_Cost, a.Movie_Num,b.Movie_Genre,b.Average, (SELECT 100*(a.Movie_Cost-b.Average)/b.Average FROM a INNER JOIN b WHERE a.Movie_Genre=b.Movie_Genre)
FROM
movie AS a
iNNER JOIN(
SELECT
Movie_Genre,
AVG(Movie_Cost) AS Average
FROM
movie
GROUP BY
Movie_Genre
) AS b
ON
a.Movie_Genre = b.Movie_Genre
But this doesn't work because of issues with the aliases. It gives me the error that the table 'a' does not exist.
Use window functions:
SELECT m.*,
(m.movie_cost - AVG(m.movie_cost) OVER (PARTITION BY movie_genre)) as diff
FROM movie m;
EDIT:
You can do any arithmetic you want, for instance:
SELECT m.*,
(m.movie_cost * 100.0 /
AVG(m.movie_cost) OVER (PARTITION BY movie_genre)
) - 100 as percentage
FROM movie m;
SELECT
a.Movie_Cost, a.Movie_Num,
b.Movie_Genre,b.Average,
100*(a.Movie_Cost-b.Average)/b.Average
FROM
movie AS a
iNNER JOIN(
SELECT
Movie_Genre,
AVG(Movie_Cost) AS Average
FROM
movie
GROUP BY
Movie_Genre
) AS b
ON
a.Movie_Genre = b.Movie_Genre
I need to identify which Month has the most entries. Ive used the TO_DATE function to format the date column to just the MONTH. Also, SELECT COUNT(*) in combination with the GROUP BY Clause I am able to return all records month and count attributes.
However, I need to be able to only return one row that is the MAX of the COUNT. IVE atempted to do so by adding a HAVING clause but returns an error. I suspect I need a subquery in here somewhere but am unsure as to how to go about it.
SELECT TO_CHAR(P.DATEREGISTERED,'MONTH') MONTH, COUNT(*) COUNT
FROM PET P
GROUP BY TO_CHAR(P.DATEREGISTERED,'MONTH')
HAVING COUNT = MAX(COUNT);
Another Attempt:
SELECT TO_CHAR(P.DATEREGISTERED,'MONTH') MONTH, COUNT(*) COUNT
FROM PET P
GROUP BY TO_CHAR(P.DATEREGISTERED,'MONTH')
HAVING COUNT(*) = (SELECT MAX(TO_CHAR(P.DATEREGISTERED,'MONTH')) FROM PET P);
In the query with alias, you are grouping by Month and getting a count of the number of records and you are checking whether that count is same as the maximum of the "date value" converted to month string. They are not even comparisons of the same type.
The query that you have provided in the answer correctly compares the count on both sides.
Another way to rewrite the query would be
select * from
(SELECT TO_CHAR(P.DATEREGISTERED,'MONTH') MONTH, COUNT(*) COUNT
FROM PET P
GROUP BY TO_CHAR(P.DATEREGISTERED,'MONTH') order by count(*) desc )
where rownum=1
Here we order the records in the subquery by descending order of the count and then getting the first row from that.
The bellow code works and returns the correct response. It is unclear to me as to why it works but the above attempts (w/ aliases) do not.
SELECT TO_CHAR(P.DATEREGISTERED,'MONTH') MONTH, COUNT(*) COUNT
FROM PET P
GROUP BY TO_CHAR(P.DATEREGISTERED,'MONTH')
HAVING COUNT(*) = (SELECT MAX(COUNT(*)) FROM PET P GROUP BY TO_CHAR(P.DATEREGISTERED,'MONTH'));
I just started learning SQL the other day and hit a stumbling block. I've got some code that looks like this:
SELECT player, SUM(wins) from (
SELECT win_counts.player, win_counts.wins
from win_counts
UNION
SELECT lose_counts.player, lose_counts.loses
from lose_counts
group by win_counts.player
) as temp_alias
Here is the error I get:
ERROR: missing FROM-clause entry for table "win_counts"
LINE 7: group by win_counts.player
This win_counts table contains a list of player ids and the number of matches they have one. The lose_counts tables contains a list of player ids and the number of matches they have lost. Ultimately I want a table of player ids and the total number of matches each player has played.
Thank you for the help. Sorry I don't have more information... my understanding of sql is pretty rudimentary.
Group by appears to be in the wrong place.
SELECT player, SUM(wins) as SumWinsLoses
FROM(
SELECT win_counts.player, win_counts.wins
FROM win_counts
UNION ALL -- as Gordon points out 'ALL' is likely needed, otherwise your sum will be
-- off as the UNION alone will distinct the values before the sum
-- and if you have a player with the same wins and losses (2),
-- the sum will return only 2 instead of (4).
SELECT lose_counts.player, lose_counts.loses
FROM lose_counts) as temp_alias
GROUP BY player
Just so we are clear though the SUm(Wins) will sum wins and losses as "wins" the first name in a union for a field is the name used. So a players wins and losses are going to be aggregated.
Here's a working SQL FIddle Notice without the union all... the player #2 has an improper count.
You already have a good answer and comments from others. For your edification:
In some scenarios it might be more efficient to aggregate before the union.
select player, sum(wins) from (
select player, count(*) as wins
from win_counts
group by player
UNION ALL /* without ALL you'll eliminate duplicate rows */
select player, count(*) as losses
from lose_counts
group by player
) as t
group by player
This should also give equivalent results if each player has both wins and losses:
select wins.player, wins + losses as total_matches
from
(
select player, count(*) as wins
from win_counts
group by player
) as wins
inner join
(
select player, count(*) as losses
from lose_counts
group by player
) as losses
on losses.player = wins.player
The fix for missing wins/losses is a full outer join:
select
coalesce(wins.player, losses.player) as player,
coalesce(wins. 0) + coalesce(losses, 0) as total_matches
from
(
select player, count(*) as wins
from win_counts
group by player
) as wins
full outer join
(
select player, count(*) as losses
from lose_counts
group by player
) as losses
on losses.player = wins.player
These complicated queries should give you a taste of why it's a bad idea to use separate tables for data that belongs together. In this case you should probably prefer a single table that records all matches as wins, losses (or ties).
I have a table named grades. A column named Students, Practical, Written. I am trying to figure out the top 5 students by total score on the test. Here are the queries that I have not sure how to join them correctly. I am using oracle 11g.
This get's me the total sums from each student:
SELECT Student, Practical, Written, (Practical+Written) AS SumColumn
FROM Grades;
This gets the top 5 students:
SELECT Student
FROM ( SELECT Student,
, DENSE_RANK() OVER (ORDER BY Score DESC) as Score_dr
FROM Grades )
WHERE Student_dr <= 5
order by Student_dr;
The approach I prefer is data-centric, rather than row-position centric:
SELECT g.Student, g.Practical, g.Written, (g.Practical+g.Written) AS SumColumn
FROM Grades g
LEFT JOIN Grades g2 on g2.Practical+g2.Written > g.Practical+g.Written
GROUP BY g.Student, g.Practical, g.Written, (g.Practical+g.Written) AS SumColumn
HAVING COUNT(*) < 5
ORDER BY g.Practical+g.Written DESC
This works by joining with all students that have greater scores, then using a HAVING clause to filter out those that have less than 5 with a greater score - giving you the top 5.
The left join is needed to return the top scorer(s), which have no other students with greater scores to join to.
Ties are all returned, leading to more than 5 rows in the case of a tie for 5th.
By not using row position logic, which varies from darabase to database, this query is also completely portable.
Note that the ORDER BY is optional.
With Oracle's PLSQL you can do:
SELECT score.Student, Practical, Written, (Practical+Written) as SumColumn
FROM ( SELECT Student, DENSE_RANK() OVER (ORDER BY Score DESC) as Score_dr
FROM VOTES ) as score, students
WHERE score.score_dr <= 5
and score.Student = students.Student
order by score.Score_dr;
You can easily include the projection of the first query in the sub-query of the second.
SELECT Student
, Practical
, Written
, tot_score
FROM (
SELECT Student
, Practical
, Written
, (Practical+Written) AS tot_score
, DENSE_RANK() OVER (ORDER BY (Practical+Written) DESC) as Score_dr
FROM Grades
)
WHERE Student_dr <= 5
order by Student_dr;
One virtue of analytic functions is that we can just use them in any query. This distinguishes them from aggregate functions, where we need to include all non-aggregate columns in the GROUP BY clause (at least with Oracle).