I have 2 tables I am working with, one is movie_score which contains an id, name, and score. I have another table that is movie_cast which contains mid, cid, and name. Mid is movie id and cid is cast id. the problem I must do is as follows:
Find top 10 (distinct) cast members who have the highest average movie scores. The output list must be sorted by score (from high to low), and then, by cast name in alphabetical order, if/when they have the same average score. The search must NOT include: (a) movies with scores lower than 50 AND (b) cast members who have appeared in less than 3 movies (again, only counting the number of appearances in movies with scores of at least 50). (Expected Output: cid, cname, average score)
I have tried to put the command together but so far this is all I was able to get:
SELECT DISTINCT movie_cast.cid, movie_cast.cname, FROM movie_score INNER JOIN movie_cast ON movie_score.id=movie_cast.mid ORDER BY cname LIMIT 10;
movie-name-score.txt goes with movie_score:
Example of .txt file
9,"Star Wars: Episode III - Revenge of the Sith 3D",80
24214,"The Chronicles of Narnia: The Lion, The Witch and The Wardrobe",76
1789,"War of the Worlds",74
10009,"Star Wars: Episode II - Attack of the Clones 3D",67
771238285,"Warm Bodies",-1
770785616,"World War Z",-1
771303871,"War Witch",89
771323601,"War of the Worlds the True Story",-1
movie-cast.txt goes with movie_cast:
Example:
9,162652153,"Hayden Christensen"
9,162652152,"Ewan McGregor"
9,418638213,"Kenny Baker"
9,548155708,"Graeme Blundell"
9,358317901,"Jeremy Bulloch"
9,178810494,"Anthony Daniels"
9,770726713,"Oliver Ford Davies"
9,162652156,"Samuel L. Jackson"
9,162655731,"James Earl Jones"
I expect to have an output something like:
162655731,"James Earl Jones",average score of the movies they have been in
Does anyone know the best way to create this command?
Related
I am working a project and encountered a problem regarding writing the best query for the problem.
I will start presenting the problem and the solution I found.
We have the following ERD structure:
A Player has many Scores, and a Score has many Handicap Results.
We have a many to many relationship between Handicap and League.
In my app at some point I run a calculation formula that takes all players from a Customer or League and for each score of the player we create a handicap result corresponding to he Handicaps that the Club / League has.
HandicapResults: value, score_id, handicap_id
Handicaps: game_type(string), league_ids (association between the handicap and leagues)
Score: league_id, player_id, game_type(string), play_at (date), round_id (integer)
On the customers#show action I want to display for all handicaps and players the result.
The result is the LAST SCORE PLAYED SORTED BY ROUND_ID DESC AND PLAY_AT DESC. From this score we the the handicap_result corresponding to customer handicaps.
The solution I found for one player would be:
The only problem is that this will do 1 select for all players displayed in the view. I would want to write a select that would return the values for a collection of players (player_ids).
At the moment the sql returns the handicap_result corresponding to the last score played (sorted_by round_id desc and play_at desc) where score.league_id included in handicap.league_ids and score.game_type = handicap.game_type.
I would want to created a method that can have as parameters player_ids, handicaps, ... all informations required for the query.
That returns the following:
Ex:
player_ids: [1, 2, 3]
handicaps: [handicap_1, handicap_2]
# and return something like:
{
#player_id 1: { handicap_1.id: value, handicap_2.id: value }
.........
#player_id 5: { handicap_1.id: value, handicap_2.id: value }
}
# the value is the handicap_result where handicap_result.handicap_id == handicap_1.id / handicap_2.id and for the corresponding score
Hopefully I described the problem correctly and people can understand me. I really wish that someone can help me into writing the query that runes 1 time and returns the values for a given collection of players.
Thank you and have a nice day!
I am currently working on a program that is supposed to predict the outcomes of a 1v1 contest. I have given each player their own elo score and am collecting all sorts of data in order to predict who the winner would be.
For each fighter, I want to collect the current average elo of people that they are defeating as well as the current average elo of people that are defeating them. Below is some sample data and explanations in order to help you better understand the data structure.
The picture above shows the basic stats view, V_FIGHT_REVIEW that simplifies my fights table for stats collection. FID is the unique fight id and identifies the fight. PID is the player id and identifies each unique player. The WINNER column represents the winner of the fight. So if PID is not equal to WINNER, that player did not win the fight.
This picture represents the PLAYERS table. To the left you will recognize the PID for each player. To the right you will see the column named ELO.
To rephrase the question, I am having trouble figuring out how I can produce the current average elo of each player they have defeated and the current average elo of each player that has defeated them. These average elos should change as their opponents win/lose fights. The output should be similar to below:
PID | AVG_ELO_DEF | AVG_ELO_DEF_BY
I am 99% sure there is a better way to do this but here is the answer I came up with.
I created a new view with this query:
select w.fid, w.pid winner, l.pid loser, w.elo winner_elo, l.elo loser_elo
from (select * from v_fight_review where pid = winner and fid = fid) w,
(select * from v_fight_review where pid <> winner and fid = fid) l
where w.fid = l.fid;
This query gets the elos of every winner and loser for each fight.
I then created two other views. One view is for the average elos that each opponent has defeated. The other view is for the average elos of opponents that have defeated them. Their code is below.
--gets the average elo of oppenents that they can beat
create view v_avg_elo_winning as
select p.pid, round(avg(vae.loser_elo), 0) elo
from players p, v_avg_elo vae
where p.pid = vae.winner
group by pid
order by pid;
--gets the average of all of the people that can beat them
create view v_avg_elo_losing as
select p.pid, round(avg(vae.winner_elo), 0) elo
from players p, v_avg_elo vae
where p.pid = vae.loser
group by pid
order by pid;
Suggestions are always welcome.
The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us
I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T
So, I have a problem with a SQL Query.
It's about getting weather data for German cities. I have 4 tables: staedte (the cities with primary key loc_id), gehoert_zu (contains the city-key and the key of the weather station that is closest to this city (stations_id)), wettermessung (contains all the weather information and the station's key value) and wetterstation (contains the stations key and location). And I'm using PostgreSQL
Here is how the tables look like:
wetterstation
s_id[PK] standort lon lat hoehe
----------------------------------------
10224 Bremen 53.05 8.8 4
wettermessung
stations_id[PK] datum[PK] max_temp_2m ......
----------------------------------------------------
10224 2013-3-24 -0.4
staedte
loc_id[PK] name lat lon
-------------------------------
15 Asch 48.4 9.8
gehoert_zu
loc_id[PK] stations_id[PK]
-----------------------------
15 10224
What I'm trying to do is to get the name of the city with the (for example) highest temperature at a specified date (could be a whole month, or a day). Since the weather data is bound to a station, I actually need to get the station's ID and then just choose one of the corresponding to this station cities. A possible question would be: "In which city was it hottest in June ?" and, say, the highest measured temperature was in station number 10224. As a result I want to get the city Asch. What I got so far is this
SELECT name, MAX (max_temp_2m)
FROM wettermessung, staedte, gehoert_zu
WHERE wettermessung.stations_id = gehoert_zu.stations_id
AND gehoert_zu.loc_id = staedte.loc_id
AND wettermessung.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX (max_temp_2m) DESC
LIMIT 1
There are two problems with the results:
1) it's taking waaaay too long. The tables are not that big (cities has about 70k entries), but it needs between 1 and 7 minutes to get things done (depending on the time span)
2) it ALWAYS produces the same city and I'm pretty sure it's not the right one either.
I hope I managed to explain my problem clearly enough and I'd be happy for any kind of help. Thanks in advance ! :D
If you want to get the max temperature per city use this statement:
SELECT * FROM (
SELECT gz.loc_id, MAX(max_temp_2m) as temperature
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY gz.loc_id) as subselect
INNER JOIN staedte as std
ON std.loc_id = subselect.loc_id
ORDER BY subselect.temperature DESC
Use this statement to get the city with the highest temperature (only 1 city):
SELECT * FROM(
SELECT name, MAX(max_temp_2m) as temp
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
INNER JOIN staedte as std
ON gz.loc_id = std.loc_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX(max_temp_2m) DESC
LIMIT 1) as subselect
ORDER BY temp desc
LIMIT 1
For performance reasons always use explicit joins as LEFT, RIGHT, INNER JOIN and avoid to use joins with separated table name, so your sql serevr has not to guess your table references.
This is a general example of how to get the item with the highest, lowest, biggest, smallest, whatever value. You can adjust it to your particular situation.
select fred, barney, wilma
from bedrock join
(select fred, max(dino) maxdino
from bedrock
where whatever
group by fred ) flinstone on bedrock.fred = flinstone.fred
where dino = maxdino
and other conditions
I propose you use a consistent naming convention. Singular terms for tables holding a single item per row is a good convention. You only table breaking this is staedte. Should be stadt.
And I suggest to use station_id consistently instead of either s_id and stations_id.
Building on these premises, for your question:
... get the name of the city with the ... highest temperature at a specified date
SELECT s.name, w.max_temp_2m
FROM (
SELECT station_id, max_temp_2m
FROM wettermessung
WHERE datum >= '2012-8-1'::date
AND datum < '2012-12-1'::date -- exclude upper border
ORDER BY max_temp_2m DESC, station_id -- id as tie breaker
LIMIT 1
) w
JOIN gehoert_zu g USING (station_id) -- assuming normalized names
JOIN stadt s USING (loc_id)
Use explicit JOIN conditions for better readability and maintenance.
Use table aliases to simplify your query.
Use x >= a AND x < b to include the lower border and exclude the upper border, which is the common use case.
Aggregate first and pick your station with the highest temperature, before you join to the other tables to retrieve the city name. Much simpler and faster.
You did not specify what to do when multiple "wettermessungen" tie on max_temp_2m in the given time frame. I added station_id as tiebreaker, meaning the station with the lowest id will be picked consistently if there are multiple qualifying stations.
I'm having trouble solving the following SQL requests:
Give the names of the actors that have acted in more films than 'sara allgood' and who have acted in films that won the 'cannes film festival'. Also, give the filmname.
Get the percentage of movies who won awards out of all movies produced between the years 1970 and 1990.
There are several tables but I'm assuming that only 4 are needed:
'films','remakes','casts', 'awtypes'
'films' attributes: filmid, filmname, year, director, studio, award
'remakes' attributes: filmid, title, year, priorfilm, prioryear
'casts' attributes: filmid, filmname, actor, award(10)
'awtypes' attributes: award(10), org(100), country, colloquial(50), year
It's a bit unclear to me how to match the award to the 'Cannes film festival' in the first query since the award field is only 10 characters meaning it is a reference to the awtypes table but I don't know which field in the awtypes table contains the name of the award and I don't have access to the database at the moment so it's either org or colloquial.
As for the second I don't know how I could compute the percentage but it seems that it should be solved using a union operator for the movies produced between 1970 and 1990 and the films that have won an award (I don't know how to place a condition for having at least one award).
A few hints, I hope they help you!
Give the names of the actors that have acted in more films than 'sara
allgood' and who have acted in films that won the 'cannes film
festival'. Also, give the filmname.
Based on the attributes you're stating, I would say that you can get to the right awtypes attribute via the casts table. They both contain the award(10) column. Given your data, I would expect the org(100) column to contain something on the organization that provides the prizes, so that would be my guess in this case for the cannes film festival content. But you would have to try it out and see what results you get. Unfortunately, as in this case, it is often quite hard to guess the contents of a column based only on column names.
Get the percentage of movies who won awards out of all movies produced
between the years 1970 and 1990.
Based on the info stated in your question, I would go with a guess that the award column in the films table contains a boolean or something that states if the movie won an award or not. You'd have to try this out. If that's the case, you can use a COUNT(*) on all movies between 1970 and 1990 and a COUNT(*) on all movies WHERE award = 1 (or something) to get the total numbers.
You could indeed combine these in a computation query with a UNION. Example that might help you:
SELECT SUM(cnt1) / SUM(cnt2) ... do the right computation here ...
FROM ( SELECT COUNT(*) AS cnt1
,0 AS cnt2
FROM table1
UNION ALL
SELECT 0 AS cnt1
,COUNT(*) AS cnt2
FROM table2) AS sub