So, I've a Uni assignment and the lecturer has picked this week to be ill and unable to answer questions.
We've been given a baseball database made up of 4 tables to work with. Table structures are as follows:
TABLENAME:(column1, column2...etc) PK = Primary Key, FK = Foreign Key
PLAYER:(num PK, name, dob, team FK, position)
GAME:(num PK, gamedate, hometeam FK, awayteam FK, homescore,
awayscore)
GAMESTAT:(gamenum PK, playernum FK, homeruns, strikeout)
TEAM:(code PK, name, town, ground)
The aim of this particularly question is to obtain the name of the stadium's (ground in team table), the sum of the home runs scored on that ground, the sum of the strikeouts and then the sum of these two values within a specified date range.
My query and issue are below:
SELECT
t.ground AS GROUNDPLAYED,
SUM(gs.homeruns) as TOTALHOMERUNS,
SUM(gs.strikeouts) AS TOTALSTRIKEOUTS,
SUM(gs.homeruns + gs.strikeouts) AS COMBINEDTOTAL
FROM team t
LEFT OUTER JOIN game g ON g.hometeam = t.code
LEFT OUTER JOIN gamestat gs ON g.num = gs.gamenum
WHERE g.gamedate BETWEEN '7-AUG-2014' AND '13-AUG-2014'
GROUP BY t.ground;
My problem lies in the fact that I get the correct values for games played but regardless of using the LEFT OUTER JOIN, I'm not getting all the stadium's to list. I'm convinced it has to do with the fact that I have had to join to the hometeam from the GAME table and it can only pick the home stadiums based on that.
Any help you may be able to offer would be much appreciated.
Move your WHERE clause to the ON clause for the join to gamestat.
By imposing the filter criteria in the WHERE clause, it occurs after the join has been performed, removing the stadiums with no activity. Once this predicate is moved to the appropriate ON clause it will filter the gamestat's before the join instead of after.
You have experienced the good fortune to encounter this important quirk of SQL, that the positioning of predicates affects the result-set, early in your education.
Related
I have a query with joins that is not using the index that would be the best match and I am looking for help to correct this.
I have the following query:
select
equipment.name,purchaselines.description,contacts.name,vendors.accountNumber
from purchaselines
left join vendors on vendors.id = purchaselines.vendorId
left join contacts on contacts.id = vendors.contactId
left join equipment on equipment.id = purchaselines.equipmentId
where contacts.id = 12345
The table purchaselines has an index on the column vendorId, which is the proper index to use. When the query is run, I know the value of contacts.id which is joined to vendors.contactId which is joined to purchaselines.vendorId.
What is the proper way to run this query? Currently, no index is used on the table purchaselines.
If you are intending to query a specific contact, I would put THAT first since that is the primary basis. Additionally, you had left-joins to the other tables (vendors, contacts, equipment). So by having a WHERE clause to the CONTACTS table forces the equation to become an INNER JOIN, thus REQUIRING.
That said, I would try to rewrite the query as (also using aliases for simplified readability of longer table names)
select
equipment.name,
purchaselines.description,
contacts.name,
vendors.accountNumber
from
contacts c
join vendors v
on c.id = v.contactid
join purchaselines pl
on v.id = pl.vendorid
join equipment e
on pl.equipmentid = e.id
where
c.id = 12345
Also notice the indentation of the JOINs helps readability (IMO) to see how/where each table gets to the next in a more hierarchical manner. They are all regular inner JOIN context.
So, the customer ID will be the first / fastest, then to vendors by that contact ID which should optimize the join to that. Then, I would expect the purchase lines to have an index on vendorid optimizing that. And finally, the equipment table on ITs PK.
FEEDBACK Basic JOIN clarification.
JOIN is just the explicit statement of how two tables are related. By listing them left-side and right-side and the join condition showing on what relationship is between them is all.
Now, in your data example, each table is subsequently nested under the one prior. It is quite common though that one table may link to multiple other tables. For example an employee. A customer could have an ethnicity ID linking to an ethnicity lookup table, but also, a job position id also linking to a job position lookup table. That might look something like
select
e.name,
eth.ethnicity,
jp.jobPosition
from
employee e
join ethnicitiy eth
on e.ethnicityid = eth.id
join jobPosition jp
on e.jobPositionID = jp.id
Notice here that both ethnicity and jobPosition are at the same hierarchical level to the employee table scenario. If, for example, you wanted to further apply conditions that you only wanted certain types of employees, you can just add your logical additional conditions directly at the location of the join such as
join jobPosition jp
on e.jobPositionID = jp.id
AND jp.jobPosition = 'Manager'
This would get you a list of only those employees who are managers. You do not need to explictily add a WHERE condition if you already include it directly at the JOIN/ON criteria. This helps keeping the table-specific criteria at the join if you ever find yourself needing LEFT JOINs.
I am trying to create a SQL SELECT that displays the player's username and the name of the team associated with the player. This is my code but it don't works like expected:
SELECT Player.userName AS Player,
Teams.TeamName AS [Team Name]
FROM Players, Teams
INNER JOIN Players
ON Team.ID = Player.userName
This is the Team's table
This is the player's table I just included the names only.
This is the full Player's Table with the contents of what is inside the table.
Your syntax is a bit off in your FROM clause. The best rule of thumb here is to NEVER use a comma in your FROM clause (the only exception is if you want to join EVERY row from one table to EVERY row of another creating a cartesian product of the two tables, but we rarely do that).
When you specify the relationship of the two tables in the ON clause of your JOIN you need to put the column(s) from each table that they have in common. A Team.ID will NEVER match to a Players.userName, so that is not the proper join condition.
Assuming you have a TeamID column in your players table so you know which Team each Player is on, you will have SQL that will look like:
SELECT Players.userName AS Player,
Teams.TeamName AS [Team Name]
FROM Teams
INNER JOIN Players
ON Team.ID = Players.[TeamName_]
I am attempting to answer question #12 on sqlzoo.net
(http://sqlzoo.net/wiki/More_JOIN_operations). I couldn't figure out the answer on my own but I did manage to find the answer online.
12: Which were the busiest years for 'John Travolta', show the year and the number of movies he made each year for any year in which he made more than 2 movies.
Answer:
SELECT yr,COUNT(title) FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr
HAVING COUNT(title)=(SELECT MAX(c) FROM
(SELECT yr,COUNT(title) AS c FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr) AS t)
One of parts that I do not fully understand is the multiple joins:
FROM movie
JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
Is Actor being joined only with Movie, or is actor being joined with Movie JOIN Casting?
I am trying to find a website that explains complex join statements as my attempted answer was far from correct (missing many sections). I think subselect statements with multiple complex join statements is a bit confusing at the moment. But, I could not find a good website that breaks the information up to help me form my own queries.
The other part I don't fully understand is this:
(SELECT yr,COUNT(title) AS c FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr) AS t)
3. What is the above code trying to find?
Ok, glad you are not afraid to ask, and I'll do my best to help clarify what is going on... Please excuse my re-formatting of the query to my mindset of writing queries. It better shows the relationships of where things are coming from (my perspective), and may help you too.
A few other things about my rewrite. I also like to use alias references to the tables so every column is qualified with the table (or alias) it originates from. It prevents ambiguity, especially for someone who does not know your table structures and relationships between tables. (m = alias to movie, c = alias for casting, a = alias for actor tables). For the sub query, and to keep alias confusion clear, I suffixed them with 2, such as m2, c2, a2.
SELECT
m.yr,
COUNT(m.title)
FROM
movie m
JOIN casting c
ON m.id = c.movieid
JOIN actor a
ON c.actorid = a.id
WHERE
a.name = 'John Travolta'
GROUP BY
m.yr
HAVING
COUNT(m.title) = ( SELECT MAX(t.movieCount)
FROM
( SELECT m2.yr,
COUNT(m2.title) AS movieCount
FROM
movie m2
JOIN casting c2
ON m2.id = c2.movieid
JOIN actor a2
ON c2.actorid = a2.id
WHERE
a2.name='John Travolta'
GROUP BY
m2.yr ) AS t
)
First, look at the outermost query (aliases m, c, a ) and the innermost query (aliases m2, c2, a2) are virtually identical.
The query has to run from the deepest query first... in this case the m2, c2, a2 query. Look at it and see what IT is going to deliver. If you ran that, you would get every year he had a movie and the number of movies... starting result from their sample data goes from 1976 all the way to 2010. So far, nothing complex unto itself (about 20 rows). Now, since each table may have an alias, each sub query (such as this MUST have an alias, so that is why the "as t". So, there is no true table, it is wrapping the entire query's result set and assigning THAT the alias of "t".
So now, go one level up in the query also wrapped in parens...
SELECT MAX(t.movieCount)
FROM (EntireSubquery as t)
Although abbreviated, this is what the engine is doing. Looking at the subquery result given an alias of "t" and finding the maximum "movieCount" value which is the count of movies that were done in a given year. In this case, the actual number is 3 and we are almost done.
Now, to the outermost query... again, this was virtually identical to the innermost query. The only difference is the HAVING clause. This is applied after all the grouping per year is performed. Then it is comparing ITs row result set count per year to the 3 value result of the SELECT MAX( t.movieCount )...
So, all the years that had only 1 or 2 movies are excluded from the result, and only the one year that had 3 movies are included.
Now, to clarify the JOINs. Each table should have a relationship with one or more tables (also known as linking tables, such as the cast table that has both a movie and actors/actresses. So, think of the join as how to I put the tables in order so that each one can touch a piece to the other until I have them all chained together. In this case
Movie -> Casting linked by the movie ID, then Casting -> actor by the actor ID, so that is how I do it visually hierarchically... I am starting FROM the Movie table, JOINing to the cast table based ON Movie ID = Cast Movie ID. Now, from the Casting table joined to the Actor table based on the common Actor ID field
FROM
movie m
JOIN casting c
ON m.id = c.movieid
JOIN actor a
ON c.actorid = a.id
Now, this is a simple relationship, but you COULD have one primary table with multiple child-level tables. You could join multiple tables based on the respective data. Very simple sample to clarify the point. You have a student table going to a school. A student has a degree major, an ethnicity, an address state (assuming an online school and students can be from any state). If you had lookup tables for degrees, ethnicity and states, you might come up with something like...
select
s.firstname,
s.lastname,
d.DegreeDescription,
e.ethnicityDescription,
st.stateName
from
students s
join degrees d
on s.degreemajor = d.degreeID
join ethnicity e
on s.ethnicityID = e.id
join states st
on s.homeState = st.stateID
Notice the hierarchical representation that each table is directly associated under that of the student. Not all tables need to be one deeper than the last.
So, there are many sites out there, such as the w3schools as offered by Mark, but learn to dissect small pieces at a time... what are the bare minimum tables to get from point-A to point-Z and draw the relationships. THEN, tare down based on requirement criteria you are looking for.
The correct answer would be:
SELECT yr, COUNT(title)
FROM movie m
JOIN casting c ON m.id=c.movieid JOIN actor a ON c.actorid=a.id
WHERE name='John Travolta'
GROUP BY yr
HAVING COUNT(title) > 2;
The answer you found (which seems to be a mistake on the sqlzoo site) is looking for any year that has a count equal to the year with the highest count.
I used table aliases in the query above to clear up how the tables are joined. Movie is joined to casting and casting is joined to actor.
The subquery that confuses you is listing each year and a count of movies for that year that star John Travolta. It's not needed if you're answering the question as written.
As for learning resources, make sure you have the basics down. Understand everything at http://w3schools.com/sql. Try searching for "sql joining multiple tables" in your favorite search engine when you're ready for more.
Above is my schema. What you can't see in tblPatientVisits is the foreign key from tblPatient, which is patientid.
tblPatient contains a distinct copies of each patient in the dataset as well as their gender. tblPatientVists contains their demographic information, where they lived at time of admission and which hospital they went to. I chose to put that information into a separate table because it changes throughout the data (a person can move from one visit to the next and go to a different hospital).
I don't get any strange numbers with my queries until I add tblPatientVisits. There are just under one millions claims in tblClaims, but when I add tblPatientVisits so I can check out where that person was from, it returns over million. I thinkthis is due to the fact that in tblPatientVisits the same patientID shows up more than once (due to the fact that they had different admission/dischargedates).
For the life of me I can't see where this is incorrect design, nor do I know how to rectify it beyond doing one query with count(tblPatientVisits.PatientID=1 and then union with count(tblPatientVisits.patientid)>1.
Any insight into this type of design, or how I might more elegantly find a way to get the claimType from tblClaims to give me the correct number of rows with I associate a claim ID with a patientID?
EDIT: The biggest problem I'm having is the fact that if I include the admissionDate,dischargeDate or the patientStatein the tblPatient table I can't use the patientID as a primary key.
It should be noted that tblClaims are NOT necessarily related to tblPatientVisits.admissionDate, tblPatientVisits.dischargeDate.
EDIT: sample queries to show that when tblPatientVisits is added, more rows are returned than claims
SELECT tblclaims.id, tblClaims.claimType
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID INNER JOIN
tblPatientVisits ON tblPatient.patientID = tblPatientVisits.patientID
more than one million query rows returned
SELECT tblClaims.id, tblPatient.patientID
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID
less than one million query rows returned
I think this is crying for a better design. I really think that a visit should be associated with a claim, and that a claim can only be associated with a single patient, so I think the design should be (and eliminating the needless tbl prefix, which is just clutter):
CREATE TABLE dbo.Patients
(
PatientID INT PRIMARY KEY
-- , ... other columns ...
);
CREATE TABLE dbo.Claims
(
ClaimID INT PRIMARY KEY,
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID)
-- , ... other columns ...
);
CREATE TABLE dbo.PatientVisits
(
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID),
ClaimID INT NULL FOREIGN KEY
REFERENCES dbo.Claims(ClaimID),
VisitDate DATE
, -- ... other columns ...
, PRIMARY KEY (PatientID, ClaimID, VisitDate) -- not convinced on this one
);
There is some redundant information here, but it's not clear from your model whether a patient can have a visit that is not associated with a specific claim, or even whether you know that a visit belongs to a specific claim (this seems like crucial information given the type of query you're after).
In any case, given your current model, one query you might try is:
SELECT c.id, c.claimType
FROM dbo.tblClaims AS c
INNER JOIN dbo.tblPatientClaims AS pc
ON c.id = pc.id
INNER JOIN dbo.tblPatient AS p
ON pc.patientid = p.patientID
-- where exists tells SQL server you don't care how many
-- visits took place, as long as there was at least one:
WHERE EXISTS (SELECT 1 FROM dbo.tblPatientVisits AS pv
WHERE pv.patientID = p.patientID);
This will still return one row for every patient / claim combination, but it should only return one row per patient / visit combination. Again, it really feels like the design isn't right here. You should also get in the habit of using table aliases - they make your query much easier to read, especially if you insist on the messy tbl prefix. You should also always use the dbo (or whatever schema you use) prefix when creating and referencing objects.
I'm not sure I understand the concept of a claim but I suspect you want to remove the link table between claims and patient and instead make the association between patient visit and a claim.
Would that work out better for you?
The database is quite simple. Below there is a part of a schema relevant to this question
ROUND (round_id, round_number)
TEAM (team_id, team_name)
MATCH (match_id, match_date, round_id)
OUTCOME (team_id, match_id, score)
I have a problem with query to retrieve data for all matches played. The simple query below gives of course two rows for every match played.
select *
from round r
inner join match m on m.round_id = r.round_id
inner join outcome o on o.match_id = m.match_id
inner join team t on t.team_id = o.team_id
How should I write a query to have the match data in one row?
Or maybe should I redesign the database - drop the OUTCOME table and modify the MATCH table to look like this:
MATCH (match_id, match_date, team_away, team_home, score_away, score_home)?
You can almost generate the suggested change from the original tables using a self join on outcome table:
select o1.team_id team_id_1,
o2.team_id team_id_2,
o1.score score_1,
o2.score score_2,
o1.match_id match_id
from outcome o1
inner join outcome o2 on o1.match_id = o2.match_id and o1.team_id < o2.team_id
Of course, the information for home and away are not possible to generate, so your suggested alternative approach might be better after all. Also, take note of the condition o1.team_id < o2.team_id, which gets rid of the redundant symmetric match data (actually it gets rid of the same outcome row being joined with itself as well, which can be seen as the more important aspect).
In any case, using this select as part of your join, you can generate one row per match.
you fetch 2 rows for every matches played but team_id and team_name are differents :
- one for team home
- one for team away
so your query is good
Using the match table as you describe captures the logic of a game simply and naturally and additionally shows home and away teams which your initial model does not.
You might want to add the round id as a foreign key to round table and perhaps a flag to indicate a match abandoned situation.
drop outcome. it shouldn't be a separate table, because you have exactly one outcome per match.
you may consider how to handle matches that are cancelled - perhaps scores are null?