3 Table Join SQL Big Query - google-bigquery

I'm new to SQL joins so bear with me, what I'm trying to do is list out student test scores by week added up per student. The query I have is,
SELECT
agg.Week,
meta.Category,
agg.Student,
meta.Funding,
agg.Test_Results,
agg.Test_Points,
phy.Test_Results,
phy.Test_Points
FROM Agriculture agg
INNER JOIN metadata meta
ON meta.Student_ID = agg.Student_ID
INNER JOIN Physics phy
ON phy.Student_ID = agg.Student_ID
WHERE agg.Week BETWEEN '1' AND '5'
To me this looks correct, but for some reason what is happening is the Test_Results and Test_Points are duplicating themselves over and over and I'm not sure how to fix it. What I mean by duplicating is, since the points are the same 120 each week, a student(9) will score a 95, that 95 will repeat itself for that same student over and over again. I'm not sure if I need to have a subquery added to the INNER JOINS or not.
Tables are:
metadata --> (Category, Funding, Student_ID) (One to many comparison table)
Physics --> (Test_Results, Test_Points, Student_ID) (Many to many)
Agriculture --> (Week, Student, Test_Results, Test_Points, Student_ID) (Many to Many)
Again, I'm new to joins so if you need more information please let me know.

You join the table by using Student_ID. If there is more than one corresponding row in metadata for the same Student_ID from Agriculture the resultset will show duplicates.
The same goes for table Physics.
for example : You have 10 rows in table Agriculture for a Student_ID and 2 rows in metadata with the same Student_ID, you will get 20 rows if you join the tables.
You can indeed use subqueries to reduce the number of rows per Studen_ID in metadata and Physics and join on those subselects.

You can simply use DISTINCT keyword for agg.Student.

Related

SQL Server count() returns different results when using joins

I am new to SQL Server and I am not looking for a solution (but it may help others), rather, I would like to understand the behaviour / why I get two different results from the two pseudo queries below.
The reason I have joined two other tables is because I will need to count all items in Vehicle against the 'Date_Recorded' in 'Garage'. All recorded since 2010. So before I did this I wanted to be sure I was getting the same total count from the table 'Car' and also get the same result with the joins, before I added the 'isDate' on 'Garage' condition, that is when I noticed the difference in the results.
I would have thought the joins would have been ignored?
Hope someone can advise? Thanks in advance!
SELECT count(Car.CAR_ID) AS Car_ID
FROM Vehicle Car
INNER JOIN Road Rd
ON Car.CAR_ID = Rd.CAR_ID
JOIN Garage g
ON Rd.GARAGE_ID = g.GARAGE_ID
----------------------------------------------
Car_ID
----------------------------------------------
226923
SELECT count(Car.CAR_ID) AS Car_ID
FROM Vehicle Car
----------------------------------------------
Car_ID
----------------------------------------------
203417
INNER JOIN: Returns all rows when there is at least one match in BOTH tables.
LEFT JOIN: Return all rows from the left table, and the matched rows from the right table.
RIGHT JOIN: Return all rows from the right table, and the matched rows from the left table.
FULL JOIN: Return all rows when there is a match in ONE of the tables.
you are using inner join thats why you are getting the wrong result from the actual count() result.
if you want to get all the record from the left table the you have to use left join in this query then you'll get the result of count() same as the main table[left table].
You either have
multiple records in the Road table with the same Car_ID or
multiple records in the Road table with the same Garage_ID or
Both of the above
You may be able to run the following to get what you want (assuming there is always a match in the road and garage tables):
SELECT count(Distinct Car.CAR_ID) AS Car_ID
FROM Vehicle Car
INNER JOIN Road Rd
ON Car.CAR_ID = Rd.CAR_ID
JOIN Garage g
ON Rd.GARAGE_ID = g.GARAGE_ID

Ms-Access: counting from 2 tables

I have two tables in a Database
and
I need to retrieve the number of staff per manager in the following format
I've been trying to adapt an answer to another question
SELECT bankNo AS "Bank Number",
COUNT (*) AS "Total Branches"
FROM BankBranch
GROUP BY bankNo
As
SELECT COUNT (*) AS StaffCount ,
Employee.Name AS Name
FROM Employee, Stafflink
GROUP BY Name
As I look at the Group BY I'm thinking I should be grouping by The ManID in the Stafflink Table.
My output with this query looks like this
So it is counting correctly but as you can see it's far off the output I need to get.
Any advice would be appreciated.
You need to join the Employee and Stafflink tables. It appears that your FROM clause should look like this:
FROM Employee INNER JOIN StaffLink ON Employee.ID = StaffLink.ManID
You have to join the Eployee table twice to get the summary of employees under manager
select count(*) as StaffCount,Manager.Name
from Employee join Stafflink on employee.Id = StaffLink.EmpId
join Employee as Manager on StaffLink.ManId = Manager.Id
Group by Manager.Name
The answers that advise you on how to join are correct, assuming that you want to learn how to use SQL in MS Access. But there is a way to accomplish the same thing using the ACCESS GUI for designing queries, and this involves a shorter learning curve than learning SQL.
The key to using the GUI when more than one table is involved is to realize that you have to define the relationships between tables in the relationship manager. Once you do that, designing the query you are after is a piece of cake, just point and click.
The tricky thing in your case is that there are two relationships between the two tables. One relationship links EmpId to ID and the other links ManId to ID.
If, however, you want to learn SQL, then this shortcut will be a digression.
If you don't specify a join between the tables, a so called Cartesian product will be built, i.e., each record from one table will be paired with every record from the other table. If you have 7 records in one table and 10 in the other you will get 70 pairs (i.e. rows) before grouping. This explains why you are getting a count of 7 per manager name.
Besides joining the tables, I would suggest you to group on the manager id instead of the manager name. The manager id is known to be unique per manager, but not the name. This then requires you to either group on the name in addition, because the name is in the select list or to apply an aggregate function on the name. Each additional grouping slows down the query; therefore I prefer the aggregate function.
SELECT
COUNT(*) AS StaffCount,
FIRST(Manager.Name) AS ManagerName
FROM
Stafflink
INNER JOIN Employee AS Manager
ON StaffLink.ManId = Manager.Id
GROUP BY
StaffLink.ManId
I don't know if it makes a performance difference, but I prefer to group on StaffLink.ManId than on Employee.Id, since StaffLink is the main table here and Employee is just used as lookup table in this query.

SQL Counting and Joining

I'm taking a database course this semester, and we're learning SQL. I understand most simple queries, but I'm having some difficulty using the count aggregate function.
I'm supposed to relate an advertisement number to a property number to a branch number so that I can tally up the amount of advertisements by branch number and compute their cost. I set up what I think are two appropriate new views, but I'm clueless as to what to write for the select statement. Am I approaching this the correct way? I have a feeling I'm over complicating this bigtime...
with ad_prop(ad_no, property_no, overseen_by) as
(select a.ad_no, a.property_no, p.overseen_by
from advertisement as a, property as p
where a.property_no = p.property_no)
with prop_branch(property_no, overseen_by, allocated_to) as
(select p.property_no, p.overseen_by, s.allocated_to
from property as p, staff as s
where p.overseen_by = s.staff_no)
select distinct pb.allocated_to as branch_no, count( ??? ) * 100 as ad_cost
from prop_branch as pb, ad_prop as ap
where ap.property_no = pb.property_no
group by branch_no;
Any insight would be greatly appreciated!
You could simplify it like this:
advertisement
- ad_no
- property_no
property
- property_no
- overseen_by
staff
- staff_no
- allocated_to
SELECT s.allocated_to AS branch, COUNT(*) as num_ads, COUNT(*)*100 as ad_cost
FROM advertisement AS a
INNER JOIN property AS p ON a.property_no = p.property_no
INNER JOIN staff AS s ON p.overseen_by = s.staff_no
GROUP BY s.allocated_to;
Update: changed above to match your schema needs
You can condense your WITH clauses into a single statement. Then, the piece I think you are missing is that columns referenced in the column definition have to be aggregated if they aren't included in the GROUP BY clause. So you GROUP BY your distinct column then apply your aggregation and math in your column definitions.
SELECT
s.allocated_to AS branch_no
,COUNT(a.ad_no) AS ad_count
,(ad_count * 100) AS ad_cost
...
GROUP BY s.allocated_to
i can tell you that you are making it way too complicated. It should be a select statement with a couple of joins. You should re-read the chapter on joins or take a look at the following link
http://www.sql-tutorial.net/SQL-JOIN.asp
A join allows you to "combine" the data from two tables based on a common key between the two tables (you can chain more tables together with more joins). Once you have this "joined" table, you can pretend that it is really one table (aliases are used to indicate where that column came from). You understand how aggregates work on a single table right?
I'd prefer not to give you the answer so that you can actually learn :)

Need help in understanding JOINS in SQL

I was asked the below SQL question in an interview. Kindly explain how it works and what join it is.
Q: There are two tables: table emp contains 10 rows and table department contains 12 rows.
Select * from emp,department;
What is the result and what join it is?
It would return the Cartesian Product of the two tables, meaning that every combination of emp and department would be included in the result.
I believe that the next question would be:
Blockquote
How do you show the correct department for each employee?
That is, show only the combination of emp and department where the employee belongs to the department.
This can be done by:
SELECT * FROM emp LEFT JOIN department ON emp.department_id=department.id;
Assuming that emp has a field called department_id, and department has a matching id field (This is quite standard in these type of questions).
The LEFT JOIN means that all items from the left side (emp) will be included, and each employee will be matched with the corresponding department. If no matching department is found, the resulting fields from departments will remain empty. Note that exactly 10 rows will be returned.
To show only the employees with valid department IDs, use JOIN instead of LEFT JOIN. This query will return 0 to 10 rows, depending on the number of matching department ids.
The join you specified is a cross join. It will produce one row for each combination of records in the tables being joined.
I'll let you do the math from there.
This will do a cross join I believe, returning 120 rows. One row for each pair-wise combination of rows from each of the two tables.
All-in-all a fairly useless join most of the time.
You will get all rows from both tables with each row joined together.
This is known as a Cartesian join and is very bad.
You will get a total of 120 rows.
This is also the old implied syntax (18 yeasr out of date) and accidental cross joins are a common problem with this syntax. One should never use it. Explict joins are a better choice. I would have also mentioned this in an interview and explained why. I also would not have taken the job if they actually used crappy syntax like this because it's very use shows me the database is very likely to be poorly designed.

Fetch data with single and fast SQL query

I have the following data:
ExamEntry Student_ID Grade
11 1 80
12 2 70
13 3 20
14 3 68
15 4 75
I want to find all the students that passed an exam. In this case, if there are few exams
that one student attended to, I need to find the last result.
So, in this case I'd get that all students passed.
Can I find it with one fast query? I do it this way:
Find the list of entries by
select max(ExamEntry) from data group by Student_ID
Find the results:
select ExamEntry from data where ExamEntry in ( ).
But this is VERY slow - I get around 1000 entries, and this 2 step process takes 10 seconds.
Is there a better way?
Thanks.
If your query is very slow at with 1000 records in your table, there is something wrong.
For a modern Database system a table containing, 1000 entries is considered very very small.
Most likely, you did not provid a (primary) key for your table?
Assuming that a student would pass if at least on of the grades is above the minimum needed, the appropriate query would be:
SELECT
Student_ID
, MAX(Grade) AS maxGrade
FROM table_name
GROUP BY Student_ID
HAVING maxGrade > MINIMUM_GRADE_NEEDED
If you really need the latest grade to be above the minimum:
SELECT
Student_ID
, Grade
FROM table_name
WHERE ExamEntry IN (
SELECT
MAX(ExamEntry)
FROM table_name
GROUP BY Student_ID
)
HAVING Grade > MINIMUM_GRADE_NEEDED
SELECT student_id, MAX(ExamEntry)
FROM data
WHERE Grade > :threshold
GROUP BY student_id
Like this?
I'll make some assumptions that you have a student table and test table and the table you are showing us is the test_result table... (if you don't have a similar structure, you should revisit your schema)
select s.id, s.name, t.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
left outer join test t on r.test_id = t.id
group by s.id, s.name, t.name
All the fields with id in it should be indexed.
If you really only have a single test (type) in your domain... then the query would be
select s.id, s.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
group by s.id, s.name
I've used the hints given here, and here the query I found that runs almost 3 orders faster than my first one (.03 sec instead of 10 sec):
SELECT ExamEntry, Student_ID, Grade from data,
( SELECT max(ExamEntry) as ExId GROUP BY Student_ID) as newdata
WHERE `data`.`ExamEntry`=`newdata`.`ExId` AND Grade > 60;
Thanks All!
As mentioned, indexing is a powerful tool for speeding up queries. The order of the index, however, is fundamentally important.
An index in order of (ExamEntry) then (Student_ID) then (Grade) would be next to useless for finding exams where the student passed.
An index in the opposite order would fit perfectly, if all you wanted was to find what exams had been passed. This would enable the query engine to quickly identify rows for exams that have been passed, and just process those.
In MS SQL Server this can be done with...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Grade],
[Student_ID],
[ExamEntry]
)
ON [PRIMARY]
(I recommend reading more about indexs to see what other options there are, such as ClusterdIndexes, etc, etc)
With that index, the following query would be able to ignore the 'failed' exams very quickly, and just display the students who ever passed the exam...
(This assumes that if you ever get over 60, you're counted as a pass, even if you subsequently take the exam again and get 27.)
SELECT
Student_ID
FROM
[results]
WHERE
Grade >= 60
GROUP BY
Student_ID
Should you definitely need the most recent value, then you need to change the order of the index back to something like...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Student_ID],
[ExamEntry],
[Grade]
)
ON [PRIMARY]
This is because the first thing we are interested in is the most recent ExamEntry for any given student. Which can be achieved using the following query...
SELECT
*
FROM
[results]
WHERE
[results].ExamEntry = (
SELECT
MAX([student_results].ExamEntry)
FROM
[results] AS [student_results]
WHERE
[student_results].Student_ID = [results].student_id
)
AND [results].Grade > 60
Having a sub query like this can appear slow, especially since it appears to be executed for every row in [results].
This, however, is not the case...
- Both main and sub query reference the same table
- The query engine scans through the Index for every unique Student_ID
- The sub query is executed, for that Student_ID
- The query engine is already in that part of the index
- So a new Index Lookup is not needed
EDIT:
A comment was made that at 1000 records indexs are not relevant. It should be noted that the question states that there are 1000 records Returned, not that the table contains 1000 records. For a basic query to take as long as stated, I'd wager there are many more than 1000 records in the table. Maybe this can be clarified?
EDIT:
I have just investigated 3 queries, with 999 records in each (3 exam results for each of 333 students)
Method 1: WHERE a.ExamEntry = (SELECT MAX(b.ExamEntry) FROM results [a] WHERE a.Student_ID = b.student_id)
Method 2: WHERE a.ExamEntry IN (SELECT MAX(ExamEntry) FROM resuls GROUP BY Student_ID)
Method 3: USING an INNER JOIN instead of the IN clause
The following times were found:
Method QueryCost(No Index) QueryCost(WithIndex)
1 23% 9%
2 38% 46%
3 38% 46%
So, Query 1 is faster regardless of indexes, but indexes also definitely make method 1 substantially faster.
The reason for this is that indexes allow lookups, where otherwise you need a scan. The difference between a linear law and a square law.
Thanks for the answers!!
I think that Dems is probably closest to what I need, but I will elaborate a bit on the issue.
Only the latest grade counts. If the student had passed first time, attended again and failed, he failed in total. He/She could've attended 3 or 4 exams, but still only the last one counts.
I use MySQL server. The problem I experience in both Linux and Windows installations.
My data set is around 2K entries now and grows with the speed of ~ 1K per new exam.
The query for specific exam also returns ~ 1K entries, when ~ 1K would be the number of students attended (received by SELECT DISTINCT STUDENT_ID from results;), then almost all have passed and some have failed.
I perform the following query in my code:
SELECT ExamEntry, Student_ID from exams WHERE ExamEntry in ( SELECT MAX(ExamEntry) from exams GROUP BY Student_ID). As subquery returns about ~1K entries, it appears that main query scans them in loop, making all the query run for a very long time and with 50% server load (100% on Windows).
I feel that there is a better way :-), just can't find it yet.
select examentry,student_id,grade
from data
where examentry in
(select max(examentry)
from data
where grade > 60
group by student_id)
don't use
where grade > 60
but
where grade between 60 and 100
that should go faster