Inner Join versus Exists() while avoiding duplicate rows - sql

This is a complicated question so bear with me as I set up the scenario:
Say we have a simplified table setup like so:
table 1(employee): {
employee_id, -primary key
first_name,
last_name,
days_of_employment
}
with data:
employee_id first_name last_name days_of_employment
111 Jack Stevens 543
222 Clarice Bobber 323
333 Roy Cook 736
444 Fred Roberts 1000
...
table 2(teams): {
team_code, --primary key
description
}
with data:
team_code description
ERA Enrollment Records Assoc.
RR Rolling Runners
FR French Revolution
...
table 3(employees_teams):{
employee_id, --primary key
team_code --primary key
}
with data:
employee_id team_code
111 RR
111 FR
222 FR
222 ERA
333 FR
...
I'm hoping these tables should be clear as to what they are and their purpose. Here is my scenario from requirements: "I want the average days of employment of employees on the Rolling Runners and Enrollment Records Assoc. team." There are two ways I know how to write this query and they both seem to work well enough, but what I really want to know is which one is faster for the oracle database to process. Keep in mind that these queries are written the way they are to keep from producing duplicate rows which would screw up the average calculation:
Query 1:
SELECT AVG(e.days_of_employment) avg_days_of_employment
FROM employee e,
(
SELECT DISTINCT employee_id
FROM employees_teams
WHERE team_code IN ('ERA','RR')) available_employees
WHERE e.employee_id = available_employees.employee_id
Query 2:
SELECT AVG(e.days_of_employment) avg_days_of_employment
FROM employee e
WHERE EXISTS(
SELECT 1
FROM employees_teams et
WHERE et.team_code IN ('ERA','RR')
AND et.employee_id = e.employee_id)
It is possible that with this sample data I provided that this situation may not make sense to begin with, but I still would like to know which query is 'better' to use.

I would say go with the EXISTS approach since you are not really needing anything from the available_employees other than checking for the existence.
Having said that it depends on your data as well and how your database query optimizer optmizes it. I would suggest you to see the query plan for each approach and see which one is less expensive.
Check these links as well http://dotnetvj.blogspot.com/2009/07/why-we-should-use-exists-instead-of.html Can an INNER JOIN offer better performance than EXISTS

Related

Subquery in FROM clause

Looking around in the (now-discontinued) documentation and found this example:
Subquery in FROM clause
A subquery in a FROM clause acts similarly to a temporary table that is generated during the execution of a query and lost afterwards.
SELECT Managers.Id, Employees.Salary
FROM (
SELECT Id
FROM Employees
WHERE ManagerId IS NULL
) AS Managers
JOIN Employees ON Managers.Id = Employees.Id
(Excerpted from Subqueries - Subquery in FROM clause. The original author was Phrancis. Attribution details can be found on the contributor page. The source is licenced under CC BY-SA 3.0 and may be found in the Documentation archive. Reference topic ID: 1030 and example ID: 3327.)
My question is:
why using an extra ManagerId. An Id column is already in
Employees table,
why have the extra ManagerId to be null for a manager (ok it wants to be a joke).
My opinion:
despite the upvotes, something is wrong with this is example,
Tables with example data would be nice to see on the fly how it's working. One table with start data, one table temporary SELECT and one table the
resultset.
Edit: Thanks to all contributors for their answers!
#Alex K.: That is my point of view "it is not something one would actually use". But people, who wants to learn SQL, might think, that it is good practice, because it is in the documentation here.,
#Nebi: Thanks for the point that one would write it simpler to get the same result.
#Unnikrishnan R: "showcase how the sub query works" does in my eyes not only mean that it is fully functional but additional that it makes sense. If I get things simpler, why doing it the errorprone hard way.
#me: should have titled it "let's discuss sql documentation" or like that ;)
Let us consider a situation where Employee table holds all employees including their managers in which employee has an Id, and there is also a column for the manager Id (which can be null). This can be the point of view ,who was writing that SQL queries.
For Example,
+----+-------+--------+-----------+
| Id | Name | Salary | ManagerId |
+----+-------+--------+-----------+
| 1 | Joe | 70000 | 3 |
| 2 | Henry | 80000 | 4 |
| 3 | Sam | 60000 | NULL |
| 4 | Max | 90000 | NULL |
+----+-------+--------+-----------+
why have the extra ManagerId to be null for a manager --
getting the employees that are not managers
It is just an example how to do/ use Subqueries.
To your questions:
why using an extra ManagerId. An Id column is already in Employees
table
First of all ManagerId and Id are different columns of the table Employees. So there is a difference between them. But you might reffering to the Id of the Subquery Managers and the Id of the joined table Employees.
Then you need to define which Id you are using. Else you get the Error for ambigiuos column. In this example you to specify either the Subqueries Id which is Managers.Id or the Id of the joined table Employees (Employees.Id). Which one you choose is totally regardless because you use INNER JOIN one the Id.
why have the extra ManagerId to be null for a manager (ok it wants
to be a joke).
This is because of getting all the Employees that have are not managers. You are right about saying this could be done easier or in other form. For instance:
SELECT Id, Salary
FROM Employees
WHERE ManagerId IS NULL
This probably gets the same result as in the original. But the example is not about that, it is about the structure of a subquery.
why using an extra ManagerId. An Id column is already in Employees table
Consider you are having an employee table and you also wanted to keep the manager information in the same table.so apart from the ID column you need to add another column to keep the managerid.
why have the extra ManagerId to be null for a manager (ok it wants to be a joke).
The query is just to showcase how the sub query works. In this case subquery retrieves the manager from the Employee table (managerID is null) then join those id's with Employee table in the outer query to get the salary of each managers.

Excessive Case Statement Help - SQL Server

I'm supposed to answer this for class, and it's tricky (for me)
Write a SELECT query to output the name of all employees with the name of their supervisor. If the employee has no supervisor, the supervisor name column should contain the text 'No Supervisor'.
The primary key field in my db is the employeeid and they are provided with names, and each student also has a supervisorid
The table for this is shown below (sorry for the layout):
employeeid lastname firstname salary supervisorid
1 Stolz Ted 25000 NULL
2 Boswell Nancy 23000 1
3 Hargett Vincent 22000 1
4 Weekley Kevin 22000 3
5 Metts Geraldine 22000 2
6 McBride Jeffrey 21000 2
7 Xiong Jay 20000 3
I was wondering how I could go about this statement without using the case statement to apply each of the 7 students with:
when concat(firstname,' ',lastname)='Nancy Boswell' then 'Ted Stolz'
In larger tables this would simply be a HUGE statement, is there a better way to do it?
Thanks!
EDIT:
I've now tried this:
SELECT
EMP1.employeeid as 'employee',
EMP2.supervisorid as 'manager'
FROM
employee EMP1
LEFT OUTER JOIN
employee EMP2
ON
emp1.employeeid = emp2.supervisorid;
However, I am seeing duplicate fields, for some reason employee 2 and 3 are appearing twice, meaning there are 9 fields showing instead of 7.
Also, I need to display their names, not their id's does that mean I need to join the join that i've already done to the employee name ? How would I do this?
Thanks for the feedback guys!
You need to link the table with itself based on the supervisorId. This might be strange if you are new to SQL but it is very common to do. You tell with SQL to add the row of the supervisor to the row of the employee via its primary key.
SELECT
*
FROM
EMPLOYEES EMP1
LEFT OUTER JOIN
EMPLOYEES EMP2
ON
-- make link between tables here
Note that the above query is not 100% correct / complete, its an indication. The LEFT OUTER JOIN statement makes the employees without supervisor have null values for the supervisor, otherwise the whole record would be left out.

Searching for parent records whose children meet predicate

Let's say I have a parent and child database, and the child keeps a sort of running transcript of things that happen to the parent:
create table patient (
fullname text not null,
admission_number integer primary key
);
create table history (
note text not null,
doctor text not null,
admission_number integer references patient (admission_number)
);
(Just an example, I'm not doing a medical application).
history is going to have many records for the same admission_number:
admission_number doctor note
------------------------------------
3456 Johnson Took blood pressure
7828 Johnson EKG 120, temp 99.2
3456 Nichols Drew blood
9001 Damien Discharged patient
7828 Damien Discharged patient with Rx
So, my question is, how would I build a query that let me do and/or/not searches of the note field for patient records, like, for example, if I wanted to find every patient whose history contained "blood pressure" and "discharged".
Right now I'm been doing a select on history that groups by admission_number, combining all the notes with a group_concat(note) and doing my search in the having, thus:
select * from history
group by admission_number
having group_concat(note) like '%blood pressure%'
and group_concat(note) like '%discharged';
This works, but it makes certain elaborations very complicated -- for example, I'd like to be able to ask things like "every patient whose history contains "blood pressure" and whose history with Dr. Damien says "discharged," and building qualifications like this on top of my basic query is very messy.
Is there any better way of phrasing my basic query?
This is similar to your EXISTS method, but computes the subqueries differently.
This might or might not be faster, depending on how your tables and indexes are organized, and on the queries' selectivity.
SELECT *
FROM patient
WHERE admission_number IN (SELECT admission_number
FROM history
WHERE note LIKE '%blood pressure%')
AND admission_number IN (SELECT admission_number
FROM history
WHERE note LIKE '%discharged%'
AND doctor = 'Damien')
Alternatively, you could use a compound subquery (computing the intersection once is likely to be faster than executing IN twice for every record):
SELECT *
FROM patient
WHERE admission_number IN (SELECT admission_number
FROM history
WHERE note LIKE '%blood pressure%'
INTERSECT
SELECT admission_number
FROM history
WHERE note LIKE '%discharged%'
AND doctor = 'Damien')
Why don't you use a JOIN operation?
e.g.
considering, the patient table contains the following data:
INSERT INTO patient VALUES('Bob', 3456);
INSERT INTO patient VALUES('Mary', 7828);
INSERT INTO patient VALUES('Lucy', 9001);
Running the query:
SELECT DISTINCT p.fullname, p.admission_number FROM patient p
INNER JOIN history h ON p.admission_number = h.admission_number
WHERE note LIKE '%blood pressure%' OR note LIKE '%Discharged%';
gets you:
fullname = Bob
admission_number = 3456
fullname = Lucy
admission_number = 9001
fullname = Mary
admission_number = 7828
And running the following query:
SELECT DISTINCT p.fullname, p.admission_number FROM patient p
INNER JOIN history h ON p.admission_number = h.admission_number
WHERE note LIKE '%blood pressure%';
gets you:
fullname = Bob
admission_number = 3456
I have something -- using EXISTS to construct these is a bit cleaner:
select * from patients where
exists (
select 1 from history where
history.admission_number == patients.admission_number
AND
history.note LIKE '%blood pressure%'
)
AND
exists (
select 1 from history where
history.admission_number == patients.admission_number
AND
history.note LIKE '%discharged%'
AND
history.doctor == 'Damien'
);
That's much better, now I can construct really fine-grained predicates.

UPDATE query that fixes orphaned records

I have an Access database that has two tables that are related by PK/FK. Unfortunately, the database tables have allowed for duplicate/redundant records and has made the database a bit screwy. I am trying to figure out a SQL statement that will fix the problem.
To better explain the problem and goal, I have created example tables to use as reference:
alt text http://img38.imageshack.us/img38/9243/514201074110am.png
You'll notice there are two tables, a Student table and a TestScore table where StudentID is the PK/FK.
The Student table contains duplicate records for students John, Sally, Tommy, and Suzy. In other words the John's with StudentID's 1 and 5 are the same person, Sally 2 and 6 are the same person, and so on.
The TestScore table relates test scores with a student.
Ignoring how/why the Student table allowed duplicates, etc - The goal I'm trying to accomplish is to update the TestScore table so that it replaces the StudentID's that have been disabled with the corresponding enabled StudentID. So, all StudentID's = 1 (John) will be updated to 5; all StudentID's = 2 (Sally) will be updated to 6, and so on. Here's the resultant TestScore table that I'm shooting for (Notice there is no longer any reference to the disabled StudentID's 1-4):
alt text http://img163.imageshack.us/img163/1954/514201091121am.png
Can you think of a query (compatible with MS Access's JET Engine) that can accomplish this goal? Or, maybe, you can offer some tips/perspectives that will point me in the right direction.
Thanks.
The only way to do this is through a series of queries and temporary tables.
First, I would create the following Make Table query that you would use to create a mapping of the bad StudentID to correct StudentID.
Select S1.StudentId As NewStudentId, S2.StudentId As OldStudentId
Into zzStudentMap
From Student As S1
Inner Join Student As S2
On S2.Name = S1.Name
Where S1.Disabled = False
And S2.StudentId <> S1.StudentId
And S2.Disabled = True
Next, you would use that temporary table to update the TestScore table with the correct StudentID.
Update TestScore
Inner Join zzStudentMap
On zzStudentMap.OldStudentId = TestScore.StudentId
Set StudentId = zzStudentMap.NewStudentId
The most common technique to identify duplicates in a table is to group by the fields that represent duplicate records:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
25 Brian Smith
In this case we want to remove one of the Brian Smith Records, or in your case, update the ID field so they both have the value of 25 or 1 (completely arbitrary which one to use).
SELECT min(id)
FROM example
GROUP BY first_name, last_name
Using min on ID will return:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
If you use max you would get
ID FIRST_NAME LAST_NAME
25 Brian Smith
3 George Smith
I usually use this technique to delete the duplicates, not update them:
DELETE FROM example
WHERE ID NOT IN (SELECT MAX (ID)
FROM example
GROUP BY first_name, last_name)

How to manage "groups" in the database?

I've asked this question here, but I don't think I got my point across.
Let's say I have the following tables (all PK are IDENTITY fields):
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, etc.)
Borrowers (BorrowerId(PK), PersonId, LoanId)
Let's say Mr. Smith got 2 loans on his name, 3 joint loans with his wife, and 1 join loan with his mistress. For the purposes of application I want to GROUP people, so that I can easily single-out the loans that Mr. Smith took out jointly with his wife.
To accomplish that I added BorrowerGroup table, now I have the following (all PK are IDENTITY fields):
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, BorrowerGroupId, etc.)
BorrowerGroup(GroupId (PK))
Borrowers (BorrowerId(PK), GroupId, PersonId)
Now Mr. Smith is in 3 groups (himself, him and his wife, him and his mistress) and I can easily lookup his activity in any of those groups.
The problems with new design:
The only way to generate new BorrowerGroup is by inserting MAX(GourpId)+1 with IDENTITY_INSERT ON, this just doesn't feel right. Also, the notion of a table with 1 column is kind of weird.
I'm a firm believer in surrogate keys, and would like to stick to that design if possible.
This application does not care about individuals, the GROUP is treated as an individual
Is there a better way to group people for the purpose of this application?
You could just remove the table BorrowerGroups - it carries no information. This information is allready present via the Loans People share - I just assume you have a PeopleLoans table.
People Loans PeopleLoans
----------- ------------ -----------
1 Smith 6 S1 60 1 6
2 Wife 7 S2 60 1 7
3 Mistress 8 S+W1 74 1 8
9 S+W2 74 1 9
10 S+W3 74 1 10
11 S+M1 89 1 11
2 8
2 9
2 10
3 11
So your BorrowerGroups are actually almost the Loans - 6 and 7 with Smith only, 8 to 10 with Smith and Wife, and 11 with Smith and Mistress. So there is no need for BorrowerGroups in the first place, because they are identical to Loans grouped by the involved People.
But it might be quite hard to efficently retrieve this information, so you could think about adding a GroupId directly to Loans. Ignoring the second column of Loans (just for readability) the third column schould represent your groups. They are redundant, so you have to be carefull if you change them.
If you find a good way to derive a unique GroupId from the ids of involved people, you could make it a computed column. If a string would be okay as an group id, you could just order the ids of the people an concat them with a separator.
Group 60 with Smith only would get id '1', group 74 would become 1.2, and group 89 would become 1.3. Not that smart, but unique and easy to compute.
use the original schema:
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, etc.)
Borrowers (BorrowerId(PK), PersonId, LoanId)
just query for the data you need (your example to find husband and wife on same loans):
SELECT
l.*
FROM Borrowers b1
INNER JOIN Borrowers b2 ON b1.LoanId=b2.LoanId
INNER JOIN Loans l ON b1.LoanId=l.LoanId
WHERE b1.PersonId=#HusbandID
AND b2.PersonId=#WifeID
The design of the database seems OK. Why do you have to use MAX(GourpId)+1 when you create a new group? Can't you just create the row and then use SCOPE_IDENTITY() to return the new ID?
e.g.
INSERT INTO BorrowerGroup() DEFAULT VALUES
SELECT SCOPE_IDENTITY()
(See this other question)
(edit to SQL courtesy of this question)
I would do something more like this:
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, BorrowerGroupId, etc.)
BorrowerGroup(BorrowerGroupId (PK))
PersonBelongsToBorrowerGroup(BorrowerGroupId
(PK), PersonId(PK))
I got rid of the Borrowers table. Just store the info in the BorrowerGroup table. That's my preference.
The consensus seems to be to omit the BorrowerGroup table and I have to agree. Suggesting that you would use MAX(groupId+1) has all sorts of ACID/transaction issues and the main reason why IDENTITY fields exist.
That said; the SQL that KM provided looks good. There are any number of ways to get the same results. Joins, sub-selects and so on. The real issue there... is knowing the dataset. Given the explanation you provided the datasets are going to be very small. That also supports removing the BorrowerGroup table.
I would have a group table and then a groupmembers(borrowers) table to accomplish the many-to-many relationship between loans and people. This allows the tracking of data on the group other than just a list of members (I believe someone else made this suggestion?).
CREATE TABLE LoanGroup
(
ID int NOT NULL
, Group_Name char(50) NULL
, Date_Started datetime NULL
, Primary_ContactID int NULL
, Group_Type varchar(25)
)