Joining tables without keys causes incorrect results - sql

I have 2 tables - one master and the other lookup. both don't have any keys. The structure of the tables is below.
first name last name role location Compensation Level state
john smith Manager LA A CA
john smith Manager BOS B MA
super smither developer LA B CA
tina taylor supervisor SFO A CA
tina taylor supervisor BOS B MA
first name last name role dept
john smith manager finance
john smith manager hr
super smither developer PA
tina taylor supervisor HR
tina taylor supervisor hr
very understandably, joining the two tables to get the dept for a first name, last name and role combination will result in incorrect results since there are other fields involved in the mix which identify a true unique record.
But given a structure like this, is there any way i can join the two tables to get the dept?
Using an inline subquery is not an option due to the way the final procedure is designed and due to other factors.
Any thoughts on this?
Expected output:
first name last name role location Compensation state dept
john smith Manager LA A CA finance
john smith Manager BOS B MA hr
super smither developer LA B CA PA
tina taylor supervisor SFO A CA HR
tina taylor supervisor BOS B MA HR

Here's an example that gives deterministic results, but they're arbitrary results. It's simply bases on determining an "ordered position" in each table, so that a choice can be made, and that choice be the same every time the query is executed, but there is no way to know that the choice is the correct one.
WITH
sorted_t1 AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, role
ORDER BY compensation_level, location, state) AS discriminator
FROM
t1
)
,
sorted_t2 AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, role
ORDER BY dept) AS discriminator
FROM
t2
)
SELECT
*
FROM
sorted_t1 t1
FULL OUTER JOIN
sorted_t2 t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.role = t2.role
AND t1.discriminator = t2.discriminator
NOTES:
This assumes a "case-insensitive" collation sequence. Otherwise the john smith rows will never join (as 'Manager' wouldn't match 'manager')
Similarly, the two tina taylor rows in table 2 are different ('hr' vs 'HR'), but if the collation sequence is case-insensitive it doesn't matter which gets joined to which, as there is no "material" difference between the rows.
It's also worth noting that in the example above there is no Real Reason to assume that the 'John Smith' from LA is in finance. The query simply forces that association because or the ORDER BY chosen in the ROW_NUMBER(). This means that when using this technique you really should be using other fields, one's that mean something in relation to each other.

Related

Query rows and include rows with columns reversed

I'm trying to query a table. I want the results to include the FROM and TO columns, but then also include rows with these two values reversed. And then I want to eliminate all duplicates. (A duplicate is the same two cities in the same order.)
For example, given this data.
Trips
FROM TO
-------------------- --------------------
West Jordan Taylorsville
Salt Lake City Ogden
West Jordan Taylorsville
Sandy South Jordan
Taylorsville West Jordan
I would want the following results.
West Jordan Taylorsville
Taylorsville West Jordan
Salt Lake City Ogden
Ogden Salt Lake City
Sandy South Jordan
South Jordan Sandy
I want to do this using C# and Entity Framework, but I could use raw SQL if I need to.
Is it possible to do this in a query, or do I need to manually perform some of this logic?
Not sure if I'm following, but doesn't just a simple union work for your sample?
select from, to
from some_table
union
select to, from
from some_table
I do believe the first sub query should handle the first part of your question. the WHERE ID NOT IN will handle the second part of your question.
SELECT *
FROM
(
SELECT *
FROM Trips
WHERE ID IN (
SELECT ID
FROM Trips t1
INNER JOIN Trips AS t2
ON t2.To = t1.From AND t2.From = t1.To
)
)
WHERE ID NOT IN
(
SELECT MIN(ID)
FROM Trips
GROUP BY [From], [To]
)
I am assuming there is more to the table than just those fields. Usually you have a field (primary key) to uniquely identify the row. I am using ID for that field, replace with whatever your table is using.

Combining two mostly identical rows in SQL

I have a table that contains data like below:
Name
ID
Dept
Joe
1001
Accounting
Joe
1001
Marketing
Mary
1003
Administration
Mary
1009
Accounting
Each row is uniquely identified with a combo of Name and ID. I want the resulting table to combine rows that have same Name and ID and put their dept's together separated by a comma in alpha order. So the result would be:
Name
ID
Dept
Joe
1001
Accounting, Marketing
Mary
1003
Administration
Mary
1009
Accounting
I am not sure how to approach this. So far I have this, which doesn't really do what I need:
SELECT Name, ID, COUNT(*)
FROM employees
GROUP BY Name, ID
I know COUNT(*) is irrelevant here, but I am not sure what to do. Any help is appreciated! By the way, I am using PostgreSQL and I am new to the language.
Apparently there is an aggregate function for string concatenation with PostgreSQL. Find documentation here. Try the following:
SELECT Name, ID, string_agg(Dept, ', ' ORDER BY Dept ASC) AS Departments
FROM employees
GROUP BY Name, ID

What happens if one query contains duplicate joins?

I have an application filter which can generate duplicate SQL queries to the result SQL like:
select * from articles
inner join users on articles.users_id = users.id
inner join users on articles.users_id = users.id
where users.name like %xxx%
The question is if a database is able to handle these duplicates or not. What happened in the database if this query comes inside? If I should remove it from the result SQL or if I can leave it as is.
This is a self join.
A self join is a regular join, but the table is joined with itself.
Example
SELECT
A.Id,
A.FullName,
A.ManagerId,
B.FullName as ManagerName
FROM Employees A
JOIN Employees B
ON A.ManagerId = B.Id
A and B are different table aliases for the same table.
The self join, as its name implies, joins a table to itself. To use a self join, the table must contain a column (call it X) that acts as the primary key and a different column (call it Y) that stores values that can be matched up with the values in Column X. The values of Columns X and Y do not have to be the same for any given row, and the value in Column Y may even be null.
Let’s take a look at an example. Consider the table Employees:
Id
FullName
Salary
ManagerId
1
John Smith
10000
3
2
Jane Anderson
12000
3
3
Tom Lanon
15000
4
4
Anne Connor
20000
5
Jeremy York
9000
1
Each employee has his/her own Id, which is our “Column X.” For a given employee (i.e., row), the column ManagerId contains the Id of his or her manager; this is our “Column Y.” If we trace the employee-manager pairs in this table using these columns:
The manager of the employee John Smith is the employee with Id 3,
i.e., Tom Lanon.
The manager of the employee Jane Anderson is the employee with Id 3,
i.e., Tom Lanon.
The manager of the employee Tom Lanon is the employee with Id 4,
i.e., Anne Connor.
The employee Anne Connor does not have a manager; her ManagerId is
null.
The manager of the employee Jeremy York is the employee with Id 1,
i.e., John Smith.
This type of table structure is very common in hierarchies. Now, to show the name of the manager for each employee in the same row, we can run the following query:
SELECT
employee.Id,
employee.FullName,
employee.ManagerId,
manager.FullName as ManagerName
FROM Employees employee
JOIN Employees manager
ON employee.ManagerId = manager.Id
which returns the following result:
Id
FullName
ManagerId
ManagerName
1
John Smith
3
Tom Lanon
2
Jane Anderson
3
Tom Lanon
3
Tom Lanon
4
Anne Connor
5
Jeremy York
1
John Smith
The query selects the columns Id, FullName, and ManagerId from the table aliased employee. It also selects the FullName column of the table aliased manager and designates this column as ManagerName. As a result, every employee who has a manager is output along with his/her manager’s ID and name.
In this query, the Employees table is joined with itself and has two different roles:
Role 1: It stores the employee data (alias employee).
Role 2: It stores the manager data (alias manager).
By doing so, we are essentially considering the two copies of the Employees table as if they are two distinct tables, one for the employees and another for the managers.
You can find more about the concept of the self join in our article "An illustrated guide to the SQL self join".

Excessive Case Statement Help - SQL Server

I'm supposed to answer this for class, and it's tricky (for me)
Write a SELECT query to output the name of all employees with the name of their supervisor. If the employee has no supervisor, the supervisor name column should contain the text 'No Supervisor'.
The primary key field in my db is the employeeid and they are provided with names, and each student also has a supervisorid
The table for this is shown below (sorry for the layout):
employeeid lastname firstname salary supervisorid
1 Stolz Ted 25000 NULL
2 Boswell Nancy 23000 1
3 Hargett Vincent 22000 1
4 Weekley Kevin 22000 3
5 Metts Geraldine 22000 2
6 McBride Jeffrey 21000 2
7 Xiong Jay 20000 3
I was wondering how I could go about this statement without using the case statement to apply each of the 7 students with:
when concat(firstname,' ',lastname)='Nancy Boswell' then 'Ted Stolz'
In larger tables this would simply be a HUGE statement, is there a better way to do it?
Thanks!
EDIT:
I've now tried this:
SELECT
EMP1.employeeid as 'employee',
EMP2.supervisorid as 'manager'
FROM
employee EMP1
LEFT OUTER JOIN
employee EMP2
ON
emp1.employeeid = emp2.supervisorid;
However, I am seeing duplicate fields, for some reason employee 2 and 3 are appearing twice, meaning there are 9 fields showing instead of 7.
Also, I need to display their names, not their id's does that mean I need to join the join that i've already done to the employee name ? How would I do this?
Thanks for the feedback guys!
You need to link the table with itself based on the supervisorId. This might be strange if you are new to SQL but it is very common to do. You tell with SQL to add the row of the supervisor to the row of the employee via its primary key.
SELECT
*
FROM
EMPLOYEES EMP1
LEFT OUTER JOIN
EMPLOYEES EMP2
ON
-- make link between tables here
Note that the above query is not 100% correct / complete, its an indication. The LEFT OUTER JOIN statement makes the employees without supervisor have null values for the supervisor, otherwise the whole record would be left out.

UPDATE query that fixes orphaned records

I have an Access database that has two tables that are related by PK/FK. Unfortunately, the database tables have allowed for duplicate/redundant records and has made the database a bit screwy. I am trying to figure out a SQL statement that will fix the problem.
To better explain the problem and goal, I have created example tables to use as reference:
alt text http://img38.imageshack.us/img38/9243/514201074110am.png
You'll notice there are two tables, a Student table and a TestScore table where StudentID is the PK/FK.
The Student table contains duplicate records for students John, Sally, Tommy, and Suzy. In other words the John's with StudentID's 1 and 5 are the same person, Sally 2 and 6 are the same person, and so on.
The TestScore table relates test scores with a student.
Ignoring how/why the Student table allowed duplicates, etc - The goal I'm trying to accomplish is to update the TestScore table so that it replaces the StudentID's that have been disabled with the corresponding enabled StudentID. So, all StudentID's = 1 (John) will be updated to 5; all StudentID's = 2 (Sally) will be updated to 6, and so on. Here's the resultant TestScore table that I'm shooting for (Notice there is no longer any reference to the disabled StudentID's 1-4):
alt text http://img163.imageshack.us/img163/1954/514201091121am.png
Can you think of a query (compatible with MS Access's JET Engine) that can accomplish this goal? Or, maybe, you can offer some tips/perspectives that will point me in the right direction.
Thanks.
The only way to do this is through a series of queries and temporary tables.
First, I would create the following Make Table query that you would use to create a mapping of the bad StudentID to correct StudentID.
Select S1.StudentId As NewStudentId, S2.StudentId As OldStudentId
Into zzStudentMap
From Student As S1
Inner Join Student As S2
On S2.Name = S1.Name
Where S1.Disabled = False
And S2.StudentId <> S1.StudentId
And S2.Disabled = True
Next, you would use that temporary table to update the TestScore table with the correct StudentID.
Update TestScore
Inner Join zzStudentMap
On zzStudentMap.OldStudentId = TestScore.StudentId
Set StudentId = zzStudentMap.NewStudentId
The most common technique to identify duplicates in a table is to group by the fields that represent duplicate records:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
25 Brian Smith
In this case we want to remove one of the Brian Smith Records, or in your case, update the ID field so they both have the value of 25 or 1 (completely arbitrary which one to use).
SELECT min(id)
FROM example
GROUP BY first_name, last_name
Using min on ID will return:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
If you use max you would get
ID FIRST_NAME LAST_NAME
25 Brian Smith
3 George Smith
I usually use this technique to delete the duplicates, not update them:
DELETE FROM example
WHERE ID NOT IN (SELECT MAX (ID)
FROM example
GROUP BY first_name, last_name)