SQL how to compare a table to itself - sql

So lets say I have a table with this information in it
- Tom BLDG200
- Kevin BLDG200
- Mary BLDG340
I want to find everyone who shares the same building. So I want it to print out Ton and Kevin. But because Mary is by herself it shouldn't print. The way I have been going about it is using INNER JOIN to join them at the buildings but because I am comparing a table to itself it joins even if it's only 1 person. So in my case it would print out Mary even though I don't want it to. How can I make it print out only if 2 or more people share the same building.

Here is an efficient way to solve this query:
select t.*
from table t
where exists (select 1
from table t2
where t2.name <> t.name and t2.building = t.building
) ;
This will optimally take advantage of an index on building, name.
Most databases offer window/analytic functions, which are another efficient approach:
select name, building
from (select t.*, count(*) over (partition by building) as cnt
from table t
) t
where cnt > 1;

Assuming your column names are person and building:
SELECT t1.person, t2.person
FROM `table` t1
JOIN `table` t2
ON ( t1.building = t2.building
AND t1.person > t2.person
);
This line AND t1.person > t2.person solves your problem with Mary.
There is a problem with more persons, beacause they would be divided into pairs. But if this doesn't bother you, that would work.
Also, followind would work (but results appear per person, so you'll have list of every non-lonely persons and buildings they live in)
SELECT t1.person, t1.building
FROM `table` t1
JOIN `table` t2
ON ( t1.building = t2.building
AND t1.person > t2.person
);

Relational algebra 101. I added some more names so you can see that distinct is needed. In my sample data only Jane is alone and should not be in result.
with cte (name, building) as (
values
('Tom', 'BLDG200'),
('Kevin','BLDG200'),
('John', 'BLDG200'),
('Jack', 'BLDG200'),
('Mary', 'BLDG340'),
('Terry','BLDG340'),
('Jane', 'BLDG341')
)
select
distinct
a.name, a.building
from
cte a
join cte b on (a.name <> b.name and a.building = b.building)
SQLFiddle

Related

How To Query IDs that have different Patient Names

First, I understand that I should have a Primary Key on a Value Patient ID. A project was performed for ID conversions that did not go very well. So now I need to find all Patient IDs that have differnt Patient Names. There are 4 different DBs>Tables that contain info. For now I selected them into a Temp DB. Because I actually need all PIDs to be distinct across those DBs. Our application has tools to keep that synchronized. But due to some bad SQL work, I need to synchronize all the Data again.
PID NAME
1234 Johnson
1234 Johnson
4567 Jones
4567 Alexander
I am trying to write a query that will return the results of PID 4567 + NAME Values of Jones and Alexander.
You need to use a self-join. Please try this:
create table #temp
(id int, name varchar(30))
insert into #temp values (1,'johnson')
insert into #temp values (1,'johnson')
insert into #temp values (2,'james')
insert into #temp values (2,'Alex')
SELECT * FROM #temp WHERE id IN (
SELECT a.id FROM #temp a
JOIN #temp b on b.id = a.id AND b.name <> a.name
)
SELECT Min(PID), Name FROM [Table]
GROUP BY Name
HAVING Count(PID) = 1
SELECT PID,NAME
FROM TABLE
GROUP BY PID,NAME
HAVING COUNT(*) =1
I think this will do it
select p.pid, max(name), min(name), count(*) as cnt
from p
group by pid
having max(name) <> min(name)
or
select p1.pid, p1.name, p2.name
from p p1
join p p2
on p1.pid = p2.pid
and p1.name < p2.name
order by p1.pid, p1.name, p2.name
There are a lot of ways and some more optimized than others depending on which RDBMS system you are using. But typically this is a 2 step operations.
1) Find all of the PIDs that have more than 1 Name associated with it
2) Relate back to get the rest of the data you are seeking.
CREATE TABLE #T (
PID INT
,Name VARCHAR(25)
)
INSERT INTO #T (PID,Name) VALUES (1234,'Johnson'),(1234,'Johnson'),(4567,'Jones'),(4567,'Alexander')
SELECT
t2.*
FROM
(
SELECT
PID
FROM
#T t1
GROUP BY
PID
HAVING COUNT(DISTINCT Name) > 1
) dupes
INNER JOIN #T t2
ON dupes.PID = t2.PID
It is important when using a method such as the join or IN above that you use DISTINCT name because simplying counting * or name will return multiple occurrences of the same PID to name combination not simply duplicates.
If you only want the duplicate not all of the combinations. Using a RowNumber() or something can help you get to the answer a little more efficiently too. Or you can also use a method such as looking for existence of a non identical record, like so:
SELECT DISTINCT t1.PID, t1.Name
FROM
#T t1
WHERE
EXISTS (SELECT 1 FROM #t t2 WHERE t1.PID = t2.PID AND t1.Name <> t2.Name)
This way could perform faster for you depending on data sets etc. I would tend to stay away from solutions that use IN for cases like these.

Select rows having dstinct values for two fields

Pardon me for the title. I have a table like this:
There will be thousands of rows and now I want to select the rows having the same group_id but vr_debit and vr_credit values must not be equal: ie;, in the image shown, none of the rows satisfy this criteria. If there is are two rows, say, (6,500.000,0) and(6,0,600.000), I want them as the result. Hope you get the idea.
Thank you.
Calculate each group using SUM() which is an aggregate function and filter them using HAVING clause.
SELECT GROUP_ID, SUM(vr_debit) totalDebit, SUM(vr_credit) totalCredit
FROM TableName
GROUP BY GROUP_ID
HAVING SUM(vr_debit) <> SUM(vr_credit)
if you want to get the uncalculated rows, you can join it on the subquery.
SELECT a.*
FROM TableName a
INNER JOIN
(
SELECT GROUP_ID
FROM TableName
GROUP BY GROUP_ID
HAVING SUM(vr_debit) <> SUM(vr_credit)
) b ON a.GROUP_ID = b.GROUP_ID
SQLFiddle Demo (for both queries)
Perhaps:
SELECT group_ID,
vr_debit,
vr_credit
FROM
dbo.TableName T1
WHERE
EXISTS(
SELECT 1 FROM dbo.TableName T2
WHERE T1.group_ID = T2.group_ID
AND T1.vr_debit <> T2.vr_debit
AND T1.vr_credit<> T2.vr_credit
AND T1.vr_debit <> T2.vr_credit
)
Also you can use this option
SELECT *
FROM dbo.test64 t
WHERE EXISTS (
SELECT 1
FROM dbo.test64 t2
WHERE t.group_id = t2.group_id
HAVING SUM(t2.vr_debit) - SUM(t2.vr_credit) != 0
)
Demo on SQLFiddle

SQL - remove duplicates from left join

I'm creating a joined view of two tables, but am getting unwanted duplicates from table2.
For example: table1 has 9000 records and I need the resulting view to contain exactly the same; table2 may have multiple records with the same FKID but I only want to return one record (random chosen is ok with my customer). I have the following code that works correctly, but performance is slower than desired (over 14 seconds).
SELECT
OBJECTID
, PKID
,(SELECT TOP (1) SUBDIVISIO
FROM dbo.table2 AS t2
WHERE (t1.PKID = t2.FKID)) AS ProjectName
,(SELECT TOP (1) ASBUILT1
FROM dbo.table2 AS t2
WHERE (t1.PKID = t2.FKID)) AS Asbuilt
FROM dbo.table1 AS t1
Is there a way to do something similar with joins to speed up performance?
I'm using SQL Server 2008 R2.
I got close with the following code (~.5 seconds), but 'Distinct' only filters out records when all columns are duplicate (rather than just the FKID).
SELECT
t1.OBJECTID
,t1.PKID
,t2.ProjectName
,t2.Asbuilt
FROM dbo.table1 AS t1
LEFT JOIN (SELECT
DISTINCT FKID
,ProjectName
,Asbuilt
FROM dbo.table2) t2
ON t1.PKID = t2.FKID
table examples
table1 table2
OID, PKID FKID, ProjectName, Asbuilt
1, id1 id1, P1, AB1
2, id2 id1, P5, AB5
3, id4 id2, P10, AB2
5, id5 id5, P4, AB4
In the above example returned records should be id5/P4/AB4, id2/P10/AB2, and (id1/P1/AB1 OR id1/P5/AB5)
My search came up with similar questions, but none that resolved my problem. link, link
Thanks in advance for your help. This is my first post so let me know if I've broken any rules.
This will give the results you requested and should have the best performance.
SELECT
OBJECTID
, PKID
, t2.SUBDIVISIO,
, t2.ASBUILT1
FROM dbo.table1 AS t1
OUTER APPLY (
SELECT TOP 1 *
FROM dbo.table2 AS t2
WHERE t1.PKID = t2.FKID
) AS t2
Your original query is producing arbitrary values for the two columns (the use of top with no order by). You can get the same effect with this:
SELECT t1.OBJECTID, t1.PKID, t2.ProjectName, t2.Asbuilt
FROM dbo.table1 t1 LEFT JOIN
(SELECT FKID, min(ProjectName) as ProjectName, MIN(asBuilt) as AsBuilt
FROM dbo.table2
group by fkid
) t2
ON t1.PKID = t2.FKID
This version replaces the distinct with a group by.
To get a truly random row in SQL Server (which your syntax suggests you are using), try this:
SELECT t1.OBJECTID, t1.PKID, t2.ProjectName, t2.Asbuilt
FROM dbo.table1 t1 LEFT JOIN
(SELECT FKID, ProjectName, AsBuilt,
ROW_NUMBER() over (PARTITION by fkid order by newid()) as seqnum
FROM dbo.table2
) t2
ON t1.PKID = t2.FKID and t2.seqnum = 1
This assumes version 2005 or greater.
If you want described result, you need to use INNER JOIN and following query will satisfy your need:
SELECT
t1.OID,
t1.PKID,
MAX(t2.ProjectName) AS ProjectName,
MAX(t2.Asbuilt) AS Asbuilt
FROM table1 t1
JOIN table2 t2 ON t1.PKID = t2.FKID
GROUP BY
t1.OID,
t1.PKID
If you want to see all rows from left table (table1) whether it has pair in right table or not, then use LEFT JOIN and same query will gave you desired result.
EDITED
This construction has good performance, and you dont need to use subqueries.

SQL query to find record with ID not in another table

I have two tables with binding primary key in database and I desire to find a disjoint set between them. For example,
Table1 has columns (ID, Name) and sample data: (1 ,John), (2, Peter), (3, Mary)
Table2 has columns (ID, Address) and sample data: (1, address2), (2, address2)
So how do I create a SQL query so I can fetch the row with ID from table1 that is not in table2. In this case, (3, Mary) should be returned?
PS: The ID is the primary key for those two tables.
Try this
SELECT ID, Name
FROM Table1
WHERE ID NOT IN (SELECT ID FROM Table2)
Use LEFT JOIN
SELECT a.*
FROM table1 a
LEFT JOIN table2 b
on a.ID = b.ID
WHERE b.id IS NULL
There are basically 3 approaches to that: not exists, not in and left join / is null.
LEFT JOIN with IS NULL
SELECT l.*
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL
NOT IN
SELECT l.*
FROM t_left l
WHERE l.value NOT IN
(
SELECT value
FROM t_right r
)
NOT EXISTS
SELECT l.*
FROM t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM t_right r
WHERE r.value = l.value
)
Which one is better? The answer to this question might be better to be broken down to major specific RDBMS vendors. Generally speaking, one should avoid using select ... where ... in (select...) when the magnitude of number of records in the sub-query is unknown. Some vendors might limit the size. Oracle, for example, has a limit of 1,000. Best thing to do is to try all three and show the execution plan.
Specifically form PostgreSQL, execution plan of NOT EXISTS and LEFT JOIN / IS NULL are the same. I personally prefer the NOT EXISTS option because it shows better the intent. After all the semantic is that you want to find records in A that its pk do not exist in B.
Old but still gold, specific to PostgreSQL though: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/
Fast Alternative
I ran some tests (on postgres 9.5) using two tables with ~2M rows each. This query below performed at least 5* better than the other queries proposed:
-- Count
SELECT count(*) FROM (
(SELECT id FROM table1) EXCEPT (SELECT id FROM table2)
) t1_not_in_t2;
-- Get full row
SELECT table1.* FROM (
(SELECT id FROM table1) EXCEPT (SELECT id FROM table2)
) t1_not_in_t2 JOIN table1 ON t1_not_in_t2.id=table1.id;
Keeping in mind the points made in #John Woo's comment/link above, this is how I typically would handle it:
SELECT t1.ID, t1.Name
FROM Table1 t1
WHERE NOT EXISTS (
SELECT TOP 1 NULL
FROM Table2 t2
WHERE t1.ID = t2.ID
)
SELECT COUNT(ID) FROM tblA a
WHERE a.ID NOT IN (SELECT b.ID FROM tblB b) --For count
SELECT ID FROM tblA a
WHERE a.ID NOT IN (SELECT b.ID FROM tblB b) --For results

Can the result of a subquery be joined with itself?

Let's say I need to find the oldest animal in each zoo. It's a typical maximum-of-a-group sort of query. Only here's a complication: the zebras and giraffes are stored in separate tables. To get a listing of all animals, be they giraffes or zebras, I can do this:
(SELECT id,zoo,age FROM zebras
UNION ALL
SELECT id,zoo,age FROM giraffes) t1
Then given t1, I could build a typical maximum-of-a-group query:
SELECT t1.*
FROM t1
JOIN
(SELECT zoo,max(age) as max_age
FROM t1
GROUP BY zoo) t2
ON (t1.zoo = t2.zoo)
Clearly I could store t1 as a temporary table, but is there a way I could do this all within one query, and without having to repeat the definition of t1? (Please let's not discuss modifications to the table design; I want to focus on the issue of working with the subquery result.)
Here is a link to the with clause.
Understanding the WITH Clause
with t1 as
(select id, zoo, age from zebras
union all
select id, zoo, age from giraffes)
select t1.*
from t1
join
(SELECT zoo,max(age) as max_age
FROM t1
GROUP BY zoo) t2
on (t1.zoo = t2.zoo);
Note: You could move t2 up to your with clause as well.
Note 2: An alternative solution is to simply create t1 as a view and use it in your query instead.
To find the oldest animal, you should use wjndoq functions:
select z.*
from (select z.*,
row_number() over (partition by zoo order by age desc) as seqnum
from ( <subquery to union all animals>) z
)
where seqnum = 1