Compare two tables, find missing rows and mismatched data

Compare two tables, find missing rows and mismatched data - sql

I'd like to compare two tables and get a set of results where the lookup values are mismatched as well as where the key values are missing from the other table. The first part works fine with the following query:
SELECT * FROM (
SELECT mID, mLookup
FROM m) t1
FULL OUTER JOIN (
SELECT aID, aLookup
FROM a) t2
ON t1.mID = t2.aID
WHERE
t1.mID = t2.aID AND
t1.mLookup <> t2.aLookup
However, it doesn't return rows from t1 and t2 where there is no corresponding ID in the other table (because of the ON t1.mID = t2.aID).
How can I achieve both in the same query?

Remove the ID part of the WHERE clause. The FULL OUTER JOIN ON t1.mID = t2.aID is enough to link the tables together. The FULL OUTER JOIN will return both tables in the join even if one does not have a match.
However, the WHERE t1.m_ID = t2.aID clause limits the results to IDs that exist in both tables. This effectively causes the FULL OUTER JOIN to act like an INNER JOIN.
In other words:
SELECT * FROM (
SELECT mID, mLookup
FROM m) t1
FULL OUTER JOIN (
SELECT aID, aLookup
FROM a) t2
ON t1.mID = t2.aID
WHERE
--t1.mID = t2.aID AND -- remove this line
t1.mLookup <> t2.aLookup
-- EDIT --
Re-reading your question, you wanted only the mismatches. In that case, you need to search on where either side's ID is NULL:
SELECT * FROM (
SELECT mID, mLookup
FROM m) t1
FULL OUTER JOIN (
SELECT aID, aLookup
FROM a) t2
ON t1.mID = t2.aID
WHERE
t1.mID IS NULL OR
t2.mID IS NULL OR
t1.mLookup <> t2.aLookup

The where clause of your query filters out those rows that dont have matching "Ids". Try this:
SELECT m.mId, m.mLookup, a.aId, a.aLookup
from m
full outer join a
on a.aId = m.mId
where m.mId is null
or a.aID is null
or m.mLookup <> a.aLookup
The full outer join gets all possible rows, and the where clause keeps all rows where one or the other side are null and, where they match (neither null), keeps only those rows where the "lookup" values differ.

Starting from SQL Server 2008 and also valid for Azure SQL Database, Azure SQL Data Warehouse, Parallel Data Warehouse
Following is the SQL queries;
USE [test]
GO
CREATE TABLE [dbo].[Student1](
[Id] [int] NOT NULL,
[Name] [nvarchar](256) NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Student2](
[Id] [int] NOT NULL,
[Name] [nvarchar](256) NOT NULL
) ON [PRIMARY]
GO
---- You can re-run from here with your data
truncate table [Student1]
truncate table [Student2]
insert into [Student1] values (1, N'سید حیدر')
insert into [Student1] values (2, N'Syed Ali')
insert into [Student1] values (3, N'Misbah Arfin')
insert into [Student2] values (2, N'Syed Ali')
insert into [Student2] values (3, N'Misbah Arfin');
with StudentsAll (Id, [Name]) as
(
select s1.Id, s1.[Name] from Student1 s1
left outer join Student2 s2
on
s1.Id = s2.Id
),
StudentsMatched (Id, [Name]) as
(
select s1.Id, s1.[Name] from Student1 s1
inner join Student2 s2
on
s1.Id = s2.Id
)
select * from StudentsAll
except
select * from StudentsMatched

Related

Combining two tables without losing column or rows

I have two tables:
The first one has the colums "SomeValue" and "Timestamp". The other one has the columns "SomeOtherValue" and also "Timestamp".
What I need as an output is the following:
A table with the three Colums "SomeValue", "SomeOtherValue" and "Timestamp".
When a row in table 1 is like this: [2; 04/07/2017-20:05] and a row in table 2 is like that: [5; 04/07/2017-20:05], I want the combined output row to be [2; 5; 04/07/2017-20:05].
Until that point it would be easy done with a simple join, but I also need all other rows. So for example if we have a row in table 1 like [2; 04/07/2017-20:05] and no matching timestamp in table 2, the output should be like [2; ?; 04/07/2017-20:05]. The '?' stands for undefined or null. It would also be possible to not join two rows with the same timestamp but rather concating both tables, so that every row would have one empty cell with '?'.
I do realize that I didn't use correct Date/Time Format here in that example, but assume that it is used in the database.
I already tried using UNION ALL but it always removes one column.
For my use case it is not possible to query both tables independently. I really need both values in one row/object.
I hope someone can help me with this. Thank you!

What you are describing is a full outer join:
select t1.somevalue, t2.someothervalue, timestamp
from t1
full outer join t2 using (timestamp);
I don't know, however, whether SAP HANA supports the USING clause. Here is the same query with ON instead:
select
t1.somevalue,
t2.someothervalue,
coalesce(t1.timestamp, t2.timestamp) as timestamp
from t1
full outer join t2 on t2.timestamp = t1.timestamp;

Joining on a datetime stamp is not always going to be reliable unless you are setting a datetime variable and writing the value of that to both tables. It's probably not very efficient either.
That said, assuming you want all the results from table 1 and matching table 2 result if it exists then you need a left outer join
Select T1.[SomeValue]
, ISNULL(T2.[SomeOtherValue], '?')
, T1.[TimeStamp]
FROM Table1 T1
LEFT OUTER JOIN Table2 T2
ON T2.[TimeStamp] = T1.[TimeStamp]
Update based on comment from OP
If you need all rows from both tables then you could either do 2 queries as above but interchange T1 and T2 position then union the 2 queries.
SELECT T1.[TimeStamp]
, T1.[SomeValue]
, ISNULL(T2.[SomeOtherValue], '?')
FROM Table1 T1
LEFT OUTER JOIN Table2 T2
ON T2.[TimeStamp] = T2.[TimeStamp]
UNION
SELECT T2.[TimeStamp]
, T2.[SomeValue]
, ISNULL(T1.[SomeOtherValue], '?')
FROM Table2 T2
LEFT OUTER JOIN Table1 T1
ON T1.[TimeStamp] = T2.[TimeStamp]
;
Or you could insert the 1st query results into a table variable then add any missing from T2 rows into that table variable using a where not exists, then select the output.
DECLARE #TempTab TABLE
( [TimeStamp] [datetime] NOT NULL
, [SomeValue] [nvarchar] (MAX) -- or int if this is always an integer
, [SomeOtherValue] [nvarchar] (MAX) -- or int if this is always an integer
)
;
INSERT INTO #TempTab
( [TimeStamp]
, [SomeValue]
, [SomeOtherValue]
)
SELECT T1.[TimeStamp]
, T1.[SomeValue]
, ISNULL(T2.[SomeOtherValue], '?')
FROM Table1 T1
LEFT OUTER JOIN Table2 T2
ON T2.[TimeStamp] = T2.[TimeStamp]
;
INSERT INTO #TempTab
( [TimeStamp]
, [SomeValue]
, [SomeOtherValue]
)
SELECT T2.[TimeStamp]
, T2.[SomeValue]
, ISNULL(T1.[SomeOtherValue], '?')
FROM Table2 T2
LEFT OUTER JOIN Table1 T1
ON T1.[TimeStamp] = T2.[TimeStamp]
WHERE NOT EXISTS
( SELECT 1
FROM #TempTab T
WHERE T.[TimeStamp] = T2.[TimeStamp]
)
;
SELECT T.[TimeStamp]
, T.[SomeValue]
, T.[SomeOtherValue]
FROM #TempTab T
;

Multiple table join

I have a scenario whereby I have 3 tables (Table1, Table2, Table3)
Table1 contains data whereby each MEMBNO is unique
I would like to JOIN to Table2 and Table3 to display results but only have one row for each result
I tried
SELECT A.MEMBNO,A.FIELD1,B.FIELD1,B.FIELD2,C.FIELD1
FROM Table1 A
INNER join Table2 B ON A.MEMBNO = B.MEMBNO
INNER join Table3 C ON A.MEMBNO = C.MEMBNO
but I get multiple results. If the MEMBNO is in Table2 twice and Table3 four times, I get 8 rows returned.
Is my JOIN correct or is the only way to control this through the WHERE statement after the JOIN to control what is returned from Table2 and Table3 (ie: does SQL "dumb" join all the data and expect the WHERE statement to be the filer?)
Many thanks

What you are fighting with is the different relationships between the data. Table1 is the primary key table which has your one row per MEMBNO. Table2\3 have more than one row for each MEMBNO. What you therefore need to think about is what data you actually want to see before you attempt the joins. The difference in cardinality is causing your row duplication when the joins are happening. If you want the data in Table2\3 to be squished into a single row, have a think how that might look. i.e. do you want to sum the numbers from the different rows into a total? do you want to take the maximum date? etc
Best thing to do is give some data examples from each table and give an example result. More than happy to have a go if you add that info.

As I am concern about only MEMBNO. What if I use distinct of MEMBNO from both tables Table2 and Table3.
Check the below example:
create table #t1
(
F1 int,
F2 int
)
Insert into #t1 values(1, 111)
Create table #t2
(
F1 int,
F2 int
)
Insert into #t2 values(1, 111)
Insert into #t2 values(1, 222)
Create table #t3
(
F1 int,
F2 int
)
Insert into #t3 values(1, 333)
Insert into #t3 values(1, 444)
SELECT a.*
FroM #t1 a left join (Select distinct f1 from #t2) b on a.F1 = b.f1
left join (Select distinct f1 from #t3) c on a.F1 = c.f1
Where #t1, #t2, #t3 are table1, table2, table3 respecively
AND F1 is your MEMBNO in all the tables.

You get multiple results because of using inner join.
You should use left or right join.

Insert records dropped fron a join condition into another holding table

I have a logic like this.
INSERT INTO TGT_TABLE
SELECT
*
FROM SRC_TABLE
INNER JOIN REF_TABLE
WHERE SRC.ID = REF.ID
WHEN NOT MATCHED
INSERT INTO HOLDING_TABLE
;
I have an insert select statement, and I want the records which do not satisfy the condition be logged into another table.
How do I write this in SQL?

First , your join syntax is wrong , correct is
INSERT INTO TGT_TABLE
SELECT
*
FROM SRC_TABLE
INNER JOIN REF_TABLE
ON SRC.ID = REF.ID
Now to fetch those rows for which matching column values are not present in both tables , do a left join which will give NULL values in REF_TABLE columns for non matching rows and get those rows in WHERE clause
INSERT INTO HOLDING_TABLE
SELECT
*
FROM SRC_TABLE
LEFT JOIN REF_TABLE
ON SRC.ID = REF_TABLE.ID
WHERE REF_TABLE.ID is null

SQL to resequence items by groups

Lets say I have a database that looks like this:
tblA:
ID, Name, Sequence, tblBID
1 a 5 14
2 b 3 15
3 c 3 16
4 d 3 17
tblB:
ID, Group
14 1
15 1
16 2
17 3
I would like to sequence A so that the sequences go 1...n for each group of B.
So in this case, the sequences going down should be 1,2,1,1.
The ordering needs to be consistent with the current ordering, but there are no guarantees as to the current ordering.
I am not exactly a sql master and I am sure there is a fairly easy way to do this, but I really don't know the right route to take. Any hints?

If you are using SQL Server 2005+ or higher, you can use a ranking function:
Select tblA.Id, tblA.Name
, Row_Number() Over ( Partition By tblB.[Group] Order By tblA.Id ) As Sequence
, tblA.tblBID
From tblA
Join tblB
On tblB.tblBID = tblB.ID
Row_Number ranking function.
Here's another solution that would work in SQL Server 2000 and prior.
Select A.Id, A.Name
, (Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID) + 1 As Sequence
, A.tblBID
From tblA As A
Join tblB As B
On B.Id = A.tblBID
EDIT
Also want to make it clear that I want to actually update tblA to reflect the proper sequences.
In SQL Server, you can use their proprietary From clause in an Update statement like so:
Update tblA
Set Sequence = (
Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID
) + 1
From tblA As A
Join tblB As B
On B.Id = A.tblBID
The Hoyle ANSI solution might be something like:
Update tblA
Set Sequence = (
Select (Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID) + 1
From tblA As A
Join tblB As B
On B.Id = A.tblBID
Where A.Id = tblA.Id
)
EDIT
Can we do that [the inner group] comparison based on A.Sequence instead of B.ID?
Select A1.*
, (Select Count(*)
From tblB As B2
Join tblA As A2
On A2.tblBID = B2.Id
Where B2.[Group] = B1.[Group]
And A2.Sequence < A1.Sequence) + 1
From tblA As A1
Join tblB As B1
On B1.Id = A1.tblBID

Because it's SQL 2000, we can't use a windowing function. That's okay.
Thomas's queries are good and will work. However, they will get worse and worse as the number of rows increases—with different characteristics depending on how wide (the number of groups) and how deep (the number of items per group). This is because those queries use a partial cross-join, perhaps we could call it a "pyramidal cross-join" where the crossing part is limited to right side values less than left side values rather than left crossing to all right values.
What to do?
I think you will be surprised to find that the following long and painful-looking script will outperform the pyramidal join at a certain size of data (which may not be all that big) and eventually, with really large data sets must be considered a screaming performer:
CREATE TABLE #tblA (
ID int identity(1,1) NOT NULL,
Name varchar(1) NOT NULL,
Sequence int NOT NULL,
tblBID int NOT NULL,
PRIMARY KEY CLUSTERED (ID)
)
INSERT #tblA VALUES ('a', 5, 14)
INSERT #tblA VALUES ('b', 3, 15)
INSERT #tblA VALUES ('c', 3, 16)
INSERT #tblA VALUES ('d', 3, 17)
CREATE TABLE #tblB (
ID int NOT NULL PRIMARY KEY CLUSTERED,
GroupID int NOT NULL
)
INSERT #tblB VALUES (14, 1)
INSERT #tblB VALUES (15, 1)
INSERT #tblB VALUES (16, 2)
INSERT #tblB VALUES (17, 3)
CREATE TABLE #seq (
seq int identity(1,1) NOT NULL,
ID int NOT NULL,
GroupID int NOT NULL,
PRIMARY KEY CLUSTERED (ID)
)
INSERT #seq
SELECT
A.ID,
B.GroupID
FROM
#tblA A
INNER JOIN #tblB B ON A.tblBID = b.ID
ORDER BY B.GroupID, A.Sequence
UPDATE A
SET A.Sequence = S.seq - X.MinSeq + 1
FROM
#tblA A
INNER JOIN #seq S ON A.ID = S.ID
INNER JOIN (
SELECT GroupID, MinSeq = Min(seq)
FROM #seq
GROUP BY GroupID
) X ON S.GroupID = X.GroupID
SELECT * FROM #tblA
DROP TABLE #seq
DROP TABLE #tblB
DROP TABLE #tblA
If I understood you correctly, then ORDER BY B.GroupID, A.Sequence is correct. If not, you can switch A.Sequence to B.ID.
Also, my index on the temp table should be experimented with. For a certain quantity of rows, and also the width and depth characteristics of those rows, clustering on one of the other two columns in the #seq table could be helpful.
Last, there is a possible different data organization possible: leaving GroupID out of the #seq table and joining again. I suspect it would be worse, but am not 100% sure.

Something like:
SELECT a.id, a.name, row_number() over (partition by b.group order by a.id)
FROM tblA a
JOIN tblB on a.tblBID = b.ID;

inner join on null value

I'm not sure if i made a mistake in logic.
If i have a query and i do an inner join with a null value would i always get no results or will it ignore the join and succeed? example
user { id PK, name NVARCHAR NOT NULL, banStatus nullable reference }
if i write and u.banStatus i will receive no rows?
select * from user as u
join banstatus as b on u.banStatus=b.id
where id=1

You don't get the row if the join is null because NULL cannot be equal to anything, even NULL.
If you change it to a LEFT JOIN, then you will get the row.
With an inner join:
select * from user as u
join banstatus as b on u.banStatus=b.id
1, '1', 1, 'Banned'
With a left join:
select * from user as u
left join banstatus as b on u.banStatus=b.id
1, '1', 1, 'Banned'
2, 'NULL', , ''
Using this test data:
CREATE TABLE user (id int, banstatus nvarchar(100));
INSERT INTO user (id, banstatus) VALUES
(1, '1'),
(2, 'NULL');
CREATE TABLE banstatus (id int, text nvarchar(100));
INSERT INTO banstatus (id, text) VALUES
(1, 'Banned');

When you do an INNER JOIN, NULL values do not match with anything. Not even with each other. That is why your query is not returning any rows. (Source)

This is an inner joins on nulls (Oracle syntax):
select *
from user
uu
join banstatus
bb
on uu.banstatus = bb.id
or
uu.banstatus is null and bb.id is null

Nulls are not equal to any other value, so the join condition is not true for nulls. You can achieve the desired result by choosing a different join condition. Instead of
u.banStatus = b.id
use
u.banStatus = b.id OR (u.banStatus IS NULL AND b.id IS NULL)
Some SQL dialects have a more concise syntax for this kind of comparison:
-- PostgreSQL
u.banStatus IS NOT DISTINCT FROM b.id
-- SQLite
u.banStatus IS b.id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Compare two tables, find missing rows and mismatched data - sql

Related

Combining two tables without losing column or rows

Multiple table join

Insert records dropped fron a join condition into another holding table

SQL to resequence items by groups

inner join on null value

Categories

Resources