Diff between two tables (using sql) -> incremental changes - sql

I have a need to identify differences between two tables. I have looked at sql query to return differences between two tables but it was a bit too different for me to extrapolate with my current SQL skills.
Table A is a snapshot of a certain group of people where the snapshot was taken yesterday, where each row is a unique person and certain characteristics about the person. Table B is the same snapshot taken 24 hours later. Within the 24 hour period:
New people may have been added.
People from yesterday may have been removed.
People from yesterday may have changed (i.e., original row is there, but one or more column values have changed).
My output should have the following:
a row for each new person added
a row for each person removed
a row for each person who has changed
I would grateful for any ideas. Thanks!

This type of problem has a very simple and efficient solution that does not use joins (it doesn't even use a union of the results of two MINUS operations) - it just uses one union and a GROUP BY operation. The solution was developed in a thread on AskTom many years ago, it is surprising that it is not more widely known and used. For example (but not only): https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:24371552251735
In your case, assuming there is a primary key constraint on PERSON_ID (which makes the solution simpler):
select max(flag) as flag, PERSON_ID, first_name, last_name, (etc. - all the columns)
from ( select 'old' as flag, t1.*
from old_table t1
union all
select 'new' as flag, t2.*
from new_table t2
)
group by PERSON_ID, first_name, last_name, (etc.)
having count(*) = 1
order by PERSON_ID -- optional
;
If for a PERSON_ID all the data is the same in both tables, that will result in a count of 2 for that group. So it won't pass the HAVING condition. The only groups that will have a count of 1 (and therefore will be just one row each!) are either rows that are in one table but not the other. If a person was added, that will show only one row, with the flag = 'new'. If a person was deleted, you will get only one row, with the flag 'old'. If there were updates, the same PERSON_ID will appear twice, but since at least one field is different, the two rows (one with flag 'new' and the other with 'old') will be in different groups, they will pass the HAVING filter, and they will BOTH be in the output.
Which is slightly different from what you requested; you will get both the old AND the new information for updates, labeled as 'old' and 'new'. You said you wanted only one of those but didn't state which one. This will give you both (which makes more sense anyway), but if you really only want one, it can be done easily in the query above.
Note - the outer select must have max(flag) rather than flag because flag is not a GROUP BY column; but it's the max() over exactly one row, so it WILL be the flag for that row anyway.
Added - OP indicated he would like to get only the "new" row for a person with updated (changed, modified) data. The approach shown below will change the flag to "changed" in this case.
with old_table ( person_id, first_name, last_name ) as (
select 101, 'John', 'Smith' from dual union all
select 102, 'Mary', 'Green' from dual union all
select 103, 'July', 'Dobbs' from dual union all
select 104, 'Will', 'Scott' from dual
),
new_table ( person_id, first_name, last_name ) as (
select 101, 'Joe' , 'Smith' from dual union all
select 102, 'Mary', 'Green' from dual union all
select 104, 'Will', 'Scott' from dual union all
select 105, 'Andy', 'Brown' from dual
)
-- end of test data; solution (SQL query) begins below this line
select case ct when 1 then flag else 'changed' end as flag,
person_id, first_name, last_name
from (
select max(flag) as flag, person_id, first_name, last_name,
count(*) over (partition by person_id) as ct,
row_number() over (partition by person_id order by max(flag)) as rn
from ( select 'old' as flag, t1.*
from old_table t1
union all
select 'new' as flag, t2.*
from new_table t2
)
group by person_id, first_name, last_name
having count(*) = 1
)
where rn = 1
order by person_id -- ORDER BY clause is optional
;
Output:
FLAG PERSON_ID FIRS_NAME LAST_NAME
------- ---------- --------- ---------
changed 101 Joe Smith
old 103 July Dobbs
new 105 Andy Brown

The first 2 parts are easy:
select 'New', name from B where not exists (select name from A where A.name=B.name)
union select 'Removed', name from A where not exists (select name from B where B.name = A.name)
The last one is where you need to compare characteristics. How many of them are there? Do you want to list what has changed or only that they have changed?
For argument's sake, let us only say that the characteristics are address and telephone #:
union select 'Phone', name from A,B where A.name = B.name and A.telephone != B.telephone
union select 'Address', name from A,B where A.name = B.name and A.address != B.address

Note: The question isn't currently tagged with the dbms. I use sql-server, so that's what I used to write the below. There may be slight differences in another dbms.
You can do something along these lines:
select *
from TableA a
left join TableB b on b.ID = a.ID
where a.ID is null -- added since yesterday
union
select *
from TableA a
left join TableB b on b.ID = a.ID
where b.ID is null -- removed since yesterday
union
select *
from TableA a
inner join TableB b on b.ID = a.ID -- restrict to records in both tables
where a.SomeValue <> b.SomeValue
or a.SomeOtherValue <> b.SomeOtherValue
--etc
Each select handles one portion of your expected output. In this manner, they'd all be joined into 1 result set. If you drop the union, you'll end up with a separate set for each.

I suggest to use Except to get the changed records. The below query should work if the db is sql server.
-- added since yesterday
SELECT B.*
FROM TableA A
LEFT Outer Join TableB B on B.ID = A.ID
WHERE A.ID IS NULL
UNION
-- removed since yesterday
SELECT A.*
FROM TableA A
LEFT OUTER JOIN TableB B on B.ID = A.ID
WHERE B.ID IS NULL
UNION
-- Those changed with values from yesterdady
SELECT B.* FROM TableB B WHERE EXISTS(SELECT A.ID FROM TableA A WHERE A.ID = B.ID)
EXCEPT
SELECT A.* FROM TableA A WHERE EXISTS(SELECT B.ID FROM TableB B WHERE B.ID = A.ID)

Assuming you have a unique id for each person in the able, you can use full outer join:
select coalesce(ty.customerid, tt.customerid) as customerid,
(case when ty.customerid is null then 'New'
when tt.customerid is null then 'Removed'
else 'Modified'
end) as status
from tyesterday ty full outer join
ttoday tt
on ty.customerid= tt.customerid
where ty.customerid is null or
tt.customerid is null or
(tt.col1 <> ty.col1 or tt.col2 <> ty.col2 or . . . ); -- may need to take `NULL`s into account

mathguy provided a successful answer to my initial problem. I asked him for a revision (to make it even better). He provided a revision, but I am getting a "missing keyword" error when executing against my code. Here is my code:
select case when ct = 1 then flag else 'changed' as flag, PERSON_ID, FIRSTNAME, LASTNAME
from (
select max(flag), PERSON_ID, FIRSTNAME, LASTNAME
count() over (partition by PERSON_ID) as ct,
row_number() over (partition by PERSON_ID
order by case when flag = 'new' then 0 end) as rn
from ( select 'old' as flag, t1.*
from YESTERDAY_TABLE t1
union all
select 'new' as flag, t2.*
from TODAY_TABLE t2
)
group by PERSON_ID, FIRSTNAME, LASTNAME
having count(*) = 1
)
where rn = 1
order by PERSON_ID;

Related

How to return two values from PostgreSQL subquery?

I have a problem where I need to get the last item across various tables in PostgreSQL.
The following code works and returns me the type of the latest update and when it was last updated.
The problem is, this query needs to be used as a subquery, so I want to select both the type and the last updated value from this query and PostgreSQL does not seem to like this... (Subquery must return only one column)
Any suggestions?
SELECT last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
Query is used like this inside of a CTE;
WITH sql_query as (
SELECT id, name, address, (...other columns),
last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
FROM table_c
WHERE table_c.fk_id = 1
)
The inherent problem is that SQL (all SQL not just Postgres) requires that a subquery used within a select clause can only return a single value. If you think about that restriction for a while it does makes sense. The select clause is returning rows and a certain number of columns, each row.column location is a single position within a grid. You can bend that rule a bit by putting concatenations into a single position (or a single "complex type" like a JSON value) but it remains a single position in that grid regardless.
Here however you do want 2 separate columns AND you need to return both columns from the same row, so instead of LIMIT 1 I suggest using ROW_NUMBER() instead to facilitate this:
WITH LastVals as (
SELECT type
, max_date
, row_number() over(order by max_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS max_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS max_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
)
, sql_query as (
SELECT id
, name, address, (...other columns)
, (select type from lastVals where rn = 1) as last_type
, (select max_date from lastVals where rn = 1) as last_date
FROM table_c
WHERE table_c.fk_id = 1
)
----
By the way in your subquery you should use UNION ALL with type being a constant like 'a' or 'b' then even if MAX(a.updated_at) was identical for 2 or more tables, the rows would still be unique because of the difference in type. UNION will attempt to remove duplicate rows but here it just isn't going to help, so avoid that wasted effort by using UNION ALL.
----
For another way to skin this cat, consider using a LEFT JOIN instead
SELECT id
, name, address, (...other columns)
, lastVals.type
, LastVals.last_date
FROM table_c
WHERE table_c.fk_id = 1
LEFT JOIN (
SELECT type
, last_date
, row_number() over(order by last_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS last_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS last_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
) LastVals ON LastVals.rn = 1

Comparing between rows in same table in Oracle SQL

I'm trying to find the best way to compare between rows by CustomerID and Status. In other words, only show the CustomerID when the status are equal between multiple rows and CustomerID. If not, don't show the CustomerID.
Example data
CUSTOMERID STATUS
1000 ACTIVE
1000 ACTIVE
1000 NOT ACTIVE
2000 ACTIVE
2000 ACTIVE
RESULT I'm hoping for
CUSTOMERID STATUS
2000 ACTIVE
You can do this with a WHERE NOT EXISTS:
Select Distinct CustomerId, Status
From YourTable A
Where Not Exists
(
Select *
From YourTable B
Where A.CustomerId = B.CustomerId
And A.Status <> B.Status
)
SELECT DISTINCT o.*
FROM
(
SELECT
CustomerId
FROm
TableName
GROUP BY
CustomerId
HAVING
COUNT(DISTINCT Status) = 1
) t
INNER JOIN TableName o
ON t.CustomerId = o.CustomerId
The only "Code" here is the last 4 lines in the code block. The other is establishing sample data.
with T1 as (
Select 1000 as CUSTOMERID, 'ACTIVE' as STATUS from dual union all
select 1000, 'ACTIVE' from dual union all
select 1000, 'NOT ACTIVE' from dual union all
select 2000, 'ACTIVE' from dual union all
select 2000, 'ACTIVE' from dual )
SELECT customerID, max(status) as status
FROM T1
GROUP BY customerID
HAVING count(distinct Status) = 1
I used a CTE to setup sample data and called this Common table Expression T1.
Order of operations matter here. First the table T1 is identified
second the engine groups by customer ID
third the engine limits the results to those records having a distinct record status matching 1 and only 1 value.
4th the engine picks the max status which will always be 1 value. min/max it doesn't matter as there is only 1 possible value. note, we have to use an aggregate here since we can't group by status or you wouldn't get the desired results.
Here's a pretty simple one using IN:
SELECT DISTINCT CustomerID, Status
FROM My_Table
WHERE CustomerID IN
(SELECT CustomerID
FROM My_Table
GROUP BY CustomerID
HAVING COUNT(Distinct Status) = 1)
Addition: based on your comment, it seems what you really want is all the IDs that do not have a 'Not Active' row, which is actually easier:
SELECT Distinct CustomerID, Status
FROM My_Table
WHERE CustomerID NOT IN
(SELECT CustomerID
FROM My_Table
WHERE Status = 'Not Active')
This is a SQL Server answer, I believe it should work in Oracle.
SELECT
a.AGMTNUM
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.Status = 'NOT ACTIVE' AND a.AGMTNUM = b.AGMTNUM)
AND EXISTS (SELECT 1 FROM TableB c WHERE c.Status = 'ACTIVE' AND a.AGMTNUM = c.AGMTNUM)
This will only return values that have at least one 'ACTIVE' value and no 'NOT ACTIVE' values.

Trying to determine if there is any Advantage

If I was in SQL Server i would just look at the execution plan, but i don't have rights to do that in my Oracle systems and I cannot see any speed difference when running. Any thoughts?
SELECT c_ID, c_Date
FROM Table1
WHERE CUR_IND = 'Y'
UNION
SELECT c_ID, c_Date
FROM Table1
WHERE LAST_UPDATE_DATE BETWEEN TO_DATE('2015-07-01','YYYY-MM-DD') AND TO_DATE('2015-07-20','YYYY-MM-DD')
AND c_ID NOT IN (SELECT c_ID FROM Table1 WHERE CUR_IND = 'Y')
I guess my main question, which seems obvious sorta is, since Union is going to run a distinct on this query, is the Sub Query in the second select helpful? This returns about 400K records on the first select, and about 10K on the 2nd, with the sub query the 2nd only returns 2. With or without it I get the same result in what appears to be the same amount of time. (Wishing i could see execution plan)
You don't need a UNION at all. Just have a single query with an OR between your conditions (unless the c_ID-c_Date combination is not unique, in which case UNION might have been filtering out duplicate rows as well. But I suspect this is not your case.):
SELECT c_ID, c_Date
FROM Table1
WHERE CUR_IND = 'Y'
OR LAST_UPDATE_DATE BETWEEN TO_DATE('2015-07-01','YYYY-MM-DD') AND TO_DATE('2015-07-20','YYYY-MM-DD')
Or... if the UNION from your original query was also performing DISTINCT duties for you, then simply add a DISTINCT:
SELECT DISTINCT c_ID, c_Date
FROM Table1
WHERE CUR_IND = 'Y'
OR LAST_UPDATE_DATE BETWEEN TO_DATE('2015-07-01','YYYY-MM-DD') AND TO_DATE('2015-07-20','YYYY-MM-DD')
EDIT
Based on your comments, it sounds like you'll be writing a query like this (adding in for completeness):
SELECT c_ID, c_Date
FROM (
SELECT c_ID, c_Date,
row_number() over (partition by c_ID
order by case when CUR_IND = 'Y' then 1 else 2 end) as rn
FROM Table1
WHERE CUR_IND = 'Y'
OR LAST_UPDATE_DATE BETWEEN TO_DATE('2015-07-01','YYYY-MM-DD') AND TO_DATE('2015-07-20','YYYY-MM-DD')
) WHERE rn = 1

Case on union of multiple unions and issue with alias

I have 2 series of unions which I wish to join by another union. In the first one, I have 3 Selects and in the second one I have 2 different Selects.
Select id, min(value)
from table1 t1
join (Select id, value
Union
Select id, value
Union
Select id, value) as foo
on foo.id=t1.id
Group by id
Select id, max(value)
from table1 t1
join (Select id, value
Union
Select id, value) as bar
on bar.id=t1.id
Group by id
I tried to do a union between these two, but it made things pretty complicated. My biggest issue is with my alias. My second is with the case linked to my value columns, which I wish to name value.
Select (alias).id,
Case
When foo.value= 0 or bar.value=1 THEN 1
Else 0
End as value
from table1 t1
Join (Select id, min(value)
from table1 t1
join (Select id, value
Union
Select id, value
Union
Select id, value) as foo
on foo.id=t1.id
Group by id
UNION
Select id, max(value)
from table1 t1
join (Select id, value
Union
Select id, value) as bar
on bar.id=t1.id
Group by id) as (alias)
on ??.id=??.id
I wrote my case the way I think it should be written, but normally, when there are more than one column with the same name, SQL states it as ambiguous. I am still unsure if I should use UNION or INTERSECT, but I assume either of them would be done the same way. How should I deal with this?
I'm reading this right, you probably want something like this
SELECT ...
FROM ( ... union #1 ) AS u1
JOIN (... union #2 ) AS u2 ON u1.id = u2.id

TSQL: Return row(s) with earliest dates

Given 2 tables called "table1" and "table1_hist" that structurally resemble this:
TABLE1
id status date_this_status
1 open 2008-12-12
2 closed 2009-01-01
3 pending 2009-05-05
4 pending 2009-05-06
5 open 2009-06-01
TABLE1_hist
id status date_this_status
2 open 2008-12-24
2 pending 2008-12-26
3 open 2009-04-24
4 open 2009-05-04
With table1 being the current status and table1_hist being a history table of table1, how can I return the rows for each id that has the earliest date. In other words, for each id, I need to know it's earliest status and date.
EXAMPLE:
For id 1 earliest status and date is open and 2008-12-12.
For id 2 earliest status and date is open and 2008-12-24.
I've tried using MIN(datetime), unions, dynamic SQL, etc. I've just reached tsql writers block today and I'm stuck.
Edited to add: Ugh. This is for a SQL2000 database, so Alex Martelli's answer won't work. ROW_NUMBER wasn't introduced until SQL2005.
SQL Server 2005 and later support an interesting (relatively recent) aspect of SQL Standards, "ranking/windowing functions", allowing:
WITH AllRows AS (
SELECT id, status, date_this_status,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY date_this_status ASC) AS row,
FROM (SELECT * FROM Table1 UNION SELECT * FROM Table1_hist) Both_tables
)
SELECT id, status, date_this_status
FROM AllRows
WHERE row = 1
ORDER BY id;
where I'm also using the nice (and equally "new") WITH syntax to avoid nesting the sub-query in the main SELECT.
This article shows how one could hack the equivalent of ROW_NUMBER (and also RANK and DENSE_RANK, the other two "new" ranking/windowing functions) in SQL Server 2000 -- but that's not necessarily pretty nor especially well-performing, alas.
The following code sample is completely self-sufficient, just copy and paste it into a management studio query and hit F5 =)
DECLARE #TABLE1 TABLE
(
id INT,
status VARCHAR(50),
date_this_status DATETIME
)
DECLARE #TABLE1_hist TABLE
(
id INT,
status VARCHAR(50),
date_this_status DATETIME
)
--TABLE1
INSERT #TABLE1
SELECT 1, 'open', '2008-12-12' UNION ALL
SELECT 2, 'closed', '2009-01-01' UNION ALL
SELECT 3, 'pending', '2009-05-05' UNION ALL
SELECT 4, 'pending', '2009-05-06' UNION ALL
SELECT 5, 'open', '2009-06-01'
--TABLE1_hist
INSERT #TABLE1_hist
SELECT 2, 'open', '2008-12-24' UNION ALL
SELECT 2, 'pending', '2008-12-26' UNION ALL
SELECT 3, 'open', '2009-04-24' UNION ALL
SELECT 4, 'open', '2009-05-04'
SELECT x.id,
ISNULL(y.[status], x.[status]) AS [status],
ISNULL(y.date_this_status, x.date_this_status) AS date_this_status
FROM #TABLE1 x
LEFT JOIN (
SELECT a.*
FROM #TABLE1_hist a
INNER JOIN (
SELECT id,
MIN(date_this_status) AS date_this_status
FROM #TABLE1_hist
GROUP BY id
) b
ON a.id = b.id
AND a.date_this_status = b.date_this_status
) y
ON x.id = y.id
SELECT id,
status,
date_this_status
FROM ( SELECT *
FROM Table1
UNION
SELECT *
from TABLE1_hist
) a
WHERE date_this_status = ( SELECT MIN(date_this_status)
FROM ( SELECT *
FROM Table1
UNION
SELECT *
from TABLE1_hist
) t
WHERE id = a.id
)
This is a bit ugly, but seems to work in MS SQL Server 2005.
You can do this with an exclusive self join. Join on the history table, and then another time on all earlier history entries. In the where statement, you specify that there are not allowed to be any earlier entries.
select t1.id,
isnull(hist.status, t1.status),
isnull(hist.date_this_status, t1.date_this_status)
from table1 t1
left join (
select h1.id, h1.status, h1.date_this_status
from table1_hist h1
left join table1_hist h2
on h2.id = h1.id
and h2.date_this_status < h1.date_this_status
where h2.date_this_status is null
) hist on hist.id = t1.id
A bit of a mind-binder, but fairly flexible and efficient!
This assumes there are no two history entries with the exact same date. If there are, write the self join like:
left join table1_hist h2
on h2.id = h1.id
and (
h2.date_this_status < h1.date_this_status
or (h2.date_this_status = h1.date_this_status and h2.id < h1.id)
)
If I understand the OP correctly, a given ID may appear in TABLE1 or TABLE1_HISTORY or both.
In your result set, you want back each distinct ID and the oldest status/date associated with that ID, regardless which table the oldest one happens to be in.
So, look in BOTH tables and return any record where there is no record in either table for it's ID that has a smaller date_this_status.
Try this:
SELECT ID, status, date_this_status FROM table1 ta WHERE
NOT EXISTS(SELECT null FROM table1 tb WHERE
tb.id = ta.id
AND tb.date_this_status < ta.date_this_status)
AND NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = ta.id
AND tbh.date_this_status < ta.date_this_status)
UNION ALL
SELECT ID, status, date_this_status FROM table1_history tah WHERE
NOT EXISTS(SELECT null FROM table1 tb WHERE
tb.id = tah.id
AND tb.date_this_status < tah.date_this_status)
AND NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = tah.id
AND tbh.date_this_status < tah.date_this_status)
Three underlying assumptions here:
Every ID you want back will have at least one record in at least one of the tables.
There won't be multiple records for the same ID in the same table with the same date_this_status value (can be mitigated by using DISTINCT)
There won't be records for the same ID in the other table with the same date_this_status value (can be mitigated by using UNION instead of UNION ALL)
There are two slight optimizations we can make:
If an ID has a record in TABLE1_HISTORY, it will always be older than the record in TABLE1 for that ID.
TABLE1 will never contain multiple records for the same ID (but the history table may).
So:
SELECT ID, status, date_this_status FROM table1 ta WHERE
NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = ta.id
)
UNION ALL
SELECT ID, status, date_this_status FROM table1_history tah WHERE
NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = tah.id
AND tbh.date_this_status < tah.date_this_status)
If that is the actual structure of your tables, you can't get a 100% accurate answer, the issue being that you can have 2 different statuses for the same (earliest) date for any given record and you would not know which one was entered first, because you don't have a primary key on the history table
Ignoring the "two tables" issues for a moment, I'd use the following logic...
SELECT
id, status, date
FROM
Table1_hist AS [data]
WHERE
[data].date = (SELECT MIN(date) FROM Table1_hist WHERE id = [data].id)
(EDIT: As per BlackTigerX's comment, this assumes no id can have more than one status with the same datetime.)
The simple way to extrapolate this to two tables is to use breitak67's answer. Replace all instances of "my_table" with subqueries that UNION the two tables together. A potential issue here is that of performance, as you may find that indexes become unusable.
One method of speeding this up could be to use implied knowledge:
1. The main table always has a record for each id.
2. The history table doesn't always have a record.
3. Any record in the history table is always 'older' than the one in main table.
SELECT
[main].id,
ISNULL([hist].status, [main].status),
ISNULL([hist].date, [main].date)
FROM
Table1 AS [main]
LEFT JOIN
(
SELECT
id, status, date
FROM
Table1_hist AS [data]
WHERE
[data].date = (SELECT MIN(date) FROM Table1_hist WHERE id = [data].id)
)
AS [hist]
ON [hist].id = [main].id
Find the oldest status for each id in the history table. (Can use its indexes)
LEFT JOIN that to the main table (which always has exactly one record for each id)
If [hist] contains a value, it's the older by definition
If the [hist] doesn't have a value, use the [main] value