Find differences in both tables without multiple joins - sql

After comparing two data sets I'd like to extract information such as:
Rows that are only present in table A
Rows that are only present in table B
Non-key value differences after join
What's the preferred way to go about this? Is there a way to do this without having to do LEFT and RIGHT joins separately?

It sounds you want a FULL OUTER JOIN. That gives you all rows from both tables, joining ones that match keys. Then you can see which rows are present in only one table, and compare values for rows that are in both.

I would typically use group by for this, so I'm not sure what the reference to multiple joins is.
select col1, col2, col3, sum(in_a) as a_cnt, sum(in_b) as b_cnt
from ((select col1, col2, col3, 1 as in_a, 0 as in_b
from a
) union all
(select col1, col2, col3, 0 as in_a, 1 as in_b
from b
)
) ab;

Related

SQL - Transposing rows from some columns in a table to each record in thesame table

I am using a platform which accepts minimal SQL functions to write a SQL code. The UNPIVOT function cannot be used on the platform so I have to do this manually. I am thinking along the line of UNION ALL and then CROSS JOINING (which I attempted but ended up with the wrong record counts. Please see image attached.
Any help / pointer will be highly appreciated!
I don't know how you used UNION ALL but it can be done like this:
select col1, col2, col3 as NewCol from Table1
union all
select col1, col2, col4 from Table1
You could also use an ORDER BY clause, so that rows with the same col1 and col2 appear in subsequent rows:
select col1, col2, NewCol
from (
select col1, col2, col3 as NewCol, 1 as ord from Table1
union all
select col1, col2, col4, 2 from Table1
) t
order by col1, col2, ord
A portable approach uses union all:
select col1, col2, col3 as newcol from mytable
union all
select col1, col2, col4 from mytable
If your database supports lateral joins (also called cross apply in some databases) and values(), this can be simplified:
select t.col1, t.col2, x.newcol
from mytable t
cross join lateral (values(col3), (col4)) x(newcol)
You can use a cross join, but it requires some case logic. The exact syntax depends on the database, but something like this:
select t.col1, t.col2,
(case when n.n = 1 then t.col3 else t.col4 end) as newcol
from t cross join
(select 1 as n union all select 2) n;
To load another table, you would do one of the following:
insert these results into a table that has already been created.
Use select into or create table as depending on the database.
If you care about the ordering, then you can add order by t.col1, t.col2, n.n.
In most cases, a simple union all approach is fine (such as GMB suggests). That approach requires scanning the table twice, which incurs some additional overhead. However, if the "table" is really a complex query or view, then only processing it once is a bigger advantage.

Counting matching rows of two same tables and counting rows of the table

I have the same table structure called "table1" under two different schemas "schema1" and "schema2". "table1" contains columns "col1, col2, col3". Initialy I want see whether there are records having the same entries of col1 and col2 in the table schema1.table1 and schema2.table1. But I had mistyped schema2.table1 as schema1.table1. And now I am confused by the query result.
SELECT COUNT(*) FROM schema1.table1 AS s1t, schema1.table1 AS s2t
WHERE s1t.col1 = s2t.col1 AND s1t.col2 = s2t.col2;
I got
count
-------
530
(1 row)
However, SELECT COUNT(*) FROM schema1.table1; shows that there are 17815 rows.
Why would the first query show there are only 530 satisfied records? Shouldn't it be 17815 as well?
You can try to use FULL OUTER JOIN to see even mismatched rows, including null values for columns(col1 and 2). This way, at least(more than or equal to) 17815 rows return
SELECT COUNT(*)
FROM schema1.table1 AS s1t
FULL OUTER JOIN schema1.table1 AS s2t
ON s1t.col1 = s2t.col1 AND s1t.col2 = s2t.col2
In your case, only matched rows return for those columns (col1 and 2).
You are joining the table to itself. That is really strange.
In any case, your join is going to filter out any rows where col1 or col2 are NULL.
In addition, the self-join might multiply the number of rows if there are duplicates (with respect to the two columns) in the table.
It is really unclear why you would be doing this, but the above explains the results you are seeing.
If you want to compare the results in the two schemas allowing for duplicates and missing values, I recommend union all/group by:
select col1, col2, sum(cnt1) as cnt1, sum(cnt2) as cnt2
from ((select col1, col2, count(*) as cnt1, 0 as cnt2
from schema1.table1
group by col1, col2
) union all
(select col1, col2, 0 as cnt1, count(*) as cnt2
from schema2.table1
group by col1, col2
)
) t12
group by col1, col2
having sum(cnt1) <> sum(cnt2);
This returns pairs where the counts are not the same in the two tables. It even works for NULL values. If you ran this on the same table, no rows would be returned.

Perform Union/OR Operation between where clause and having Clause

I am working on implementation for a SQL which should display results with Union operation between Where and Having Clause.
For example,
Select * from table where col1= 'get' group by col2 (OR/UNION) having avg(col3) >30 . This is not valid but trying to give use a case
The purpose of the sql statement is to return result set which satisfies both where and having conditions.
Lets say I have a table1, has with col1, col2, col3, col4 and large data in the table. Now, There is a use case in which user wants to see results when selects filters with specific crtieria col1 ='Y', avg(col2) >10, avg(col3*col4) =30 in filters list. Now, I have to create a criteria, such that, I should return all results which satisfies col1 ='Y' OR avg(col2) >10 OR avg(col3*col4) =30 , like we do in where clause with OR operator but here we have both where clause and having clause –
Like, the below query
resultset1 <= select * from table1 where col1= 'get';
resultset2 <= select * from table1 group by col2 having avg(col3) >30
final results = resultset1+ resultset2
Do any one have better approach or ideas in implementing such scenario?
Lets say I have filters combinations as below
col1 =23
OR
avg(col2) >30
AND
avg(col3) =10
OR
avg(col1) <10
AND
col2 =10
I need to display results satisfying these criteria in SQL
It's not clear what do you want from this quasi SQL. I guess you need to select records with two conditions col1= 'get' AND /OR ? having avg(col3) >30. So here is the solution:
Select * from table
where (col1= 'get')
OR
col2 in (SELECT col2 FROM table GROUP BY col2 HAVING avg(col3) >30)
If you need both conditions where true then replace OR with AND.
If you need to count AVG only for col1 = 'get' then add this condition into the subquery:
Select * from table
where (col1= 'get')
OR
col2 in (SELECT col2 FROM table WHERE (col1= 'get')
GROUP BY col2
HAVING avg(col3) >30)
SELECT <resultset1> --resultset based on a WHERE clause
UNION
SELECT <resultset2> --resultset based on HAVING
In general, if you want a union of resultsets, use ... UNION.
Using OR in a condition is equivalent to UNION (because the UNION operator is the relational algebra equivalent of logical disjunction), but it requires the scope of the involved conditions to be identical.
In this case, this is impossible because a HAVING condition applies not to the table mentioned in the SELECT, but instead to an intermediate table that is "silently" created by the GROUP clause. This is inevitably so because things like AVG,SUM,... only make sense if it is also determined which set of rows must be used to compute the AVG,SUM,... over, and that is what the GROUP BY specification does.
EDIT
In SQL, UNION comes in distinct flavours, UNION DISTINCT and UNION ALL. One eliminates duplicates, the other won't. If you want the exact same behaviour as OR, you'll obviously need the one that eliminates duplicates from its result set.

selecting entire table when selecting columns - How

Currently i have joined together several tables. now i want to select some columns out of this joined mesh.
my question is: is it possible to select all columns of a specific table?
for example:
select col1,col4,col7,all_columns.Table1,col9 from (joined tables including Table1)
if it is possible, how to implement it
this is what you want?
select col1,
col4,
col7,
Table1.*,
col9
from (joined tables including Table1)
How about this?
SELECT ... Table1.*, ... FROM ...

Efficient query for finding duplicate records

I need to query a table for duplicate deposit records, where two deposits made at one cash terminal, for the same amount, within a certain time window, are considered duplicate records. I've started working on a query now, but I would appreciate any advice or suggestions on doing this 'properly'.
Generally, you'd do a self join to the same table, and put your "duplicate" criteria in the join conditions.
E.g.
SELECT
*
FROM
Transactions t1
inner join
Transactions t2
on
t1.Terminal = t2.Terminal and
t1.Amount = t2.Amount and
DATEDIFF(minute,t2.TransactionDate,t1.TransactionDate) between 0 and 10 and
t1.TransactionID > t2.TransactionID /* prevent matching the same row */
Simple aggregate
SELECT
col1, col2, col3, ...
FROM
MyTable
GROUP BY
col1, col2, col3, ...
HAVING
COUNT(*) >= 2
Don't include your identity/key/PK column: this will be unique per row and mess up the aggregate.
To get a row to remove or keep, do a MAX or MIN on that
SELECT
col1, col2, col3, ...,
MAX(IDCol) AS RowToDelete,
MIN(IDCol) AS RowToKeep
FROM
MyTable
GROUP BY
col1, col2, col3, ...
HAVING
COUNT(*) >= 2
Of course, with 3 duplicates then do a "keep".
Edit:
For rows within a time window, use a self join or window/ranking function