How to compare two tables in Postgresql? - sql

I have two identical tables:
A : id1, id2, qty, unit
B: id1, id2, qty, unit
The set of (id1,id2) is identifying each row and it can appear only once in each table.
I have 140 rows in table A and 141 rows in table B.
I would like to find all the keys (id1,id2) that are not appearing in both tables. There is 1 for sure but there can't be more (for example if each table has whole different data).
I wrote this query:
(TABLE a EXCEPT TABLE b)
UNION ALL
(TABLE b EXCEPT TABLE a) ;
But it's not working. It compares the whole table where I don't care if qty or unit are different, I only care about id1,id2.

use a full outer join:
select a.*,b.*
from a full outer join b
on a.id1=b.id1 and a.id2=b.id2
this show both tables side by side. with gaps where there is an unmatched row.
select a.*,b.*
from a full outer join b
on a.id1=b.id1 and a.id2=b.id2
where a.id1 is null or b.id1 is null;
that will only show unmatched rows.
or you can use not in
select * from a
where (id1,id2) not in
( select id1,id2 from b )
that will show rows from a not matched by b.
or the same result using a join
select a.*
from a left outer join b
on a.id1=b.id1 and a.id2=b.id2
where b.id1 is null
sometimes the join is faster than the "not in"

Here is an example of using EXCEPT to see what records are different. Reverse the select statements to see what is different. a except s / then s except a
SELECT
a.address_entrytype,
a.address_street,
a.address_city,
a.address_state,
a.address_postal_code,
a.company_id
FROM
prospects.address a
except
SELECT
s.address_entrytype,
s.address_street,
s.address_city,
s.address_state,
s.address_postal_code,
s.company_id
FROM
prospects.address_short s

Related

What's the purpose of a JOIN where no column from 2nd table is being used?

I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query
SELECT
c_id, v_id, COUNT(DISTINCT(m_id)) AS participants,
cast(date_sub(current_date, ${window}) as string) as event_date
from (
select
a.c_id, a.v_id, a.user_id,
case
when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id
end as m_id
from (
select * from first
where event_date <= cast(date_sub(current_date, ${window}) as string)
) a
join (
select * from second
) b on a.c_id = b.c_id
left join third c
on a.user_id = c.id1
) dx
group by c_id, v_id;
I have changed the names but otherwise this is the select statement being used to insert overwrite to another table.
Regarding the join
join (
select * from second
) b on a.c_id = b.c_id
b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing.
Or I can just remove this join and it won't make any difference at all.
Thanks.
Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.
If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:
left semi join (
select c_id from second
) b on a.c_id = b.c_id
This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.
You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.
If you prefer to move it to the WHERE:
WHERE a.c_id IN (select c_id from second)
or correlated EXISTS:
WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)
But as I said, all of them are implemented internally using JOIN operator.

Distinct IDs from one table for inner join SQL

I'm trying to take the distinct IDs that appear in table a, filter table b for only these distinct IDs from table a, and present the remaining columns from b. I've tried:
SELECT * FROM
(
SELECT DISTINCT
a.ID,
a.test_group,
b.ch_name,
b.donation_amt
FROM table_a a
INNER JOIN table_b b
ON a.ID=b.ID
ORDER by a.ID;
) t
This doesn't seem to work. This query worked:
SELECT DISTINCT a.ID, a.test_group, b.ch_name, b.donation_amt
FROM table_a a
inner join table_b b
on a.ID = b.ID
order by a.ID
But I'm not entirely sure this is the correct way to go about it. Is this second query only going to take unique combinations of a.ID and a.test_group or does it know to only take distinct values of a.ID which is what I want.
Your first and second query are similar.(just that you can not use ; inside your query) Both will produce the same result.
Even your second query which you think is giving you desired output, can not produce the output what you actually want.
Distinct works on the entire column list of the select clause.
In your case, if for the same a.id there is different a.test_group available then it will have multiple records with same a.id and different a.test_group.

Does wrapping my Coalesce in a subquery make my query more efficient or does it do nothing?

Lets say I have a query where one field can appear in either Table A or Table B but not both. So to retrieve it I use Coalesce.
Something like
Select
...
Coalesce(A.Number,B.Number) Number
...
From Table A
Left Join Table B on A.C= B.C
Now lets say I want to join another table to that Number field
should I just do
Join Table Z on Z.Z = Coalesce(A.Number,B.Number)
Or is it better to wrap my original table in a query and join on the definite result. So something like
Select * from (
Select
...
Coalesce(A.Number,B.Number) Number
...
From Table A
Left Join Table B on A.C= B.C
) T
left join Table Z on Z.Number= T.Number
Does this make a difference?
if i were joining another table to the result of the first query instead of a sub query i would place the first part in a CTE whenever possible, i believe the performance would be the same as a subquery but CTEs are more readable in my opinion.
with cte1 as
(
Select
...
Coalesce(A.Number,B.Number) Number
...
From Table A
Left Join Table B
on A.C= B.C
)
select *
from cte1 a
Join Table Z
on Z.Z = a.number

join or merge two table based on id merge

I have two tables:
I am looking for the results like mentioned in the last.
I tried union (only similar col can be merged), left join, right join i am getting repeated fields in Null areas what can be other options where i can get null without column repeating
A full join would get all results from both tables.
select
A.ID,
A.ColA,
A.ColB,
B.ColC,
B.ColD
from TableA A
full join Table B on A.ID = B.ID
Here is a good post to understand joins
You can try distinct:
select distinct * from
tableA a,
tableB b
where a.id = b.id;
It will not give any duplicate tuples.

In SQL, a Join is actually an Intersection? And it is also a linkage or a "Sideway Union"?

I always thought of a Join in SQL as some kind of linkage between two tables.
For example,
select e.name, d.name from employees e, departments d
where employees.deptID = departments.deptID
In this case, it is linking two tables, to show each employee with a department name instead of a department ID. And kind of like a "linkage" or "Union" sideway".
But, after learning about inner join vs outer join, it shows that a Join (Inner join) is actually an intersection.
For example, when one table has the ID 1, 2, 7, 8, while another table has the ID 7 and 8 only, the way we get the intersection is:
select * from t1, t2 where t1.ID = t2.ID
to get the two records of "7 and 8". So it is actually an intersection.
So we have the "Intersection" of 2 tables. Compare this with the "Union" operation on 2 tables. Can a Join be thought of as an "Intersection"? But what about the "linking" or "sideway union" aspect of it?
You're on the right track; the rows returned by an INNER JOIN are those that satisfy the join conditions. But this is like an intersection only because you're using equality in your join condition, applied to columns from each table.
Also be aware that INTERSECTION is already an SQL operation and it has another meaning -- and it's not the same as JOIN.
An SQL JOIN can produce a new type of row, which has all the columns from both joined tables. For example: col4, col5, and col6 don't exist in table A, but they do exist in the result of a join with table B:
SELECT a.col1, a.col2, a.col3, b.col4, b.col5, b.col6
FROM A INNER JOIN B ON a.col2=b.col5;
An SQL INTERSECTION returns rows that are common to two separate tables, which must already have the same columns.
SELECT col1, col2, col3 FROM A
INTERSECT
SELECT col1, col2, col3 FROM B;
This happens to produce the same result as the following join:
SELECT a.col1, a.col2, a.col3
FROM A INNER JOIN B ON a.col1=b.col1 AND a.col2=b.col2 AND a.col3=b.col3;
Not every brand of database supports the INTERSECTION operator.
A join 'links' or erm... joins the rows from two tables. I think that's what you mean by 'sideways union' although I personally think that is a terrible way to phrase it. But there are different types of joins that do slightly different things:
An inner join is indeed an intersection.
A full outer join is a union.
This page on Jeff Atwood's blog describes other possibilities.
An Outer Join - is not related to - Union or Union All.
For example, a 'null' would not occur as a result of Union or Union All operation, but it results from an Outer Join.
INNER JOIN treats two NULLs as two different values. So, if you join based on a nullable column, and if both tables have NULL values in that column, then INNER JOIN will ignore those rows.
Therefore, to correctly retrieve all common rows between two tables, INTERSECT should be used. INTERSECT treats two NULLs as the same value.
Example(SQLite):
Create two tables with nullable columns:
CREATE TABLE Table1 (id INT, firstName TEXT);
CREATE TABLE Table2 (id INT, firstName TEXT);
Insert NULL values:
INSERT INTO Table1 VALUES (1, NULL);
INSERT INTO Table2 VALUES (1, NULL);
Retrieve common rows using INNER JOIN (This shows no output):
SELECT * FROM Table1 INNER JOIN Table2 ON
Table1.id=Table2.id AND Table1.firstName=Table2.firstName;
Retrieve common rows using INTERSECT (This correctly shows the common row):
SELECT * FROM Table1 INTERSECT SELECT * FROM Table2;
Conclusion:
Even though, many times both INTERSECT and INNER JOIN can be used to get the same results, they are not the same and should be picked depending on the situation.