Basic difference between two tables - sql

I am attempting a very basic difference function in postgresql. Table 1 and Table 2 have identical columns. Only difference is Table 1 has some surplus rows. I would like to select for surplus rows only:
SELECT *
FROM table1
WHERE NOT EXISTS (SELECT * from table2);
The query above returns nothing when I know there are surplus rows.

I think you are looking for except:
select t1.*
from table1 t1
except
select t2.*
from table2 t2;
Note that the two tables must have the same number of columns, and the columns must all be of the same type. You can review the documentation here.

If you wish to use NOT EXISTS you're missing the joining of your table's keys in the inner where clause. Try:
SELECT *
FROM table1 t1
WHERE NOT EXISTS (SELECT * from table2 t2 WHERE t2.id = t1.id);

Related

SQL IN operator value of subquery

I want to get a value from an IN subquery with two columns, without needing to do two queries.
Sample:
SELECT * FROM table1 WHERE id IN(SELECT id, flags FROM table2);
Now I want to get flags directly. Is it possible, and if yes, how?
Any help is appreciated :)
It sounds like you are trying to achieve one of two things:
1) Select every field of records in table1 (and the associated table 2 flag) where the record's id is also found in the id column of table2. If that is the case, then yes, a join will accomplish what you want:
SELECT t1.*,
t2.flags
FROM table1 t1
JOIN table2 t2
ON t1.id = t2.id;
Note that JOIN is used here (rather than other types of joins such as LEFT JOIN) because JOIN will return only table1 records with a match in table2.id. LEFT JOIN, on the other hand, would return every table1 record, and table1 ids without a match in table2 would simply have null in the flags column of your returned table.
2) Select every field of records in table1 where the record's id is also found in either the id column of table2 or the flags column of table2. If that is the case, there are a few ways you could get the desired result, but achieving this using a subquery similar to the question
SELECT *
FROM table1
WHERE id IN (SELECT id FROM table2 UNION DISTINCT SELECT flags FROM table2)
You do this using join:
SELECT t1.*, t2.flags
FROM table1 t1 JOIN
table2 t2
ON t1.id = t2.id;

Comparing two tables for equality in HIVE

I have two tables, table1 and table2. Each with the same columns:
key, c1, c2, c3
I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
And
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null
So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.
The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.
The second one will find rows that exist in t1 but not in t2.
To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:
select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */
If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:
select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
(select t2.*, 2 as which from table2 t2)
) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;
There are various ways that you can relax the conditions in the first paragraph, if necessary.
Note that this version also works when the columns have NULL values. These might be causing the problem with your data.
Well, the best way is calculate the hash sum of each table, and compare the sum of hash.
So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:
select sum(hash(*)) from t1;
select sum(hash(*)) from t2;
And you just need to compare the return values.
I used EXCEPT statement and it worked.
select * from Original_table
EXCEPT
select * from Revised_table
Will show us all the rows of the Original table that are not in the Revised table.
If your table is partitioned you will have to provide a partition predicate.
Fyi, partition values don't need to be provided if you use Presto and querying via SQL lab.
I would recommend you not using any JOINs to try to compare tables:
it is quite an expensive operations when tables are big (which is often the case in Hive)
it can give problems when some rows/"join keys" are repeated
(and it can also be unpractical when data are in different clusters/datacenters/clouds).
Instead, I think using a checksum approach and comparing the checksums of both tables is best.
I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:
https://github.com/bolcom/hive_compared_bq
I hope that can help you!
First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query
select count(*) from table1
if C1 and C2 are not equal, then the tables are not identical.
2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:
select count(*) from (select distinct * from table1)
if DC1 and DC2 are not equal, the tables are not identical.
3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:
SELECT count (*)
FROM
(SELECT *
FROM table1
UNION
SELECT *
FROM table2)
You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U
another variant
select c1-c2 "different row counts"
, c1-c3 "mismatched rows"
from
( select count(*) c1 from table1)
,( select count(*) c2 from table2 )
,(select count(*) c3 from table1 t1, table2 t2
where t1.key= t2.key
and T1.c1=T2.c1 )
Try with WITH Clause:
With cnt as(
select count(*) cn1 from table1
)
select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2);
One easy solution is to do inner join. Let's suppose we have two hive tables namely table1 and table2. Both the table has same column namely col1, col2 and col3. The number of rows should also be same. Then the command would be as follows
**
select count(*) from table1
inner join table2
on table1.col1 = table2.col1
and table1.col2 = table2.col2
and table1.col3 = table2.col3 ;
**
If the output value is same as number of rows in table1 and table2 , then all the columns has same value, If however the output count is lesser than there are some data which are different.
Use a MINUS operator:
SELECT count(*) FROM
(SELECT t1.c1, t1.c2, t1.c3 from table1 t1
MINUS
SELECT t2.c1, t2.c2, t2.c3 from table2 t2)

Multiple Indexes or Single One in a comparison between two large tables

I'm going to compare two tables on Oracle with about 10 million records in each one.
t1 (anumber, bnumber, cdate, ctime, duration)
t2 (fcode, anumber, bnumber, mdate, mtime, odate, otime, duration)
Rows in these tables are the information of calls from a number to the other for a specific month (august 2012).
For example (12345,9876,120821,120000,68) indicates a call from anumber=12345 to bnumber=9876 in date=2012/08/21 and time=12:08:21 which lasted for 68 seconds.
I want to find records that don't exists in one of these tables but exists in the other. My comparison query is like this
select t1.*
from table1 t1
where not exists(select t1.* from table2 t2
where t1.anumber = t2.anumber
and t1.cdate = t2.mdate
and t1.duration = t2.duration);
and my questions are:
Which kind of indexes is better to use? Multiple index on columns (anumber,cdate,duration) or single index on each of them?
Considering that the third column is duration of a call which could have a wide range, is it worth to create an index on it? doesn't it slower down my query?
What is the fastest way to find the differences between these table?
Is it better to loop through dates and execute my query with (cdate='A DATE MONTH') added to the where clause?
Compared to the above query how much slower is this one:
select t1.*
from table1 t1
where not exists (select t1.*
from table2 t2
where t1.anumber = t2.anumber
and t1.bnumber like '%t2.bnumber%'
and t1.cdate = t2.mdate
and t1.duration = t2.duration);
select * from t1
minus
select * from t2
don't use indexes, you want to scan all 10 million rows in both tables, therefore a TABLE_ACCESS_FULL is rather in this case.
try this way:
select t1.*
from table1 t1
where (t1.anumber, t1.date, t1.duration) not in (select t2.anumber, t2.date, t2.duration
from table2 t2);
see the explain, if good then don't creat indexes, or create like this
create index idx_anum_dat_dur on table2(anumber, date, duration)
query performance depends on the result of which will return, you must to see the explain and try different variants
Your query is equivalient to:
select t1.*
from table1 t1
WHERE t1.bnumber like '%t2.bnumber%' -- << like 'Literal' !!!
AND NOT EXISTS (
select *
from table2 t2
where t2.anumber = t1.anumber
and t2.cdate = t1.mdate
and t2.duration = t1.duration
);
(table references inside quotes are not expanded! Is this the OP's intention ??? )

How to extract non-duplicate values in two tables

I have two tables where each one contains columns with numbers. I need to compare columns in both tables and extract the number that does exist in first table, and does not exist in second one. I don't need unique value.
I wrote this query:
SELECT Table1.Numbers, Table1.Name
FROM Table1, Table2
WHERE Table1.Numbers != Table2.numbers
Since I am working on several million records can someone recommend more efficient query which would provide me with identical results?
I would use NOT EXISTS:
SELECT Table1.Numbers, Table1.Name
FROM Table1
WHERE NOT EXISTS(
SELECT 1 FROM Table2
WHERE Table1.Numbers=Table2.Numbers
)
Other approaches:
Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?
You can do this easily by checking for the existance on the number in Table2.
SELECT T1.Numbers
,T1.Name
FROM Table1 T1
WHERE NOT EXISTS (SELECT 1 FROM Table2 T2 WHERE T2.Numbers = T1.Numbers)
Try this (assuming that your Numbers column are not nullable)
SELECT T1.Numbers, T1.Name
FROM Table1 AS T1
LEFT JOIN Table2 AS T2 ON T1.Numbers = T2.Numbers
WHERE T2.Numbers IS NULL;
SELECT
Table1.Numbers, Table1.Name
FROM
Table1, Table2
GROUP BY
Table1, Table2
HAVING
COUNT(*) > 1

SQL SELECT across two tables

I am a little confused as to how to approach this SQL query.
I have two tables (equal number of records), and I would like to return a column with which is the division between the two.
In other words, here is my not-working-correctly query:
SELECT( (SELECT v FROM Table1) / (SELECT DotProduct FROM Table2) );
How would I do this? All I want it a column where each row equals the same row in Table1 divided by the same row in Table2. The resulting table should have the same number of rows, but I am getting something with a lot more rows than the original two tables.
I am at a complete loss. Any advice?
It sounds like you have some kind of key between the two tables. You need an Inner Join:
select t1.v / t2.DotProduct
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
Should work. Just make sure you watch out for division by zero errors.
You didn't specify the full table structure so I will assume a common ID column to link rows in the tables.
SELECT table1.v/table2.DotProduct
FROM Table1 INNER JOIN Table2
ON (Table1.ID=Table2.ID)
You need to do a JOIN on the tables and divide the columns you want.
SELECT (Table1.v / Table2.DotProduct) FROM Table1 JOIN Table2 ON something
You need to substitue something to tell SQL how to match up the rows:
Something like: Table1.id = Table2.id
In case your fileds are both integers you need to do this to avoid integer math:
select t1.v / (t2.DotProduct*1.00)
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
If you have multiple values in table2 relating to values in table1 you need to specify which to use -here I chose the largest one.
select t1.v / (max(t2.DotProduct)*1.00)
from Table1 as t1
inner join Table2 as t2
on t1.ForeignKey = t2.PrimaryKey
Group By t1.v