Find difference between two big tables in PostgreSQL

Find difference between two big tables in PostgreSQL - sql

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash).
Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)
Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.
This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.

EXISTS seems like the best option.
tbl1 is the table with surplus rows in this example:
SELECT *
FROM tbl1
WHERE NOT EXISTS (SELECT FROM tbl2 WHERE tbl2.col = tbl1.col);
If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:
SELECT *
FROM tbl1
FULL OUTER JOIN tbl2 USING (col)
WHERE tbl2 col IS NULL OR
tbl1.col IS NULL;
Overview over basic techniques in a later post:
Select rows which are not present in other table
Aside: The data type uuid is efficient for md5 hashes:
Convert hex in text representation to decimal number
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

To augment existing answers I use the row() function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this
select *
from tbl1
full outer join tbl2
on row(tbl1) = row(tbl2)
where tbl1.col is null
or tbl2.col is null

If you want to find the difference without knowing which table has more rows than other, you can try this option that get all rows present in either tables:
SELECT * FROM A
WHERE NOT EXISTS (SELECT * FROM B)
UNION
SELECT * FROM B
WHERE NOT EXISTS (SELECT * FROM A)

In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:
DELETE FROM table1 where ID IN (
SELECT id FROM table1
LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield
WHERE table2.hashfield IS NULL)
And then do the same the other way around for the other table.

Related

Concatenate ALL values from 2 tables using SQL?

I am trying to use SQL to create a table that concatenates all dates from a specific range to all items in another table. See image for an example.
I have a solution where I can create a column of "null" values in both tables and join on that column but wondering if there is a more sophisticated approach to doing this.
Example image
I've tried the following:
Added a constant value to each table
Then I joined the 2 tables on that constant value so that each row matched each row of both tables.
This got the intended result but I'm wondering if there's a better way to do this where I don't have to add the constant values:
SELECT c.Date_,k.user_email
FROM `operations-div-qa.7_dbtCloud.calendar_table_hours_st` c
JOIN `operations-div-qa.7_dbtCloud.table_key` k
ON c.match = k.match
ORDER BY Date_,user_email asc

It's not a concatenation in the image given, Its a join
select t1.dates Date ,t2.name Person
from table t1,table t2;

Cross join should work for you:
It joins every row from both tables with each other. Use this when there is no relationship between the tables.
Did not test so syntax may be slightly off.
SELECT c.Date_,k.user_email
FROM `operations-div-qa.7_dbtCloud.calendar_table_hours_st` c
CROSS JOIN `operations-div-qa.7_dbtCloud.table_key` k
ORDER BY Date_,user_email asc

Which is best to use between the IN and JOIN operators in SQL server for the list of values as table two?

I heard that the IN operator is costlier than the JOIN operator.
Is that true?
Example case for IN operator:
SELECT *
FROM table_one
WHERE column_one IN (SELECT column_one FROM table_two)
Example case for JOIN operator:
SELECT *
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?

tl;dr; - once the queries are fixed so that they will yield the same results, the performance is the same.
Both queries are not the same, and will yield different results.
The IN query will return all the columns from table_one,
while the JOIN query will return all the columns from both tables.
That can be solved easily by replacing the * in the second query to table_one.*, or better yet, specify only the columns you want to get back from the query (which is best practice).
However, even if that issue is changed, the queries might still yield different results if the values on table_two.column_one are not unique.
The IN query will yield a single record from table_one even if it fits multiple records in table_two, while the JOIN query will simply duplicate the records as many times as the criteria in the ON clause is met.
Having said all that - if the values in table_two.column_one are guaranteed to be unique, and the join query is changed to select table_one.*... - then, and only then, will both queries yield the same results - and that would be a valid question to compare their performance.
So, in the performance front:
The IN operator has a history of poor performance with a large values list - in earlier versions of SQL Server, if you would have used the IN operator with, say, 10,000 or more values, it would have suffer from a performance issue.
With a small values list (say, up to 5,000, probably even more) there's absolutely no difference in performance.
However, in currently supported versions of SQL Server (that is, 2012 or higher), the query optimizer is smart enough to understand that in the conditions specified above these queries are equivalent and might generate exactly the same execution plan for both queries - so performance will be the same for both queries.
UPDATE: I've done some performance research, on the only available version I have for SQL Server which is 2016 .
First, I've made sure that Column_One in Table_Two is unique by setting it as the primary key of the table.
CREATE TABLE Table_One
(
id int,
CONSTRAINT PK_Table_One PRIMARY KEY(Id)
);
CREATE TABLE Table_Two
(
column_one int,
CONSTRAINT PK_Table_Two PRIMARY KEY(column_one)
);
Then, I've populated both tables with 1,000,000 (one million) rows.
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY ##SPID) As N INTO Tally
FROM sys.objects A
CROSS JOIN sys.objects B
CROSS JOIN sys.objects C;
INSERT INTO Table_One (id)
SELECT N
FROM Tally;
INSERT INTO Table_Two (column_one)
SELECT N
FROM Tally;
Next, I've ran four different ways of getting all the values of table_one that matches values of table_two. - The first two are from the original question (with minor changes), the third is a simplified version of the join query, and the fourth is a query that uses the exists operator with a correlated subquery instead of the in operaor`,
SELECT *
FROM table_one
WHERE Id IN (SELECT column_one FROM table_two);
SELECT TOne.*
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.id = TTwo.column_one;
SELECT TOne.*
FROM table_one TOne
JOIN table_two AS TTwo
ON TOne.id = TTwo.column_one;
SELECT *
FROM table_one
WHERE EXISTS
(
SELECT 1
FROM table_two
WHERE column_one = id
);
All four queries yielded the exact same result with the exact same execution plan - so from it's safe to say performance, under these circumstances, are exactly the same.
You can copy the full script (with comments) from Rextester (result is the same with any number of rows in the tally table).

From the point of performance view, mostly, using EXISTS might be a better option rather than using IN operator and JOIN among the tables :
SELECT TOne.*
FROM table_one TOne
WHERE EXISTS ( SELECT 1 FROM table_two TTwo WHERE TOne.column_one = TTwo.column_one )
If you need the columns from both tables, and provided those have indexes on the column column_one used in the join condition, using a JOIN would be better than using an IN operator, since you will be able to benefit from the indexes :
SELECT TOne.*, TTwo.*
FROM table_one TOne
JOIN table_two TTwo
ON TOne.column_one = TTwo.column_one

In the above query, which is recommended to use and why?
The second (JOIN) query cannot be optimal compare to first query unless you put where clause within sub-query as follows:
Select * from table_one TOne
JOIN (select column_one from table_two where column_tow = 'Some Value') AS TTwo
ON TOne.column_one = TTwo.column_one
However, the better decision can be based on execution plan with following points into consideration:
How many tasks the query has to perform to get the result
What is task type and execution time of each task
Variance between Estimated number of row and Actual number of rows in each task - this can be fixed by UPDATED STATISTICS on TABLE if the variance too high.
In general, the Logical Processing Order of the SELECT statement goes as follows, considering that if you manage your query to read the less amount of rows/pages at higher level (as per following order) would make that query less logical I/O cost and eventually query is more optimized. i.e. It's optimal to get rows filtered within From or Where clause rather than filtering it in GROUP BY or HAVING clause.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

Deleting records from a table without primary key

I need to delete some specific record from database table but table itself does not have primary key. So condition depends on other table. So what is the correct way to do that?
delete from table_1
where exists
(select distinct tb.*
from table_1 tb, table_2 tb_2, table_3 tb_3
where tb1.col = tb2.col
and tb3.col = tb2.col
and tb3.col_2= 10)
is that correct way to do that? Lets say table_1 has 4 columns and first two columns should be the criteria to remove.

If the select version of your query returns the results you want to delete, then you're good. A couple things though..
Use the ANSI compliant explicit join syntax instead of the comma delineated, implicit syntax (which is long since depreciated). The explicit syntax looks better and is easier to read, anyway.
Correlate your EXISTS back to the main table. And you don't need a distinct, it will return positive whether there is 1 matching row or 10 billion.
SELECT *
FROM table_1 tb_1
WHERE EXISTS (SELECT *
FROM table_2 tb_2
JOIN table_3 tb_3 ON tb_2.col = tb_3.col
WHERE tb_1.col = tb_2.col
AND tb_3.col_2 = 10)

Number of Records don't match when Joining three tables

Despite going through every material I could possibly find on the internet, I haven't been able to solve this issue myself. I am new to MS Access and would really appreciate any pointers.
Here's my problem - I have three tables
Source1084 with columns - Department, Sub-Dept, Entity, Account, +few more
R12CAOmappingTable with columns - Account, R12_Account
Table4 with columns - R12_Account, Department, Sub-Dept, Entity, New Dept, LOB +few more
I have a total of 1084 records in Source and the result table must also contain 1084 records. I need to draw a table with all the columns from Source + R12_account from R12CAOmappingTable + all columns from Table4.
Here is the query I wrote. This yields the right columns but gives me more or less number of records with interchanging different join options.
SELECT rmt.r12_account,
srb.version,
srb.fy,
srb.joblevel,
srb.scenario,
srb.department,
srb.[sub-department],
srb.[job function],
srb.entity,
srb.employee,
table4.lob,
table4.product,
table4.newacct,
table4.newdept,
srb.[beg balance],
srb.jan,
srb.feb,
srb.mar,
srb.apr,
srb.may,
srb.jun,
srb.jul,
srb.aug,
srb.sep,
srb.oct,
srb.nov,
srb.dec,
rmt.r12_account
FROM (source1084 AS srb
LEFT JOIN r12caomappingtable AS rmt
ON srb.account = rmt.account)
LEFT JOIN table4
ON ( srb.department = table4.dept )
AND ( srb.[sub-department] = table4.subdept )
AND ( srb.entity = table4.entity )
WHERE ( ( ( srb.[sub-department] ) = table4.subdept )
AND ( ( srb.entity ) = table4.entity )
AND ( ( rmt.r12_account ) = table4.r12_account ) );

In this simple example, Table1 contains 3 rows with unique fld1 values. Table2 contains one row, and the fld1 value in that row matches one of those in Table1. Therefore this query returns 3 rows.
SELECT *
FROM
Table1 AS t1
LEFT JOIN Table2 AS t2
ON t1.fld1 = t2.fld1;
However if I add the WHERE clause as below, that version of the query returns only one row --- the row where the fld1 values match.
SELECT *
FROM
Table1 AS t1
LEFT JOIN Table2 AS t2
ON t1.fld1 = t2.fld1
WHERE t1.fld1 = t2.fld1;
In other words, that WHERE clause counteracts the LEFT JOIN because it excludes rows where t2.fld1 is Null. If that makes sense, notice that second query is functionally equivalent to this ...
SELECT *
FROM
Table1 AS t1
INNER JOIN Table2 AS t2
ON t1.fld1 = t2.fld1;
Your situation is similar. I suggest you first eliminate the WHERE clause and confirm this query returns at least your expected 1084 rows.
SELECT Count(*) AS CountOfRows
FROM (source1084 AS srb
LEFT JOIN r12caomappingtable AS rmt
ON srb.account = rmt.account)
LEFT JOIN table4
ON ( srb.department = table4.dept )
AND ( srb.[sub-department] = table4.subdept )
AND ( srb.entity = table4.entity );
After you get the query returning the correct number of rows, you can alter the SELECT list to return the columns you want. But the columns aren't really the issue until you can get the correct rows.

Without knowing your tables values it is hard to give a complete answer to your question. The issue that is causing you a problem based on how you described it. Is more then likely based on the type of joins you are using.
The best way I found to understand what type of joins you should be using would referencing a Venn diagram explaining the different type of joins that you can use.
Jeff Atwood also has a really good explanation of SQL joins on his site using the above method as well.

Best to just use the query builder. Drop in your main table. Choose the columns you want. Now for any of the other lookup values then simply drop in the other tables, draw the join line(s), double click and use a left join. You can do this for 2 or 30 columns that need to "grab" or lookup other values from other tables. The number of ORIGINAL rows in the base table returned should ALWAYS remain the same.
So just use the query builder and follow the above.
The problem with your posted SQL is you NESTED the joins inside (). Don't do that. (or let the query builder do this for you – they tend to be quite messy but will also work).
Just use this:
FROM source1084 AS srb
LEFT JOIN r12caomappingtable AS rmt
ON srb.account = rmt.account
LEFT JOIN table4
ON ( srb.department = table4.dept )
AND ( srb.[sub-department] = table4.subdept )
AND ( srb.entity = table4.entity )
As noted, I don't see why you are "repeating" the conditions again in the where clause.

Joining two tables on colums with dissimilar (but connected) values

How can I connect two tables on columns with certain linked values but not having the same values?
For instance I need to join tbl1 to tbl2 where tbl1.col=100 and tbl2.col=200. The only connection that have is to me/my company.
Is there a way to link the rows without an explicit shared value? I need all rows with col value '100' to be on the same row as all tbl2 columns have col value 200.

You can put some logic in your join predicate, as in:
select *
from tbl1 as a
join tbl2 as b on a.col + 100 = b.col

Is there a way to link the rows without an explicit shared value?
Yes. You can write a custom JOIN to relate data yourself.
You didn't specify your specific DBMS, so the following examples contain generic SQL.
SELECT * FROM tbl1, tbl2 WHERE tbl1.col = 100 AND tbl12.col = 200
Or, more dynamically:
SELECT * FROM tbl1, tbl2 WHERE tbl1.col + 100 = tbl12.col;
-- with JOIN
SELECT * FROM tbl1 JOIN tbl2 ON (tbl1.col + 100) = tbl12.col;

select
*
from
tbl1
inner join
tbl2
on tbl1.col = 100 and tbl2.col = 200
weird, but it will work

If I understand your problem correctly, you have two tables that logically relate to each other but the current keys in the tables don't (but you have business rules that put them together). I think you need to create a cross-reference table that maps that relationship. The cross-reference table would map the primary keys of each other tables together to show the logical relationship between the data.
I think all of the others posters have made the assumption that the relationship is one you can calculate, but I don't think that is what you are asking. Correct me if I'm wrong.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find difference between two big tables in PostgreSQL - sql

To augment existing answers I use the row() function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this select * from tbl1 full outer join tbl2 on row(tbl1) = row(tbl2) where tbl1.col is null or tbl2.col is null

If you want to find the difference without knowing which table has more rows than other, you can try this option that get all rows present in either tables: SELECT * FROM A WHERE NOT EXISTS (SELECT * FROM B) UNION SELECT * FROM B WHERE NOT EXISTS (SELECT * FROM A)

Related

Concatenate ALL values from 2 tables using SQL?

Which is best to use between the IN and JOIN operators in SQL server for the list of values as table two?

Deleting records from a table without primary key

Number of Records don't match when Joining three tables

Joining two tables on colums with dissimilar (but connected) values

Categories

Resources