INTERSECT two table of size 500ml rows in vertica - sql

I am very new to vertica db and hence looking for different efficient ways for comparing two tables of average size 500ml-800ml rows in vertica. I have a process that gets the data from vertica view and dump in to SQL server for later merge to final table in sql server. for few large tables combine it is dumping about 3bl rows daily. Instead of dumping all data I want to take daily snapshot, and compare it with previous days snapshot on vertica side only and then push changed rows only in to SQL SEREVER.
lets say previous snapshot is stored in tableA, today's snapshot stored in tableB. PK on both table is column named OrderId.
Simplest way I can think of is
Select * from tableB
Where OrderId NOT IN (
SELECT * from tableA
INTERSECT
SELECT * from tbleB
)
So my questions are:
Is there any other/better option in vertica to get only changed rows between two tables? Or should I
even consider doing this compare on vertica side?
How much doing such comparison should take?
What should I consider to improve the performance of such query?

If your columns have no NULL values, then a massive LEFT JOIN would seem to do what you want:
select b.*
from tableB b left join
tableA a
on b.OrderId = a.OrderId and
b.col1 = a.col1 and
. . . -- for all the columns you care about
However, I think you want except:
select b.*
from tableB b
except
select a.*
from tableA a;
I imagine this would have reasonable performance.

Do you have a primary key in the two tables?
Then my technique, for a complete Change Data Capture, is:
SELECT
'I' AS to_do
, newrows.*
FROM tb_today newrows
LEFT
JOIN tb_yesterday oldrows USING(id)
WHERE oldrows.id IS NULL
UNION ALL
SELECT
'U' AS to_do
, newrows.*
FROM tb_today newrows
JOIN tb_yesterday oldrows
WHERE oldrows.fname <> newrows.fname
OR oldrows.lnamd <> newrows.lname
OR oldrows.bdate <> newrwos.bdate
OR oldrows.sal <> newrows.sal
[...]
OR oldrows.lastcol <> newrows.lastcol
UNION ALL
SELECT
'D' AS to_do
, oldrows.*
FROM tb_yesterday oldrows
LEFT
JOIN tb_today oldrows USING(id)
WHERE newrows.id IS NULL
;
Just leave out the last leg of the UNION SELECT if you don't want to cater for DELETEs ('D')
Good luck

you also do it nicely using joins:
SELECT b.*
FROM tableB AS b
LEFT JOIN tableA AS a ON a.id = b.id
WHERE a.id IS NULL
so above query return only diff from TableB to TableA i.e. data which is present in both table will be skipped...

Related

duplicate query result when join table

I face issue about duplicate data when join table, here my sample data table I have
-- Table A
I want to join with
-- Table B
this my query notation for join both table,
select a.trans_id, name
from tableA a
inner join tableB b
on a.ID_Trans = b.trans_id
and this the result, why I get the duplicating data which should show only two lines of data, please help me to solve this case.
Firstly, as you have been told multiple times in the comments, this is working exactly as you have written, and (more importantly) as intended. You have 2 rows in tableA and those 2 rows match 2 rows in your table tableB according to the ON clause. This means that each join operation, for the each of the rows in tableA, results in 2 rows as well; thus 4 rows (2 * 2 = 4).
Considering that your table, TableA only has one column then it seems that you should be cleaning up that data and deleting the duplicates. There are plenty of examples on how to do that already (example).
Perhaps the column you show us in TableA is one many, and thus instead you have a denormalisation issue, and instead there should be another table with the details of Id_trans and a PRIMARY KEY or UNIQUE CONSTRAINT/INDEX on it. Then you would join fron that table to TableB.
Finally, what you might be after is an EXISTS, which would look like this:
SELECT B.trans_id, B.[name]
FROM dbo.TableB B
WHERE EXISTS(SELECT 1
FROM dbo.TableA A
WHERE A.ID_Trans = B.trans_id); --Odd that it's called ID_Trans in one table, and Trans_ID in another
As the comments mentioned your query does exactly what you asked it to do but I think you wanted something like:
select a.trans_id, a.name, b.name
from tableA a
inner join tableB b on a.trans_id = b.trans_id
group by a.trans_id, a.name, b.name
Since there are two rows in both table with same ID join will make them four. You can use distinct to remove duplicates:
select distinct a.trans_id, name
from tableA a
inner join tableB b
on a.id_trans = b.trans_id
But I would suggest to use exists:
select trans_id, name
from tableB b
exists (select 1 from tableA a where a.trans_id=b.trans_id)

How to join tables based on certain condition? SQL

There are three tables A,B,C
Table A has columns [ID], [flag], [many other columns]
Table B has columns [ID], [column subset of Table A]
Table C has columns [ID], [same column subset as Table B (thus also a subset of Table A), however with different values]
I want to join Table A & Table B if Flag = '1', and want to join Table A & Table C if Flag ='2'
Could you help me how I might be able to achieve this?
Many thanks!
You're looking for a UNION.
SELECT
<interesting columns>
FROM
A
JOIN
B
ON A.ID = B.ID
AND A.Flag = 1
UNION ALL
SELECT
<exactly the same interesting columns>
FROM
A
JOIN
C
ON A.ID = C.ID
AND A.Flag = 2
If the flag is really a string column, put the single quotes back. If it's numeric, leave them out.
Since the flag field in A should effectively eliminate duplicates between the result sets, I opted for UNION ALL, which is more efficient than UNION because UNION will run a DISTINCT under the covers, which in this case is likely unnecessary.

How to loop through rows in two tables and create a new set based on the merged results in SQL

Here is my obstacle.
I have two tables. Table A contains more rows than Table B. I have to merge the results and if Table A does not contain a row from Table B then I insert it into the new set. If however, a row from Table A contains a row with the same primary key as Table B, the new set will take the row from Table B.
Would this best be done in a cursor or is there an easier way to do this? I ask because there are 20 million rows and while I am new to sql, i've heard cursors are expensive.
Your phrasing is a little vague. It seems that you want everything from TableB and then rows from TableA that have no matching primary key in B. The following query solves this problem:
select *
from tableB union all
select *
from tableA
where tableA.pk not in (select pk from tableB)
Yep, cursors are expensive.
There's a MERGE command in later versions of SQL that will do this in one shot, but it's sooo cumbersome. Better to do it in two pieces - first:
UPDATE A SET
field1 = B.field1
,field2 = B.field2
, etc
FROM A JOIN B on B.id = A.id
Then:
INSERT A SELECT * FROM B --enumerate fields if different
WHERE B.id not in (select id FROM A)
An OUTER JOIN should do what you need and be more efficient than a cursor.
Try this query
--first get the rows that match between TableA and TableB
INSERT INTO [new set]
SELECT TableB.* --or columns of your choice
FROM TableA LEFT JOIN TableB ON [matching key criteria]
WHERE TableB.[joining column/PK] IS NOT NULL
--then get the rows from TableA that don't have a match
INSERT INTO [new set]
SELECT TableA.* --you didn't say what was inserted if there was no matching row
FROM TableA LEFT JOIN TableB ON [matching key criteria]
WHERE TableB.[joining column/PK] IS NULL

Join SQL query to get data from two tables

I'm a newbie, just learning SQL and have this question: I have two tables with the same columns. Some registers are in the two tables but others only are in one of the tables. To illustrate, suppose table A = (1,2,3,4), table B=(3,4,5,6), numbers are registers. I need to select all registers in table B if they are not in table A, that is result=(5,6). What query should I use? Maybe a join. Thanks.
You can either use a NOT IN query like this:
SELECT col from A where col not in (select col from B)
or use an outer join:
select A.col
from A LEFT OUTER JOIN B on A.col=B.col
where B.col is NULL
The first is easier to understand, but the second is easier to use with more tables in the query.
Select register from TABLE_B b
Where not exists (Select register from TABLE_A a where a.register = b.register)
I assumed you have a column named register in TABLE_A and TABLE_B

fastest way to find non-matching ids from two tables

I have base table with 1000 values. and second temporary table with 100 values. I need to compare them by guids and return only those rows from second table that do not exist in first table. I need the fastest performance solution for that. Thanks!
The classic left join/isnull test
select A.*
from secondTbl A
left join firstTbl B on A.guid = B.guid
WHERE B.guid is null
SELECT * FROM Table2 WHERE
NOT EXISTS (SELECT 'x' FROM table1 where
table1.field= table2.field)
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx