Multiple Indexes or Single One in a comparison between two large tables - sql

I'm going to compare two tables on Oracle with about 10 million records in each one.
t1 (anumber, bnumber, cdate, ctime, duration)
t2 (fcode, anumber, bnumber, mdate, mtime, odate, otime, duration)
Rows in these tables are the information of calls from a number to the other for a specific month (august 2012).
For example (12345,9876,120821,120000,68) indicates a call from anumber=12345 to bnumber=9876 in date=2012/08/21 and time=12:08:21 which lasted for 68 seconds.
I want to find records that don't exists in one of these tables but exists in the other. My comparison query is like this
select t1.*
from table1 t1
where not exists(select t1.* from table2 t2
where t1.anumber = t2.anumber
and t1.cdate = t2.mdate
and t1.duration = t2.duration);
and my questions are:
Which kind of indexes is better to use? Multiple index on columns (anumber,cdate,duration) or single index on each of them?
Considering that the third column is duration of a call which could have a wide range, is it worth to create an index on it? doesn't it slower down my query?
What is the fastest way to find the differences between these table?
Is it better to loop through dates and execute my query with (cdate='A DATE MONTH') added to the where clause?
Compared to the above query how much slower is this one:
select t1.*
from table1 t1
where not exists (select t1.*
from table2 t2
where t1.anumber = t2.anumber
and t1.bnumber like '%t2.bnumber%'
and t1.cdate = t2.mdate
and t1.duration = t2.duration);

select * from t1
minus
select * from t2
don't use indexes, you want to scan all 10 million rows in both tables, therefore a TABLE_ACCESS_FULL is rather in this case.

try this way:
select t1.*
from table1 t1
where (t1.anumber, t1.date, t1.duration) not in (select t2.anumber, t2.date, t2.duration
from table2 t2);
see the explain, if good then don't creat indexes, or create like this
create index idx_anum_dat_dur on table2(anumber, date, duration)
query performance depends on the result of which will return, you must to see the explain and try different variants

Your query is equivalient to:
select t1.*
from table1 t1
WHERE t1.bnumber like '%t2.bnumber%' -- << like 'Literal' !!!
AND NOT EXISTS (
select *
from table2 t2
where t2.anumber = t1.anumber
and t2.cdate = t1.mdate
and t2.duration = t1.duration
);
(table references inside quotes are not expanded! Is this the OP's intention ??? )

Related

faster way of process return

I want to return last given passport from a database.
In Table1 there is only one passport info. In Table2 there are all passport info belong to every person.
My comparing code extremely slowly working, it takes too much time.
So, if there is any faster alternative code for my code please share it, Tanks befor.
from table1 t
where t.pass_date <
(select max(tb_datebeg) from table2 where tb_inn = t.tin)
We could phrase this as a join along with analytic functions:
WITH cte AS (
SELECT t1.*, MAX(t2.tb_datebeg) OVER (PARTITION BY t2.tb_inn) max_tb_datebeg
FROM table1 t1
INNER JOIN table2 t2 ON t2.tb_inn = t1.tin
)
SELECT *
FROM cte
WHERE pass_date < max_tb_datebeg;
The above query would benefit from the following index on table2:
CREATE INDEX idx2 ON table2 (tb_inn, tb_datebeg);

Tuning SQL query : subquery with aggregate function on the same table

The following query takes approximately 30 seconds to give results.
table1 contains ~20m lines
table2 contains ~10000 lines
I'm trying to find a way to improve performances. Any ideas ?
declare #PreviousMonthDate datetime
select #PreviousMonthDate = (SELECT DATEADD(MONTH, DATEDIFF(MONTH, '19000101', GETDATE()) - 1, '19000101') as [PreviousMonthDate])
select
distinct(t1.code), t1.ent, t3.lib, t3.typ from table1 t1, table2 t3
where (select min(t2.dat) from table1 t2 where t2.code=t1.code) >#PreviousMonthDate
and t1.ent in ('XXX')
and t1.code=t3.cod
and t1.dat>#PreviousMonthDate
Thanks
This is your query, more sensibly written:
select t1.code, t1.ent, t2.lib, t2.typ
from table1 t1 join
table2 t2
on t1.code = t2.cod
where not exists (select 1
from table1 tt1
where tt1.code = t1.code and
tt1.dat <= #PreviousMonthDate
) and
t1.ent = 'XXX' and
t1.dat > #PreviousMonthDate;
For this query, you want the following indexes:
table1(ent, dat, code) -- for the where
table1(code, dat) -- for the subquery
table2(cod, lib, typ) -- for the join
Notes:
Table aliases should make sense. t3 for table2 is cognitively dissonant, even though I know these are made up names.
not exists (especially with the right indexes) should be faster than the aggregation subquery.
The indexes will satisfy the where clause, reducing the data needed for filtering.
select distinct is a statement. distinct is not a function, so the parentheses do nothing.
Never use comma in the FROM clause. Always use proper, explicit, standard JOIN syntax.

Basic difference between two tables

I am attempting a very basic difference function in postgresql. Table 1 and Table 2 have identical columns. Only difference is Table 1 has some surplus rows. I would like to select for surplus rows only:
SELECT *
FROM table1
WHERE NOT EXISTS (SELECT * from table2);
The query above returns nothing when I know there are surplus rows.
I think you are looking for except:
select t1.*
from table1 t1
except
select t2.*
from table2 t2;
Note that the two tables must have the same number of columns, and the columns must all be of the same type. You can review the documentation here.
If you wish to use NOT EXISTS you're missing the joining of your table's keys in the inner where clause. Try:
SELECT *
FROM table1 t1
WHERE NOT EXISTS (SELECT * from table2 t2 WHERE t2.id = t1.id);

Comparing two tables for equality in HIVE

I have two tables, table1 and table2. Each with the same columns:
key, c1, c2, c3
I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
And
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null
So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.
The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.
The second one will find rows that exist in t1 but not in t2.
To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:
select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */
If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:
select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
(select t2.*, 2 as which from table2 t2)
) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;
There are various ways that you can relax the conditions in the first paragraph, if necessary.
Note that this version also works when the columns have NULL values. These might be causing the problem with your data.
Well, the best way is calculate the hash sum of each table, and compare the sum of hash.
So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:
select sum(hash(*)) from t1;
select sum(hash(*)) from t2;
And you just need to compare the return values.
I used EXCEPT statement and it worked.
select * from Original_table
EXCEPT
select * from Revised_table
Will show us all the rows of the Original table that are not in the Revised table.
If your table is partitioned you will have to provide a partition predicate.
Fyi, partition values don't need to be provided if you use Presto and querying via SQL lab.
I would recommend you not using any JOINs to try to compare tables:
it is quite an expensive operations when tables are big (which is often the case in Hive)
it can give problems when some rows/"join keys" are repeated
(and it can also be unpractical when data are in different clusters/datacenters/clouds).
Instead, I think using a checksum approach and comparing the checksums of both tables is best.
I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:
https://github.com/bolcom/hive_compared_bq
I hope that can help you!
First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query
select count(*) from table1
if C1 and C2 are not equal, then the tables are not identical.
2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:
select count(*) from (select distinct * from table1)
if DC1 and DC2 are not equal, the tables are not identical.
3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:
SELECT count (*)
FROM
(SELECT *
FROM table1
UNION
SELECT *
FROM table2)
You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U
another variant
select c1-c2 "different row counts"
, c1-c3 "mismatched rows"
from
( select count(*) c1 from table1)
,( select count(*) c2 from table2 )
,(select count(*) c3 from table1 t1, table2 t2
where t1.key= t2.key
and T1.c1=T2.c1 )
Try with WITH Clause:
With cnt as(
select count(*) cn1 from table1
)
select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2);
One easy solution is to do inner join. Let's suppose we have two hive tables namely table1 and table2. Both the table has same column namely col1, col2 and col3. The number of rows should also be same. Then the command would be as follows
**
select count(*) from table1
inner join table2
on table1.col1 = table2.col1
and table1.col2 = table2.col2
and table1.col3 = table2.col3 ;
**
If the output value is same as number of rows in table1 and table2 , then all the columns has same value, If however the output count is lesser than there are some data which are different.
Use a MINUS operator:
SELECT count(*) FROM
(SELECT t1.c1, t1.c2, t1.c3 from table1 t1
MINUS
SELECT t2.c1, t2.c2, t2.c3 from table2 t2)

Should I avoid IN() because slower than EXISTS() [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server IN vs. EXISTS Performance
Should I avoid IN() because slower than EXISTS()?
SELECT * FROM TABLE1 t1 WHERE EXISTS (SELECT 1 FROM TABLE2 t2 WHERE t1.ID = t2.ID)
VS
SELECT * FROM TABLE1 t1 WHERE t1.ID IN(SELECT t2.ID FROM TABLE2 t2)
From my investigation, I set SHOWPLAN_ALL. I get the same execution plan and estimation cost. The index(pk) is used, seek on both query. No difference.
What are other scenarios or other cases to make big difference result from both query? Is optimizer so optimization for me to get same execution plan?
Do neither. Do this:
SELECT DISTINCT T1.*
FROM TABLE1 t1
JOIN TABLE2 t2 ON t1.ID = t2.ID;
This will out perform anything else by orders of magnitude.
Both queries will produce the same execution plan (assuming no indexes were created): two table scans and one nested loop (join).
The join, suggested by Bohemian, will do a Hash Match instead of the loop, which I've always heard (and here is a proof: Link) is the worst kind of join.
Among IN and EXIST (your actuall question), EXISTS returs better performance (take a lok at: Link)
If your table T2 has a lot of records, EXISTS is the better approach hands down, because when your database find a record that match your requirement, the condition will be evaluated to true and it stopped the scan from T2. However, in the IN clause, you're scanning your Table2 for every row in table1.
IN is better than Exists when you have a bunch of values, or few values in the subquery.
Expandad a little my answer, based on Ask Tom answer:
In a Select with in, for example:
Select * from T1 where x in ( select y from T2 )
is usually processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table (typically).
In an exist query like:
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
When is where exists appropriate and in appropriate?
Use EXISTS when... Subquery T2 is huge and takes a long time and T1 is relatively small and executing (select null from t2 where y = x.x ) is very very fast
Use IN when... The result of the subquery is small -- then IN is typicaly more appropriate.
If both the subquery and the outer table are huge -- either might work as well as the other -- depends on the indexes and other factors.