How to select efficiently in a very large table - sql

The table contains about 300 million rows. I need to select those rows based on two columns.
SELECT *
FROM table_1
WHERE column_1
IN (SELECT column FROM table_2)
AND column_2
IN (SELECT column FROM table_2)
table_1 has 300 million rows. table_2 has 1 million distinct rows.
I also used the exists method:
SELECT *
FROM table_1
WHERE EXISTS (
SELECT 1
FROM table_2
WHERE column=table_1.column_1)
AND EXISTS (
SELECT 1
FROM table_2
WHERE column=table_1.column_2)
But it is too slow. I created index on both columns in table_1 and column in table_2. It would take more than two hours on a 12G RAM Dell server.
Is there any better way to deal with such big table? Can Hadoop solve this problem?

USE THIS :
SELECT *
FROM table_1
INNER JOIN
(SELECT DISTINCT column FROM table_2) tab2_1
ON colum_1 = tab2_1.column
INNER JOIN
(SELECT DISTINCT column FROM table_2) tab2_2
ON colum_2 = tab2_2.column
Hope this will helps you

With a such big database, I would create a materialized view on this query, and then can do a simple SELECT * FROM table_view :
CREATE MATERIALIZED VIEW table_view AS
SELECT * FROM table_1
WHERE column_1 IN (SELECT column FROM table_2)
AND column_2 IN (SELECT column FROM table_2);
And you just have to create a TRIGGER to update this view whenever you add or remove a row from table1 or table2.

Related

How to retrieve only those rows of a table (db1) which are not in another table (db2)

I have a table t1 in db1, and another table t2 in db2. I have the same columns in both tables.
How do I retrieve only those rows which are not in the other table?
select id_num
from [db1].[dbo].[Tbl1]
except
select id_num
from [db2].[dbo].[Tb01]
You can use LEFT JOIN or WHERE NOT IN functions.
Using WHERE NOT IN:
select
dbase1.id_num from [db1].[dbo].[Tbl1] as dbase1
where dbase1.id_num not in
(select dbase2.id_num from [db2].[dbo].[Tb01] as dbase2)
Using LEFT JOIN (recommended as this is much faster)
SELECT dbase1.id_num
FROM [db1].[dbo].[Tbl1] as dbase1
LEFT JOIN [db2].[dbo].[Tb01] as dbase2 ON dbase2.id_num COLLATE Latin1_General_CI_A = dbase1.id_num COLLATE Latin1_General_CI_A
WHERE dbase2.id_num IS NULL
Compare tables with DB2 other databases may have a select a - b statement or similar. Because at the time my database also didn't have a-b I use the following. Wrap the statement in a create table statement to dig into the results. No rows and the tables are identical. I've added in a column BEFORE|AFTER which makes the results easy to read.
SELECT 'AFTER', A.* FROM
(SELECT * FROM &AFTER
EXCEPT
SELECT * FROM &BEFORE) AS A
UNION
SELECT 'BEFORE', B.* FROM
(SELECT * FROM &BEFORE
EXCEPT
SELECT * FROM &AFTER) AS B

How to take distinct values in hive join

I need to take the distinct values from Table 2 while joining with Table 1 in Hive. Because the table 2 has duplicate records.
Considering below join condition is it possible to take only distinct key_col from table 2? i dont want to use select distinct * from ...
select * from Table_1 a left join Table_2 b on a.key_col = b.key_col
Note: This is in Hive
Use Left semi join. This will give you all the record in table1 which exist in table2(duplicate record) without duplicates.
select a.* from Table_1 a left semi join Table_2 b on a.key_col = b.key_col

Getting the count from two tables in Apache HIVE or SQL

So I have two tables:
table_1 and table_2
They both have various columns with the same name.
We only need to work with 2 columns:
ID and REGION
table_1 has ID fields that are distinct to table_1 only.
table_2 has ID fields that are distinct to table_2 only.
however, some ID fields are shared by both table_1 and table_2
I need to write a query where i get the number of different ID fields from both tables where REGION = '1'
A FULL OUTER JOIN should do the trick.
SELECT COUNT(*)
FROM table_1
FULL OUTER JOIN table_2 ON (table_1.id=table_2.id)
It will create a single row for every id that is either in table_1 or table_2. If the id is in both tables, it will still create a single row.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
Using SQL, take advantage of a UNION to eliminate duplicate values between the two tables, so you're left with a distinct list of ID values to count.
SELECT COUNT(*)
FROM (SELECT ID
FROM table_1
WHERE REGION = '1'
UNION
SELECT ID
FROM table_2
WHERE REGION = '1') t

Selecting few columns as table

I have a problem in designing a query:
I have to select few records based on criteria
(SELECT COL_1, COL_2,COL_3 FROM TABLE_1 WHERE COL_3 = 'CND')
Now I need to select records from two databases based on these results
(SELECT XX_1, XX_2
FROM TABLE_2 WHERE TABLE_2.XX1 = TABLE_1.COL1
(from filtered results in step 1)
(SELECT YY_1, YY_2, YY_3
FROM TABLE_3 WHERE TABLE_3.YY_2 = TABLE_1.COL2)
(from filtered results in step 1)
I need results in single table view
XX_1, XX_2, YY_1, YY_2, YY_3
mentioned columns must be equal to be in result and only record with such equality should be fetched.
I need to run this on millions of records, so performance is considered
It's gonna be used in Java classes, so please don't suggest me any db specific/sql commands which can't be executed since I don't hold any db permissions other than read.
Hope I am clear. In case not, I will explain the doubts.
I tried something like this
SELECT *
FROM TABLE_2
JOIN
(SELECT COL_1,
COL_2,
COL_3
FROM TABLE_1
WHERE COL_3 = 'CND'
GROUP BY COL_1) TMP_TABLE
ON (TMP_TABLE.COL_1 = TABLE2.XX_1)
But I got view/table doesn't exists - oracle error.
I think you need to use a subquery,just like this
select col_1,col_2
from(
select col_1,col_2
from (
select col_1,col_2 from table_1
)tbl1
left join table_2 tbl2 on tbl2.col_1 = tbl1.col_1
)tbl3
left join table_3 tbl3 on tbl4.col_1 = tbl3.col_1
with usedrows as
( select a.Col_1,a.Col_2 FROM table1 a left JOIN table2 b ON a.Col_1=b.Col_2)
select Col_1,C0l_2 from usedrows
This is just an example where usedrows is a virtual table made after join.and u can select the columns from that join table as u select from other table.

How do I merge data from two tables in a single database call into the same columns?

If I run the two statements in batch will they return one table to two to my sqlcommand object with the data merged. What I am trying to do is optimize a search by searching twice, the first time on one set of data and then a second on another. They have the same fields and I’d like to have all the records from both tables show and be added to each other. I need this so that I can sort the data between both sets of data but short of writing a stored procedure I can’t think of a way of doing this.
Eg. Table 1 has columns A and B, Table 2 has these same columns but different data source. I then wan to merge them so that if a only exists in one column it is added to the result set and if both exist it eh tables the column B will be summed between the two.
Please note that this is not the same as a full outer join operation as that does not merge the data.
[EDIT]
Here's what the code looks like:
Select * From
(Select ID,COUNT(*) AS Count From [Table1]) as T1
full outer join
(Select ID,COUNT(*) AS Count From [Table2]) as T2
on t1.ID = T2.ID
Perhaps you're looking for UNION?
IE:
SELECT A, B FROM Table1
UNION
SELECT A, B FROM Table2
Possibly:
select table1.a, table1.b
from table1
where table1.a not in (select a from table2)
union all
select table1.a, table1.b+table2.b as b
from table1
inner join table2 on table1.a = table2.a
edit: perhaps you would benefit from unioning the tables before counting. e.g.
select id, count() as count from
(select id from table1
union all
select id from table2)
I'm not sure if I understand completely but you seem to be asking about a UNION
SELECT A,B
FROM tableX
UNION ALL
SELECT A,B
FROM tableY
To do it, you would go:
SELECT * INTO TABLE3 FROM TABLE1
UNION
SELECT * FROM TABLE2
Provided both tables have the same columns
I think what you are looking for is this, but I am not sure I am understanding your language correctly.
select id, sum(count) as count
from (
select id, count() as count
from table1
union all
select id, count() as count
from table2
) a
group by id