What is the efficient way to query subset of a joined table - sql

My query is somewhat like this
SELECT TableA.Column1
FROM TableA
LEFT JOIN TableB ON TableA.ForeignKey = TableB.PrimaryKey
LEFT JOIN TableC ON TableC.PrimaryKey = TableB.ForeignKey
WHERE TableC.SomeColumn = 'XXX'
In the above case Table A and Table B are large tables (may contain more than 1 million rows), but Table C is small, with just 25 rows.
I have applied indexes on primary keys of all the tables.
In our application scenario, I need to search in TableC for just two conditions, TableC.SomeColumn = 'XXX' or TableC.SomeColumn = 'YYY'.
My question is what is the most efficient way to do this. A straight join does work, but I am concerned about joining with all the rows in TableB, just to pick a small subset of it, when joined in Table C.
Is it a good approach to have an indexed view?
For example,
CREATE INDEXED VIEW FOR TableB
JOIN TableC ON TableC.PrimaryKey = TableB.ForeignKey
WHERE TableC.SomeColumn IN ('XXX', 'YYY'))?

You where clause undoes the outer join, so you might as well write the query as:
SELECT a.Column1
FROM TableA a JOIN
TableB b
ON a.ForeignKey = b.PrimaryKey JOIN
TableC c
ON c.PrimaryKey = b.ForeignKey
WHERE c.SomeColumn = 'XXX';
For this query, you want indexes these indexes:
TableC(SomeColumn, PrimaryKey)
TableB(ForeignKey, PrimaryKey)
TableA(ForeignKey, Column1)
You can create an indexed view. That would generally be the fastest for querying. However, it can incur a lot more overhead for updates and inserts into any of the base tables.

I typically only use a JOIN when I need to SELECT or GROUP on the data, not when using it as a predicate. That said, I would be very curious to see if Gordon's answer or this one performs better.
I would also suggest getting in the habit of using alias' when referencing your tables, its less typing, and makes your code easier to read.
I would test and compare execution times:
SELECT A.Column1
FROM TableA A
WHERE EXISTS (SELECT 1
FROM TableB B
WHERE A.ForeignKey = B.PrimaryKey
AND EXISTS (SELECT 1
FROM TableC C
WHERE C.PrimaryKey = B.ForeignKey
AND C.SomeColumn = 'XXX'))

Related

Oracle SQL - Select not using index as expected

So I haven't used Oracle in more than 5 years and I'm out of practice. I've been on SQL Server all that time.
I'm looking at some of the existing queries and trying to improve them, but they're reacting really weirdly. According to the explain plan instead of going faster they're instead doing full table scans and not using the indexes.
In the original query, there is an equijoin done between two tables done in the where statement. We'll call them table A and B. I used an explain plan followed by SELECT * FROM table(DBMS_XPLAN.DISPLAY (FORMAT=>'ALL +OUTLINE')); and it tells me that Table A is queried by Local Index.
TABLE ACCESS BY LOCAL INDEX ROWID
SELECT A.*
FROM TableA A, TableB B
WHERE A.SecondaryID = B.ID;
I tried to change the query and join TableA with a new table (Table C). Table C is a subset of Table B with 700 records instead of 100K. However the explain plan tells me that Table A is now queried with a full lookup.
CREATE TableC
AS<br>
SELECT * FROM TableB WHERE Active='Y';
SELECT A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
Next step, I kept the join between tables A & C, but used a hint to tell it to use the index on Table A. However it still does a full lookup.
SELECT /*+ INDEX (A_NDX01) */ A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
So I tried to change from a join to a simple Select of table A and use an IN statement to compare to table C. Still a full table scan.
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC);
Lastly, I took the previous statement and changed the subselect to pull the top 1000 records, and it used the index. The odd thing is that there are only 700 records in Table C.
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC WHERE rownum <1000
)
I was wondering if someone could help me figure out what's happening?
My best guess is that since TableC is a new table, maybe the optimizer doesn't know how many records are in it and that's why it's it will only use the index if it knows that there are fewer than 1000 records?
I tried to run dbms_stats.gather_schema_stats on my schema though and it did not help.
Thank you for your help.
As a general rule Using an index will not necessarily make your query go faster ALWAYS.
Hints are directives to the optimizer to make use of the path, it doenst mean optimizer would choose to obey the hint directive. In this case, the optimizer would have considered that an index lookup on TableA is more expensive in the
SELECT A.*
FROM TableA A, TableB B
WHERE A.SecondaryID = B.ID;
SELECT /*+ INDEX (A_NDX01) */ A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC);
Internally it might have converted all of these statements(IN) into a join which when considering the data in the tableA and tableC decided to make use of full table scan.
When you did the rownum condition, this plan conversion was not done. This is because view-merging will not work when it has the rownum in the query block.
I believe this is what is happening when you did
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC WHERE rownum <1000)
Have a look at the following link
Oracle. Preventing merge subquery and main query conditions

INTERSECT two table of size 500ml rows in vertica

I am very new to vertica db and hence looking for different efficient ways for comparing two tables of average size 500ml-800ml rows in vertica. I have a process that gets the data from vertica view and dump in to SQL server for later merge to final table in sql server. for few large tables combine it is dumping about 3bl rows daily. Instead of dumping all data I want to take daily snapshot, and compare it with previous days snapshot on vertica side only and then push changed rows only in to SQL SEREVER.
lets say previous snapshot is stored in tableA, today's snapshot stored in tableB. PK on both table is column named OrderId.
Simplest way I can think of is
Select * from tableB
Where OrderId NOT IN (
SELECT * from tableA
INTERSECT
SELECT * from tbleB
)
So my questions are:
Is there any other/better option in vertica to get only changed rows between two tables? Or should I
even consider doing this compare on vertica side?
How much doing such comparison should take?
What should I consider to improve the performance of such query?
If your columns have no NULL values, then a massive LEFT JOIN would seem to do what you want:
select b.*
from tableB b left join
tableA a
on b.OrderId = a.OrderId and
b.col1 = a.col1 and
. . . -- for all the columns you care about
However, I think you want except:
select b.*
from tableB b
except
select a.*
from tableA a;
I imagine this would have reasonable performance.
Do you have a primary key in the two tables?
Then my technique, for a complete Change Data Capture, is:
SELECT
'I' AS to_do
, newrows.*
FROM tb_today newrows
LEFT
JOIN tb_yesterday oldrows USING(id)
WHERE oldrows.id IS NULL
UNION ALL
SELECT
'U' AS to_do
, newrows.*
FROM tb_today newrows
JOIN tb_yesterday oldrows
WHERE oldrows.fname <> newrows.fname
OR oldrows.lnamd <> newrows.lname
OR oldrows.bdate <> newrwos.bdate
OR oldrows.sal <> newrows.sal
[...]
OR oldrows.lastcol <> newrows.lastcol
UNION ALL
SELECT
'D' AS to_do
, oldrows.*
FROM tb_yesterday oldrows
LEFT
JOIN tb_today oldrows USING(id)
WHERE newrows.id IS NULL
;
Just leave out the last leg of the UNION SELECT if you don't want to cater for DELETEs ('D')
Good luck
you also do it nicely using joins:
SELECT b.*
FROM tableB AS b
LEFT JOIN tableA AS a ON a.id = b.id
WHERE a.id IS NULL
so above query return only diff from TableB to TableA i.e. data which is present in both table will be skipped...

SQL filter LEFT TABLE before left join

I have read a number of posts from SO and I understand the differences between filtering in the where clause and on clause. But most of those examples are filtering on the RIGHT table (when using left join). If I have a query such as below:
select * from tableA A left join tableB B on A.ID = B.ID and A.ID = 20
The return values are not what I expected. I would have thought it first filters the left table and fetches only rows with ID = 20 and then do a left join with tableB.
Of course, this should be technically the same as doing:
select * from tableA A left join table B on A.ID = B.ID where A.ID = 20
But I thought the performance would be better if you could filter the table before doing a join. Can someone enlighten me on how this SQL is processed and help me understand this thoroughly.
A left join follows a simple rule. It keeps all the rows in the first table. The values of columns depend on the on clause. If there is no match, then the corresponding table's columns are NULL -- whether the first or second table.
So, for this query:
select *
from tableA A left join
tableB B
on A.ID = B.ID and A.ID = 20;
All the rows in A are in the result set, regardless of whether or not there is a match. When the id is not 20, then the rows and columns are still taken from A. However, the condition is false so the columns in B are NULL. This is a simple rule. It does not depend on whether the conditions are on the first table or the second table.
For this query:
select *
from tableA A left join
tableB B
on A.ID = B.ID
where A.ID = 20;
The from clause keeps all the rows in A. But then the where clause has its effect. And it filters the rows so on only id 20s are in the result set.
When using a left join:
Filter conditions on the first table go in the where clause.
Filter conditions on subsequent tables go in the on clause.
Where you have from tablea, you could put a subquery like from (select x.* from tablea X where x.value=20) TA
Then refer to TA like you did tablea previously.
Likely the query optimizer would do this for you.
Oracle should have a way to show the query plan. Put "Explain plan" before the sql statement. Look at the plan both ways and see what it does.
In your first SQL statement, A.ID=20 is not being joined to anything technically. Joins are used to connect two separate tables together, with the ON statement joining columns by associating them as keys.
WHERE statements allow the filtering of data by reducing the number of rows returned only where that value can be found under that particular column.

Optimizing SQL Query - Joining 4 tables

I am trying to join 4 tables. Currently I've achieved it by doing this.
SELECT columns
FROM tableA
LEFT OUTER JOIN tableB ON tableB.address_id = tableA.address_id
INNER JOIN tableC ON tableC.company_id = tableA.company_id AND tableC.client_id = ?
UNION
SELECT columns
FROM tableA
LEFT OUTER JOIN tableB ON tableB.address_id = tableA.gaddress_id
INNER JOIN tableD ON tableD.company_id = tableA.company_id AND tableD.branch_id = ?
The structure of tableC and tableD is very similar. Let's say that tableC contains data for clients. And tableD contains data for client's branch. tableA are companies and tableB are addresses My goal is to get data from tableA that are joined to table B (All companies that has addresses) and all the data from tableD and also from tableC.
This wroks nice, but I am afraid that is would be very slow.
I think you can trick it like this:
First UNION between C,D and only the join to the rest of the query, it should improve the query significantly :
SELECT columns
FROM TableA
LEFT OUTER JOIN tableB ON tableB.address_id = tableA.address_id
INNER JOIN(SELECT Columns,'1' as ind_where FROM tableC
UNION ALL
SELECT Columns,'2' FROM TableD) joined_Table
ON (joined_Table.company_id = tableA.company_id AND joined_Table.New_Col_ID= ?)
The New_Col_ID -> just select both branch_id and client_id in the same column and alias it as New_Col_ID or what ever
In addition you can index the tables(if not exists yet) :
TableA(address_id,company_id)
TableB(address_id)
TableC(company_id,client_id)
TableD(company_id,branch_id)
Why should that be slow? You select client adresses and branch addresses and show the complete result. That seems straight-forward.
You join on IDs and this should be fast (as there should be indexes available accordingly). You may want to introduce composite indexes on
create index idx_c on tableC(client_id, company_id)
and
create index idx_d on tableD(branch_id, company_id)
However: UNION is a lot of work for the DBMS, because it has to look for and eliminate duplicates. Can there even be any? Otherwise use UNION ALL.
Try CTE so that you don't have to go through TableA and TableB twice for the union.
; WITH TempTable (Column1, Column2, ...)
AS ( SELECT columns
FROM tableA
LEFT OUTER JOIN tableB
ON tableB.address_id = tableA.gaddress_id
)
SELECT Columns
FROM TempTable
INNER JOIN tableC
ON tableC.company_id = tableA.company_id AND tableC.client_id = ?
UNION
SELECT Columns
FROM TempTable
INNER JOIN tableD ON tableD.company_id = tableA.company_id AND tableD.branch_id = ?

Should I use a temp table?

I have a report query that is taking 4 minutes, and under the maximum 30 seconds allowed limit applied on us.
I notice that it has a LOT of INNER JOINS. One, I see, is it joins to a Person table, which has millions of rows. I'm wondering if it would be more efficient to break up the query. Would it be more efficient to do something like:
Assume all keys are indexed.
Table C has 8 million records, Table B has 6 Million records, Table A has 400,000 records.
SELECT Fields
FROM TableA A
INNER JOIN TableB B
ON b.key = a.key
INNER JOIN Table C
ON C.key = b.CKey
WHERE A.id = AnInput
Or
SELECT *
INTO TempTableC
FROM TableC
WHERE id = AnInput
-- TempTableC now has 1000 records
Then
SELECT Fields
FROM TableA A
INNER JOIN TableB B --Maybe put this into a filtered temp table?
ON b.key = a.key
INNER JOIN TempTableC c
ON c.AField = b.aField
WHERE a.id = AnInput
Basically, bring the result sets into temp tables, then join.
If your Person table is indexed correctly, then the INNER JOIN should not be causing such a problem. Check that you have an index created on column(s) that are joined to in all your tables. Using temp tables for what appears to be a relatively simple query seems to be papering over the cracks of an inadequate database design.
As others have said, the only way to be sure is to post your query plan.