SQL query on large tables fast at first then slow - sql

Below query returns the initial result fast and then becomes extremely slow.
SELECT A.Id
, B.Date1
FROM A
LEFT OUTER JOIN B
ON A.Id = B.Id AND A.Flag = 'Y'
AND (B.Date1 IS NOT NULL AND A.Date >= B.Date2 AND A.Date < B.Date1)
Table A has 24 million records and Table B has 500 thousand records.
Index for Table A is on columns: Id and Date
Index for Table B is on columns: Id, Date2, Date1 - Date1 is nullable - index is unique
Frist 11m records are returned quite fast and it then suddenly becomes extremely slow. Execution Plan shows the indexes are used.
However, when I remove condition A.Date < B.Date1, query becomes fast again.
Do you know what should be done to improve the performance? Thanks
UPDATE:
I updated the query to show that I need fields of Table B in the result. You might think why I used left join when I have condition "B.Date1 is not null". That's because I have posted the simplified query. My performance issue is even with this simplified version.

You can maybe try using EXISTS. It should be faster as it stops looking for further rows once a match is found unlike JOIN where all the rows will have to be fetched and joined.
select id
from a
where flag = 'Y'
and exists (
select 1
from b
where a.id = b.id
and a.date >= b.date2
and a.date < b.date1
and date1 is not null
);

Generally what I've noticed with queries, and SQL performance is the DATA you are joining, for instance ONE to ONE relationships are much faster than ONE to MANY relationships.
I've noticed ONE to MANY relationship on table 3000 items, joining to a table with 30,000 items can easily take up to 11-15 seconds, with LIMIT. But that same query, redesigned with all ONE TO ONE relationships would take less than 1 second.
So my suggestion to speed up your query.
According to Left Outer Join (desc) "LEFT JOIN and LEFT OUTER JOIN are the same" so it doesn't matter which one, you use.
But ideally, should use INNER because in your question you stated B.Date1 IS NOT NULL
Based on this parent columns in join selection (desc), you can use parent column in SELECT within JOIN.
SELECT a.Id FROM A a
INNER JOIN (SELECT b.Id AS 'Id', COUNT(1) as `TotalLinks` FROM B b WHERE ((b.Date1 IS NOT NULL) AND ((a.Date >= b.Date2) AND (a.Date < b.Date1)) GROUP BY b.Id) AS `ab` ON (a.Id = ab.Id) AND (a.Flag = 'Y')
WHERE a.Flag = 'Y' AND b.totalLinks > 0
LIMIT 0, 500
Try and also, LIMIT the DATA you want; this will reduce the filtering necessary by SQL.

Related

SQL: placement of inner joins and impact of performance and correctness

Is there any difference ( in performance and correctness) between these to SQL (sqls are not important)
SQL no. 1
SELECT a.test1,
a.test2,
a.test3,
b.test1,
b.test2,
b.test3,
c.test1,
c.test2,
c.test3
FROM table_1 a
JOIN table_2 b ON a.id = b.id
AND (a.test1 ="test")
AND (b.test2 = "test")
JOIN table_3 c ON c.id2 = b.id2
SQL no. 2
SELECT a.test1,
a.test2,
a.test3,
b.test1,
b.test2,
b.test3,
c.test1,
c.test2,
c.test3
FROM table_3 c
JOIN table_2 b ON c.id2 = b.id2
JOIN table_1 a ON a.id = b.id
AND (a.test1 ="test")
AND (b.test2 = "test")
also, table a has 500 000 records, table b has 1000 000 and table c has 1 5000 00 records
Honestly I don't think reversing the join statements will make much of a difference. Either way the program will have to combine the data of the table with the data of the two other tables. The calculations that have to be performed don't seem to be affected by the joins. If the first join changed the length of the data that needed to be analysed to perform the second join, I could see why the query execution time would be affected, as would the results. But aren't the statements independent of eachother? Measuring time of queries can be done like following in SQL Server Management Studio:
set statistics time on
$query_to_be_measured
set statistics time off

Optimize Join GreenPlum

I have table A with 20 million records. There is table B with 200,000 records.
I want to do a join like:
select *
from tableA a
left join tableB b
on ((a.name1 = b.name1 OR a.name1 = b.name2) OR a.id = b.id)
and a.time > b.time
;
This is very time consuming.
I am using GreenPlum so I cannot make use of indexes.
How can I optimize this?
The number of rows in table B are incremental and will increase.
Greenplum does support indexes. However, this query is a tricky since immaterial of what your distribution column is there is no way to co-locate the join for the following reason.
a.time or b.time is a bad candidate for distribution since it is a ">" operator
You could distribute tableA by (name1, id) and tableB by (name1, name2, id). But to see if the a.time > b.time is satisfied you needs to still see all tuples.
Not sure the query is a very MPP friendly one I am afraid.

SQL filter LEFT TABLE before left join

I have read a number of posts from SO and I understand the differences between filtering in the where clause and on clause. But most of those examples are filtering on the RIGHT table (when using left join). If I have a query such as below:
select * from tableA A left join tableB B on A.ID = B.ID and A.ID = 20
The return values are not what I expected. I would have thought it first filters the left table and fetches only rows with ID = 20 and then do a left join with tableB.
Of course, this should be technically the same as doing:
select * from tableA A left join table B on A.ID = B.ID where A.ID = 20
But I thought the performance would be better if you could filter the table before doing a join. Can someone enlighten me on how this SQL is processed and help me understand this thoroughly.
A left join follows a simple rule. It keeps all the rows in the first table. The values of columns depend on the on clause. If there is no match, then the corresponding table's columns are NULL -- whether the first or second table.
So, for this query:
select *
from tableA A left join
tableB B
on A.ID = B.ID and A.ID = 20;
All the rows in A are in the result set, regardless of whether or not there is a match. When the id is not 20, then the rows and columns are still taken from A. However, the condition is false so the columns in B are NULL. This is a simple rule. It does not depend on whether the conditions are on the first table or the second table.
For this query:
select *
from tableA A left join
tableB B
on A.ID = B.ID
where A.ID = 20;
The from clause keeps all the rows in A. But then the where clause has its effect. And it filters the rows so on only id 20s are in the result set.
When using a left join:
Filter conditions on the first table go in the where clause.
Filter conditions on subsequent tables go in the on clause.
Where you have from tablea, you could put a subquery like from (select x.* from tablea X where x.value=20) TA
Then refer to TA like you did tablea previously.
Likely the query optimizer would do this for you.
Oracle should have a way to show the query plan. Put "Explain plan" before the sql statement. Look at the plan both ways and see what it does.
In your first SQL statement, A.ID=20 is not being joined to anything technically. Joins are used to connect two separate tables together, with the ON statement joining columns by associating them as keys.
WHERE statements allow the filtering of data by reducing the number of rows returned only where that value can be found under that particular column.

Oracle semi-join with multiple tables in SQL subquery

This question is how to work around the apparent oracle limitation on semi-joins with multiple tables in the subquery. I have the following 2 UPDATE statements.
Update 1:
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c
WHERE c.id2 = b.id2 AND
c.time BETWEEN start_in AND end_in) AND
EXISTS (SELECT NULL
FROM TABLE(update_in) d
WHERE b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'
The execution plan indicayes that this correctly performs 2 semi-joins, and the update executes in seconds. These need to be semi-joins because c.id2 is not a unique foreign key on b.id2, unlike b.id and a.id. And update_in doesn't have any constraints at all since it's an array.
Update 2:
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c, TABLE(update_in) d
WHERE c.id2 = b.id2 AND
c.time > d.time AND
b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'
This does not do a semi-join; I believe based on the Oracle documentation that's because the EXISTS subquery has 2 tables in it. Due to the sizes of the tables, and partitioning, this update takes hours. However, there is no way to relate d.time to the associated d.start_time and d.end_time other than being on the same row. And the reason we pass in the update_in array and join it here is because running this query in a loop for each time/start_time/end_time combination also proved to give poor performance.
Is there a reason other than the 2 tables that the semi-join could be not working? If not, is there a way around this limitation? Some simple solution I am missing that could make these criteria work without putting 2 tables in the subquery?
As Bob suggests you can use a Global Temporary Table (GTT) with the same structure as your update_in array, but the key difference is that you can create indexes on the GTT, and if you populate the GTT with representative sample data, you can also collect statistics on the table so the SQL query analyzer is better able to predict an optimal query plan.
That said there are also some other notable differences in your two queries:
In the first exists clause of your first query you refer to two columns start_in and end_in that don't have table references. My guess is that they are either columns in table a or b, or they are variables within the current scope of your sql statement. It's not clear which.
In your second query you refer to column d.time, however, you don't use that column in the first query.
Does updating your second query to the following improve it's performance?
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c, TABLE(update_in) d
WHERE c.id2 = b.id2 AND
c.time BETWEEN start_in AND end_in AND
c.time > d.time AND
b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'

Should I use a temp table?

I have a report query that is taking 4 minutes, and under the maximum 30 seconds allowed limit applied on us.
I notice that it has a LOT of INNER JOINS. One, I see, is it joins to a Person table, which has millions of rows. I'm wondering if it would be more efficient to break up the query. Would it be more efficient to do something like:
Assume all keys are indexed.
Table C has 8 million records, Table B has 6 Million records, Table A has 400,000 records.
SELECT Fields
FROM TableA A
INNER JOIN TableB B
ON b.key = a.key
INNER JOIN Table C
ON C.key = b.CKey
WHERE A.id = AnInput
Or
SELECT *
INTO TempTableC
FROM TableC
WHERE id = AnInput
-- TempTableC now has 1000 records
Then
SELECT Fields
FROM TableA A
INNER JOIN TableB B --Maybe put this into a filtered temp table?
ON b.key = a.key
INNER JOIN TempTableC c
ON c.AField = b.aField
WHERE a.id = AnInput
Basically, bring the result sets into temp tables, then join.
If your Person table is indexed correctly, then the INNER JOIN should not be causing such a problem. Check that you have an index created on column(s) that are joined to in all your tables. Using temp tables for what appears to be a relatively simple query seems to be papering over the cracks of an inadequate database design.
As others have said, the only way to be sure is to post your query plan.