I have table A with 20 million records. There is table B with 200,000 records.
I want to do a join like:
select *
from tableA a
left join tableB b
on ((a.name1 = b.name1 OR a.name1 = b.name2) OR a.id = b.id)
and a.time > b.time
;
This is very time consuming.
I am using GreenPlum so I cannot make use of indexes.
How can I optimize this?
The number of rows in table B are incremental and will increase.
Greenplum does support indexes. However, this query is a tricky since immaterial of what your distribution column is there is no way to co-locate the join for the following reason.
a.time or b.time is a bad candidate for distribution since it is a ">" operator
You could distribute tableA by (name1, id) and tableB by (name1, name2, id). But to see if the a.time > b.time is satisfied you needs to still see all tuples.
Not sure the query is a very MPP friendly one I am afraid.
Related
Using Snowflake,have 2 tables, one with many columns and the other with a few, trying to select * on their join, get the following error:
SQL compilation error:duplicate column name
which makes sense because my joining columns are in both tables, could probably use select with columns names instead of *, but is there a way I could avoid that? or at least have the query infer the columns names dynamically from any table it gets?
I am quite sure snowflake will let you choose all from both halves of two+ tables via
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
what you will not be able to do is refer to the named of the columns in GROUP BY indirectly, thus this will not work
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY x
even though some databases know because you have JOIN ON a.x = b.x there is only one x, snowflake will not allow it (well it didn't last time I tried this)
but you can with the above use the alias name or the output column position thus both the following will work.
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY a.x
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY 1 -- assuming x is the first column
in general the * and a.* forms are super convenient, but are actually bad for performance.
when selecting you are now are risk of getting the columns back in a different order if the table has been recreated, thus making reading code unstable. Which also impacts VIEWs.
It also means all meta data for the table need to be loaded to know what the complete form of the data will be in. Where if you want x,y,z only and later a w was added to the table, the whole query plan can be compiled faster.
Lastly if you are selecting SELECT * FROM table in a sub-select and only a sub-set of those columns are needed the execution compiler doesn't need to prune these. And if all variables are attached to a correctly aliased table, if later a second table adds the same named column, naked columns are not later ambiguous. Which will only occur when that SQL is run, which might be an "annual report" which doesn't happen that often. wow, what a long use alias rant.
You can prefix the name of the column with the name of the table:
select table_a.id, table_b.name from table_a join table_b using (id)
The same works in combination with *:
select table_a.id, table_b.* from table_a join table_b using (id)
It works in "join" and "where" parts of the statement as well
select table_a.id, table_b.* from table_a join table_b
on table_a.id = table_b.id where table_b.name LIKE 'b%'
You can use table aliases to make the statement sorter:
select a.id, b.* from table_a a join table_b b
on a.id = b.id
Aliases could be applies on fields to use in subqueries, client software and (depending on the SQL server) in the other parts of the statements, for example 'order by':
select a.id as a_id, b.* from table_a a join table_b b
on a.id = b.id order by a_id
If you're after a result that includes all the distinct non-join columns from each table in the join with the join columns included in the output only once (given they will be identical for an inner-join) you can use NATURAL JOIN.
e.g.
select * from d1 natural inner join d2 order by id;
See examples: https://docs.snowflake.com/en/sql-reference/constructs/join.html#examples
So I haven't used Oracle in more than 5 years and I'm out of practice. I've been on SQL Server all that time.
I'm looking at some of the existing queries and trying to improve them, but they're reacting really weirdly. According to the explain plan instead of going faster they're instead doing full table scans and not using the indexes.
In the original query, there is an equijoin done between two tables done in the where statement. We'll call them table A and B. I used an explain plan followed by SELECT * FROM table(DBMS_XPLAN.DISPLAY (FORMAT=>'ALL +OUTLINE')); and it tells me that Table A is queried by Local Index.
TABLE ACCESS BY LOCAL INDEX ROWID
SELECT A.*
FROM TableA A, TableB B
WHERE A.SecondaryID = B.ID;
I tried to change the query and join TableA with a new table (Table C). Table C is a subset of Table B with 700 records instead of 100K. However the explain plan tells me that Table A is now queried with a full lookup.
CREATE TableC
AS<br>
SELECT * FROM TableB WHERE Active='Y';
SELECT A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
Next step, I kept the join between tables A & C, but used a hint to tell it to use the index on Table A. However it still does a full lookup.
SELECT /*+ INDEX (A_NDX01) */ A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
So I tried to change from a join to a simple Select of table A and use an IN statement to compare to table C. Still a full table scan.
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC);
Lastly, I took the previous statement and changed the subselect to pull the top 1000 records, and it used the index. The odd thing is that there are only 700 records in Table C.
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC WHERE rownum <1000
)
I was wondering if someone could help me figure out what's happening?
My best guess is that since TableC is a new table, maybe the optimizer doesn't know how many records are in it and that's why it's it will only use the index if it knows that there are fewer than 1000 records?
I tried to run dbms_stats.gather_schema_stats on my schema though and it did not help.
Thank you for your help.
As a general rule Using an index will not necessarily make your query go faster ALWAYS.
Hints are directives to the optimizer to make use of the path, it doenst mean optimizer would choose to obey the hint directive. In this case, the optimizer would have considered that an index lookup on TableA is more expensive in the
SELECT A.*
FROM TableA A, TableB B
WHERE A.SecondaryID = B.ID;
SELECT /*+ INDEX (A_NDX01) */ A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC);
Internally it might have converted all of these statements(IN) into a join which when considering the data in the tableA and tableC decided to make use of full table scan.
When you did the rownum condition, this plan conversion was not done. This is because view-merging will not work when it has the rownum in the query block.
I believe this is what is happening when you did
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC WHERE rownum <1000)
Have a look at the following link
Oracle. Preventing merge subquery and main query conditions
My query is somewhat like this
SELECT TableA.Column1
FROM TableA
LEFT JOIN TableB ON TableA.ForeignKey = TableB.PrimaryKey
LEFT JOIN TableC ON TableC.PrimaryKey = TableB.ForeignKey
WHERE TableC.SomeColumn = 'XXX'
In the above case Table A and Table B are large tables (may contain more than 1 million rows), but Table C is small, with just 25 rows.
I have applied indexes on primary keys of all the tables.
In our application scenario, I need to search in TableC for just two conditions, TableC.SomeColumn = 'XXX' or TableC.SomeColumn = 'YYY'.
My question is what is the most efficient way to do this. A straight join does work, but I am concerned about joining with all the rows in TableB, just to pick a small subset of it, when joined in Table C.
Is it a good approach to have an indexed view?
For example,
CREATE INDEXED VIEW FOR TableB
JOIN TableC ON TableC.PrimaryKey = TableB.ForeignKey
WHERE TableC.SomeColumn IN ('XXX', 'YYY'))?
You where clause undoes the outer join, so you might as well write the query as:
SELECT a.Column1
FROM TableA a JOIN
TableB b
ON a.ForeignKey = b.PrimaryKey JOIN
TableC c
ON c.PrimaryKey = b.ForeignKey
WHERE c.SomeColumn = 'XXX';
For this query, you want indexes these indexes:
TableC(SomeColumn, PrimaryKey)
TableB(ForeignKey, PrimaryKey)
TableA(ForeignKey, Column1)
You can create an indexed view. That would generally be the fastest for querying. However, it can incur a lot more overhead for updates and inserts into any of the base tables.
I typically only use a JOIN when I need to SELECT or GROUP on the data, not when using it as a predicate. That said, I would be very curious to see if Gordon's answer or this one performs better.
I would also suggest getting in the habit of using alias' when referencing your tables, its less typing, and makes your code easier to read.
I would test and compare execution times:
SELECT A.Column1
FROM TableA A
WHERE EXISTS (SELECT 1
FROM TableB B
WHERE A.ForeignKey = B.PrimaryKey
AND EXISTS (SELECT 1
FROM TableC C
WHERE C.PrimaryKey = B.ForeignKey
AND C.SomeColumn = 'XXX'))
Below query returns the initial result fast and then becomes extremely slow.
SELECT A.Id
, B.Date1
FROM A
LEFT OUTER JOIN B
ON A.Id = B.Id AND A.Flag = 'Y'
AND (B.Date1 IS NOT NULL AND A.Date >= B.Date2 AND A.Date < B.Date1)
Table A has 24 million records and Table B has 500 thousand records.
Index for Table A is on columns: Id and Date
Index for Table B is on columns: Id, Date2, Date1 - Date1 is nullable - index is unique
Frist 11m records are returned quite fast and it then suddenly becomes extremely slow. Execution Plan shows the indexes are used.
However, when I remove condition A.Date < B.Date1, query becomes fast again.
Do you know what should be done to improve the performance? Thanks
UPDATE:
I updated the query to show that I need fields of Table B in the result. You might think why I used left join when I have condition "B.Date1 is not null". That's because I have posted the simplified query. My performance issue is even with this simplified version.
You can maybe try using EXISTS. It should be faster as it stops looking for further rows once a match is found unlike JOIN where all the rows will have to be fetched and joined.
select id
from a
where flag = 'Y'
and exists (
select 1
from b
where a.id = b.id
and a.date >= b.date2
and a.date < b.date1
and date1 is not null
);
Generally what I've noticed with queries, and SQL performance is the DATA you are joining, for instance ONE to ONE relationships are much faster than ONE to MANY relationships.
I've noticed ONE to MANY relationship on table 3000 items, joining to a table with 30,000 items can easily take up to 11-15 seconds, with LIMIT. But that same query, redesigned with all ONE TO ONE relationships would take less than 1 second.
So my suggestion to speed up your query.
According to Left Outer Join (desc) "LEFT JOIN and LEFT OUTER JOIN are the same" so it doesn't matter which one, you use.
But ideally, should use INNER because in your question you stated B.Date1 IS NOT NULL
Based on this parent columns in join selection (desc), you can use parent column in SELECT within JOIN.
SELECT a.Id FROM A a
INNER JOIN (SELECT b.Id AS 'Id', COUNT(1) as `TotalLinks` FROM B b WHERE ((b.Date1 IS NOT NULL) AND ((a.Date >= b.Date2) AND (a.Date < b.Date1)) GROUP BY b.Id) AS `ab` ON (a.Id = ab.Id) AND (a.Flag = 'Y')
WHERE a.Flag = 'Y' AND b.totalLinks > 0
LIMIT 0, 500
Try and also, LIMIT the DATA you want; this will reduce the filtering necessary by SQL.
This question is how to work around the apparent oracle limitation on semi-joins with multiple tables in the subquery. I have the following 2 UPDATE statements.
Update 1:
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c
WHERE c.id2 = b.id2 AND
c.time BETWEEN start_in AND end_in) AND
EXISTS (SELECT NULL
FROM TABLE(update_in) d
WHERE b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'
The execution plan indicayes that this correctly performs 2 semi-joins, and the update executes in seconds. These need to be semi-joins because c.id2 is not a unique foreign key on b.id2, unlike b.id and a.id. And update_in doesn't have any constraints at all since it's an array.
Update 2:
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c, TABLE(update_in) d
WHERE c.id2 = b.id2 AND
c.time > d.time AND
b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'
This does not do a semi-join; I believe based on the Oracle documentation that's because the EXISTS subquery has 2 tables in it. Due to the sizes of the tables, and partitioning, this update takes hours. However, there is no way to relate d.time to the associated d.start_time and d.end_time other than being on the same row. And the reason we pass in the update_in array and join it here is because running this query in a loop for each time/start_time/end_time combination also proved to give poor performance.
Is there a reason other than the 2 tables that the semi-join could be not working? If not, is there a way around this limitation? Some simple solution I am missing that could make these criteria work without putting 2 tables in the subquery?
As Bob suggests you can use a Global Temporary Table (GTT) with the same structure as your update_in array, but the key difference is that you can create indexes on the GTT, and if you populate the GTT with representative sample data, you can also collect statistics on the table so the SQL query analyzer is better able to predict an optimal query plan.
That said there are also some other notable differences in your two queries:
In the first exists clause of your first query you refer to two columns start_in and end_in that don't have table references. My guess is that they are either columns in table a or b, or they are variables within the current scope of your sql statement. It's not clear which.
In your second query you refer to column d.time, however, you don't use that column in the first query.
Does updating your second query to the following improve it's performance?
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c, TABLE(update_in) d
WHERE c.id2 = b.id2 AND
c.time BETWEEN start_in AND end_in AND
c.time > d.time AND
b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'