I have a table A with columns ABK and ACK. Each row can have a value in either ABK or ACK but not in both at the same time
ABK and ACK are keys to be used to fetch more detailed information from tables B and C, respectively
B has columns named BK (key) and B1 and C has columns named CK (key) and C1
When fetching information from B and C, I want to select between B1 and C1 depending on which column in A (ABK or ACK) is NOT null
What would be better considering readability and performance:
1
select COALESCE(B.B1, C.C1) as X from A
left join B on A.ABK = B.BK
left join C on A.ACK = C.CK
OR
2
select B.B1 as X from A join B on A.ABK = B.BK
UNION
select C.C1 as X from A join C on A.ACK = C.CK
In other words should I do a left join with all the tables I want to use or do union?
I am guessing that readability wise the UNION is better, but I am not sure about performance
Also B and C do not overlap, i.e. there is no duplicates between B and C
I don't think the answer in the question pointed as a duplicate of mine is correct for my case since it focus on the fact that there could be duplicates among tables B and C, but as stated B and C are mutually exclusive
your two queries aren't necessarily accomplishing the same thing.
Using LEFT JOIN will create duplicate rows in the result if there are duplicate values in either table, whereas UNION (as opposed to UNION ALL) automatically limits to remove duplicate values if applicable. Therefore, before thinking about performance, I would decide which method to use based on whether you are interested in preserving duplicates in your results.
See here for more info: Performance of two left joins versus union.
First, you can test the performance on your database and your data. You can also look at the execution plans.
Second, the queries are not equivalent, because the union removes duplicates.
That said, I would go for the first version using left join for both readability and performance. Readability is obviously a matter of taste. I think having all the logic in the select makes it more apparent what the query is doing.
More importantly, the first will start with table A and be able to use indexes for the additional lookups. There is no additional phase for removing duplicates. My guess is that two joins is faster than two joins and duplicate removal.
Related
What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.
I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).
I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null
if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.
I've been dealing with a slow running query, similar to the following
select
count(*)
from
a
join b
on a.akey = b.akey
join c
on b.bkey = c.bkey
left join d
on c.ykey = d.ykey
and b.xkey = d.xkey
where
a.idkey = 'someid'
This query takes 130s to run for 'someid'
If I remove either condition of the left join, the query runs in <1s.
I've determined the issue for this particular record (someid). There are a huge number of matching d.xkey values (~5 000 000). I've done some tests and modifying the relevant d.xkey values for this record to more unique values improves run time to <1s.
This is the fix I'm currently using.
select
count(*)
from
a
join b
on a.akey = b.akey
join c
on b.bkey = c.bkey
left join d
on c.ykey = d.ykey
where
a.idkey = 'someid'
and (
b.xkey = d.xkey
OR b.xkey is null
OR not exists (
select
dd.xkey
from
d dd
where
dd.xkey = b.xkey
and dd.ykey = c.ykey
)
)
This query runs in less than 1s.
My question is, why is this so much faster than the left join?
Is my new query equivalent to the old one in terms of results?
If the join onto d is efficient for either b.xkey or c.ykey alone (these names are appallingly subtle), but not when both are combined, it's probably because it is able to use an index on d for each one individually, but there is no combined index available.
The second example you've posted with the NOT EXISTS clause is almost unfathomable, but crucially it includes extra logic and is not directly equivalent to the LEFT JOIN in the first example.
In the WHERE clause of your second example, you permit rows to be included that have been left-joined between c and d where b.xkey is null, whereas in your first example, the joining of these rows would never have occurred (because b.xkey being null would have precluded the left-join). This means d has already possibly multiplied the rows in the results set improperly, which cannot be filtered by the where-clause (because without a ROW_NUMBER function, the where-clause cannot differentiate between each improper match - and can only filter either all or none of them, rather than reducing them back down to a single row), so the two queries can be deemed not logically identical on this ground alone.
It's otherwise difficult to reason precisely about what the combined effect of the whole where-clause is, and how it might be interacting with the other constraints and the underlying data to allow the query to perform better (despite ostensibly having to perform a similar lookup as the left-join did in the first example). If you are getting identical results from both queries, I would say it is only because of a dangerous coincidence in the data, while the logical constraints imposed by the two queries are fundamentally different.
Both queries don't seems logically equal. You can understand simply with condition
OR b.xkey is null .
a=b and b=c and c=d(+)
If you are taking OR b.xkey is null then a=b is filtering out some data.
Actually both are very much different.
I'm using PostgreSQL. Everything I read here suggests that in a query using nothing but full joins on a single column, the order of tables joined basically doesn't matter.
My intuition says this should also go for multiple columns, so long as every common column is listed in the query where possible (that is, wherever both joined tables have the column in common). But this is not the case, and I'm trying to figure out why.
Simplified to three tables a, b, and c.
Columns in table a: id, name_a
Columns in table b: id, id_x
Columns in table c: id, id_x
This query:
SELECT *
FROM a
FULL JOIN b USING(id)
FULL JOIN c USING(id, id_x);
returns a different number of rows than this one:
SELECT *
FROM a
FULL JOIN c USING(id)
FULL JOIN b USING(id, id_x);
What I want/expect is hard to articulate, but basically, a I'd like a "complete" full merger. I want no null fields anywhere unless that is unavoidable.
For example, whenever there is a not-null id, I want the corresponding name column to always have the name_a and not be null. Instead, one of those example queries returns semi-redundant results, with one row having a name_a but no id, and another having an id but no name_a, rather than a single merged row.
When the joins are listed in the other order, I do get that desired result (but I'm not sure what other problems might occur, because future data is unknown).
Your queries are different.
In the first, you are doing a full join to b using a single column, id.
In the second, you are doing a full join to b using two columns.
Although the two queries could return the same results under some circumstances, there is not reason to think that the results would be comparable.
Argument order matters in OUTER JOINs, except that FULL NATURAL JOIN is symmetric. They return what an INNER JOIN (ON, USING or NATURAL) does but also the unmatched rows from the left (LEFT JOIN), right (RIGHT JOIN) or both (FULL JOIN) tables extended by NULLs.
USING returns the single shared value for each specified column in INNER JOIN rows; in NULL-extended rows another common column can have NULL in one table's version and a value in the other's.
Join order matters too. Even FULL NATURAL JOIN is not associative, since with multiple tables each pair of tables (either operand being an original or join result) can have a unique set of common columns, ie in general (A ⟗ B) ⟗ C ≠ A ⟗ (B ⟗ C).
There are a lot of special cases where certain additional identities hold. Eg FULL JOIN USING all common column names and OUTER JOIN ON equality of same-named columns are symmetric. Some cases involve CKs (candidate keys), FKs (foreign keys) and other constraints on arguments.
Your question doesn't make clear exactly what input conditions you are assuming or what output conditions you are seeking.
Is there any performance difference between two different SQL-codes as below? The first one is without left jon and matching with where, the other is with left join and matching with on.
Because I get exactly the same result/output from those sql's, but I will be working with bigger tables soon (like couple of billions rows), so I don't want to have any performance issues. Thanks in advance ...
select a.customer_id
from table a, table b
where a.customer_id = b.customer_id
select a.customer_id
from table a
left join table b
on a.customer_id = b.customer_id
The two do different things and yes, there is a performance impact.
Your first example is a cross join with a filter which reduces it to an inner join (virtually all planners are smart enough to reduce this to an inner join but it is semantically a cross join and filter).
Your second is a left join which means that where the filter is not met, you will still get all records from table a.
This means that the planner has to assume all records from table a are relevant, and that correlating records from table b are relevant in your second example, but in your first example it knows that only correlated records are relevant (and therefore has more freedom in planning).
In a very small data set you will see no difference but you may get different results. In a large data set, your left join can never perform better than your inner join and may perform worse.