Difference between Two Queries - Join vs IN - sql

I have the following two queries. Query1 is returning 1000 as row count where as Query2 is returning 4000 as row count. Can someone please explain the difference between both the queries. I was hoping both would return same count.
Query1:
SELECT COUNT(*)
FROM TableA A
WHERE A.VIN IN (
SELECT VIN
FROM TableB B, TableC C
WHERE B.MODEL_YEAR = '2014' AND B.VIN_NBR = C.VIN
)
Query2:
SELECT COUNT(*)
FROM TABLEA A, TableB B, TableC C
WHERE B.MODEL_YEAR = '2014' AND B.VIN_NBR = C.VIN AND A.VIN = C.VIN

In many cases, they will return the same answer, but not necessarily. The first counts the number of rows in A that match the conditions -- each row is counted only once, regardless of the number of matches. The second does a join, which can multiply the number of rows.
The second query would be equivalent in results if it used count(distinct A.id), where id is unique or a primary key.
That said, although they are similar in functionality, how they are executed can be quite different. Different SQL engines might do a better job of optimizing one version or the other.
By the way, you should avoid the archaic join syntax that you are using. Since 1992, explicit joins have been part of SQL syntax.

Related

Inner Join Producing cartesian product

Looking at the 2 queries below, I assumed they would return the same result set but they're way off. Why is the 2 query with the inner join producing so many records? What am I doing wrong? I've been staring at this a little too long and need a fresh pair of eyes to look at it.
SELECT COUNT(*)
FROM ZCQ Z
WHERE Z.QUOTE_CUSTOMER_ID IN (SELECT CUSTOMER_ID FROM CUST_ORDER)
-- returned 6,646 RECS
SELECT COUNT(*)
FROM ZCQ Z
INNER JOIN CUST_ORDER CO ON zquote_customer_id = co.customer_id
-- returned 4,232,473 RECS
Please note these are Oracle 10g tables but have no FK or PK setup by the DBA.
No, these will not generally return the same result.
The first counts the number of rows in ZCQ that match a customer in CUST_ORDER.
The second counts the total number of rows that match. If there are duplicate customers in CUST_ORDER, then all duplicates will be counted.
You could get the same result using:
SELECT COUNT(DISTINCT z.zquote_customer_id)
FROM ZCQ Z JOIN
CUST_ORDER CO
ON zquote_customer_id = co.customer_id;
But IN or EXISTS is probably more efficient than removing the duplicates after doing the match.

Any tips and tricks to avoid or reduce cost of one-to-many joins and non-equi joins when dataset is large?

I am wondering how people grapple with large one-to-many joins, and in particular non-equi joins, when they have large data. If the keys of the two tables A and B are sufficiently repetitive, the output of the join between the two can be nearly the size of |A| * |B|. This must come up frequently in analytics at large companies, so I am wondering what ways there are to reduce the computation time of these joins.
However, many times A and B are different tables, and in those cases I do not think LAG() can be used.
Example of a non-equi, one-to-many join
As a simplified example of a situation where a non-equi and one-to-many join might be warranted, I have tables A and B, with each having a numeric id column, a date field date_created and some field group. For each row in table A, I want the id column of A and all data of the corresponding row in table B where B.date_created is the largest possible value such that A.date_created > B.date_created and A.group = B.group. In other words, I want the most recent row of table B with respect to the date_created and group fields of each row in column A.
Code when using a window function
In most use cases where these non-equi-joins come up, A and B are the same table and the date_created fields in fact correspond to the same column. In this situation, I could use the LAG() window function:
WITH id_tuples AS
(
SELECT A.id,
LAG(A.id, 1) OVER (PARTITION BY A.group ORDER BY A.date_created) AS lagged_id
FROM A
)
SELECT id_t.id,
A.*
FROM id_tuples id_t
INNER JOIN A ON A.id = id_t.lagged_id
which I believe is more efficient than a self-join. However, this approach is not possible when the columns being compared are different, or belong to different tables.
Code when window function is not feasible
I use the following code to compute the most recent row of table B for each row in table A.
SELECT *
FROM
(
SELECT A.id,
B.*,
DENSE_RANK() OVER (PARTITION BY A.id ORDER BY B.date_created) AS date_rank
FROM A
INNER JOIN B ON B.group = A.group
AND B.date_created < A.date_created
)
WHERE date_rank = 1
The problem here is that the grouping variables A.group and B.group can have only a few distinct values. Then the join becomes nearly a Cartesian join and the number of outputted results in the subquery can be many orders of magnitude greater than the sum of the rows of A and B. This is wasteful since the outer query proceeds to throw out the majority of the results by filtering for date_rank = 1.
Is there a better way of structuring the query to reduce the cost of these joins, or avoid them entirely in these situations? I am asking in the abstract but I've found that neither my relational database, nor my Spark cluster (once I move the data there) has enough memory to handle such a join. Even on smaller datasets, this operation takes a large amount of time to run. And I don't believe my dataset is particularly large relative to what others are doing.
Your first query can simply be written as:
SELECT A.id,
LAG(A.id, 1) OVER (PARTITION BY A.group ORDER BY A.date_created) AS lagged_id
FROM A;
There is no need for the JOIN.
For the second query, one method is a lateral join:
SELECT A.id, B.*,
FROM A LEFT JOIN LATERAL
(SELECT B.*
FROM B
WHERE B.group = A.group AND
B.date_created < A.date_created
ORDER BY B.date_created DESC
FETCH FIRST 1 ROW ONLY
) B;
This should use an index on B(GROUP, date_created).

SQL grouping. How to select row with the highest column value when joined. No CTEs please

I've been banging my head against the wall for something that I think should be simple but just cant get to work.
I'm trying to retrieve the row with the highest multi_flag value when I join table A and table B but I can't seem to get the SQL right because it returns all the rows rather than the one with the highest multi_flag value.
Here are my tables...
Table A
Table B
This is almost my desired result but only if I leave out the value_id row
SELECT CATALOG, VENDOR_CODE, INVLINK, NAME_ID, MAX(multi_flag) AS multiflag
FROM TBLINVENT_ATTRIBUTE AS A
INNER JOIN TBLATTRIBUTE_VALUE AS B
ON A.VALUE_ID = B.VALUE_ID
GROUP BY CATALOG, VENDOR_CODE, INVLINK, NAME_ID
ORDER BY CATALOG DESC
This is close to what I want to retreive but not quite notice how it returns unique name_id and the highest multi_flag but I also need the value_id that belongs to such multi_flag / name_id grouping...
If I include the value_id in my SQL statement then it returns all rows and is no longer grouped
Notic ein the results below how it no longer returns the row for the highest multi_flag and how all the different values for name_id (Ex. name_id 1) are also returned
You can choose to use a sub-query, derived table or CTE to solve this problem. Performance will be depending on the amount of data you are querying. To achieve your goal of getting the max multiflag you must first get the max value based on the grouping you want to achieve this you can use a CTE or sub query. The below CTE will give the max multi_flag by value that you can use to get the max multi_flag and then you can use that to join back to your other tables. I have three joins in this example but this can be reduce and as far a performance it may be better to use a subquery but you want know until you get the se the actual execution plans side by side.
;with highest_multi_flag as
(
select value_id, max(multi_flag) AS multiflag
FROM TBLINVENT_ATTRIBUTE
group by value_id
)
select A.CATALOG, a.VENDOR_CODE, a.INVLINK, b.NAME_ID,m.multiflag
from highest_multi_flag m
inner join TBLINVENT_ATTRIBUTE AS A on a.VALUE_ID =b. m.VALUE_ID
INNER JOIN TBLATTRIBUTE_VALUE AS B ON m.VALUE_ID = B.VALUE
You can use Lateral too, its an other solution
SELECT
A.CATALOG, A.VENDOR_CODE, A.INVLINK, B.NAME_ID, M.maxmultiflag
FROM TBLINVENT_ATTRIBUTE AS A
inner join lateral
(
select max(B.multi_flag) as maxmultiflag from TBLINVENT_ATTRIBUTE C
where A.VALUE_ID = C.VALUE_ID
) M on 1=1
INNER JOIN TBLATTRIBUTE_VALUE AS B ON M.maxmultiflag = B.VALUE

Oracle SQL: Is it more efficient to use a WHERE clause in a subquery or after the join?

I wanted to know which would be more efficient and why:
example 1:
SELECT a.CUSTOMER_KEY a.LAST_NAME b.TRASACTION_AMT,
FROM CUSTOMER_TABLE a
LEFT JOIN TRANSACTION_TABLE b
ON a.CUSTOMER_KEY = b.CUSTOMER_KEY
WHERE b.DATE_TRANSACTION > 20150101 AND a.CUSTOMER_ACTIVE_FLAG = 'Y';
or example 2:
SELECT a.CUSTOMER_KEY a.LAST_NAME b.TRASACTION_AMT,
FROM
(SELECT *
FROM CUSTOMER_TABLE
WHERE CUSTOMER_ACTIVE_FLAG = 'Y') a
LEFT JOIN
(SELECT *
FROM TRANSACTION_TABLE
WHERE b.DATE_TRANSACTION > 20150101) b
ON a.CUSTOMER_KEY = b.CUSTOMER_KEY
For instance would option 2 be better optimized because it would filter out the records not satisfying the where clause first?
(NOTE: the query looks to join customer information with transaction information based on customer key. The customer key is unique to the customer table. Both querys produce equivalent output.)
The correct equivalent query without a join is:
SELECT a.CUSTOMER_KEY a.LAST_NAME b.TRASACTION_AMT,
FROM CUSTOMER_TABLE a LEFT JOIN
TRANSACTION_TABLE b
ON a.CUSTOMER_KEY = b.CUSTOMER_KEY AND b.DATE_TRANSACTION > 20150101
WHERE a.CUSTOMER_ACTIVE_FLAG = 'Y';
The condition on the second table goes in the ON clause.
The best way to know is to look at the execution plans and run-times for the two queries. I would expect the equivalent versions to have the same execution plan. Oracle has a smart optimizer and should optimize away the subqueries. However, it might miss a particular case or two, which is why you should check on your own queries.

Count the overlapping values between two tables?

I have two tables that are structured the same with a sequence column and I am trying to count the number of sequences that show up in two different tables.
I am using this right now:
SELECT A.sequence FROM p2.pool A WHERE EXISTS (SELECT * from
p1.pool B WHERE B.sequence = A.sequence)
And then I was going to count the number of results.
Is there an easier way to do this using COUNT so I don't have to get all of the results first?
Yes, there is an easier way using COUNT:
SELECT COUNT(*)
FROM p2.pool A
WHERE EXISTS (SELECT *
FROM p1.pool B
WHERE B.sequence = A.sequence)
You could also use a join instead of a subquery, but the speed is unlikely to change:
SELECT COUNT(*)
FROM p2.pool A
JOIN p1.pool B ON A.sequence = B.sequence