Join within an "exists" subquery - sql

I am wondering why when you join two tables on a key within an exists subquery the join has to happen within the WHERE clause instead of the FROM clause.
This is my example:
Join within FROM clause:
SELECT payer_id
FROM Population1
WHERE NOT EXISTS
(Select *
From Population2 join Population1
On Population2.payer_id = Population1.payer_id)
Join within WHERE clause:
SELECT payer_id
FROM Population1
WHERE NOT EXISTS
(Select *
From Population2
WHERE Population2.payer_id = Population1.payer_id)
The first query gives me 0 results, which I know is incorrect, while the second query gives the the thousands of results I am expecting to see.
Could someone just explain to me why where the join happens in an EXISTS subquery matters? If you take the subqueries without the parent query and run them they literally give you the same result.
It would help me a lot to remember to not continue to make this mistake when using exists.
Thanks in advance.

You need to understand the distinction between a regular subquery and a correlated subquery.
Using your examples, this should be easy. The first where clause is:
where not exists (Select 1
from Population2 join
Population1
on Population2.payer_id = Population1.payer_id
)
This condition does exactly what it says it is doing. The subquery has no connection to the outer query. So, the not exists will either filter out all rows or keep all rows.
In this case, the engine runs the subquery and determines that at least one row is returned. Hence, not exists returns false in all cases, and the nothing is returned.
In the second case, the subquery is a correlated subquery. So, for each row in population1 the subquery is run using the value of Population1.payer_id. In some cases, matching rows exist in Population2; these are filtered out. In other cases, matching rows do not exist; these are in the result set.

The first example is not actually reffering to the base table which creates a logic that is unpredictable.
Another way to do the same logic would be:
SELECT payer_id
FROM Population1 P1
LEFT JOIN Population2 P2 ON
P2.Payer_Id = P1.Payer_Id
WHERE
P2.Payer_Id IS NULL

You qry return ROW EXISTS always if exist even if there is one result row.
Select *
from Population2
join Population1 on Population2.payer_id = Population1.payer_id
If exist at least one row from this join (and for sure there exists), you can imagine your subqry looks like:
select 'ROW EXISTS'
And result of:
select *
from Population1
where not exists (select 'ROW EXISTS')
So your anti-semijoin return:
payer_id 1 --> some ROW EXISTS -> dont't return this row
payer_id 2 --> some ROW EXISTS -> dont't return this row

Related

How did this old SQL query work without a join in the subquery

Here is the T-SQL. The code has been around for years and it was handed to me to migrate to another SQL server. It apparently works, but I don't know why. The execution plan doesn't show any predicates being used, so how does it know which rows to exclude. If I run the subquery I get 1146 rows with the value 1
SELECT EM.PERSON_ID
FROM EMP_BEN_ELECTS EBE, EMPLOYEE_MAP EM
WHERE EBE.BW_ID = EM.BW_ID
AND CHANGE_BENEFIT_EVENT_DATE IS NULL
AND OPTION_ID <> 'WAIVE'
AND NOT EXISTS (SELECT 1 FROM EMPLOYEE_BILLING WHERE BILLING_GROUPING_ID
IN('HWMONTHLY','HWINDIVIDUALBILLED') AND END_DATE IS NULL)
I plan rewrite it without the subquery and use a left join instead, but this just boggled me that it works. The only time I seen code written like this without the join being qualified was when I seen code coming from an Oracle developer.
The subquery of NOT EXISTS is not used to return any (of the 1146) rows.
It is used to check if at least 1 row exists in the table EMPLOYEE_BILLING with the specified conditions:
BILLING_GROUPING_ID IN('HWMONTHLY','HWINDIVIDUALBILLED') AND END_DATE IS NULL
If there is such a row, then NOT EXISTS returns FALSE and since all the conditions in the WHERE clause of the main query are linked with the operator AND, then the final result is WHERE FALSE, making the query to not return any rows.
Don't rewrite the query with a LEFT join.
EXISTS and NOT EXISTS provide usually better performance than joins.
What you must change though, is that archaic join syntax with the ,.
Change it to a proper INNER join with an ON clause:
SELECT EM.PERSON_ID
FROM EMP_BEN_ELECTS EBE INNER JOIN EMPLOYEE_MAP EM
ON EBE.BW_ID = EM.BW_ID
WHERE CHANGE_BENEFIT_EVENT_DATE IS NULL
AND OPTION_ID <> 'WAIVE'
AND NOT EXISTS (
SELECT 1
FROM EMPLOYEE_BILLING
WHERE BILLING_GROUPING_ID IN('HWMONTHLY','HWINDIVIDUALBILLED')
AND END_DATE IS NULL
)
Also, you should qualify all the column names with the table's name/alias they belong to (CHANGE_BENEFIT_EVENT_DATE and OPTION_ID which I left unqualified because I don't know which alias to use).

Select statement with columns that are select statement but not subqueries

SQL Masters,
I don't understand part of this query. In the select statement there are what look like independent 'select statements'almost like a function. This code is vendor written Blackbaud CRM. As independent code there is no join in the code for the info they bring into the data set as you can see in the from clause. One last odd item is that in the column aliased Spouse_id the column SPOUSE.RECIPROCALCONSTITUENTID dose not even exist in the table referred to. Any BBCRM people out there that can explain this?
Thanks
select
CONSTITUENT.ID,
CONSTITUENT.ISORGANIZATION,
CONSTITUENT.KEYNAME,
CONSTITUENT.FIRSTNAME,
CONSTITUENT.MIDDLENAME,
CONSTITUENT.MAIDENNAME,
CONSTITUENT.NICKNAME,
(select SPOUSE.RECIPROCALCONSTITUENTID
from dbo.RELATIONSHIP as SPOUSE
where SPOUSE.RELATIONSHIPCONSTITUENTID = CONSTITUENT.ID
and SPOUSE.ISSPOUSE = 1) as [SPOUSE_ID],
(select MARITALSTATUSCODE.DESCRIPTION
from dbo.MARITALSTATUSCODE
where MARITALSTATUSCODE.ID = CONSTITUENT.MARITALSTATUSCODEID) as [MARITALSTATUSCODEID_TRANSLATION]
From
dbo.constituent
left join
dbo.ORGANIZATIONDATA on ORGANIZATIONDATA.ID = CONSTITUENT.ID
where
(CONSTITUENT.ISCONSTITUENT = 1)
These are correlated subqueries. Although there is no explicit JOIN, there is a link to the outer table which behaves like a join (although more constrained than explicit JOINs):
(select SPOUSE.RECIPROCALCONSTITUENTID
from dbo.RELATIONSHIP as SPOUSE
where SPOUSE.RELATIONSHIPCONSTITUENTID = CONSTITUENT.ID AND
-------^ correlation clause connecting to outer table
SPOUSE.ISSPOUSE = 1
) as [SPOUSE_ID],
This behaves like a LEFT JOIN. If no rows match, then the result is NULL.
Note that in this context, the correlated subquery is also a scalar subquery. That means that it returns exactly one column and at most one row.
If the query returned more than one column, you would get a compile-time error on the query. If the query returns more than one row, you will get a run-time error on the query.

How do I run a joined subquery once together for all rows?

I need to do something like this:
SELECT T.*, X.Val
FROM SomeTable T
LEFT OUTER JOIN (SELECT TOP 1 A.[Value] AS Val FROM AnotherTable A ORDER BY A.Id DESC) X ON X.Val = X.Val
The goal is to get a single value from a subquery and join/apply it to all rows of my record set. The value from the subquery is the same for all rows and is independent on them. So it would be efficient to run the subquery only once, and then only use the retrieved value for all rows. But based on the query running time, it seems that the subquery is running for each row again which is slow. The same was for other ways I tried like outer apply etc.
In fact, my subquery looks like this:
(SELECT dbo.MyScalarFunction() AS Val) X
But I hope it shouldn't matter. The scalar function itself is running some 1 sec., so it must run only once in the whole query, I can't wait thousand seconds for thousand rows in the record set.
Is there a way to enforce running the subquery only once before joining it?
I am inside of a view, so I can't use a declared variable to store the subquery value.
Why are you phrasing this as coming from another table? Based on your later question, why not just use CROSS JOIN:
SELECT T.*, X.Val
FROM SomeTable T CROSS JOIN
(SELECT dbo.MyScalarFunction() AS Val) X
I cannot see how the function would be called more than once, unless it is a recursive function.

How does EXISTS return things other than all rows or no rows?

I am a beginning SQL programmer - I am getting most things, but not EXISTS.
It looks to me, and looks by the documentation, that an entire EXISTS statement returns a boolean value.
However, I see specific examples where it can be used and returns part of a table as opposed to all or none of it.
SELECT DISTINCT PNAME
FROM P
WHERE EXISTS
(
SELECT *
FROM SP Join S ON SP.SNO = S.SNO
WHERE SP.PNO = P.PNO
AND S.STATUS > 25
)
This query returns to me one value, the one that meets the criteria (S.Status > 25).
However, with other queries, it seems to return the whole table I am selecting from if even one of the rows in the EXISTS subquery is true.
How does one control this?
Subqueries such as with EXISTS can either be correlated or non-correlated.
In your example you use a correlated subquery, which is usually the case with EXISTS. You look up records in SP for a given P.PNO, i.e. you do the lookup for each P record.
Without SP.PNO = P.PNO you would have a non-correlated subquery. I.e. the subquery no longer depends on the P record. It would return the same result for any P record (either a Status > 25 exists at all or not). Most often when this happens this is done by mistake (one forgot to relate the subquery to the record in question), but sometimes it is desired so.
You have actually created a Correlated subquery. Exists predicate accepts a subquery as input and returns TRUE if the subquery returns any rows and FALSE otherwise.
The outer query against table P doesn't have any filters, so all the rows from this table will be considered for which the EXISTS predicate returns TRUE.
SELECT DISTINCT PNAME -- Outer Query
FROM P
Now, the EXISTS predicate returns TRUE if the current row in table P has related rows in SP Join S ON SP.SNO = S.SNO where S.STATUS > 25
SELECT *
FROM SP Join S ON SP.SNO = S.SNO
WHERE SP.PNO = P.PNO -- Inner query
AND S.STATUS > 25
One of the benefits of using the EXISTS predicate is that it allows you to intuitively phrase English like queries. For example, this query can be read just as you would say it in ordinary English: select all unique PNAME from table P where at least one row exists in which PNO equals PNO in table SP and Status in table S > 25, provided table SP and S are joined based on SNO.
Which SQL language are you using?
Either EXISTS return allways true or false or it allways returning rows, but in WHERE EXISTS... it will check returned rows > 0 (=>true).
Oracle, MySQL, PostreSQL:
The EXISTS condition is used in combination with a subquery and is considered "to be met" if the subquery returns at least one row.
(http://www.techonthenet.com)
your condition in where clause for main query
SELECT DISTINCT PNAME FROM P
is dependent to Exist ,
if your subquery returns any rows ,
then exists returns true ,otherwise it returns false
and the main query where clause return all of records in p if Exists return true and nothing if it returns false

I would like a simple example of a sub-query using T-SQL 2008

Can anyone give me a good example of a subquery using TSQL 2008?
Maximilian Mayer believes that, due to referencing MS documentation, my assertion that there is a difference between a subquery and a subSelect is incorrect. Frankly, I'd consider MSDN's "Subquery Fundamentals" a better choice. Quote:
You are making distinctions between terms that actually mean the same.
O RLY?
A subQUERY...
IE:
WHERE id IN (SELECT n.id FROM TABLE n)
OR id = (SELECT MAX(m.id) FROM TABLE m)
OR EXISTS(SELECT 1/0 FROM TABLE) --won't return a math error for division by zero
...affects the WHERE or HAVING clauses -- the filteration of data -- for a SELECT, INSERT, UPDATE or DELETE statement. The value from a subquery is never directly visible in the SELECT clause.
A subSELECT...
IE:
SELECT t.column,
(SELECT x.col FROM TABLE x) AS col2
FROM TABLE t
...does not affect the filteration of data in the main query, and the value is exposed directly in the SELECT clause. But it's only one value - you can't return two or more columns into a single column in the outer query.
A subselect is a consistent means of performing a LEFT JOIN in ANSI-89 join syntax - if there is no supporting row, the column will be null. Additionally, a non-correlated subselect will return the same value for every row of the main query.
Correlation
If a subquery or subselect is correlated, that query runs once for every record of the main query returned -- which doesn't scale well as the number of rows in the result set increases.
Derived Table/Inline View
IE:
SELECT x.*,
y.max_date,
y.num
FROM TABLE x
JOIN (SELECT t.id,
t.num,
MAX(t.date) AS max_date
FROM TABLE t
GROUP BY t.id, t.num) y ON y.id = x.id
...is a JOIN to a derived table (AKA inline view).
"Inline view" is a better term, because that is all that happens when you reference a non-materialized view -- a view is just a prepared SQL statement. There's no performance or efficiency difference if you create a view with a query like the one in the example, and reference the view name in place of the SELECT statement within the brackets of the JOIN. The example has the same information as a correlated subquery, but the performance benefit of using a join and none of the subquery detriments. And you can return more than one column, because it is a view/derived table.
Conclusion
It should be obvious why I and others make distinctions. The concept of relying on the word "subquery" to categorize any SELECT statement that isn't the main clause is fatality flawed, because it's also a specific case under a categorization of the same word (IE: subquery-subselect, subquery-subquery, subquery-join...). Now think of helping someone who says "I've got a problem with a subquery..."
Maximilian Mayer's idea of "official" documentation was written by technical writers, who often have no experience in the subject and are only summarizing what they've been told to from knowledgeable people who have simplified things. Ultimately, it's just text on a page or screen -- like what you're reading now -- and the decision is up to you if the details I've laid out make sense to you.
For variety's sake, here's one in the where clause:
select
a.firstname,
a.lastname
from
employee a
where
a.companyid in (
select top 10
c.companyid
from
company c
where
c.num_employees > 1000
)
...returns all employees in the top ten companies with over 1000 employees.
SELECT
*,
(SELECT TOP 1 SomeColumn FROM dbo.SomeOtherTable)
FROM
dbo.MyTable
SELECT a.*, b.*
FROM TableA AS a
INNER JOIN
(
SELECT *
FROM TableB
) as b
ON a.id = b.id
Thats a normal subquery, running once for the whole result set.
On the other hand
SELECT a.*, (SELECT b.somecolumn FROM TableB AS b WHERE b.id = a.id)
FROM TableA AS a
is a correlated subquery, running once for every row in the result set.