Why is my SQL 'NOT IN' clause producing different results from 'NOT EXISTS' - sql

I have two SQL queries producing different results when I would expect them to produce the same result. I am trying to find the number of events that do not have a corresponding location. All locations have an event but events can also link to non-location records.
The following query produces a count of 16244, the correct value.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE NOT EXISTS
(SELECT * FROM locations AS l WHERE l.event_id = e.event_id)
The following query produces a count of 0.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The following SQL does some summaries of the data set
SELECT 'Event Count',
COUNT(DISTINCT event_id)
FROM events
UNION ALL
SELECT 'Locations Count',
COUNT(DISTINCT event_id)
FROM locations
UNION ALL
SELECT 'Event+Location Count',
COUNT(DISTINCT l.event_id)
FROM locations AS l JOIN events AS e ON l.event_Id = e.event_id
And returns the following results
Event Count 139599
Locations Count 123355
Event+Location Count 123355
Can anyone shed any light on why the 2 initial queries do not produce the same figure.

You have a NULL in the subquery SELECT l.event_id FROM locations AS l so NOT IN will always evaluate to unknown and return 0 results
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The reason for this behaviour can be seen from the below example.
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x'
<> 'b'
≡ Unknown and True and True
≡ Unknown

The NOT IN form works differently for NULLs. The presence of a single NULL will cause the entire statement to fail, thus returning no results.
So you have at least one event_id in locations that is NULL.
Also, your query might be better written as a join:
SELECT
COUNT(DISTINCT e.event_id)
FROM
events AS e
LEFT JOIN locations AS l ON e.event_id = l.event_id
WHERE
l.event_id IS NULL
[UPDATE: apparently, the NOT EXISTS version is faster.]

In and Exists are processed very very differently.
Select * from T1 where x in ( select y from T2 )
is typically processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to
the original table -- typically.
As opposed to
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index
on T1(x).

Related

JOIN AND CASE MORE AN TABLE

I have 2 tables; the first one ORG contains the following columns:
ORG_REF, ARB_REF, NAME, LEVEL, START_DATE
and the second one WORK contains these columns:
ARB_REF, WORK_STREET - WORK_NUM, WORK_ZIP
I want to do the following: write a select query that search in work and see if the WORK_STREET, WORK_ZIP are duplicate together, then you should look at WORK_NUM. If it is the same then output value ' ok ', but if WORK_NUM is not the same, output 'not ok'
I wrote this SQL query:
select
A.ARB_REF, A.WORK_STREET, A.WORK_NUM, A.WORK_ZIP
case when B.B = 1 then 'OK' else 'not ok' end
from
work A
join
(select
WORK_STREET, WORK_ZIP count(distinct , A.WORK_NUM) B
from
WORK
group by
WORK_STREET, WORK_ZIP) B on B.WORK_STREET = A.WORK_STREET
and B.WORK_ZIP = A.WORK_ZIP
Now I want to join the table ORG with this result I want to check if every address belong to org if it belong I should create a new column result and set it to yes in it (RESULT) AND show the "name" column otherwise set no in 'RESULT'.
Can anyone help me please?
While you can accomplish your result by adding a left outer join to the query you've already started, it might be easiest to just use count() over....
with org_data as (
-- do the inner join before the left join later
select * from org1 o1 inner join org2 o2 on o2.orgid = o1.orgid
)
select
*,
count(*) over (partition by WORK_STREET, WORKZIP) as cnt,
case when o.ARB_REF is not null then 'Yes' else 'No' end as result
from
WORK w left outer join org_data o on o.ARB_REF = w.ARB_REF

Optimizing SQL Cross Join that checks if any array value in other column

Let's say I have a table events with structure:
id
value_array
XXXX
[a,b,c,d]
...
...
I have a second table values_of_interest with structure:
value
x
y
z
a
I want to find id's that have any of the values found in values_of_interest. All else equal, what would be the most performant SQL to make this happen? (I am using BigQuery, but feel free to answer more generally)
My current thought is:
SELECT
DISTINCT e.id
FROM
events e, values_of_interest vi
WHERE
EXISTS(
SELECT
value
FROM
UNNEST(e.value_array) value
JOIN
vi ON vi.value = e.value
)
Few quick options for BigQuery Standard SQL
Option 1
select id
from `project.dataset.events`
where exists (
select 1
from `project.dataset.values_of_interest`
where value in unnest(value_array)
)
Option 2
select id
from `project.dataset.events` t
where (
select count(1)
from t.value_array as value
join `project.dataset.values_of_interest`
using(value)
) > 0
I would write this using exists and a join:
select e.id
from `project.dataset.events` e
where exists (select 1
from unnest(e.value_array) val join
`project.dataset.values_of_interest` voi
on val = voi.value
);

Eliminating "not in" for SQL command

I would like to take a sample of an Oracle table, but not include entries from another table. I have a query that currently works, but I'm pretty sure it will blow-up when the sub-select gets more than 1000 records.
select user_key from users sample(5)
where active_flag = 'Y'
and user_key not in (
select user_key from user_validation where validation_state <> 'expired'
);
How could this be re-written without the not in. I thought of using minus, but then my sample size would keep going down as new entries were added to the user_validation table.
You can do this with a left outer join:
select *
from (select u.user_key,
count(*) over () as numrecs
from users u left outer join
user_validation uv
on u.user_key = uv.user_key and
uv.validation_state <> 'expired'
where u.active_flag = 'Y' and uv.user_key is null
) t
where rownum <= numrecs * 0.05
You are using the sample clause. It is not clear if you just want the non-matches in the 5% you choose or if you want 5% of the data that is non-matches. This is the latter.
EDIT: Added example based on author's comment:
select user_key from (
select u.user_key, row_number() over (order by dbms_random.value) as randval
from users u
left outer join user_validation uv
on u.user_key = uv.user_key
and uv.validation_state <> 'expired'
where u.active_flag = 'Y'
and uv.user_key is null
) myrandomjoin where randval <=100;
select us.user_key
from users us -- sample(5)
where us.active_flag = 'Y'
and NOT EXISTS (
SELECT *
from user_validation nx
where nx.user_key = us.user_key
AND nx.validation_state <> 'expired'
);
BTW: I commented-out the sample(5) because I don't know what it means. (I strongly believe that it is not relevant, though)
select u.user_key from users u, user_validation uv
where u.active_flag = 'Y'
and u.user_key=uv.user_key
uv.validation_state= 'expired';
This was a double negation query, x not in list of non expired ids, which is equivalent to x is in the list of expired IDs, which is what I did, in addition to changing the subquery to a join.

Calculating AVG between two tables (JOIN)

The table PROBE looks like this:
'ProbeID'----- 'TranscriptID' ---- 'Start'---- 'End'
'1056'-----------'7981326'----------'1013'---'1010'
'1057'-----------'7878826'----------'1011'---'1015'
etc..
The table EXPRESSION2 looks like this:
'ProbeID'----- 'SampleID' ---- 'Value'
'10425'---------'7981326'-----'16.55''
'11123'---------'7878826'----- '3.55'
etc..
I need to find the top 100 largest differences in transcripts (i.e. take average of probes).
Essentially, I need to link the ProbeID of the EXPRESSION2 table with the TranscriptID in the PROBE table and calculate the average for the top 100.
I tried the code below, but keep getting "null" return. Any alternative scripts will be much appreciated. I think I am missing something.
The EXPRESSION2 table has no null values, fyi
`select avg(value)
from expression2
where probeID in
(
select P.ProbeId
from Probe P
join Transcript T on P.TranscriptID = T.TranscriptID
)`
limit 100;`
If you whant to eliminate the null values
select e.probeID, avg(value) as avarage
from expression2 e
inner join
(
select P.ProbeId
from Probe P
join Transcript T on P.TranscriptID = T.TranscriptID
) p on e.probeID = p.probeID
where e.value is not null
group by e.probeID
if you whant to treat them as 0
select e.probeID, avg(COALESCE(value,0)) as avarage
from expression2 e
inner join
(
select P.ProbeId
from Probe P
join Transcript T on P.TranscriptID = T.TranscriptID
) p on e.probeID = p.probeID
group by e.probeID
Which DBMS are you using? Is it significant that there are no matching ProbeIDs in the two tables in your example, but there are matching TranscriptID/SampleID pairs?
Select Top 100
p.TranscriptID,
Avg(e.Value)
From
Probe p
Inner Join
Expression2 e
on p.ProbeID = e.ProbeID
Group By
p.TranscriptID
Order By
2 Desc

Limit join to one row

I have the following query:
SELECT sum((select count(*) as itemCount) * "SalesOrderItems"."price") as amount, 'rma' as
"creditType", "Clients"."company" as "client", "Clients".id as "ClientId", "Rmas".*
FROM "Rmas" JOIN "EsnsRmas" on("EsnsRmas"."RmaId" = "Rmas"."id")
JOIN "Esns" on ("Esns".id = "EsnsRmas"."EsnId")
JOIN "EsnsSalesOrderItems" on("EsnsSalesOrderItems"."EsnId" = "Esns"."id" )
JOIN "SalesOrderItems" on("SalesOrderItems"."id" = "EsnsSalesOrderItems"."SalesOrderItemId")
JOIN "Clients" on("Clients"."id" = "Rmas"."ClientId" )
WHERE "Rmas"."credited"=false AND "Rmas"."verifyStatus" IS NOT null
GROUP BY "Clients".id, "Rmas".id;
The problem is that the table "EsnsSalesOrderItems" can have the same EsnId in different entries. I want to restrict the query to only pull the last entry in "EsnsSalesOrderItems" that has the same "EsnId".
By "last" entry I mean the following:
The one that appears last in the table "EsnsSalesOrderItems". So for example if "EsnsSalesOrderItems" has two entries with "EsnId" = 6 and "createdAt" = '2012-06-19' and '2012-07-19' respectively it should only give me the entry from '2012-07-19'.
SELECT (count(*) * sum(s."price")) AS amount
, 'rma' AS "creditType"
, c."company" AS "client"
, c.id AS "ClientId"
, r.*
FROM "Rmas" r
JOIN "EsnsRmas" er ON er."RmaId" = r."id"
JOIN "Esns" e ON e.id = er."EsnId"
JOIN (
SELECT DISTINCT ON ("EsnId") *
FROM "EsnsSalesOrderItems"
ORDER BY "EsnId", "createdAt" DESC
) es ON es."EsnId" = e."id"
JOIN "SalesOrderItems" s ON s."id" = es."SalesOrderItemId"
JOIN "Clients" c ON c."id" = r."ClientId"
WHERE r."credited" = FALSE
AND r."verifyStatus" IS NOT NULL
GROUP BY c.id, r.id;
Your query in the question has an illegal aggregate over another aggregate:
sum((select count(*) as itemCount) * "SalesOrderItems"."price") as amount
Simplified and converted to legal syntax:
(count(*) * sum(s."price")) AS amount
But do you really want to multiply with the count per group?
I retrieve the the single row per group in "EsnsSalesOrderItems" with DISTINCT ON. Detailed explanation:
Select first row in each GROUP BY group?
I also added table aliases and formatting to make the query easier to parse for human eyes. If you could avoid camel case you could get rid of all the double quotes clouding the view.
Something like:
join (
select "EsnId",
row_number() over (partition by "EsnId" order by "createdAt" desc) as rn
from "EsnsSalesOrderItems"
) t ON t."EsnId" = "Esns"."id" and rn = 1
this will select the latest "EsnId" from "EsnsSalesOrderItems" based on the column creation_date. As you didn't post the structure of your tables, I had to "invent" a column name. You can use any column that allows you to define an order on the rows that suits you.
But remember the concept of the "last row" is only valid if you specifiy an order or the rows. A table as such is not ordered, nor is the result of a query unless you specify an order by
Necromancing because the answers are outdated.
Take advantage of the LATERAL keyword introduced in PG 9.3
left | right | inner JOIN LATERAL
I'll explain with an example:
Assuming you have a table "Contacts".
Now contacts have organisational units.
They can have one OU at a point in time, but N OUs at N points in time.
Now, if you have to query contacts and OU in a time period (not a reporting date, but a date range), you could N-fold increase the record count if you just did a left join.
So, to display the OU, you need to just join the first OU for each contact (where what shall be first is an arbitrary criterion - when taking the last value, for example, that is just another way of saying the first value when sorted by descending date order).
In SQL-server, you would use cross-apply (or rather OUTER APPLY since we need a left join), which will invoke a table-valued function on each row it has to join.
SELECT * FROM T_Contacts
--LEFT JOIN T_MAP_Contacts_Ref_OrganisationalUnit ON MAP_CTCOU_CT_UID = T_Contacts.CT_UID AND MAP_CTCOU_SoftDeleteStatus = 1
--WHERE T_MAP_Contacts_Ref_OrganisationalUnit.MAP_CTCOU_UID IS NULL -- 989
-- CROSS APPLY -- = INNER JOIN
OUTER APPLY -- = LEFT JOIN
(
SELECT TOP 1
--MAP_CTCOU_UID
MAP_CTCOU_CT_UID
,MAP_CTCOU_COU_UID
,MAP_CTCOU_DateFrom
,MAP_CTCOU_DateTo
FROM T_MAP_Contacts_Ref_OrganisationalUnit
WHERE MAP_CTCOU_SoftDeleteStatus = 1
AND MAP_CTCOU_CT_UID = T_Contacts.CT_UID
/*
AND
(
(#in_DateFrom <= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateTo)
AND
(#in_DateTo >= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateFrom)
)
*/
ORDER BY MAP_CTCOU_DateFrom
) AS FirstOE
In PostgreSQL, starting from version 9.3, you can do that, too - just use the LATERAL keyword to achieve the same:
SELECT * FROM T_Contacts
--LEFT JOIN T_MAP_Contacts_Ref_OrganisationalUnit ON MAP_CTCOU_CT_UID = T_Contacts.CT_UID AND MAP_CTCOU_SoftDeleteStatus = 1
--WHERE T_MAP_Contacts_Ref_OrganisationalUnit.MAP_CTCOU_UID IS NULL -- 989
LEFT JOIN LATERAL
(
SELECT
--MAP_CTCOU_UID
MAP_CTCOU_CT_UID
,MAP_CTCOU_COU_UID
,MAP_CTCOU_DateFrom
,MAP_CTCOU_DateTo
FROM T_MAP_Contacts_Ref_OrganisationalUnit
WHERE MAP_CTCOU_SoftDeleteStatus = 1
AND MAP_CTCOU_CT_UID = T_Contacts.CT_UID
/*
AND
(
(__in_DateFrom <= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateTo)
AND
(__in_DateTo >= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateFrom)
)
*/
ORDER BY MAP_CTCOU_DateFrom
LIMIT 1
) AS FirstOE
Try using a subquery in your ON clause. An abstract example:
SELECT
*
FROM table1
JOIN table2 ON table2.id = (
SELECT id FROM table2 WHERE table2.table1_id = table1.id LIMIT 1
)
WHERE
...