How to do "case when exists..." in spark sql

How to do "case when exists..." in spark sql - sql

What I am trying to do is
case when exists (select 1 from table B where A.id = B.id and B.value in (1,2,3)) then 'Y' else 'N' end as Col_1
It seems like "left semi join" can take care of multiple matching issue, but my understanding is that "left semi join" does not allow using columns from the right (B) table, so how can I add condition "B.value in (1,2,3)"?

The normal way to do this is to left outer join to a summary of table b:
Select a.id, Case When IsNull(b.id) Then 'N' else 'Y' end as Col_1
From A Left Outer Join
(Select distinct id from tableb) b On A.id=b.id
That way you are not repeatedly executing a lookup query for every id in A.
Addition
Your comment indicated that you are trying to create multiple Y/N columns based on b values. Your example had a Y/N for col1 when there was a 1,2,3 and a Y/N for col2 when there was a 4,5,6.
You can get there easily with one summarization of table b :
Select a.id, Case When IsNull(b.val123) Then 'N' else 'Y' end as Col_1,
Case When IsNull(b.val456) Then 'N' Else 'Y' end as Col_2
From A Left Outer Join
(Select id, max(Case When value in (1,2,3) Then 'Y' End) as val123
max(Case When value in (4,5,6) Then 'Y' End) as val456
From tableb
Group By id) b On A.id=b.id
This still accomplishes that lookup with only one summarization of table b.

Related

IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few commands

I am trying to write a query in azure databricks and I am getting the following error
"IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few commands"
This is the code I am using.
SELECT id,
(CASE WHEN id in (SELECT id from aTable) THEN 1 ELSE 0 END) as a,
(CASE WHEN id in (SELECT id from bTable) THEN 1 ELSE 0 END) as b,
(CASE WHEN id in (SELECT id from cTable) THEN 1 ELSE 0 END) as c
FROM table
I read that sql doesn't let you do this because the case statements are evaluated row by row, and it wants to prevent you from doing a SELECT statement for each row evaluation. If that is the case, is there an alternative or workaround to accomplish this? Thanks

Databricks does not support subqueries using IN or EXISTS in CASE statements. As an alternative, consider outer joining each view to master table:
Query could be like then structure below:
select .....
case when a.id is not null then a
when b.id is not null then b
end as id
from Table_t t LEFT JOIN (select id from aTable ) a ON t.id=a.id LEFT JOIN(
select id from bTable) b ON t.id=b.id
..................

I tried to reproduce similar scenario and got same error:
Regardless of whether it is contained in a CASE WHEN, the IN operator utilising a subquery only functions in filters, not projections. If you explicitly supply values in the IN clause as opposed to using a subquery, it works just great.
To work around this, I tried left join to tables and then check for a null in the case statement.
This query might work
%sql
SELECT t.Id,
(CASE WHEN at.Id is not null THEN 1 ELSE 0 END) as a,
(CASE WHEN bt.Id is not null THEN 1 ELSE 0 END) as b,
(CASE WHEN ct.Id is not null THEN 1 ELSE 0 END) as c
FROM table t
LEFT JOIN aTable at ON t.Id = at.Id
LEFT JOIN bTable bt ON t.Id = bt.Id
LEFT JOIN cTable ct ON t.Id = ct.Id
Sample data:
Output:

How to pull all values (even those that don't match) during a left outer join in Snowflake?

I have two tables (table a being 1.3 mil rows and table b being 300k rows) that I want to join on email address. However, only a 1/3rd of those addresses from table b match to table a. Ideally the output should be the length of table a in col1, col2 would be all 300k rows from table b, and col3 will say 'mapped' or 'not mapped' based on col1 & col2 being populated or not.
I want to show all the addresses from table b (even those that don't match to table a) and table a with a third col that shows they matched or they didn't. Right now it only shows as NULL.
How do I do this in SQL? Currently using a left outer join but may need to use a Full Join?
SELECT DISTINCT TABLEA.EMAIL, TABLEB.EMAIL_ADDRESS FROM "db_tablea" TABLEA
left outer JOIN
"db_tableb" TABLE B
ON TABLEA.EMAIL = TABLEB.EMAIL_ADDRESS
CASE WHEN EMAIL IS NULL AND EMAIL_ADDRESS IS NOT NULL THEN 'NOT_MAPPED'
WHEN EMAIL IS NOT NULL AND EMAIL_ADDRESS IS NOT NULL THEN 'MAPPED'
ELSE 'REVIEW'
END AS MAPPED_FLAG
ORDER BY EMAIL
;

I would use exists and a correlated subquery:
select a.*,
case when exists (select 1 from tableb b where b.email_address = a.email)
then 'mapped'
else 'not mapped'
end as review
from tablea a
This generates one row for each row in the first table, with a flag that indicates whether the email exists in the second table.
One feature is that rows in the first table that have multiple matches in the second table are not "multiplied" in the resultset.

SELECT
a.email
,b.email
,CASE WHEN a.email = b.email THEN 'MAPPED' ELSE 'NOT MAPPED' END status
FROM
table_a a
FULL OUTER JOIN table_b b ON (a.email = b.email)
;
https://www.db-fiddle.com/f/sWn6RS8GsfRhj5oXwS3Dso/0

A slight variation that allows more insight into the records that don't match
select t1.email,
t2.email_address,
case when t1.email is null then 'a not mapped to b'
when t2.email_address is null then 'b not mapped to a'
else 'mutually mapped' end as mapping_flag
from table_a t1
full join table_b t2 on t2.email_address = t1.email;

Selecting Max Value, However Prioritising Certain Values

I have three tables that are joined. TableA has unique values for Column1 (ID) which joins on TableC on Column1 which has non-unique values. I'm currently joining these based off the max value for Column2 in TableC. The returning a value in TableB which is simply joined off TableC.
However I want to adjust this so that if TableB.Column2 has any value greater than 0 in TableC.Column2 then this is chosen as the max value, if it is 0 then the max value is chosen normally based off numeric value.
The current query I have is this:
Select [TableA].Column2,
FIRST_VALUE([TableB].Column2) OVER (PARTITION BY [TableA].Column2 ORDER BY MAX([TableC].Column2) Desc)
From [TableC] Left Join
[TableA]
On [TableA].Column1 = [TableC].Column1 Left Join
[TableB]
On [TableB].Column3 = [Table3].Column3
What I am expecting to happen is that if:
TableC.Column2 > '0' where TableB.Column2 = 'KEYVALUE' then show Table.Column2 based off TableC.Column3, however if TableC.Column2 = '0' where TableB.Column2 = 'KEYVALUE' then show result of [TableB].Column2 based off MAX [TableC].Column2
Sample Data:
Example Output:
S7000,KEYVALUE
S6500,OTHERVALUE1
Hope that all makes sense, thank you.

I find your conditions hard to follow, but you seem to want apply:
Select a.*, bc.column2
From a outer apply
(select top (1) b.column2
from c join
b
on c.column3 = b.column3
where c.column1 = a.column1
order by (case when c.column2 > 0 and b.column2 = 'KEYVALUE'
then 1
else 2
end),
c.column2 desc
) bc;

Boolean - Does ID Exist in Table?

I have two tables... A master ID table and a results ID table with only a few IDs from the master table. I'm looking to create the following SQL Query:
Select
A.ID
(Case when B.ID is in A.ID 1 Else 0 End) as is_found
From
master_table as A
LEFT JOIN results_table as B
ON A.ID = B.ID
The resulting table should have all IDs from master table with a boolean column saying if the ID was found in the results table. Thank you for your help!!

I would use case . . . exists:
Select mt.id,
(case when exists (select 1 from results_table rt where rt.id = mt.id) then 1 else 0 end) as is_found
From master_table ;

First, consider the case where results_table will have either zero or one matching row; in this case, the LEFT JOIN will always give one row for each ID, and B.ID will be NULL if there is no corresponding row in results_table.
We can therefore use a simple CASE to test this:
Select
A.ID,
CASE WHEN B.ID IS NOT NULL THEN 1 ELSE 0 END as is_found
From
master_table as A
LEFT JOIN results_table as B
ON A.ID = B.ID
If there may be more than one row in results_table for the same ID, the LEFT JOIN may in turn create several rows, one for each match.
The result of the CASE statement will be the same for all values of A.ID - if there are zero matches, it will occur once with value 0, and if there are one or more, it will always have the value 1. So we can simply take distinct values of the entire query:
Select Distinct
A.ID,
CASE WHEN B.ID IS NOT NULL THEN 1 ELSE 0 END as is_found
From
master_table as A
LEFT JOIN results_table as B
ON A.ID = B.ID

Update a column in a table with values from two other tables

I have to update a column in Table 'A' with values from Table 'B'. If any value in Table 'B'
is null or empty then I have to get the value from table 'C'.
Manu

Use:
UPDATE A
SET column = (SELECT COALESCE(b.val, c.value)
FROM B b
JOIN C c ON c.col = b.col)
COALESCE will return the first non-null value from the list of columns, processing from left to right.
What's odd is you haven't provided how tables B and C relate to one another - if they don't in anyway, you're looking at a cartesian product of the two tables (not ideal). My answer uses a JOIN, in hopes it is possible depending on the data.

Basically:
UPDATE a SET a.FIELD = (CASE WHEN b.FIELD IS NULL or b.FIELD = '' THEN c.FIELD ELSE b.FIELD END)
FROM TABLEA a
LEFT JOIN TABLEB b on a.id = b.someid
LEFT JOIN TABLEC c on a.id = c.someid
Joins may or may not be LEFT, depending on your data, and you may want to handle the case where both b.field and c.field are null.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to do "case when exists..." in spark sql - sql

Related

IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few commands

How to pull all values (even those that don't match) during a left outer join in Snowflake?

Selecting Max Value, However Prioritising Certain Values

Boolean - Does ID Exist in Table?

Update a column in a table with values from two other tables

Categories

Resources