SPARK SQL - case when then - sql

I'm new to SPARK-SQL. Is there an equivalent to "CASE WHEN 'CONDITION' THEN 0 ELSE 1 END" in SPARK SQL ?
select case when 1=1 then 1 else 0 end from table
Thanks
Sridhar

Before Spark 1.2.0
The supported syntax (which I just tried out on Spark 1.0.2) seems to be
SELECT IF(1=1, 1, 0) FROM table
This recent thread http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html links to the SQL parser source, which may or may not help depending on your comfort with Scala. At the very least the list of keywords starting (at time of writing) on line 70 should help.
Here's the direct link to the source for convenience: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala.
Update for Spark 1.2.0 and beyond
As of Spark 1.2.0, the more traditional syntax is supported, in response to SPARK-3813: search for "CASE WHEN" in the test source. For example:
SELECT CASE WHEN key = 1 THEN 1 ELSE 2 END FROM testData
Update for most recent place to figure out syntax from the SQL Parser
The parser source can now be found here.
Update for more complex examples
In response to a question below, the modern syntax supports complex Boolean conditions.
SELECT
CASE WHEN id = 1 OR id = 2 THEN "OneOrTwo" ELSE "NotOneOrTwo" END AS IdRedux
FROM customer
You can involve multiple columns in the condition.
SELECT
CASE WHEN id = 1 OR state = 'MA'
THEN "OneOrMA"
ELSE "NotOneOrMA" END AS IdRedux
FROM customer
You can also nest CASE WHEN THEN expression.
SELECT
CASE WHEN id = 1
THEN "OneOrMA"
ELSE
CASE WHEN state = 'MA' THEN "OneOrMA" ELSE "NotOneOrMA" END
END AS IdRedux
FROM customer

For Spark 2.+
Spark when function
From documentation:
Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.
// Example: encoding gender string column into integer.
// Scala:
people.select(when(col("gender") === "male", 0)
.when(col("gender") === "female", 1)
.otherwise(2))
// Java:
people.select(when(col("gender").equalTo("male"), 0)
.when(col("gender").equalTo("female"), 1)
.otherwise(2))

This syntax worked for me in Databricks:
select
org,
patient_id,
case
when (age is null) then 'Not Available'
when (age < 15) then 'Less than 15'
when (age >= 15 and age < 25) then '15 to 25'
when (age >= 25 and age < 35) then '25 to 35'
when (age >= 35 and age < 45) then '35 to 45'
when (age >= 45) then '45 and Older'
end as age_range
from demo

The decode() function analog of Oracle SQL for SQL Spark can be implemented as follows:
​ case
​ ​ ​ when exp1 in ('a','b','c')
​ ​ ​ ​ then element_at(map('a','A','b','B','c','C'), exp1)
​ ​ ​ else exp1
​ ​ end

Based on my current production code, this works
val identifierDF =
tempIdentifierDF.select(tempIdentifierDF("t_item_account_id"),
when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_cusip")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_ticker")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_isin")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_sedol")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_valoren")),100)
.otherwise(0)
.alias("identifier_in_description_score")
)

Spark DataFrame API (Python version) also enable to do next query:
df.selectExpr('time', \
'CASE WHEN (time > 1) THAN time * 1.1 ELSE time END AS updated_time')

Related

Spark sql parse exception [duplicate]

I'm new to SPARK-SQL. Is there an equivalent to "CASE WHEN 'CONDITION' THEN 0 ELSE 1 END" in SPARK SQL ?
select case when 1=1 then 1 else 0 end from table
Thanks
Sridhar
Before Spark 1.2.0
The supported syntax (which I just tried out on Spark 1.0.2) seems to be
SELECT IF(1=1, 1, 0) FROM table
This recent thread http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html links to the SQL parser source, which may or may not help depending on your comfort with Scala. At the very least the list of keywords starting (at time of writing) on line 70 should help.
Here's the direct link to the source for convenience: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala.
Update for Spark 1.2.0 and beyond
As of Spark 1.2.0, the more traditional syntax is supported, in response to SPARK-3813: search for "CASE WHEN" in the test source. For example:
SELECT CASE WHEN key = 1 THEN 1 ELSE 2 END FROM testData
Update for most recent place to figure out syntax from the SQL Parser
The parser source can now be found here.
Update for more complex examples
In response to a question below, the modern syntax supports complex Boolean conditions.
SELECT
CASE WHEN id = 1 OR id = 2 THEN "OneOrTwo" ELSE "NotOneOrTwo" END AS IdRedux
FROM customer
You can involve multiple columns in the condition.
SELECT
CASE WHEN id = 1 OR state = 'MA'
THEN "OneOrMA"
ELSE "NotOneOrMA" END AS IdRedux
FROM customer
You can also nest CASE WHEN THEN expression.
SELECT
CASE WHEN id = 1
THEN "OneOrMA"
ELSE
CASE WHEN state = 'MA' THEN "OneOrMA" ELSE "NotOneOrMA" END
END AS IdRedux
FROM customer
For Spark 2.+
Spark when function
From documentation:
Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.
// Example: encoding gender string column into integer.
// Scala:
people.select(when(col("gender") === "male", 0)
.when(col("gender") === "female", 1)
.otherwise(2))
// Java:
people.select(when(col("gender").equalTo("male"), 0)
.when(col("gender").equalTo("female"), 1)
.otherwise(2))
This syntax worked for me in Databricks:
select
org,
patient_id,
case
when (age is null) then 'Not Available'
when (age < 15) then 'Less than 15'
when (age >= 15 and age < 25) then '15 to 25'
when (age >= 25 and age < 35) then '25 to 35'
when (age >= 35 and age < 45) then '35 to 45'
when (age >= 45) then '45 and Older'
end as age_range
from demo
The decode() function analog of Oracle SQL for SQL Spark can be implemented as follows:
​ case
​ ​ ​ when exp1 in ('a','b','c')
​ ​ ​ ​ then element_at(map('a','A','b','B','c','C'), exp1)
​ ​ ​ else exp1
​ ​ end
Based on my current production code, this works
val identifierDF =
tempIdentifierDF.select(tempIdentifierDF("t_item_account_id"),
when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_cusip")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_ticker")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_isin")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_sedol")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_valoren")),100)
.otherwise(0)
.alias("identifier_in_description_score")
)
Spark DataFrame API (Python version) also enable to do next query:
df.selectExpr('time', \
'CASE WHEN (time > 1) THAN time * 1.1 ELSE time END AS updated_time')

Case expression with Boolean from PostgreSQL to SQL Server

I am translating a query from PostgreSQL to SQL Server. I didn't write the query in PostgreSQL and it's quite complicated for my knowledge so i don't understand every piece of it.
From my understand: we are trying to find the max version from p_policy and when insurancestatus = 7 or 14 / transactiontype = CAN, we compare two dates (whose format are BIG INT).
This is the PG Query:
SELECT *
FROM BLABLABLA
WHERE
pol.vnumber = (
SELECT MAX(pol1.vnumber)
FROM p_policy pol1
AND ( CASE WHEN pol1.insurancestatus IN (7,14)
or pol1.transactiontype IN ('CAN')
-- ('CAN','RCA')
THEN pol1.veffectivedate = pol1.vexpirydate
ELSE pol1.veffectivedate <> pol1.vexpirydate
END
)
AND pol1.vrecordstatus NOT IN (30,254)
etc.
I am used to have a where statement where I compare it to a value. I understand here from the Case statement we will have a boolean, but still that must be compared to something?
Anyway the main purpose is to make it work in SQL, but I believe SQL can't read a CASE statement where THEN is a comparison.
This is what I tried:
SELECT *
FROM BLABLABLA
WHERE pol.vnumber =
(
SELECT MAX(pol1.vnumber)
FROM p_policy pol1
WHERE sbuid = 4019
AND ( CASE WHEN pol1.insurancestatus IN (7,14)
or pol1.transactiontype IN ('CAN')
THEN CASE
WHEN pol1.veffectivedate = pol1.vexpirydate THEN 1
WHEN pol1.veffectivedate <> pol1.vexpirydate THEN 0
END
END
)
AND pol1.vrecordstatus NOT IN (30,254)
etc.
And then I get this error from SQL Server (which directly the last line of the current code - so after the double case statement)
Msg 4145, Level 15, State 1, Line 55
An expression of non-boolean type specified in a context where a condition is expected, near 'AND'.
Thank you !Let me know if it is not clear
I think you want boolean logic. The CASE expression would translate as:
(
(
(pol1.insurancestatus IN (7,14) OR pol1.transactiontype = 'CAN')
AND pol1.veffectivedate = pol1.vexpirydate
) OR (
NOT (pol1.insurancestatus IN (7,14) OR pol1.transactiontype = 'CAN')
AND pol1.veffectivedate <> pol1.vexpirydate
)
)
There are 2 main issues with your snippet, SQL Server-syntax-wise.
SELECT * FROM BLABLABLA WHERE
pol.vnumber = /* PROBLEM 1: we haven't defined pol yet; SQL Server has no idea what pol.vnumber is here, so you're going to get an error when you resolve your boolean issue */
(
SELECT MAX(pol1.vnumber)
FROM p_policy pol1
WHERE sbuid = 4019
AND ( CASE WHEN pol1.insurancestatus IN (7,14)
or pol1.transactiontype IN ('CAN')
THEN CASE
WHEN pol1.veffectivedate = pol1.vexpirydate THEN 1
WHEN pol1.veffectivedate <> pol1.vexpirydate THEN 0
END
END
) /* PROBLEM 2: Your case statement returns a 1 or a 0..
which means your WHERE is saying
WHERE sbuid = 4019
AND (1)
AND pol1.vrecordstatus NOT IN (30,254)
SQL Doesn't like that. I think you meant to add a boolean operation using your 1 or 0 after the parenthesis.
like this: */
= 1
AND pol1.vrecordstatus NOT IN (30,254)

Difference between querying from Impala and querying from Hive?

I have a Hive source table which contains:
select count(*) from dev_lkr_send.pz_send_param_ano;
--25283 lines
I am trying to get all of the table lines and put them into a dataframe using Spark2-Scala. I did the following:
val dfMet = spark.sql(s"""SELECT
CD_ANOMALIE,
CD_FAMILLE,
libelle AS LIB_ANOMALIE,
to_date(substr(MAJ_DATE, 1, 19), 'YYYY-MM-DD HH24:MI:SS') AS DT_MAJ,
CLASSIFICATION,
NB_REJEUX,
case when indic_cd_erreur = 'O' then 1 else 0 end AS TOP_INDIC_CD_ERREUR,
case when invalidation_coordonnee = 'O' then 1 else 0 end AS TOP_COORDONNEE_INVALIDE,
case when typ_mvt = 'S' then 1 else 0 end AS TOP_SUPP,
case when typ_mvt = 'S' then to_date(substr(dt_capt, 1, 19), 'YYYY-MM-DD HH24:MI:SS') else null end AS DT_SUPP
FROM ${use_database}.pz_send_param_ano""")
When I execute dfMet.count() it returns: 46314
Any ideas about the source of the difference?
EDIT1:
Trying the same query from Hive returns the same value as in the dataframe (I was querying from Impala UI before).
Someone can explain the difference please? I am working on Hue4.
A potential source of difference is your Hive query is returning the result from the metastore which is out of date rather than running a fresh count against the table.
If you have hive.compute.query.using.stats set to true and the table has stats computed then it will be returning the result from the metastore. If this is the case then it could be your stats are out of date and you need to recompute them.

Denodo - Unable to case DATE>= addday(cast(now() as date),-365)

I've encountered an issue when trying to obtain the following output:
"If x_date >= now-365 then 1 else 0"
My select statement reads:
SELECT
id,
x_date,
CASE x_date
WHEN x_date >= addday(cast(now() as date),-365) then 1
else 0
end as output
I'm receiving an error message that reads:
"SQL Error [30100] [HY000]: CASE argument case((xdate,ge,[addday(trunc(cast('date', now(), 'DATE')) '-365')], utc_il8n), 'true', 'false') is not compatible with the rest of the values.
Has anyone else performed a similar operation with dates in a CASE statement? The Addday works fine and returns 2017-01-05.
Issue with the CASE syntax. I think reading this and other sources may have cause the confusion: https://www.techonthenet.com/sql_server/functions/case.php
Should read:
SELECT
id,
x_date,
CASE WHEN x_date >= addday(cast(now() as date),-365) then 1
else 0
end as output

SQL Query Error with CASE Statement?

I'm attempting to run this query using Simba's ODBC SFDC driver but the log shows me an error near the case statement. I'm not totally convinced its an error with the CASE statement but I don't see where my error is. Someone please help!!!!
SELECT
Account_Group__c,
Hospital_Sales_Teammate__c,
Name,
StageName,
CloseDate,
Yr_Credited__c,
Probability,
Census__c,
Credit__c,
Related_VSA__c,
AB_Hospital_Relationship_Type__c,
CASE
WHEN Age_In_Stage__c >0 and Age_In_Stage__c <= 30 THEN '<30'
WHEN Age_In_Stage__c >30 and Age_In_Stage__c <= 60 THEN '31-60'
WHEN Age_In_Stage__c >60 and Age_In_Stage__c <= 90 THEN '61-90'
ELSE '>90' END AS Age_Bucket,
CASE
WHEN (Type = "Existing Business - Renewal" OR Type = 'Existing Business - Amendment')
AND (Account_HHV_Segment__c='A' OR Account_HHV_Segment__c='B')
AND AB_Hospital_Relationship_Type__c<>'N/A'
AND (RecordType='012300000000PWuAAM'
OR RecordType='01250000000DcJkAAK'
OR RecordType='01250000000DpV4AAK'
OR RecordType='01250000000Dxd7AAC'
OR RecordType='01250000000DoFPAA0'
OR RecordType='01250000000DuuEAAS') THEN 'Hosp'
WHEN Name LIKE '%AB Hospital Loss%' THEN 'Hosp'
ELSE '' END AS Hospital_Eligible,
CASE
WHEN RecordType='01250000000DpV4AAK'
AND Type LIKE '%Acquisition%'
THEN 'Acq'
ELSE '' END AS Acquisition_Eligible,
CASE
WHEN RecordType='01250000000Dxd7AAC'
AND (Business_Unit__c="Full Conversion" OR Business_Unit__c="Partial Conversion")
THEN 'BGC'
ELSE '' END AS Conversion_Eligible,
CASE
WHEN RecordType='01250000000DuuEAAS'
AND Type_of_Agreement__c ="MDA" OR Type_of_Agreement__c ="Joinder" OR Type_of_Agreement__c ="JV"
THEN 'Incr Doc'
ELSE '' END AS Incr_Doc_Eligible
FROM
Opportunity
WHERE
Eligible__c<>'No'
AND NOT Name LIKE '%test%'
AND NOT Name LIKE '%Test%'
AND NOT Name LIKE '%TEST%'
ORDER BY
Account_Group__c ASC
Business_Unit__c="Full Conversion" (and other places as well): You are using double quotes instead of single quotes (as you do in the rest of the query). I bet that's the problem...
Also, this is a case expression, not a statement.
Why are you using double quotes?
(Type = "Existing Business - Renewal" OR Type = 'Existing Business - Amendment')
You should change it to
(Type = 'Existing Business - Renewal' OR Type = 'Existing Business - Amendment')