I have a Hive source table which contains:
select count(*) from dev_lkr_send.pz_send_param_ano;
--25283 lines
I am trying to get all of the table lines and put them into a dataframe using Spark2-Scala. I did the following:
val dfMet = spark.sql(s"""SELECT
CD_ANOMALIE,
CD_FAMILLE,
libelle AS LIB_ANOMALIE,
to_date(substr(MAJ_DATE, 1, 19), 'YYYY-MM-DD HH24:MI:SS') AS DT_MAJ,
CLASSIFICATION,
NB_REJEUX,
case when indic_cd_erreur = 'O' then 1 else 0 end AS TOP_INDIC_CD_ERREUR,
case when invalidation_coordonnee = 'O' then 1 else 0 end AS TOP_COORDONNEE_INVALIDE,
case when typ_mvt = 'S' then 1 else 0 end AS TOP_SUPP,
case when typ_mvt = 'S' then to_date(substr(dt_capt, 1, 19), 'YYYY-MM-DD HH24:MI:SS') else null end AS DT_SUPP
FROM ${use_database}.pz_send_param_ano""")
When I execute dfMet.count() it returns: 46314
Any ideas about the source of the difference?
EDIT1:
Trying the same query from Hive returns the same value as in the dataframe (I was querying from Impala UI before).
Someone can explain the difference please? I am working on Hue4.
A potential source of difference is your Hive query is returning the result from the metastore which is out of date rather than running a fresh count against the table.
If you have hive.compute.query.using.stats set to true and the table has stats computed then it will be returning the result from the metastore. If this is the case then it could be your stats are out of date and you need to recompute them.
Related
I'm new to SPARK-SQL. Is there an equivalent to "CASE WHEN 'CONDITION' THEN 0 ELSE 1 END" in SPARK SQL ?
select case when 1=1 then 1 else 0 end from table
Thanks
Sridhar
Before Spark 1.2.0
The supported syntax (which I just tried out on Spark 1.0.2) seems to be
SELECT IF(1=1, 1, 0) FROM table
This recent thread http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html links to the SQL parser source, which may or may not help depending on your comfort with Scala. At the very least the list of keywords starting (at time of writing) on line 70 should help.
Here's the direct link to the source for convenience: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala.
Update for Spark 1.2.0 and beyond
As of Spark 1.2.0, the more traditional syntax is supported, in response to SPARK-3813: search for "CASE WHEN" in the test source. For example:
SELECT CASE WHEN key = 1 THEN 1 ELSE 2 END FROM testData
Update for most recent place to figure out syntax from the SQL Parser
The parser source can now be found here.
Update for more complex examples
In response to a question below, the modern syntax supports complex Boolean conditions.
SELECT
CASE WHEN id = 1 OR id = 2 THEN "OneOrTwo" ELSE "NotOneOrTwo" END AS IdRedux
FROM customer
You can involve multiple columns in the condition.
SELECT
CASE WHEN id = 1 OR state = 'MA'
THEN "OneOrMA"
ELSE "NotOneOrMA" END AS IdRedux
FROM customer
You can also nest CASE WHEN THEN expression.
SELECT
CASE WHEN id = 1
THEN "OneOrMA"
ELSE
CASE WHEN state = 'MA' THEN "OneOrMA" ELSE "NotOneOrMA" END
END AS IdRedux
FROM customer
For Spark 2.+
Spark when function
From documentation:
Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.
// Example: encoding gender string column into integer.
// Scala:
people.select(when(col("gender") === "male", 0)
.when(col("gender") === "female", 1)
.otherwise(2))
// Java:
people.select(when(col("gender").equalTo("male"), 0)
.when(col("gender").equalTo("female"), 1)
.otherwise(2))
This syntax worked for me in Databricks:
select
org,
patient_id,
case
when (age is null) then 'Not Available'
when (age < 15) then 'Less than 15'
when (age >= 15 and age < 25) then '15 to 25'
when (age >= 25 and age < 35) then '25 to 35'
when (age >= 35 and age < 45) then '35 to 45'
when (age >= 45) then '45 and Older'
end as age_range
from demo
The decode() function analog of Oracle SQL for SQL Spark can be implemented as follows:
case
when exp1 in ('a','b','c')
then element_at(map('a','A','b','B','c','C'), exp1)
else exp1
end
Based on my current production code, this works
val identifierDF =
tempIdentifierDF.select(tempIdentifierDF("t_item_account_id"),
when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_cusip")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_ticker")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_isin")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_sedol")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_valoren")),100)
.otherwise(0)
.alias("identifier_in_description_score")
)
Spark DataFrame API (Python version) also enable to do next query:
df.selectExpr('time', \
'CASE WHEN (time > 1) THAN time * 1.1 ELSE time END AS updated_time')
I need to update qualifications in a Oracle DB, I am running into a problem where my script errors.
I would usually create a few smaller update statements to get the job done.
However I thought it would be better to do one query, this is simple but my background is mostly on T-SQL and mySQL not Oracle.
So help would be appreciated.
My Statement.
--ALTER SESSION TO CHANGE DT--
alter session set nls_date_format = 'DD/MM/YYYY HH24:MI:SS';
--Update
Update Qualifications_t
Set (COMMENTS = 'Task'),
(Expiry_DTS = CASE Expiry_DTS
When cd = '1'
Then Expiry_DTS = '31/12/2016 23:59:00'
When cd = '2'
Then Expiry_DTS = '01/07/2019 23:59:00'
When cd = '3'
Then Expiry_DTS = '31/12/1999 23:59:00'
When cd = '4'
Then Expiry_DTS = '31/08/2021 23:59:00'
When cd = '5'
Then Expiry_DTS = '17/06/2021 23:59:00')
END
Where EXPIRY_DTS IS NULL;
--SELECT
Select *
from QUALIFICATIONS_T
where COMMENTS = 'Task';
Error at line 5
ORA-00905: missing keyword
Yes, I googled it but couldn't figure it out.
Remove the parentheses around the update assignments.
Then: it's not clear what you mean by the case expression. Perhaps this:
update qualifications_t
set comments = 'task',
expiry_dts = case when cd = '1' then to_date('31/12/2016 23:59:00',
'dd/mm/yyyy hh24:mi:ss')
when cd = '2' then to_date(....)
(etc.)
end
where expiry_dts is null
;
Notice the structure of a case expression. The name of the column you are updating doesn't belong after the keyword case, and the case expression "returns" values directly, not through assignments. There should be only one assignment ("equal sign"); the case expression is evaluated and returns a single value, used for update.
Note also the proper way to represent date values (assuming the column data type is date, as it should be; if it isn't, you should fix that first).
Is it possible to refer to another case statement name in another case statement within SQL query?
Example: I have 3 case statements. The first 2 case statements are returning values based off coded fields. My 3rd case statement I would like to refer to the ending case name to return a sum of quantity.
However, I cannot figure how to get the case statement to refer to the previous case names I created. I hope I am explaining this correctly.
Any assistance would be greatly appreciated. Please see attached image for more detail.
SELECT CI_ITEM.ITEMCODE
, CI_ITEM.ITEMCODEDESC
, CASE WHEN DATEDIFF("M",CI_ITEM.DATECREATED,GETDATE()) <60 THEN DATEDIFF("M",CI_ITEM.DATECREATED,GETDATE())
ELSE 60 END AS NO_OF_MONTHS
, CASE WHEN DATEDIFF("M",IM_ITEMTRANSACTIONHISTORY.TRANSACTIONDATE,GETDATE()) <=60
AND IM_ITEMTRANSACTIONHISTORY.TRANSACTIONCODE IN ('BI','II','SO','WI')
THEN IM_ITEMTRANSACTIONHISTORY.TRANSACTIONQTY *-1 ELSE '0' END AS QTY_CONSUMED_60_MONTHS
, CASE WHEN NO_OF_MONTHS = 0 THEN 0 ELSE SUM([QTY_CONSUMED_60_MONTHS])/ [NO_OF_MONTHS] END AS MONTHLY_AVE_ON_60MONTHS_DATA
FROM CI_ITEM
INNER JOIN IM_ITEMTRANSACTIONHISTORY ON CI_ITEM.ITEMCODE = IM_ITEMTRANSACTIONHISTORY.ITEMCODE
Simply wrap your dependent cases within a sub query and reference them as fields of the sub query result.
SELECT
*,
CASE WHEN NO_OF_MONTHS = 0 THEN 0 ELSE SUM([QTY_CONSUMED_60_MONTHS])/ [NO_OF_MONTHS] END AS MONTHLY_AVE_ON_60MONTHS_DATA
FROM
(
SELECT CI_ITEM.ITEMCODE
, CI_ITEM.ITEMCODEDESC
, CASE WHEN DATEDIFF("M",CI_ITEM.DATECREATED,GETDATE()) <60 THEN DATEDIFF("M",CI_ITEM.DATECREATED,GETDATE())
ELSE 60 END AS NO_OF_MONTHS
, CASE WHEN DATEDIFF("M",IM_ITEMTRANSACTIONHISTORY.TRANSACTIONDATE,GETDATE()) <=60
AND IM_ITEMTRANSACTIONHISTORY.TRANSACTIONCODE IN ('BI','II','SO','WI')
THEN IM_ITEMTRANSACTIONHISTORY.TRANSACTIONQTY *-1 ELSE '0' END AS QTY_CONSUMED_60_MONTHS
FROM CI_ITEM
INNER JOIN IM_ITEMTRANSACTIONHISTORY ON CI_ITEM.ITEMCODE = IM_ITEMTRANSACTIONHISTORY.ITEMCODE
)AS X
I have a pretty simple question me thinks. I've been looking on the internet, but haven't been able to find anything. I am trying to add an IF statement basically to my Oracle sql.
UPDATE PS_Z_TREND_NOW_TBL a
SET STATUS = (
SELECT COUNT(SEC.IS_AW_AUTH_NAME)
FROM PS_IS_AW_SECURITY sec
WHERE sec.IS_AW_AUTH_NAME LIKE '%Manager%'
I want to update STATUS so that if COUNT(SEC.IS_AW_AUTH_NAME) is greater than 0 it will insert 'M'. How would I write this?
With Case statements.
UPDATE PS_Z_TREND_NOW_TBL a
SET STATUS = ( CASE WHEN COUNT(SEC.IS_AW_AUTH_NAME) > 0 then 'M'
ELSE null END )
FROM PS_IS_AW_SECURITY sec
WHERE sec.IS_AW_AUTH_NAME LIKE '%Manager%'
I'm new to SPARK-SQL. Is there an equivalent to "CASE WHEN 'CONDITION' THEN 0 ELSE 1 END" in SPARK SQL ?
select case when 1=1 then 1 else 0 end from table
Thanks
Sridhar
Before Spark 1.2.0
The supported syntax (which I just tried out on Spark 1.0.2) seems to be
SELECT IF(1=1, 1, 0) FROM table
This recent thread http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html links to the SQL parser source, which may or may not help depending on your comfort with Scala. At the very least the list of keywords starting (at time of writing) on line 70 should help.
Here's the direct link to the source for convenience: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala.
Update for Spark 1.2.0 and beyond
As of Spark 1.2.0, the more traditional syntax is supported, in response to SPARK-3813: search for "CASE WHEN" in the test source. For example:
SELECT CASE WHEN key = 1 THEN 1 ELSE 2 END FROM testData
Update for most recent place to figure out syntax from the SQL Parser
The parser source can now be found here.
Update for more complex examples
In response to a question below, the modern syntax supports complex Boolean conditions.
SELECT
CASE WHEN id = 1 OR id = 2 THEN "OneOrTwo" ELSE "NotOneOrTwo" END AS IdRedux
FROM customer
You can involve multiple columns in the condition.
SELECT
CASE WHEN id = 1 OR state = 'MA'
THEN "OneOrMA"
ELSE "NotOneOrMA" END AS IdRedux
FROM customer
You can also nest CASE WHEN THEN expression.
SELECT
CASE WHEN id = 1
THEN "OneOrMA"
ELSE
CASE WHEN state = 'MA' THEN "OneOrMA" ELSE "NotOneOrMA" END
END AS IdRedux
FROM customer
For Spark 2.+
Spark when function
From documentation:
Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.
// Example: encoding gender string column into integer.
// Scala:
people.select(when(col("gender") === "male", 0)
.when(col("gender") === "female", 1)
.otherwise(2))
// Java:
people.select(when(col("gender").equalTo("male"), 0)
.when(col("gender").equalTo("female"), 1)
.otherwise(2))
This syntax worked for me in Databricks:
select
org,
patient_id,
case
when (age is null) then 'Not Available'
when (age < 15) then 'Less than 15'
when (age >= 15 and age < 25) then '15 to 25'
when (age >= 25 and age < 35) then '25 to 35'
when (age >= 35 and age < 45) then '35 to 45'
when (age >= 45) then '45 and Older'
end as age_range
from demo
The decode() function analog of Oracle SQL for SQL Spark can be implemented as follows:
case
when exp1 in ('a','b','c')
then element_at(map('a','A','b','B','c','C'), exp1)
else exp1
end
Based on my current production code, this works
val identifierDF =
tempIdentifierDF.select(tempIdentifierDF("t_item_account_id"),
when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_cusip")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_ticker")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_isin")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_sedol")),100)
.when(tempIdentifierDF("h_description").contains(tempIdentifierDF("t_valoren")),100)
.otherwise(0)
.alias("identifier_in_description_score")
)
Spark DataFrame API (Python version) also enable to do next query:
df.selectExpr('time', \
'CASE WHEN (time > 1) THAN time * 1.1 ELSE time END AS updated_time')