IF and ELSE statement in spark sql expression - apache-spark-sql

I am looking to run a sql expression that checks for the next event that is either 'DELIVERED' or 'ORDER-CANCELED' and return a different result depending on which is first.
df = spark.createDataFrame([["ORDER", "2009-11-23", "1"], ["DELIVERED", "2009-12-17", "1"], ["ORDER-CANCELED", "2009-11-25", "1"], ["ORDER", "2009-12-03", "1"]]).toDF("EVENT", "DATE", "ID")
+--------------+----------+---+
| EVENT| DATE| ID|
+--------------+----------+---+
| ORDER|2009-11-23| 1|
|ORDER-CANCELED|2009-11-25| 1|
| ORDER|2009-12-03| 1|
| DELIVERED|2009-12-17| 1|
+--------------+----------+---+
I have written a statement that works for just an DELIVERED event using this code:
df = df.withColumn("NEXT", f.expr("""
case when EVENT = 'ORDER' then
first(if(EVENT in ('DELIVERED'), 'SUCCESS', null), True)
over (Partition By ID ORDER BY ID, DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
else null end
"""))
This works but I don't know how to add in a second condition for the else statement of 'ORDER-CANCELED'.
df = df.withColumn("NEXT", f.expr("""
case when EVENT = 'ORDER' then
first(if(EVENT in ('DELIVERED'), 'SUCCESS', null)
**elseif(EVENT in ('ORDER-CANCELED'), 'CANCELED'), True)**
over (Partition By ID ORDER BY ID, DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
else null end
"""))

something like this maybe ?
df = df.withColumn(
"NEXT",
f.expr("""
case when EVENT = 'ORDER' then
first(
case when EVENT in ('DELIVERED') then
'SUCCESS'
when EVENT in ('ORDER-CANCELED') then
'CANCELED'
else
NULL
end
) over (Partition By ID ORDER BY ID, DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
else NULL
end
"""))

Related

Getting values from multiple rows into a single row

I want to get the values of multiple rows of a single column to different column in a single row based on the condition of another column.
I want to get the values of the field_value_name in same row but different columns based on values present in the field_name column as they belong to the same id.
How to get this through sql or pyspark?
I tried using CASE WHEN but it will scan for every row and return the output for every row. [
I'd rather want those values to be in a single row for every id.
You need pivoting logic here, e.g.
SELECT
parent_id,
MAX(CASE WHEN properties_field_name = 'Status'
THEN properties_field_value_name END) AS REQ_STATUS,
MAX(CASE WHEN properties_field_name = 'Type'
THEN properties_field_value_name END) AS REQ_TYPE,
MAX(CASE WHEN properties_field_name = 'Description'
THEN properties_field_value_name END) AS REQ_DESC
FROM yourTable
GROUP BY parent_id
ORDER BY parent_id;
You can do it with MAX like Tim shows or you can do it with joins like this:
SELECT
parent_id,
status.properties_field_value_name as status,
type.properties_field_value_name as type,
desc.properties_field_value_name as desc
FROM (
SELECT distinct partent_id
FROM thetableyoudidnotname
) as base
LEFT JOIN thetableyoudidnotname as status on base.parent_id = status.parent_id and status.properties_field_name = 'Status'
LEFT JOIN thetableyoudidnotname as type on base.parent_id = type.parent_id and type.properties_field_name = 'Type'
LEFT JOIN thetableyoudidnotname as desc on base.parent_id = desc.parent_id and desc.properties_field_name = 'Description'
Simple pivot problem I think.
data = [
[7024549, 'Status', 'Approved'],
[7024549, 'Type', 'Jama Design'],
[7024549, 'Description', 'null']
]
df = spark.createDataFrame(data, ['parent_id', 'properties_field_name', 'properties_field_value_name'])
df.withColumn('id', f.expr('uuid()')) \
.withColumn('properties_field_name', f.concat(f.lit('REQ_'), f.upper(f.col('properties_field_name')))) \
.groupBy('id', 'parent_id') \
.pivot('properties_field_name') \
.agg(f.first('properties_field_value_name')) \
.drop('id') \
.show()
+---------+---------------+----------+-----------+
|parent_id|REQ_DESCRIPTION|REQ_STATUS| REQ_TYPE|
+---------+---------------+----------+-----------+
| 7024549| null| Approved| null|
| 7024549| null| null|Jama Design|
| 7024549| null| null| null|
+---------+---------------+----------+-----------+

How to use CASE WHEN in FIRST_VALUE Oracle

If I case when then select it by first_value saparately then it works, but it I combine First_value(case when ....) then it not work althought the code still run without syntax error
My try, I want to find latest overdue date, it only show null
FIRST_VALUE(
CASE
WHEN inv.aging_period = 0
AND is_tad_paid = 0
AND is_mad_paid = 0
AND inv.min_amount_due > 0 THEN
inv.due_date
ELSE
NULL
END
)
OVER(PARTITION BY inv.account_id
ORDER BY inv.DUE_DATE DESC NULLS LAST
) AS latest_overdue_date,
If I try this saparately, it works:
select sub.*, first_value(ALL_OVER_DUE_DAY) over (partition by account_id order by ALL_OVER_DUE_DAY desc nulls last) as latest_over_due2
from (select
CASE
WHEN inv.aging_period = 0
AND is_tad_paid = 0
AND is_mad_paid = 0
AND inv.min_amount_due > 0 THEN
inv.due_date
ELSE
NULL
END AS ALL_OVER_DUE_DAY from t1 ) SUB
sample data, my resul column is the last column
Replace First_value with MAX(), add unbounded preceeding... like bellow
max(
CASE
WHEN inv.aging_period = 0
AND is_tad_paid = 0
AND is_mad_paid = 0
AND inv.min_amount_due > 0 THEN
inv.due_date
ELSE
NULL
END
)
OVER(PARTITION BY inv.account_id
ORDER BY inv.DUE_DATE DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING
AND
UNBOUNDED following
)

Rolling Up Customer Data into One Row

I've created a query in Apache Spark in hopes of taking multiple rows of customer data and rolls it up into one row, showing what types of products they have open. So data that looks like this:
Customer Product
1 Savings
1 Checking
1 Auto
Ends up looking like this:
Customer Product
1 Savings/Checking/Auto
The query currently still has multiple rows. I tried group by, but that doesn't show the multiple products that a customer has, instead, it'll just show one product.
Is there a way to do this is Apache Spark or SQL (which is really similar to apache)? Unfortunately, I don't have MYSQL nor do I think IT will install it for me.
SELECT
"ACCOUNT"."account_customerkey" AS "account_customerkey",
max(
concat(case when Savings=1 then ' Savings'end,
case when Checking=1 then ' Checking 'end,
case when CD=1 then ' CD /'end,
case when IRA=1 then ' IRA /'end,
case when StandardLoan=1 then ' SL /'end,
case when Auto=1 then ' Auto /'end,
case when Mortgage=1 then ' Mortgage /'end,
case when CreditCard=1 then ' CreditCard 'end)) AS Description
FROM "ACCOUNT" "ACCOUNT"
inner join (
SELECT
"ACCOUNT"."account_customerkey" AS "customerkey",
CASE WHEN "ACCOUNT"."account_producttype" = 'Savings' THEN 1 ELSE NULL END AS Savings,
CASE WHEN "ACCOUNT"."account_producttype" = 'Checking' THEN 1 ELSE NULL END AS Checking,
CASE WHEN "ACCOUNT"."account_producttype" = 'CD' THEN 1 ELSE NULL END AS CD,
CASE WHEN "ACCOUNT"."account_producttype" = 'IRA' THEN 1 ELSE NULL END AS IRA,
CASE WHEN "ACCOUNT"."account_producttype" = 'Standard Loan' THEN 1 ELSE NULL END AS StandardLoan,
CASE WHEN "ACCOUNT"."account_producttype" = 'Auto' THEN 1 ELSE NULL END AS Auto,
CASE WHEN "ACCOUNT"."account_producttype" = 'Mortgage' THEN 1 ELSE NULL END AS Mortgage,
CASE WHEN "ACCOUNT"."account_producttype" = 'Credit Card' THEN 1 ELSE NULL END AS CreditCard
FROM "ACCOUNT" "ACCOUNT"
)a on "account_customerkey" =a."customerkey"
GROUP BY
"ACCOUNT"."account_customerkey"
Please try this.
scala> df.show()
+--------+--------+
|Customer| Product|
+--------+--------+
| 1| Savings|
| 1|Checking|
| 1| Auto|
| 2| Savings|
| 2| Auto|
| 3|Checking|
+--------+--------+
scala> df.groupBy($"Customer").agg(collect_list($"Product").as("Product")).select($"Customer",concat_ws(",",$"Product").as("Product")).show(false)
+--------+---------------------+
|Customer|Product |
+--------+---------------------+
|1 |Savings,Checking,Auto|
|3 |Checking |
|2 |Savings,Auto |
+--------+---------------------+
scala>
See https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/collect_list and related functions
You need to use collect_list which is available with SQL or %sql.
%sql
select id, collect_list(num)
from t1
group by id
I used my own data, you need to tailor. Just demonstrating in more native SQL form.

Majority voting of columns SQL

I need to do something like 'majority voting' of columns in SQL database. That means, that having columns: c0, c1, ..., cn, I would like to have in some other column for each row the most frequent value among mentioned columns (and null or random otherwise - it doesn't really matter). For example, if we have the following table:
+--+--+--+------+
|c0|c1|c2|result|
+--+--+--+------+
| 0| 1| 0| 0|
| 0| 1| 1| 1|
| 2| 2| 0| 2|
| 0| 3| 1| null|
That is what I mean by majority voting of columns c0, c1, c2: in the first row we have 2 rows with value 0 and 1 with 1, so result = 0. In the second we have one 0's vs two 1's, ergo result = 1 and so on. We assume, that all the columns have the same type.
It would be great, if the query were concise (it can be build dynamically). Native SQL is preferred, but PL/SQL, psql will also do.
Thank you in advance.
This can easily be done by creating a table out of the three columns and using an aggregate function on that:
The following works in Postgres:
select c0,c1,c2,
(select c
from unnest(array[c0,c1,c2]) as t(c)
group by c
having count(*) > 1
order by count(*) desc
limit 1)
from the_table;
If you don't want to hard-code the column names, you can use Postgres' JSON function as well:
select t.*,
(select t.v
from jsonb_each_text(to_jsonb(t)) as t(c,v)
group by t.v
having count(*) > 1
order by count(*) desc
limit 1) as result
from the_table t;
Note that the above takes all columns into account. If you want to remove specific columns (e.g. an id column) you need to use to_jsonb(t) - 'id' to remove that key from the JSON value.
Neither of those solutions deals with ties (two different values appearing the same number of times).
Online example: https://rextester.com/PJR58760
The first solution can be "adapted" somewhat to Oracle, especially if you can build the SQL on the fly:
select t.*,
(select c
from (
-- this part would need to be done dynamically
-- if you don't know the columns
select t.c0 as c from dual union all
select t.c1 from dual union all
select t.c2 from dual
) x
group by c
having count(*) > 1
order by count(*) desc
fetch first 1 rows only) as result
from the_table t;
In Postgres use jsonb functions. You need primary key or unique column(s), id is unique in the example:
with my_table(id, c0, c1, c2) as (
values
(1, 0, 1, 0),
(2, 0, 1, 1),
(3, 2, 2, 0),
(4, 0, 3, 1)
)
select distinct on (id) id, value
from (
select id, value, count(*)
from my_table t
cross join jsonb_each_text(to_jsonb(t)- 'id')
group by id, value
) s
order by id, count desc
id | value
----+-------
1 | 0
2 | 1
3 | 2
4 | 1
(4 rows)
The query works well regardless of the number of columns.
Here's a solution for Postgres.
SELECT t1.c0,
t1.c1,
t1.c2,
(SELECT y.c
FROM (SELECT x.c,
count(*) OVER (PARTITION BY x.rn) ct
FROM (SELECT v.c,
rank() OVER (ORDER BY count(v.c) DESC) rn
FROM (VALUES (t1.c0),
(t1.c1),
(t1.c2)) v(c)
GROUP BY v.c) x
WHERE x.rn = 1) y
WHERE y.ct = 1) result
FROM elbat t1;
db<>fiddle
In the subquery first all the values with maximum count are taken using rank(). The windowed version of count() is then used to filter if there is only one value with maximum count.
If you need to do this over more columns, just add them to the SELECT and the VALUES.
THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.
You can just compare the values. For your example with two values neither of which is NULL:
select t.*
(case when ((case when c0 = 0 then 1 else -1 end) +
(case when c1 = 0 then 1 else -1 end) +
(case when c2 = 0 then 1 else -1 end)
) > 0
then 0 else 1
end)
from t;

Retrieve the First True Condition in a SQL Query

I'm running into a problem wherein I need to get the changes of an Employee based on specific conditions.
I should retrieve rows only based on the changes below:
Assignment Category (only based on specific Categories shown in the table xxtest)
Pay Basis (only from Hourly to Salaried and Vice Versa, also shown in the table xxtest)
Both Pay Basis and Assignment Category (based on the same constraints above)
I have created a Table (XXTEST) to restrict the SQL to only reference this mapping (See DML and DDL at the end of the Question).
Basically, the Data I'm Expecting would be:
| --------|-------------------|------------------|---------------|------------- |-------------|
| Sample | FROM_PAY_BASIS_ID | TO_PAY_BASIS_ID | FROM_CATEGORY | TO_CATEGORY | CHANGE_TYPE |
| --------| ------------------| -----------------| --------------|------------- |-------------|
| 1 | 1 | 2 | FR | FR | PAY_BASIS |
| 2 | 1 | 1 | FT | FR | ASSIGN_CAT |
| 3 | 1 | 2 | FT | FR | BOTH |
the table above is based on the conditions below:
1. if BOTH, it should get "Assignment Category and Pay Basis" as its Change_Type
2. if Assignment Category, it should get "Assignment Category" as its Change_Type
3. if Pay Basis, it should get "Pay Basis" as its Change_Type
I want it to evaluate first the condition 1, and if its false, then evaluate the next until the last.
the problem occurs when both of these columns change, resulting into having 3+ rows with CHANGE_TYPE having "PAY_BASIS", "ASSIGN_CAT", "BOTH" as values.
I've tried a lot of methods but nothing seemed to fully address my need, such as the query below:
select *
from (select coalesce((select CHANGE_TYPE
from XXTEST
where FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND FROM_ID = pax.PAY_BASIS_ID
and TO_ID = pax2.PAY_BASIS_ID),
select CHANGE_TYPE
from XXTEST
WHERE FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND FROM_ID = 0
and TO_ID = 0),
select CHANGE_TYPE
from XXTEST
WHERE FROM_CATEGORY = 'N/A'
AND TO_CATEGORY = 'N/A'
AND FROM_ID = pax.PAY_BASIS_ID
and TO_ID = pax2.PAY_BASIS_ID),
NULL ) CHANGE_TYPE FROM DUAL ) CHANGE_TYPE
, PPX.FULL_NAME
, PPX.EMPLOYEE_NUMBER
from per_people_X ppx
, per_assignments_X pax
, per_assignments_X pax2
WHERE pax.assignment_id = pax2.assignment_id
AND PPX.PERSON_id = pax2.PERSON_id
AND PPX.PERSON_id = PAX.PERSON_id)
where CHANGE_TYPE is not null;
This kinda works but it retrieves all records, regardless if it has a match or not (due to the NULL "else" condition) and results into Change_Type = NULL.
Is it possible to just filter only those records that do not have Change_Type = NULL without encasing it in another SELECT statement?
I've also tried using CASE,
It works fine if the Employee's Pay Basis ID only changed (ex. from 1 to 2)
and if only the Employee's Category has changed (ex. from FT to FR),
but it evaluates all the cases (returning 3 rows) whenever "both" changes occur.
Any advise?
I didn't follow all the details of your post and the answers and ensuing comments, but it seems you may be after something like this. To get exactly one answer in all situations, you probably need a CASE expression. In the code below I use nested CASE expressions, but that is only to save typing (and a little bit of execution time); you could rewrite this using a single CASE expression.
The reason this works is that evaluation ends immediately as soon as the first TRUE condition in the when... then... pairs is found. You will have to figure out how to attach this to your existing query - I just put its outputs in a CTE for testing purposes below.
with
query_output ( empl_id, from_pay_basis_id, to_pay_basis_id, from_category, to_category ) as (
select 101, 1, 2, 'FR', 'FR' from dual union all
select 102, 1, 1, 'FT', 'FR' from dual union all
select 103, 1, 2, 'FT', 'FR' from dual union all
select 104, 1, 1, 'FR', 'FR' from dual
)
select empl_id, from_pay_basis_id, to_pay_basis_id, from_category, to_category,
case when from_category != to_category then
case when from_pay_basis_id != to_pay_basis_id then 'Assignment Category and Pay Basis'
else 'Assignment Category'
end
when from_pay_basis_id != to_pay_basis_id then 'Pay Basis'
end as change_type
from query_output;
EMPL_ID FROM_PAY_BASIS_ID TO_PAY_BASIS_ID FROM_CATEGORY TO_CATEGORY CHANGE_TYPE
------- ----------------- --------------- ------------- ----------- ---------------------------------
101 1 2 FR FR Pay Basis
102 1 1 FT FR Assignment Category
103 1 2 FT FR Assignment Category and Pay Basis
104 1 1 FR FR
You are probably after case statements in the where clause, something like:
SELECT PPX.FULL_NAME
, PPX.EMPLOYEE_NUMBER
, XT.CHANGE_TYPE
FROM per_people_X ppx
, XXTEST XT
, per_assignments_X pax
, per_assignments_X pax2
WHERE pax.assignment_id = pax2.assignment_id
AND PPX.PERSON_id = pax2.PERSON_id
AND PPX.PERSON_id = PAX.PERSON_id
AND (-- Both Employment Category and Pay Basis records records that were Changed
case when XT.FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND XT.TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID
then 1
else 0
end = 1
OR
-- all Pay Basis records that were "Updated"
case when XT.FROM_CATEGORY = 'N/A'
AND XT.TO_CATEGORY = 'N/A'
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID
then 1
else 0
end = 1
OR
-- all Assignment Category records that were "Updated"
case when XT.FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND XT.TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND XT.FROM_ID = 0
AND XT.TO_ID = 0
then 1
else 0
end = 1);
N.B. untested, since you didn't provide sample data for the per_people_x or per_assignments tables.
You may have to add extra conditions into the "both categories" case expression to exclude the other two cases; it's not immediately apparent from the data that you have supplied so far.
First of all if you use OR you should use brackets and (... or ... ) or you lose some predicates. Your first query should be rewritten to:
SELECT PPX.FULL_NAME
, PPX.EMPLOYEE_NUMBER
, XT.CHANGE_TYPE
FROM per_people_X ppx
, XXTEST XT
, per_assignments_X pax
, per_assignments_X pax2
WHERE pax.assignment_id = pax2.assignment_id
AND PPX.PERSON_id = pax2.PERSON_id
AND PPX.PERSON_id = PAX.PERSON_id
-- THIS IS THE SECTION OF THE QUERY WHERE I'M HAVING PROBLEMS --
-- Both Employment Category and Pay Basis records records that were Changed
AND (
(XT.FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND XT.TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID)
-- all Pay Basis records that were "Updated"
or (XT.FROM_CATEGORY = 'N/A'
AND XT.TO_CATEGORY = 'N/A'
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID)
-- all Assignment Category records that were "Updated"
or (XT.FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND XT.TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND XT.FROM_ID = 0
AND XT.TO_ID = 0)
);
EDIT: Add solution with ranking of condition
SELECT FULL_NAME
, EMPLOYEE_NUMBER
, CHANGE_TYPE
FROM (
SELECT PPX.FULL_NAME
, PPX.EMPLOYEE_NUMBER
, XT.CHANGE_TYPE
, DENSE_RANK() OVER (PRTITION BY PPX.PERSON_id,pax.assignment_id ORDER BY
CASE WHEN XT.FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND XT.TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID
THEN 1
WHEN XT.FROM_CATEGORY = 'N/A'
AND XT.TO_CATEGORY = 'N/A'
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID
THEN 2
ELSE 3
END
) as RNK
FROM per_people_X ppx
, XXTEST XT
, per_assignments_X pax
, per_assignments_X pax2
WHERE pax.assignment_id = pax2.assignment_id
AND PPX.PERSON_id = pax2.PERSON_id
AND PPX.PERSON_id = PAX.PERSON_id
-- THIS IS THE SECTION OF THE QUERY WHERE I'M HAVING PROBLEMS --
-- Both Employment Category and Pay Basis records records that were Changed
AND (
(XT.FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND XT.TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID)
-- all Pay Basis records that were "Updated"
or (XT.FROM_CATEGORY = 'N/A'
AND XT.TO_CATEGORY = 'N/A'
AND XT.FROM_ID = pax.PAY_BASIS_ID
AND XT.TO_ID = pax2.PAY_BASIS_ID)
-- all Assignment Category records that were "Updated"
or (XT.FROM_CATEGORY = pax.EMPLOYMENT_CATEGORY
AND XT.TO_CATEGORY = pax2.EMPLOYMENT_CATEGORY
AND XT.FROM_ID = 0
AND XT.TO_ID = 0)
)
) WHERE RNK = 1;