Is there a way I can simplify my case when statement - apache-spark-sql

I'm trying to write a code (spark-sql) that will classify partitioned data as ST.
Basically the goal is to end up with another column Open ST that will only show open STs (open STs are determined by NOT having Task RW anywhere after ST).
End goal:
So in this case item12346 will end up with ST because there is no Task RW anywhere after Task ST. Item 12345 won't be an open ST because RW is present somewhere after ST.
As you can see, other tasks can be present after ST but that's not necessarily relevant as I care about the Tasks ST and RW.
Any thoughts on how I can code this. To be fully transparent I have created other columns based on partitions like NextTask, Lag and Lastvalue and I'm using a CASE WHEN clause with them but I think they might be complicating things for me:
CASE WHEN code_task = 'ST' AND lastvalue = 'CR' AND Lag_ NOT LIKE '%RW'AND Next_Task NOT LIKE '%RW%' THEN 'ST' END AS Open ST
Hoping there's a simpler solution by just looking the tables I'm providing. Thank you!

Question is unclear in terms of data, here is an approach based on some assumptions made. No, not with case, but in another way.
Code
import spark.implicits._
import org.apache.spark.sql.functions._
// Unclear if code is asc or unique, otherwise a zipWithIndex needed.
// Assumed code c can be used, seems that we just need to look for RW and it is after the ST.
val df = spark.sparkContext.parallelize(Seq( (1,7,"ST"), (1,8,"XX"), (1,9,"RW"), (3,10,"ST"), (3,11,"AA"), (3,12,"RW"), (2,3,"ST"), (2,4,"TT"))).toDF("i", "c", "t")
df.createOrReplaceTempView("data")
val res = spark.sql(""" SELECT i as iN
FROM data
WHERE t = 'RW'
""")
val temp = df.join(res, df("i") === res("iN"), "outer")
val results = temp.withColumn("openST", when(col("iN").isNull && col("t") === lit("ST"), lit("ST")).otherwise("")).select($"i", $"c", $"t", $"openST")
results.show(false)
Results
+---+---+---+------+
|i |c |t |openST|
+---+---+---+------+
|1 |7 |ST | |
|1 |8 |XX | |
|1 |9 |RW | |
|2 |3 |ST |ST |
|2 |4 |TT | |
|3 |10 |ST | |
|3 |11 |AA | |
|3 |12 |RW | |
+---+---+---+------+

Related

Spark Scala: How to Pass column value from one table as column condition another Dataframe Creation

I have a use case like this - I have a look-up table which contains formula and original table contains columns values and the final table need to create with formula and columns value from original table. For each client , formula will get changed .
lkp1:
|clnt_id | total_amount | total_avg
==============================================
|1 | col+col2 | col2-col1
|2 | col+col2+5 | 1
|3 | 2 | 14/col3
orig_1
clnt_id |name |col1 |col2 |col3
1 |name1 |1 |2 |4
2 |name2 |1 |4 |5
3 |name4 |3 |5 |7
final_1
clnt_id |name |Amount |avg
1 |name1 |3 |-2
2 |name2 |10 |1
3 |name4 |2 |2
I have achieved the same by using :
var final_1:DataFrame=_
var final_intermediate:DataFrame=_
var cnt=0
val lookup_1_df=spark.sql("select * from lookup_1")
var lookup_1_Df=lookup_1.
select(column("toatl_amount"),
column("total_avg")
.collect
val lookup_1_Df_length=lookup_1_Df.length
for (row <- 0 to lookup_1_Df_length-1)
{
var toatl_amount= lookup_1_Df(row)(0).toString
var total_avg= lookup_1_Df(row)(1).toString
var fina_df_frame= "select clnd_id,name,$toatl_amount,$total_avg from orig_table_1 a left join lookup_1 b on a.clin_id=b.clnt_id where a.clin_id='$clin_id"
var fina_df_frame_replaced=fina_df_frame.replace("$clin_id", clin_id).replace("$toatl_amount", toatl_amount).replace("$total_avg", total_avg)
final_intermediate=spark.sql(sqlText=fina_df_frame_replaced)
if (cnt == 0)
{
final_1=final_intermediate
}
else
{
final_1=final_intermediate.union(final_1)
}
cnt=cnt+1
}
final_1.createOrReplaceTempView("final_1_table")
Here I have shown sample data set only, my original table contains millions of records and I have 1000+ clients. Hence the looping is not an optimal solution as for each client, the above code snippet has to run which I am aware of. Can we do it in a more efficient way? Any suggestion?

Conditional count of rows where at least one peer qualifies

Background
I'm a novice SQL user. Using PostgreSQL 13 on Windows 10 locally, I have a table t:
+--+---------+-------+
|id|treatment|outcome|
+--+---------+-------+
|a |1 |0 |
|a |1 |1 |
|b |0 |1 |
|c |1 |0 |
|c |0 |1 |
|c |1 |1 |
+--+---------+-------+
The Problem
I didn't explain myself well initially, so I've rewritten the goal.
Desired result:
+-----------------------+-----+
|ever treated |count|
+-----------------------+-----+
|0 |1 |
|1 |3 |
+-----------------------+-----+
First, identify id that have ever been treated. Being "ever treated" means having any row with treatment = 1.
Second, count rows with outcome = 1 for each of those two groups. From my original table, the ids who are "ever treated" have a total of 3 outcome = 1, and the "never treated", so to speak, have 1 `outcome = 1.
What I've tried
I can get much of the way there, I think, with something like this:
select treatment, count(outcome)
from t
group by treatment;
But that only gets me this result:
+---------+-----+
|treatment|count|
+---------+-----+
|0 |2 |
|1 |4 |
+---------+-----+
For the updated question:
SELECT ever_treated, sum(outcome_ct) AS count
FROM (
SELECT id
, max(treatment) AS ever_treated
, count(*) FILTER (WHERE outcome = 1) AS outcome_ct
FROM t
GROUP BY 1
) sub
GROUP BY 1;
ever_treated | count
--------------+-------
0 | 1
1 | 3
db<>fiddle here
Read:
For those who got no treatment at all (all treatment = 0), we see 1 x outcome = 1.
For those who got any treatment (at least one treatment = 1), we see 3 x outcome = 1.
Would be simpler and faster with proper boolean values instead of integer.
(Answer to updated question)
here is an easy to follow subquery logic that works with integer:
select subq.ever_treated, sum(subq.count) as count
from (select id, max(treatment) as ever_treated, count(*) as count
from t where outcome = 1
group by id) as subq
group by subq.ever_treated;

In SQL, query a table by transposing column results

Background
Forgive the title of this question, as I'm not really sure how to describe what I'm trying to do.
I have a SQL table, d, that looks like this:
+--+---+------------+------------+
|id|sex|event_type_1|event_type_2|
+--+---+------------+------------+
|a |m |1 |1 |
|b |f |0 |1 |
|c |f |1 |0 |
|d |m |0 |1 |
+--+---+------------+------------+
The Problem
I'm trying to write a query that yields the following summary of counts of event_type_1 and event_type_2 cut (grouped?) by sex:
+-------------+-----+-----+
| | m | f |
+-------------+-----+-----+
|event_type_1 | 1 | 1 |
+-------------+-----+-----+
|event_type_2 | 2 | 1 |
+-------------+-----+-----+
The thing is, this seems to involve some kind of transposition of the 2 event_type columns into rows of the query result that I'm not familiar with as a novice SQL user.
What I've tried
I've so far come up with the following query:
SELECT event_type_1, event_type_2, count(sex)
FROM d
group by event_type_1, event_type_2
But that only gives me this:
+------------+------------+-----+
|event_type_1|event_type_2|count|
+------------+------------+-----+
|1 |1 |1 |
|1 |0 |1 |
|0 |1 |2 |
+------------+------------+-----+
You can use a lateral join to unpivot the data. Then use conditional aggregate to calculate m and f:
select v.which,
count(*) filter (where d.sex = 'm') as m,
count(*) filter (where d.sex = 'f') as f
from d cross join lateral
(values (d.event_type_1, 'event_type_1'),
(d.event_type_2, 'event_type_2')
) v(val, which)
where v.val = 1
group by v.which;
Here is a db<>fiddle.

How to compare two identically structured dataframes to calculate the row differences

I've the following two identically structurred dataframes with id in common.
val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000))
.toDF("id","name","city","credit_score","credit_limit")
scala> originalDF.show(false)
+---+------+---------+------------+------------+
|id |name |city |credit_score|credit_limit|
+---+------+---------+------------+------------+
|1 |gaurav|jaipur |550 |70000 |
|2 |sunil |noida |600 |80000 |
|3 |rishi |ahmedabad|510 |65000 |
+---+------+---------+------------+------------+
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000))
.toDF("id","name","city","credit_score","credit_limit")
scala> changedDF.show(false)
+---+------+------+------------+------------+
|id |name |city |credit_score|credit_limit|
+---+------+------+------------+------------+
|1 |gaurav|jaipur|550 |70000 |
|2 |sunil |noida |650 |90000 |
|4 |Joshua|cochin|612 |85000 |
+---+------+------+------------+------------+
Hence I wrote one udf to calulate the change in column values.
val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
val somedf=changedDF.alias("a").join(originalDF.alias("b"), col("a.id") === col("b.id")).withColumn("diffcolumn", split(concat_ws(",",changedDF.columns.map(x => diff(lit(x), changedDF(x), originalDF(x))):_*),","))
scala> somedf.show(false)
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
|id |name |city |credit_score|credit_limit|id |name |city |credit_score|credit_limit|diffcolumn |
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
|1 |gaurav|jaipur|550 |70000 |1 |gaurav|jaipur|550 |70000 |[, , , , ] |
|2 |sunil |noida |650 |90000 |2 |sunil |noida |600 |80000 |[, , , credit_score, credit_limit]|
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
But I'm not able to get id and diffcolumn separately. If I do a
somedf.select('id) it gives me ambiguity error coz there are two ids in the joined table
I want to get all the name of the columns in any array and id corresponding to which the values have changed. Like in the changedDF credit score and credit limit of id=2,name=sunil has been changed.
Hence I wanted the resultant dataframe to give me result like
+--+---+------+------+------------+------------+---+
|id | diffcolumn |
+---+------+------+------------+------------+---
|2 |[, , , credit_score, credit_limit] |
+---+------+------+------------+------------+---+
Can anyone suggest me what approach to follow to get eh id and changed column separately in a dataframe.
For your reference, these kinds of diffs can easily be done with the spark-extension package.
It provides the diff transformation that builds that complex query for you:
import uk.co.gresearch.spark.diff._
val options = DiffOptions.default.withChangeColumn("changes") // needed to get the optional 'changes' column
val diff = originalDF.diff(changedDF, options, "id")
diff.show(false)
+----+----------------------------+---+---------+----------+---------+----------+-----------------+------------------+-----------------+------------------+
|diff|changes |id |left_name|right_name|left_city|right_city|left_credit_score|right_credit_score|left_credit_limit|right_credit_limit|
+----+----------------------------+---+---------+----------+---------+----------+-----------------+------------------+-----------------+------------------+
|N |[] |1 |gaurav |gaurav |jaipur |jaipur |550 |550 |70000 |70000 |
|I |null |4 |null |Joshua |null |cochin |null |612 |null |85000 |
|C |[credit_score, credit_limit]|2 |sunil |sunil |noida |noida |600 |650 |80000 |90000 |
|D |null |3 |rishi |null |ahmedabad|null |510 |null |65000 |null |
+----+----------------------------+---+---------+----------+---------+----------+-----------------+------------------+-----------------+------------------+
diff.select($"id", $"diff", $"changes").show(false)
+---+----+----------------------------+
|id |diff|changes |
+---+----+----------------------------+
|1 |N |[] |
|4 |I |null |
|2 |C |[credit_score, credit_limit]|
|3 |D |null |
+---+----+----------------------------+
While this is a simple example, diffing DataFrames can become complicated when wide schemas and null values are involved.
That package is well-tested, so you don't have to worry about getting that query right yourself.
Try this :
val aliasedChangedDF = changedDF.as("a")
val aliasedOriginalDF = originalDF.as("b")
val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
val somedf=aliasedChangedDF.join(aliasedOriginalDF, col("a.id") === col("b.id")).withColumn("diffcolumn", split(concat_ws(",",changedDF.columns.map(x => diff(lit(x), changedDF(x), originalDF(x))):_*),","))
somedf.select(col("a.id").as("id"),col("diffcolumn"))
Just change your join condition from col("a.id") === col("b.id") to "id"
Then, there will be only a single id column.
Further, you don't need the alias("a") and alias("b"). So your join simplifies from
changedDF.alias("a").join(originalDF.alias("b"), col("a.id") === col("b.id"))
to
changedDF.join(originalDF, "id")

Select rows with same id but different result in another column

sql: I have a table like this:
+------+------+
|ID |Result|
+------+------+
|1 |A |
+------+------+
|2 |A |
+------+------+
|3 |A |
+------+------+
|1 |B |
+------+------+
|2 |B |
+------+------+
The output should be something like:
Output:
+------+-------+-------+
|ID |Result1|Result2|
+------+-------+-------+
|1 |A |B |
+------+-------+-------+
|2 |A |B |
+------+-------+-------+
|3 |A | |
+------+-------+-------+
How can I do this?
SELECT
Id,
MAX((CASE result WHEN 'A' THEN 'A' ELSE NULL END)) result1,
MAX((CASE result WHEN 'B' THEN 'B' ELSE NULL END)) result2,
FROM
table1
GROUP BY Id
results
+------+-------+-------+
|Id |Result1|Result2|
+------+-------+-------+
|1 |A |B |
|2 |A |B |
|3 |A |NULL |
+------+-------+-------+
run live demo on SQL fiddle: (http://sqlfiddle.com/#!9/e1081/2)
there are a few ways to do it.
None of tehm a are straight forward.
in theory, a simple way would be to create 2 temporary tables, where you separte the data, all the "A" resultas in one table and "B" in another table.
Then get the results with simple query. using JOIN.
if you are allowed to use some scrpting on the process then it is simpler, other wise you need a more complex logic on your query. And for you query to alwasy work, you need to have some rules like, A table always contains more ids than B table.
If you post your real example, it is easier to get better answers.
for this reason:
ID Name filename
1001 swapan 4566.jpg
1002 swapan 678.jpg
1003 karim 7688.jpg
1004 tarek 7889.jpg
1005 karim fdhak.jpg
output:
ID Name filename
1001 swapan 4566.jpg 678.jpg
1003 karim 7688.jpg fdhak.jpg
1004 tarek 7889.jpg ...
.. ... ... ...