How to add missing rows per group in Spark - sql

The input dataset looks like this:
org| id |step| value
1 | 1 | 1 | 12
1 | 1 | 2 | 13
1 | 1 | 3 | 14
1 | 1 | 4 | 15
1 | 2 | 1 | 16
1 | 2 | 2 | 17
2 | 1 | 1 | 1
2 | 1 | 2 | 2
for the output I want to add the missing steps per org group for example to id == 2 of org == 1
org| id |step| value
1 | 1 | 1 | 12
1 | 1 | 2 | 13
1 | 1 | 3 | 14
1 | 1 | 4 | 15
1 | 2 | 1 | 16
1 | 2 | 2 | 17
1 | 2 | 3 | null
1 | 2 | 4 | null
2 | 1 | 1 | 1
2 | 1 | 2 | 2
I tried this but doesn't work:
r = df.select("org", "step").distinct()
df.join(r, ["org", "step"], 'right_outer')

val l = df.select("org", "step");
val r = df.select("org", "id");
val right = l.join(r, "org");
val result = df.join(right, Seq("org", "id", "step"), "right_outer").distinct().orderBy("org", "id", "step");
result.show
Gives:
+---+---+----+-----+
|org| id|step|value|
+---+---+----+-----+
| 1| 1| 1| 12|
| 1| 1| 2| 13|
| 1| 1| 3| 14|
| 1| 1| 4| 15|
| 1| 2| 1| 16|
| 1| 2| 2| 17|
| 1| 2| 3| null|
| 1| 2| 4| null|
| 2| 1| 1| 1|
| 2| 1| 2| 2|
+---+---+----+-----+
Bonus: sql query for the table (orgs) reflecting the df contents
select distinct o_right."org", o_right."id", o_right."step", o_left."value"
from orgs as o_left
right outer join (
select o_in_left."org", o_in_right."id", o_in_left."step"
from orgs as o_in_right
join (select "org", "step" from orgs) as o_in_left
on o_in_right."org" = o_in_left."org"
order by "org", "id", "step"
) as o_right
on o_left."org" = o_right."org"
and o_left."step" = o_right."step"
and o_left."id" = o_right."id"
order by "org", "id", "step"

Related

pyspark add 0 with empty index

I have dataframe like below:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 5 | 85 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
+--------+---------+---------+
and index should be 0~5, so what I want to get is:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 1 | 0 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 4 | 0 |
| name0 | 5 | 85 |
| name1 | 0 | 0 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
| name1 | 4 | 0 |
| name1 | 5 | 0 |
+--------+---------+---------+
I want to fill 0 in empty index, but I have no idea.
Is there any solution? Please consider that I don't use pandas.
Cross join the names with a range of indices, then left join to the original dataframe using name and index, and replace nulls with 0.
spark.conf.set("spark.sql.crossJoin.enabled", True)
df2 = (df.select('name')
.distinct()
.join(spark.range(6).toDF('index'))
.join(df, ['name', 'index'], 'left')
.fillna({'score': 0})
)
df2.show()
+-----+-----+-----+
| name|index|score|
+-----+-----+-----+
|name0| 0| 50|
|name0| 1| 0|
|name0| 2| 90|
|name0| 3| 100|
|name0| 4| 0|
|name0| 5| 85|
|name1| 0| 0|
|name1| 1| 65|
|name1| 2| 50|
|name1| 3| 70|
|name1| 4| 0|
|name1| 5| 0|
+-----+-----+-----+

Group query in subquery to get column value as column name

The data i've in my database:
| id| some_id| status|
| 1| 1 | SUCCESS|
| 2| 2 | SUCCESS|
| 3| 1 | SUCCESS|
| 4| 3 | SUCCESS|
| 5| 1 | SUCCESS|
| 6| 4 | FAILED |
| 7| 1 | SUCCESS|
| 8| 1 | FAILED |
| 9| 4 | FAILED |
| 10| 1 | FAILED |
.......
I ran a query to group by id and status to get the below result:
| some_id| count| status|
| 1 | 20| SUCCESS|
| 2 | 5 | SUCCESS|
| 3 | 10| SUCCESS|
| 2 | 15| FAILED |
| 3 | 12| FAILED |
| 4 | 25 | FAILED |
I want to use the above query as subquery to get the result below, where the distinct status are column name.
| some_id| SUCCESS| FAILED|
| 1 | 20 | null/0|
| 2 | 5 | 15 |
| 3 | 10 | 12 |
| 4 | null/0| 25 |
Any other approach to get the final data is also appreciated. Let me know if need more info.
Thanks
You may use a pivot query here with the help of FILTER:
SELECT
some_id,
COUNT(*) FILTER (WHERE status = 'SUCCESS') AS SUCCESS,
COUNT(*) FILTER (WHERE status = 'FAILED') AS FAILED
FROM yourTable
GROUP BY
some_id;
Demo

How to group by with a condition in PySpark

How to group by with a condition in PySpark?
This is an example data:
+-----+-------+-------------+------------+
| zip | state | Agegrouping | patient_id |
+-----+-------+-------------+------------+
| 123 | x | Adult | 123 |
| 124 | x | Children | 231 |
| 123 | x | Children | 456 |
| 156 | x | Adult | 453 |
| 124 | y | Adult | 34 |
| 432 | y | Adult | 23 |
| 234 | y | Children | 13 |
| 432 | z | Children | 22 |
| 234 | z | Adult | 44 |
+-----+-------+-------------+------------+
then wanted to see the data as:
+-----+-------+-------+----------+------------+
| zip | state | Adult | Children | patient_id |
+-----+-------+-------+----------+------------+
| 123 | x | 1 | 1 | 2 |
| 124 | x | 1 | 1 | 2 |
| 156 | x | 1 | 0 | 1 |
| 432 | y | 1 | 1 | 2 |
| 234 | z | 1 | 1 | 2 |
+-----+-------+-------+----------+------------+
How can I do this?
Here is the spark sql version.
df.createOrReplaceTempView('table')
spark.sql('''
select zip, state,
count(if(Agegrouping = 'Adult', 1, null)) as adult,
count(if(Agegrouping = 'Children', 1, null)) as children,
count(1) as patient_id
from table
group by zip, state;
''').show()
+---+-----+-----+--------+----------+
|zip|state|adult|children|patient_id|
+---+-----+-----+--------+----------+
|123| x| 1| 1| 2|
|156| x| 1| 0| 1|
|234| z| 1| 0| 1|
|432| z| 0| 1| 1|
|234| y| 0| 1| 1|
|124| y| 0| 0| 1|
|124| x| 0| 1| 1|
|432| y| 1| 0| 1|
+---+-----+-----+--------+----------+
You can use conditional aggregation:
select zip, state,
sum(case when agegrouping = 'Adult' then 1 else 0 end) as adult,
sum(case when agegrouping = 'Children' then 1 else 0 end) as children,
count(*) as num_patients
from t
group by zip, state;
Use conditional aggreagation:
select
zip,
state,
sum(case when agregrouping = 'Adult' then 1 else 0 end ) as adult
sum(case when agregrouping = 'Children' then 1 else 0 end ) as children,
count(*) patient_id
from mytable
group by zip, state

compute column without using window function

I have two tables as below :
Table 1 :
+-------+----------+-------+----------+----------+----------+------+
|pid |ite |zid | date |usales |csales |p |
+-------+----------+-------+----------+----------+----------+------+
| p1| it1| z1|2016-11-21| 0.1| 1| 0|
| p1| it1| z1|2016-12-05| 0.1| 1| 0|
| p1| it1| z1|2017-01-05| 0.1| 1| 0|
| p2| it2| z2|2016-11-28| 0.1| 5| 4|
| p2| it2| z2|2016-12-12| 0.1| 3| 2|
| p1| it2| z1|2016-11-14| 0.1| 2| 1|
| p1| it3| z1|2016-11-21| 0.1| 10| 9|
| p1| it3| z1|2016-12-05| 0.1| 10| 9|
+-------+----------+-------+----------+----------+----------+------+
Table 2 :
+------------+----------+--------+----------+----------+
|z_id |p_id |rate |start_date| End_Date |
+------------+----------+--------+----------+----------+
| z1| it1| 25|2016-01-01|2016-06-01|
| z1| it1| 25.75|2016-01-01| null|
| z1| it2| 25|2016-01-01|2017-03-01|
| z1| it2| 32.75|2017-01-01| null|
+------------+----------+--------+----------+----------+
and I need the result table on the basis of join condition Table1.zid = Table2.z_id and Table1.ite = Table2.p_id,with following conditions:
* if "week" column of first table is exist between "start_date" and "End_date" of second table then need to compute column "r" in result table as "Table1.usales * Table2.rate"
* if "week" column of first table is NOT exist between "start_date" and "End_date" of second table then value of columnn "r" in result table is 0
* if "week" column of first table is exist between "start_date" and "End_date" of second table and "End_date" is null then need to compute column "r" in result table as "Table1.usales * Table2.rate"
* if "ite" and "zid" of "Table1" is not present in second table then value of columnn "r" in result table is 0
OutPut Table :
+-------+----------+-------+----------+----------+----------+------+------+
|pid |ite |zid | date |usales |csales |p |r |
+-------+----------+-------+----------+----------+----------+------+------+
| p1| it1| z1|2016-11-21| 0.1| 1| 0| |
| p1| it1| z1|2016-12-05| 0.1| 1| 0| |
| p1| it1| z1|2017-12-05| 0.1| 1| 0| |
| p2| it2| z2|2016-11-28| 0.1| 5| 4| |
| p2| it2| z2|2016-12-12| 0.1| 3| 2| |
| p1| it2| z1|2016-11-14| 0.1| 2| 1| |
| p1| it3| z1|2016-11-21| 0.1| 10| 9| |
| p1| it3| z1|2016-12-05| 0.1| 10| 9| |
+-------+----------+-------+----------+----------+----------+------+------+
I have tried this by making a left join between table 1 and table 2 and then dumping the result into "data" and doing partitioning as below :
"select * from (select *,Row_Number() over(partition by z_id,p_id,dateorder by r desc) as ID from data))a where ID=1
I am not sure how to do it using join and subquery as I need to avoid "WINDOW function".Can anyone please help
You need a LEFT JOIN of Table1 to Table2:
select t1.*,
case
when t1.date >= t2.start_date and t1.date <= coalesce(t2.end_date, t1.date)
then t1.usales * t2.rate
else 0
end r
from table1 t1 left join table2 t2
on t2.z_id = t1.zid and t2.p_id = t1.ite
and (t1.date >= t2.start_date and t1.date <= coalesce(t2.end_date, t1.date))
See the demo.
Results:
> pid | ite | zid | date | usales | csales | p | r
> :-- | :-- | :-- | :--------- | :----- | -----: | -: | :-----
> p1 | it1 | z1 | 2016-11-21 | 0.10 | 1 | 0 | 2.5750
> p1 | it1 | z1 | 2016-12-05 | 0.10 | 1 | 0 | 2.5750
> p1 | it1 | z1 | 2017-01-05 | 0.10 | 1 | 0 | 2.5750
> p2 | it2 | z2 | 2016-11-28 | 0.10 | 5 | 4 | 0.0000
> p2 | it2 | z2 | 2016-12-12 | 0.10 | 3 | 2 | 0.0000
> p1 | it2 | z1 | 2016-11-14 | 0.10 | 2 | 1 | 2.5000
> p1 | it3 | z1 | 2016-11-21 | 0.10 | 10 | 9 | 0.0000
> p1 | it3 | z1 | 2016-12-05 | 0.10 | 10 | 9 | 0.0000

How to flatten a pyspark dataframes that contains multiple rows per id?

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.
Here's an example to explain what I'm trying to achieve, my dataframe looks like this:
+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1 | 1 | 54 | 2 |
+----+-----+--------+--------+
| 1 | 2 | 0 | 6 |
+----+-----+--------+--------+
| 1 | 3 | 578 | 14 |
+----+-----+--------+--------+
| 2 | 1 | 10 | 1 |
+----+-----+--------+--------+
| 2 | 2 | 6 | 32 |
+----+-----+--------+--------+
| 2 | 3 | 0 | 0 |
+----+-----+--------+--------+
| 3 | 1 | 12 | 2 |
+----+-----+--------+--------+
| 3 | 2 | 20 | 5 |
+----+-----+--------+--------+
| 3 | 3 | 63 | 22 |
+----+-----+--------+--------+
The desired output is the following table:
+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1 | 54 | 0 | 578 | 2 | 6 | 14 |
+----+----------+----------+----------+----------+----------+----------+
| 2 | 10 | 6 | 0 | 1 | 32 | 0 |
+----+----------+----------+----------+----------+----------+----------+
| 3 | 12 | 20 | 63 | 2 | 5 | 22 |
+----+----------+----------+----------+----------+----------+----------+
So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.
Any help would be appreciated!
In Spark 2.4 you can do this way
var df3 =Seq((1,1,54 , 2 ),(1,2,0 , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6 , 32),(2,3,0 , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")
scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
| 1| 1| 54| 2|
| 1| 2| 0| 6|
| 1| 3| 578| 14|
| 2| 1| 10| 1|
| 2| 2| 6| 32|
| 2| 3| 0| 0|
| 3| 1| 12| 2|
| 3| 2| 20| 5|
| 3| 3| 63| 22|
+---+---+------+------+
using coalesce retrieve the first value of the id.
scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))
scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")
Renaming columns
scala> df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
| 1| 54| 2| 0| 6| 578| 14|
| 2| 10| 1| 6| 32| 0| 0|
| 3| 12| 2| 20| 5| 63| 22|
+---+--------+--------+--------+--------+--------+--------+
rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop