How to group by with a condition in PySpark - sql

How to group by with a condition in PySpark?
This is an example data:
+-----+-------+-------------+------------+
| zip | state | Agegrouping | patient_id |
+-----+-------+-------------+------------+
| 123 | x | Adult | 123 |
| 124 | x | Children | 231 |
| 123 | x | Children | 456 |
| 156 | x | Adult | 453 |
| 124 | y | Adult | 34 |
| 432 | y | Adult | 23 |
| 234 | y | Children | 13 |
| 432 | z | Children | 22 |
| 234 | z | Adult | 44 |
+-----+-------+-------------+------------+
then wanted to see the data as:
+-----+-------+-------+----------+------------+
| zip | state | Adult | Children | patient_id |
+-----+-------+-------+----------+------------+
| 123 | x | 1 | 1 | 2 |
| 124 | x | 1 | 1 | 2 |
| 156 | x | 1 | 0 | 1 |
| 432 | y | 1 | 1 | 2 |
| 234 | z | 1 | 1 | 2 |
+-----+-------+-------+----------+------------+
How can I do this?

Here is the spark sql version.
df.createOrReplaceTempView('table')
spark.sql('''
select zip, state,
count(if(Agegrouping = 'Adult', 1, null)) as adult,
count(if(Agegrouping = 'Children', 1, null)) as children,
count(1) as patient_id
from table
group by zip, state;
''').show()
+---+-----+-----+--------+----------+
|zip|state|adult|children|patient_id|
+---+-----+-----+--------+----------+
|123| x| 1| 1| 2|
|156| x| 1| 0| 1|
|234| z| 1| 0| 1|
|432| z| 0| 1| 1|
|234| y| 0| 1| 1|
|124| y| 0| 0| 1|
|124| x| 0| 1| 1|
|432| y| 1| 0| 1|
+---+-----+-----+--------+----------+

You can use conditional aggregation:
select zip, state,
sum(case when agegrouping = 'Adult' then 1 else 0 end) as adult,
sum(case when agegrouping = 'Children' then 1 else 0 end) as children,
count(*) as num_patients
from t
group by zip, state;

Use conditional aggreagation:
select
zip,
state,
sum(case when agregrouping = 'Adult' then 1 else 0 end ) as adult
sum(case when agregrouping = 'Children' then 1 else 0 end ) as children,
count(*) patient_id
from mytable
group by zip, state

Related

pyspark add 0 with empty index

I have dataframe like below:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 5 | 85 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
+--------+---------+---------+
and index should be 0~5, so what I want to get is:
+--------+---------+---------+
| name | index | score |
+--------+---------+---------+
| name0 | 0 | 50 |
| name0 | 1 | 0 |
| name0 | 2 | 90 |
| name0 | 3 | 100 |
| name0 | 4 | 0 |
| name0 | 5 | 85 |
| name1 | 0 | 0 |
| name1 | 1 | 65 |
| name1 | 2 | 50 |
| name1 | 3 | 70 |
| name1 | 4 | 0 |
| name1 | 5 | 0 |
+--------+---------+---------+
I want to fill 0 in empty index, but I have no idea.
Is there any solution? Please consider that I don't use pandas.
Cross join the names with a range of indices, then left join to the original dataframe using name and index, and replace nulls with 0.
spark.conf.set("spark.sql.crossJoin.enabled", True)
df2 = (df.select('name')
.distinct()
.join(spark.range(6).toDF('index'))
.join(df, ['name', 'index'], 'left')
.fillna({'score': 0})
)
df2.show()
+-----+-----+-----+
| name|index|score|
+-----+-----+-----+
|name0| 0| 50|
|name0| 1| 0|
|name0| 2| 90|
|name0| 3| 100|
|name0| 4| 0|
|name0| 5| 85|
|name1| 0| 0|
|name1| 1| 65|
|name1| 2| 50|
|name1| 3| 70|
|name1| 4| 0|
|name1| 5| 0|
+-----+-----+-----+

Group query in subquery to get column value as column name

The data i've in my database:
| id| some_id| status|
| 1| 1 | SUCCESS|
| 2| 2 | SUCCESS|
| 3| 1 | SUCCESS|
| 4| 3 | SUCCESS|
| 5| 1 | SUCCESS|
| 6| 4 | FAILED |
| 7| 1 | SUCCESS|
| 8| 1 | FAILED |
| 9| 4 | FAILED |
| 10| 1 | FAILED |
.......
I ran a query to group by id and status to get the below result:
| some_id| count| status|
| 1 | 20| SUCCESS|
| 2 | 5 | SUCCESS|
| 3 | 10| SUCCESS|
| 2 | 15| FAILED |
| 3 | 12| FAILED |
| 4 | 25 | FAILED |
I want to use the above query as subquery to get the result below, where the distinct status are column name.
| some_id| SUCCESS| FAILED|
| 1 | 20 | null/0|
| 2 | 5 | 15 |
| 3 | 10 | 12 |
| 4 | null/0| 25 |
Any other approach to get the final data is also appreciated. Let me know if need more info.
Thanks
You may use a pivot query here with the help of FILTER:
SELECT
some_id,
COUNT(*) FILTER (WHERE status = 'SUCCESS') AS SUCCESS,
COUNT(*) FILTER (WHERE status = 'FAILED') AS FAILED
FROM yourTable
GROUP BY
some_id;
Demo

Cross join remaining combinations

I am trying to build a table that would bring be a combination of all products that I could sell, based on the current ones.
Product Status Table
+-------------+--------------+----------------+
| customer_id | product_name | product_status |
+-------------+--------------+----------------+
| 1 | A | Active |
| 2 | B | Active |
| 2 | C | Active |
| 3 | A | Cancelled |
+-------------+--------------+----------------+
Now I am trying to cross join with a hard code table that would give be 4 rows per customer_id, based on all 4 product we have in our portfolio, and statuses that I would like to apply.
Portfolio Table
+--------------+------------+----------+
| product_name | status_1 | status_2 |
+--------------+------------+----------+
| A | Inelegible | Inactive |
| B | Inelegible | Inactive |
| C | Ineligible | Inactive |
| D | Inelegible | Inactive |
+--------------+------------+----------+
On my code I tried to use a CROSS JOIN in order to achieve 4 rows per customer_id. Unfortunately, for customers that have more than one product, I have double/triple rows.
This is my code:
SELECT
p.customer_id,
CASE WHEN p.product_name = pt.product_name THEN p.product_name ELSE pt.product_name END AS product_name,
CASE
WHEN p.product_name = pt.product_name THEN p.product_status
ELSE pt.status_1
END AS product_status
FROM
products AS p
CROSS JOIN
portfolio as pt
This is my current output:
+----+-------------+--------------+----------------+
| # | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
| 1 | 1 | A | Active |
| 2 | 1 | B | Inelegible |
| 3 | 1 | C | Inelegible |
| 4 | 1 | D | Inelegible |
| 5 | 2 | A | Ineligible |
| 6 | 2 | A | Ineligible |
| 7 | 2 | B | Active |
| 8 | 2 | B | Ineligible |
| 9 | 2 | C | Active |
| 10 | 2 | C | Ineligible |
| 11 | 2 | D | Ineligible |
| 12 | 2 | D | Ineligible |
| 13 | 3 | A | Cancelled |
| 14 | 3 | B | Ineligible |
| 15 | 3 | C | Ineligible |
| 16 | 3 | D | Ineligible |
+----+-------------+--------------+----------------+
As you may see, for the customer_id 2, I have two rows for each product having products B and C with different statuses then what I have on the product_status table.
What I would like to achieve, in this case, is a table with 12 rows, in which the current product/status from the product_status table is shown, and the remaining product/statuses from the portfolio table are added.
Expected output
+----+-------------+--------------+----------------+
| # | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
| 1 | 1 | A | Active |
| 2 | 1 | B | Inelegible |
| 3 | 1 | C | Inelegible |
| 4 | 1 | D | Inelegible |
| 5 | 2 | A | Ineligible |
| 6 | 2 | B | Active |
| 7 | 2 | C | Active |
| 8 | 2 | D | Ineligible |
| 9 | 3 | A | Cancelled |
| 10 | 3 | B | Ineligible |
| 11 | 3 | C | Ineligible |
| 12 | 3 | D | Ineligible |
+----+-------------+--------------+----------------+
Not sure if the CROSS JOIN is the best alternative, but now I am running out of ideas.
EDIT:
I thought of another cleaner solution. Do a cross join first, then a right join on the customer_id and product_name, and coalesce the product statuses.
SELECT customer_id, product_name, coalesce(product_status, status_1)
FROM products p
RIGHT JOIN (
SELECT *
FROM (SELECT DISTINCT customer_id FROM products) pro
CROSS JOIN portfolio
) pt
USING (customer_id, product_name)
ORDER BY customer_id, product_name
Old answer:
The idea is to include information of all product names for a customer_id into a list, and check whether the product in portfolio is in that list.
(SELECT customer_id, pt_product_name as product_name, first(status_1) as product_status
FROM (
SELECT
customer_id,
p.product_name as p_product_name,
pt.product_name as pt_product_name,
product_status,
status_1,
status_2,
collect_list(p.product_name) over (partition by customer_id) AS product_list
FROM products p
CROSS JOIN portfolio pt
)
WHERE NOT array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)
UNION ALL
(SELECT customer_id, p_product_name as product_name, first(product_status) as product_status
FROM (
SELECT
customer_id,
p.product_name as p_product_name,
pt.product_name as pt_product_name,
product_status,
status_1,
status_2,
collect_list(p.product_name) over (partition by customer_id) AS product_list
FROM products p
CROSS JOIN portfolio pt)
WHERE array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)
ORDER BY customer_id, product_name;
which gives
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| A| Active|
| 1| B| Inelegible|
| 1| C| Ineligible|
| 1| D| Inelegible|
| 2| A| Inelegible|
| 2| B| Active|
| 2| C| Active|
| 2| D| Inelegible|
| 3| A| Cancelled|
| 3| B| Inelegible|
| 3| C| Ineligible|
| 3| D| Inelegible|
+-----------+------------+--------------+
FYI the chunk before UNION ALL gives:
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| B| Inelegible|
| 1| C| Ineligible|
| 1| D| Inelegible|
| 2| A| Inelegible|
| 2| D| Inelegible|
| 3| B| Inelegible|
| 3| C| Ineligible|
| 3| D| Inelegible|
+-----------+------------+--------------+
And the chunk after UNION ALL gives:
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| A| Active|
| 2| B| Active|
| 2| C| Active|
| 3| A| Cancelled|
+-----------+------------+--------------+
Hope that helps!

How to add missing rows per group in Spark

The input dataset looks like this:
org| id |step| value
1 | 1 | 1 | 12
1 | 1 | 2 | 13
1 | 1 | 3 | 14
1 | 1 | 4 | 15
1 | 2 | 1 | 16
1 | 2 | 2 | 17
2 | 1 | 1 | 1
2 | 1 | 2 | 2
for the output I want to add the missing steps per org group for example to id == 2 of org == 1
org| id |step| value
1 | 1 | 1 | 12
1 | 1 | 2 | 13
1 | 1 | 3 | 14
1 | 1 | 4 | 15
1 | 2 | 1 | 16
1 | 2 | 2 | 17
1 | 2 | 3 | null
1 | 2 | 4 | null
2 | 1 | 1 | 1
2 | 1 | 2 | 2
I tried this but doesn't work:
r = df.select("org", "step").distinct()
df.join(r, ["org", "step"], 'right_outer')
val l = df.select("org", "step");
val r = df.select("org", "id");
val right = l.join(r, "org");
val result = df.join(right, Seq("org", "id", "step"), "right_outer").distinct().orderBy("org", "id", "step");
result.show
Gives:
+---+---+----+-----+
|org| id|step|value|
+---+---+----+-----+
| 1| 1| 1| 12|
| 1| 1| 2| 13|
| 1| 1| 3| 14|
| 1| 1| 4| 15|
| 1| 2| 1| 16|
| 1| 2| 2| 17|
| 1| 2| 3| null|
| 1| 2| 4| null|
| 2| 1| 1| 1|
| 2| 1| 2| 2|
+---+---+----+-----+
Bonus: sql query for the table (orgs) reflecting the df contents
select distinct o_right."org", o_right."id", o_right."step", o_left."value"
from orgs as o_left
right outer join (
select o_in_left."org", o_in_right."id", o_in_left."step"
from orgs as o_in_right
join (select "org", "step" from orgs) as o_in_left
on o_in_right."org" = o_in_left."org"
order by "org", "id", "step"
) as o_right
on o_left."org" = o_right."org"
and o_left."step" = o_right."step"
and o_left."id" = o_right."id"
order by "org", "id", "step"

How to flatten a pyspark dataframes that contains multiple rows per id?

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.
Here's an example to explain what I'm trying to achieve, my dataframe looks like this:
+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1 | 1 | 54 | 2 |
+----+-----+--------+--------+
| 1 | 2 | 0 | 6 |
+----+-----+--------+--------+
| 1 | 3 | 578 | 14 |
+----+-----+--------+--------+
| 2 | 1 | 10 | 1 |
+----+-----+--------+--------+
| 2 | 2 | 6 | 32 |
+----+-----+--------+--------+
| 2 | 3 | 0 | 0 |
+----+-----+--------+--------+
| 3 | 1 | 12 | 2 |
+----+-----+--------+--------+
| 3 | 2 | 20 | 5 |
+----+-----+--------+--------+
| 3 | 3 | 63 | 22 |
+----+-----+--------+--------+
The desired output is the following table:
+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1 | 54 | 0 | 578 | 2 | 6 | 14 |
+----+----------+----------+----------+----------+----------+----------+
| 2 | 10 | 6 | 0 | 1 | 32 | 0 |
+----+----------+----------+----------+----------+----------+----------+
| 3 | 12 | 20 | 63 | 2 | 5 | 22 |
+----+----------+----------+----------+----------+----------+----------+
So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.
Any help would be appreciated!
In Spark 2.4 you can do this way
var df3 =Seq((1,1,54 , 2 ),(1,2,0 , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6 , 32),(2,3,0 , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")
scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
| 1| 1| 54| 2|
| 1| 2| 0| 6|
| 1| 3| 578| 14|
| 2| 1| 10| 1|
| 2| 2| 6| 32|
| 2| 3| 0| 0|
| 3| 1| 12| 2|
| 3| 2| 20| 5|
| 3| 3| 63| 22|
+---+---+------+------+
using coalesce retrieve the first value of the id.
scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))
scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")
Renaming columns
scala> df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
| 1| 54| 2| 0| 6| 578| 14|
| 2| 10| 1| 6| 32| 0| 0|
| 3| 12| 2| 20| 5| 63| 22|
+---+--------+--------+--------+--------+--------+--------+
rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop