spark sql: how to create sessionId for user-item

spark sql: how to create sessionId for user-item - sql

Let's say I've got dataset like this:
| item | event | timestamp | user |
|:-----------|------------:|:---------:|:---------:|
| titanic | view | 1 | 1 |
| titanic | add to bag | 2 | 1 |
| titanic | close | 3 | 1 |
| avatar | view | 6 | 1 |
| avatar | close | 10 | 1 |
| titanic | view | 20 | 1 |
| titanic | purchase | 30 | 1 |
and so on. And I need to calculate sessionId for each user for continuous going events corresponding to particular item.
So for that particular data output should be the following :
| item | event | timestamp | user | sessionId |
|:-----------|------------:|:---------:|:---------:|:--------------:|
| titanic | view | 1 | 1 | session1 |
| titanic | add to bag | 2 | 1 | session1 |
| titanic | close | 3 | 1 | session1 |
| avatar | view | 6 | 1 | session2 |
| avatar | close | 10 | 1 | session2 |
| titanic | view | 20 | 1 | session3 |
| titanic | purchase | 30 | 1 | session3 |
I was trying to use similar approach as described here Spark: How to create a sessionId based on userId and timestamp with window:
Window.partitionBy("user", "item").orderBy("timestamp")
But that just doesn't work because the same user - item combination might be in different sessions. For example see session1 and session3.
And with that window they become the same session.
Need help with another approach how to implement that.

Here's one approach that first generates a column of timestamp value with conditional null, uses last(ts, ignoreNulls) along with rowsBetween to backfill with the last non-null timestamp value, and finally construct sessionId using dense_rank:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("titanic", "view", 1, 1),
("titanic", "add to bag", 2, 1),
("titanic", "close", 3, 1),
("avatar", "view", 6, 1),
("avatar", "close", 10, 1),
("titanic", "view", 20, 1),
("titanic", "purchase", 30, 1)
).toDF("item", "event", "timestamp", "user")
val win1 = Window.partitionBy($"user").orderBy($"timestamp")
val win2 = Window.partitionBy($"user").orderBy($"sessTS")
df.
withColumn( "firstTS",
when( row_number.over(win1) === 1 || $"item" =!= lag($"item", 1).over(win1),
$"timestamp" )
).
withColumn( "sessTS",
last($"firstTS", ignoreNulls = true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("sessionId", concat(lit("session"), dense_rank.over(win2))).
show
// +-------+----------+---------+----+-------+------+---------+
// | item| event|timestamp|user|firstTS|sessTS|sessionId|
// +-------+----------+---------+----+-------+------+---------+
// |titanic| view| 1| 1| 1| 1| session1|
// |titanic|add to bag| 2| 1| null| 1| session1|
// |titanic| close| 3| 1| null| 1| session1|
// | avatar| view| 6| 1| 6| 6| session2|
// | avatar| close| 10| 1| null| 6| session2|
// |titanic| view| 20| 1| 20| 20| session3|
// |titanic| purchase| 30| 1| null| 20| session3|
// +-------+----------+---------+----+-------+------+---------+

You seem to need to count the number of "view" records cumulatively. If so:
select t.*,
sum(case when event = 'view' then 1 else 0 end) over (partition by user order by timestamp) as session
from t;

Related

Group query in subquery to get column value as column name

The data i've in my database:
| id| some_id| status|
| 1| 1 | SUCCESS|
| 2| 2 | SUCCESS|
| 3| 1 | SUCCESS|
| 4| 3 | SUCCESS|
| 5| 1 | SUCCESS|
| 6| 4 | FAILED |
| 7| 1 | SUCCESS|
| 8| 1 | FAILED |
| 9| 4 | FAILED |
| 10| 1 | FAILED |
.......
I ran a query to group by id and status to get the below result:
| some_id| count| status|
| 1 | 20| SUCCESS|
| 2 | 5 | SUCCESS|
| 3 | 10| SUCCESS|
| 2 | 15| FAILED |
| 3 | 12| FAILED |
| 4 | 25 | FAILED |
I want to use the above query as subquery to get the result below, where the distinct status are column name.
| some_id| SUCCESS| FAILED|
| 1 | 20 | null/0|
| 2 | 5 | 15 |
| 3 | 10 | 12 |
| 4 | null/0| 25 |
Any other approach to get the final data is also appreciated. Let me know if need more info.
Thanks

You may use a pivot query here with the help of FILTER:
SELECT
some_id,
COUNT(*) FILTER (WHERE status = 'SUCCESS') AS SUCCESS,
COUNT(*) FILTER (WHERE status = 'FAILED') AS FAILED
FROM yourTable
GROUP BY
some_id;
Demo

How to group by with a condition in PySpark

How to group by with a condition in PySpark?
This is an example data:
+-----+-------+-------------+------------+
| zip | state | Agegrouping | patient_id |
+-----+-------+-------------+------------+
| 123 | x | Adult | 123 |
| 124 | x | Children | 231 |
| 123 | x | Children | 456 |
| 156 | x | Adult | 453 |
| 124 | y | Adult | 34 |
| 432 | y | Adult | 23 |
| 234 | y | Children | 13 |
| 432 | z | Children | 22 |
| 234 | z | Adult | 44 |
+-----+-------+-------------+------------+
then wanted to see the data as:
+-----+-------+-------+----------+------------+
| zip | state | Adult | Children | patient_id |
+-----+-------+-------+----------+------------+
| 123 | x | 1 | 1 | 2 |
| 124 | x | 1 | 1 | 2 |
| 156 | x | 1 | 0 | 1 |
| 432 | y | 1 | 1 | 2 |
| 234 | z | 1 | 1 | 2 |
+-----+-------+-------+----------+------------+
How can I do this?

Here is the spark sql version.
df.createOrReplaceTempView('table')
spark.sql('''
select zip, state,
count(if(Agegrouping = 'Adult', 1, null)) as adult,
count(if(Agegrouping = 'Children', 1, null)) as children,
count(1) as patient_id
from table
group by zip, state;
''').show()
+---+-----+-----+--------+----------+
|zip|state|adult|children|patient_id|
+---+-----+-----+--------+----------+
|123| x| 1| 1| 2|
|156| x| 1| 0| 1|
|234| z| 1| 0| 1|
|432| z| 0| 1| 1|
|234| y| 0| 1| 1|
|124| y| 0| 0| 1|
|124| x| 0| 1| 1|
|432| y| 1| 0| 1|
+---+-----+-----+--------+----------+

You can use conditional aggregation:
select zip, state,
sum(case when agegrouping = 'Adult' then 1 else 0 end) as adult,
sum(case when agegrouping = 'Children' then 1 else 0 end) as children,
count(*) as num_patients
from t
group by zip, state;

Use conditional aggreagation:
select
zip,
state,
sum(case when agregrouping = 'Adult' then 1 else 0 end ) as adult
sum(case when agregrouping = 'Children' then 1 else 0 end ) as children,
count(*) patient_id
from mytable
group by zip, state

How to add missing rows per group in Spark

The input dataset looks like this:
org| id |step| value
1 | 1 | 1 | 12
1 | 1 | 2 | 13
1 | 1 | 3 | 14
1 | 1 | 4 | 15
1 | 2 | 1 | 16
1 | 2 | 2 | 17
2 | 1 | 1 | 1
2 | 1 | 2 | 2
for the output I want to add the missing steps per org group for example to id == 2 of org == 1
org| id |step| value
1 | 1 | 1 | 12
1 | 1 | 2 | 13
1 | 1 | 3 | 14
1 | 1 | 4 | 15
1 | 2 | 1 | 16
1 | 2 | 2 | 17
1 | 2 | 3 | null
1 | 2 | 4 | null
2 | 1 | 1 | 1
2 | 1 | 2 | 2
I tried this but doesn't work:
r = df.select("org", "step").distinct()
df.join(r, ["org", "step"], 'right_outer')

val l = df.select("org", "step");
val r = df.select("org", "id");
val right = l.join(r, "org");
val result = df.join(right, Seq("org", "id", "step"), "right_outer").distinct().orderBy("org", "id", "step");
result.show
Gives:
+---+---+----+-----+
|org| id|step|value|
+---+---+----+-----+
| 1| 1| 1| 12|
| 1| 1| 2| 13|
| 1| 1| 3| 14|
| 1| 1| 4| 15|
| 1| 2| 1| 16|
| 1| 2| 2| 17|
| 1| 2| 3| null|
| 1| 2| 4| null|
| 2| 1| 1| 1|
| 2| 1| 2| 2|
+---+---+----+-----+
Bonus: sql query for the table (orgs) reflecting the df contents
select distinct o_right."org", o_right."id", o_right."step", o_left."value"
from orgs as o_left
right outer join (
select o_in_left."org", o_in_right."id", o_in_left."step"
from orgs as o_in_right
join (select "org", "step" from orgs) as o_in_left
on o_in_right."org" = o_in_left."org"
order by "org", "id", "step"
) as o_right
on o_left."org" = o_right."org"
and o_left."step" = o_right."step"
and o_left."id" = o_right."id"
order by "org", "id", "step"

How to flatten a pyspark dataframes that contains multiple rows per id?

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.
Here's an example to explain what I'm trying to achieve, my dataframe looks like this:
+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1 | 1 | 54 | 2 |
+----+-----+--------+--------+
| 1 | 2 | 0 | 6 |
+----+-----+--------+--------+
| 1 | 3 | 578 | 14 |
+----+-----+--------+--------+
| 2 | 1 | 10 | 1 |
+----+-----+--------+--------+
| 2 | 2 | 6 | 32 |
+----+-----+--------+--------+
| 2 | 3 | 0 | 0 |
+----+-----+--------+--------+
| 3 | 1 | 12 | 2 |
+----+-----+--------+--------+
| 3 | 2 | 20 | 5 |
+----+-----+--------+--------+
| 3 | 3 | 63 | 22 |
+----+-----+--------+--------+
The desired output is the following table:
+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1 | 54 | 0 | 578 | 2 | 6 | 14 |
+----+----------+----------+----------+----------+----------+----------+
| 2 | 10 | 6 | 0 | 1 | 32 | 0 |
+----+----------+----------+----------+----------+----------+----------+
| 3 | 12 | 20 | 63 | 2 | 5 | 22 |
+----+----------+----------+----------+----------+----------+----------+
So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.
Any help would be appreciated!

In Spark 2.4 you can do this way
var df3 =Seq((1,1,54 , 2 ),(1,2,0 , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6 , 32),(2,3,0 , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")
scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
| 1| 1| 54| 2|
| 1| 2| 0| 6|
| 1| 3| 578| 14|
| 2| 1| 10| 1|
| 2| 2| 6| 32|
| 2| 3| 0| 0|
| 3| 1| 12| 2|
| 3| 2| 20| 5|
| 3| 3| 63| 22|
+---+---+------+------+
using coalesce retrieve the first value of the id.
scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))
scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")
Renaming columns
scala> df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
| 1| 54| 2| 0| 6| 578| 14|
| 2| 10| 1| 6| 32| 0| 0|
| 3| 12| 2| 20| 5| 63| 22|
+---+--------+--------+--------+--------+--------+--------+
rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop

Scala spark find median in a window partition

I have a dataframe like this:
df =
--------------
|col1 | col2 |
--------------
| A | 1 |
| A | 5 |
| B | 0 |
| A | 2 |
| B | 6 |
| B | 8 |
--------------
I want to partition by col1, find the median of col2 in each partition, and append the result to form a new column. The result should look like this:
result =
---------------------
|col1 | col2 | col3 |
---------------------
| A | 1 | 2 |
| A | 5 | 2 |
| B | 0 | 6 |
| A | 2 | 2 |
| B | 6 | 6 |
| B | 8 | 8 |
---------------------
For now, I'm using this code:
val df2 = df
.withColumn("tmp", percent_rank over Window.partition('col1).orderBy('col2))
.where("tmp <= 0.5")
.groupBy("col1").agg(max(col2) as "col3")
val result = df.join(df2, df("col1") === df2("col1")).drop(df2("col1"))
But this takes too much time and space resources to run when the dataframe is big. Please help me find a way to do the above more efficiently!
Any help is much appreciated!

With the data you have, you can do a Spark DataFrame groupBy statement with percentile_approx to perform the calculation.
// Creating the `df` dataset
val df = Seq(("A", 1), ("A", 5), ("B", 0), ("A", 2), ("B", 6), ("B", 8)).toDF("col1", "col2")
df.createOrReplaceTempView("df")
Use percentile_approx with groupBy to perform median calculation:
val df2 = spark.sql("select col1, percentile_approx(col2, 0.5) as median from df group by col1 order by col1")
df2.show()
with the output of df2 being:
+----+------+
|col1|median|
+----+------+
| A| 2.0|
| B| 6.0|
+----+------+
And now running the join to recreate the final result:
val result = df.join(df2, df("col1") === df2("col1"))
result.show()
//// output
+----+----+----+------+
|col1|col2|col1|median|
+----+----+----+------+
| A| 1| A| 2.0|
| A| 5| A| 2.0|
| B| 0| B| 6.0|
| A| 2| A| 2.0|
| B| 6| B| 6.0|
| B| 8| B| 6.0|
+----+----+----+------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

spark sql: how to create sessionId for user-item - sql

You seem to need to count the number of "view" records cumulatively. If so: select t.*, sum(case when event = 'view' then 1 else 0 end) over (partition by user order by timestamp) as session from t;

Related

Group query in subquery to get column value as column name

How to group by with a condition in PySpark

How to add missing rows per group in Spark

How to flatten a pyspark dataframes that contains multiple rows per id?

Scala spark find median in a window partition

Categories

Resources