Spark-SQL Window functions on Dataframe - Finding first timestamp in a group

Spark-SQL Window functions on Dataframe - Finding first timestamp in a group - sql

I have below dataframe (say UserData).
uid region timestamp
a 1 1
a 1 2
a 1 3
a 1 4
a 2 5
a 2 6
a 2 7
a 3 8
a 4 9
a 4 10
a 4 11
a 4 12
a 1 13
a 1 14
a 3 15
a 3 16
a 5 17
a 5 18
a 5 19
a 5 20
This data is nothing but user (uid) travelling across different regions (region) at different time (timestamp). Presently, timestamp is shown as 'int' for simplicity. Note that above dataframe will not be necessarily in increasing order of timestamp. Also, there may be some rows in between from different users. I have shown dataframe for single user only in monotonically incrementing order of timestamp for simplicity.
My goal is - to find User 'a' spent how much time in each region and in what order? So My final expected output looks like
uid region regionTimeStart regionTimeEnd
a 1 1 5
a 2 5 8
a 3 8 9
a 4 9 13
a 1 13 15
a 3 15 17
a 5 17 20
Based on my findings, Spark SQL Window functions can be used for this purpose.
I have tried below things,
val w = Window
.partitionBy("region")
.partitionBy("uid")
.orderBy("timestamp")
val resultDF = UserData.select(
UserData("uid"), UserData("timestamp"),
UserData("region"), rank().over(w).as("Rank"))
But here onwards, I am not sure on how to get regionTimeStart and regionTimeEnd columns. regionTimeEnd column is nothing but 'lead' of regionTimeStart except the last entry in group.
I see Aggregate operations have 'first' and 'last' functions but for that I need to group data based on ('uid','region') which spoils monotonically increasing order of path traversed i.e. at time 13,14 user has come back to region '1' and I want that to be retained instead of clubbing it with initial region '1' at time 1.
It would be very helpful if anyone one can guide me. I am new to Spark and I have better understanding of Scala Spark APIs compared to Python/JAVA Spark APIs.

Window functions are indeed useful although your approach can work only if you assume that user visits given region only once. Also window definition you use is incorrect - multiple calls to partitionBy simply return new objects with different window definitions. If you want to partition by multiple columns you should pass them in a single call (.partitionBy("region", "uid")).
Lets start with marking continuous visits in each region:
import org.apache.spark.sql.functions.{lag, sum, not}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"uid").orderBy($"timestamp")
val change = (not(lag($"region", 1).over(w) <=> $"region")).cast("int")
val ind = sum(change).over(w)
val dfWithInd = df.withColumn("ind", ind)
Next you we simply aggregate over the groups and find leads:
import org.apache.spark.sql.functions.{lead, coalesce}
val regionTimeEnd = coalesce(lead($"timestamp", 1).over(w), $"max_")
val result = dfWithInd
.groupBy($"uid", $"region", $"ind")
.agg(min($"timestamp").alias("timestamp"), max($"timestamp").alias("max_"))
.drop("ind")
.withColumn("regionTimeEnd", regionTimeEnd)
.withColumnRenamed("timestamp", "regionTimeStart")
.drop("max_")
result.show
// +---+------+---------------+-------------+
// |uid|region|regionTimeStart|regionTimeEnd|
// +---+------+---------------+-------------+
// | a| 1| 1| 5|
// | a| 2| 5| 8|
// | a| 3| 8| 9|
// | a| 4| 9| 13|
// | a| 1| 13| 15|
// | a| 3| 15| 17|
// | a| 5| 17| 20|
// +---+------+---------------+-------------+

Related

how to create new column 'count' in Spark DataFrame under some condition

I have a DataFrame about connection log with columns Id, targetIP, Time. Every record in this DataFrame is a connection event to one system. Id means this connection, targetIP means the target IP address this time, Time is the connection time. With Values:
ID
Time
targetIP
1
1
192.163.0.1
2
2
192.163.0.2
3
3
192.163.0.1
4
5
192.163.0.1
5
6
192.163.0.2
6
7
192.163.0.2
7
8
192.163.0.2
I want to create a new column under some condition: count of connections to this time's target IP address in the past 2 time units. So the result DataFrame should be:
ID
Time
targetIP
count
1
1
192.163.0.1
0
2
2
192.163.0.2
0
3
3
192.163.0.1
1
4
5
192.163.0.1
1
5
6
192.163.0.2
0
6
7
192.163.0.2
1
7
8
192.163.0.2
2
For example, ID=7, the targetIP is 192.163.0.2 Connected to system in past 2 time units, which are ID=5 and ID=6, and their targetIP are also 192.163.0.2. So the count about ID=7 is 2.
Looking forward to your help.

So, what you basically need is a window function.
Let's start with your initial data
import org.apache.spark.sql.expressions.Window
import spark.implicits._
case class Event(ID: Int, Time: Int, targetIP: String)
val events = Seq(
Event(1, 1, "192.163.0.1"),
Event(2, 2, "192.163.0.2"),
Event(3, 3, "192.163.0.1"),
Event(4, 5, "192.163.0.1"),
Event(5, 6, "192.163.0.2"),
Event(6, 7, "192.163.0.2"),
Event(7, 8, "192.163.0.2")
).toDS()
Now we need to define a window function itself
val timeWindow = Window.orderBy($"Time").rowsBetween(-2, -1)
And now the most interesting part: how to count something over the window? There is no simple way, so we'll do the following
Aggregate all the targetIp's into the list
Filter the list to find only needed ips
Count size of the list
val df = events
.withColumn("tmp", collect_list($"targetIp").over(timeWindow))
.withColumn("count", size(expr("filter(tst, x -> x == targetIp)")))
.drop($"tmp")
And the result will contain a new column "count" which we need!
UPD:
There is the much shorter version without aggregation, written by #blackbishop,
val timeWindow = Window.partitionBy($"targetIP").orderBy($"Time").rangeBetween(-2, Window.currentRow)
val df = events
.withColumn("count", count("*").over(timeWindow) - lit(1))
.explain(true)

You can use count over Window bounded with range between - 2 and current row, to get the count of IP in the last 2 time units.
Using Spark SQL you can do something like this:
df.createOrReplaceTempView("connection_logs")
df1 = spark.sql("""
SELECT *,
COUNT(*) OVER(PARTITION BY targetIP ORDER BY Time
RANGE BETWEEN 2 PRECEDING AND CURRENT ROW
) -1 AS count
FROM connection_logs
ORDER BY ID
""")
df1.show()
#+---+----+-----------+-----+
#| ID|Time| targetIP|count|
#+---+----+-----------+-----+
#| 1| 1|192.163.0.1| 0|
#| 2| 2|192.163.0.2| 0|
#| 3| 3|192.163.0.1| 1|
#| 4| 5|192.163.0.1| 1|
#| 5| 6|192.163.0.2| 0|
#| 6| 7|192.163.0.2| 1|
#| 7| 8|192.163.0.2| 2|
#+---+----+-----------+-----+
Or using DataFrame API:
from pyspark.sql import Window
from pyspark.sql import functions as F
time_unit = lambda x: x
w = Window.partitionBy("targetIP").orderBy(col("Time").cast("int")).rangeBetween(-time_unit(2), 0)
df1 = df.withColumn("count", F.count("*").over(w) - 1).orderBy("ID")
df1.show()

Can Spark SQL refer to the first row of the previous window / group?

I have a kind of event stream that looks like this:
Time UserId SessionId EventType EventData
1 2 A Load /a ...
2 1 B Impressn X ...
3 2 A Impressn Y ...
4 1 B Load /b ...
5 2 A Load /info ...
6 1 B Load /about ...
7 2 A Impressn Z ...
In practice users can have many sessions over larger time windows and there's also a click event type but keeping this simple here, I'm trying to see the (page views) loads that lead to next load and also what impressions happened in aggregate.
So, without SQL I've loaded this, grouped by user, sequenced by time, and for each session marked each row with previous load info (if any). With a
val outDS = logDataset.groupByKey(_.UserId)
.flatMapGroups((_, iter) => gather(iter))
where gather sorts the iter by time (might be redundant as the input is sorted by time), then iterates over the sequence, sets lastLoadData to null at each new session, adds lastLoadData to each row and updates lastLoadData to the data of this row if the row is a Load type. Producing something like:
Time UserId SessionId EventType EventData LastLoadData
1 2 A Load / ... null
2 1 B Impressn X ... null
3 2 A Impressn Y ... / ...
4 1 B Load / ... null
5 2 A Load /info ... / ...
6 1 B Load /about ... / ...
7 2 A Impressn Z ... /info ...
Allowing me to then aggregate what (page views) loads lead to what other loads, or on each (page) load what are the top 5 Impressn events.
outDS.createOrReplaceTempView(tempTable)
val journeyPageViews = sparkSession.sql(
s"""SELECT lastLoadData, EventData,
| count(distinct UserId) as users,
| count(distinct SessionId) as sessions
|FROM ${tempTable}
|WHERE EventType='Load'
|GROUP BY lastLoadData, EventData""".stripMargin)
But, I get the feeling that the adding of a lastLoadData column could be done using Spark SQL windows too, however I'm hung up on two parts of that:
If I make a window over UserId+SessionId ordered by time how do have it apply to all events but look at the previous load event? (EG Impressn would get a new column lastLoadData assigned to this window's previous EventData)
If I somehow make a new window per session's Load event (also not sure how), The Load event in the start of the window (presumably "first") should get the lastLoadData of the previous window's "first" so that's probably not the right way to do it either.

You can mask the data that are not Load with null using case when, and get LastLoadData using last with ignorenull set to true:
logDataset.createOrReplaceTempView("table")
val logDataset2 = spark.sql("""
select
*,
last(case when EventType = 'Load' then EventData end, true)
over (partition by UserId, SessionId
order by Time
rows between unbounded preceding and 1 preceding) LastLoadData
from table
order by time
""")
logDataset2.show
+----+------+---------+---------+----------+------------+
|Time|UserId|SessionId|EventType| EventData|LastLoadData|
+----+------+---------+---------+----------+------------+
| 1| 2| A| Load| /a ...| null|
| 2| 1| B| Impressn| X ...| null|
| 3| 2| A| Impressn| Y ...| /a ...|
| 4| 1| B| Load| /b ...| null|
| 5| 2| A| Load| /info ...| /a ...|
| 6| 1| B| Load|/about ...| /b ...|
| 7| 2| A| Impressn| Z ...| /info ...|
+----+------+---------+---------+----------+------------+

Finding largest number of location IDs per hour from each zone

I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string. Something like this below:
Location hour Zone
97 0 A
49 5 B
97 0 A
10 6 D
25 5 B
97 0 A
97 3 A
What I need to do is find out for each hour of the day 0-23, what zone has the largest number of pickups from a particular location
So the answer should look something like this:
hour Zone max_count
0 A 3
1 B 4
2 A 6
3 D 1
. . .
. . .
23 D 8
What I first tried was to use an intermediate step to figure out the counts per zone and hour
val df_temp = df.select("Location","hour","Zone")
.groupBy("hour","Zone").agg(count($"Location").alias("count"))
This gives me a dataframe that looks like this:
hour Zone count
3 A 5
8 B 9
3 B 2
23 F 8
23 A 1
23 C 4
3 D 12
. . .
. . .
I then tried doing the following:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours","Zone").agg(max($"count").alias("max_count")).orderBy($"hours")
This doesn't do anything except just grouping by hours and zone but I still have 1000s of rows. I also tried:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours").agg(max($"count").alias("max_count")).orderBy($"hours")
The above gives me the max count and 24 rows from 0-23 but there is no Zone column there. So the answer looks like this:
hour max_count
0 12
1 15
. .
. .
23 8
I would like the Zone column included so I know which zone had the max count for each of those hours. I was also looking into the window function to do rank but I wasn't sure how to use it.

After generating the dataframe with per-hour/zone "count", you could generate another dataframe with per-hour "max_count" and join the two dataframes on "hour" and "max_count":
val df = Seq(
(97, 0, "A"),
(49, 5, "B"),
(97, 0, "A"),
(10, 6, "D"),
(25, 5, "B"),
(97, 0, "A"),
(97, 3, "A"),
(10, 0, "C"),
(20, 5, "C")
).toDF("location", "hour", "zone")
val dfC = df.groupBy($"hour", $"zone").agg(count($"location").as("count"))
val dfM = dfC.groupBy($"hour".as("m_hour")).agg(max($"count").as("max_count"))
dfC.
join(dfM, dfC("hour") === dfM("m_hour") && dfC("count") === dfM("max_count")).
drop("m_hour", "count").
orderBy("hour").
show
// +----+----+---------+
// |hour|zone|max_count|
// +----+----+---------+
// | 0| A| 3|
// | 3| A| 1|
// | 5| B| 2|
// | 6| D| 1|
// +----+----+---------+
Alternatively, you could perform the per-hour/zone groupBy followed by a Window partitioning by "hour" to compute "max_count" for the where condition, as shown below:
import org.apache.spark.sql.expressions.Window
df.
groupBy($"hour", $"zone").agg(count($"location").as("count")).
withColumn("max_count", max($"count").over(Window.partitionBy("hour"))).
where($"count" === $"max_count").
drop("count").
orderBy("hour")

You can use spark window functions for this task.
At first you can group by the data to get a count of number of zones.
val df = read_df.groupBy("hour", "zone").agg(count("*").as("count_order"))
Then create a window to partition the data by hour and order it by total count. Then you have to calculate the rank over this partition of data.
val byZoneName = Window.partitionBy($"hour").orderBy($"count_order".desc)
val rankZone = rank().over(byZoneName)
This will perform the operation and list out the rank of all the zones grouped by hour.
val result_df = df.select($"*", rankZone as "rank")
The output will be something like this:
+----+----+-----------+----+
|hour|zone|count_order|rank|
+----+----+-----------+----+
| 0| A| 3| 1|
| 0| C| 2| 2|
| 0| B| 1| 3|
| 3| A| 1| 1|
| 5| B| 2| 1|
| 6| D| 1| 1|
+----+----+-----------+----+
You can then filter out the data with rank 1.
result_df.filter($"rank" === 1).orderBy("hour").show()
You can check my code here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/1792645088721850/4927717998130263/latest.html

aggregate a column based on the another columns' values count in pyspark dataframe

I would like to do some aggregations for a pyspark hive table.
my table:
id value_tier ($)
105 5
117 5
108 10
110 12
105 10
112 10
I need to get the number of ids that only appear in one "value_tier".
value_tier num
5 1 -- for 117
10 2 -- for 108 and 112
12 1 -- for 110
Here, 105 is not counted because it appears in two value_tiers.
5 and 10
My sql DDL works but long and ugly.
I would like to have one more elegant.
thanks

In DataFrameAPI use groupBy and agg with collect_list function.
df1.show()
#+---+----------+
#| id|value_tier|
#+---+----------+
#|105| 5|
#|117| 5|
#|108| 10|
#|110| 12|
#|105| 10|
#|112| 10|
#+---+----------+
from pyspark.sql.functions import *
df1.groupBy("id").
agg(concat_ws(',',collect_list(col("value_tier"))).alias("value_tier")).\
filter(size(split(col("value_tier"),",")) <=1).\
groupBy("value_tier").\
agg(count(col("id")).alias("num"),concat_ws(",",collect_list(col("id"))).alias("ids")).\
show()
#+----------+---+-------+
#|value_tier|num| ids|
#+----------+---+-------+
#| 5| 1| 117|
#| 10| 2|112,108|
#| 12| 1| 110|
#+----------+---+-------+
#use collect_set to eliminate duplicates
df1.groupBy("id").
agg(concat_ws(',',collect_set(col("value_tier"))).alias("value_tier")).\
filter(size(split(col("value_tier"),",")) <=1).\
groupBy("value_tier").\
agg(count(col("id")).alias("num"),concat_ws(",",collect_list(col("id"))).alias("ids")).\
show()

In SQL, you can use not exists and aggregation:
sélect value_tier, count(*) cnt
from mytable t
where not exists(
select 1
from mytable t1
where t1.value_tier = t.value_tier and t1.id <> t.id
)
group by value_tier

SQL/PySpark: Create a new column consisting of a number of rows in the past n days

Currently, I have a table consisting of encounter_id and date field like so:
+---------------------------+--------------------------+
|encounter_id |date |
+---------------------------+--------------------------+
|random_id34234 |2018-09-17 21:53:08.999999|
|this_can_be_anything2432432|2018-09-18 18:37:57.000000|
|423432 |2018-09-11 21:00:36.000000|
+---------------------------+--------------------------+
encounter_id is a random string.
I'm aiming to create a column which consists of the total number of encounters in the past 30 days.
+---------------------------+--------------------------+---------------------------+
|encounter_id |date | encounters_in_past_30_days|
+---------------------------+--------------------------+---------------------------+
|random_id34234 |2018-09-17 21:53:08.999999| 2 |
|this_can_be_anything2432432|2018-09-18 18:37:57.000000| 3 |
|423432 |2018-09-11 21:00:36.000000| 1 |
+---------------------------+--------------------------+---------------------------+
Currently, I'm thinking of somehow using window functions and specifying an aggregate function.
Thanks for the time.

Here is one possible solution, I added some sample data. It indeed uses a window function, as you suggested yourself. Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame(
[
('A','2018-10-01 00:15:00'),
('B','2018-10-11 00:30:00'),
('C','2018-10-21 00:45:00'),
('D','2018-11-10 00:00:00'),
('E','2018-12-20 00:15:00'),
('F','2018-12-30 00:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.col('date').astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*30,0)
df = df.withColumn('encounters_past_30_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+-----------------------+
|encounter_id| date| timestamp|encounters_past_30_days|
+------------+-------------------+----------+-----------------------+
| A|2018-10-01 00:15:00|1538345700| 1|
| B|2018-10-11 00:30:00|1539210600| 2|
| C|2018-10-21 00:45:00|1540075500| 3|
| D|2018-11-10 00:00:00|1541804400| 2|
| E|2018-12-20 00:15:00|1545261300| 1|
| F|2018-12-30 00:30:00|1546126200| 2|
+------------+-------------------+----------+-----------------------+
EDIT: If you want to have days as the granularity, you could first convert your date column to the Date type. Example below, assuming that a window of five days means today and the four days before. If it should be today and the past five days just remove the -1.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
n_days = 5
df = sqlContext.createDataFrame(
[
('A','2018-10-01 23:15:00'),
('B','2018-10-02 00:30:00'),
('C','2018-10-05 05:45:00'),
('D','2018-10-06 00:15:00'),
('E','2018-10-07 00:15:00'),
('F','2018-10-10 21:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.to_date(F.col('date')).astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*(n_days-1),0)
df = df.withColumn('encounters_past_n_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+----------------------+
|encounter_id| date| timestamp|encounters_past_n_days|
+------------+-------------------+----------+----------------------+
| A|2018-10-01 23:15:00|1538344800| 1|
| B|2018-10-02 00:30:00|1538431200| 2|
| C|2018-10-05 05:45:00|1538690400| 3|
| D|2018-10-06 00:15:00|1538776800| 3|
| E|2018-10-07 00:15:00|1538863200| 3|
| F|2018-10-10 21:30:00|1539122400| 3|
+------------+-------------------+----------+----------------------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark-SQL Window functions on Dataframe - Finding first timestamp in a group - sql

Related

how to create new column 'count' in Spark DataFrame under some condition

Can Spark SQL refer to the first row of the previous window / group?

Finding largest number of location IDs per hour from each zone

aggregate a column based on the another columns' values count in pyspark dataframe

SQL/PySpark: Create a new column consisting of a number of rows in the past n days

Categories

Resources