Use list and replace a pyspark column - dataframe

Suppose I have a list new_id_acc = [6,8,1,2,4] and I have PySpark DataFrame
like
id_acc | name |
10 | ABC |
20 | XYZ |
21 | KBC |
34 | RAH |
19 | SPD |
I want to replace the pyspark column id_acc with new_id_acc value how can I achieve and do this.
I tried and found that lit() can be used but for a constant
value but didn't find anything how to do for list.
After replacement I want my PySpark Dataframe to look like this
id_acc | name |
6 | ABC |
8 | XYZ |
1 | KBC |
2 | RAH |
4 | SPD |

Probably long answer but it works.
df = spark.sparkContext.parallelize([(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'ABC'),(19,'SPD')]).toDF(('id_acc', 'name'))
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| ABC|
| 19| SPD|
+------+----+
new_id_acc = [6,8,1,2,4]
indx = ['ABC','XYZ','KBC','ABC','SPD']
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),StructField("new_id_ac", IntegerType(), True)])
df1=spark.createDataFrame(zip(indx,new_id_acc),schema = myschema)
df1.show()
+----+---------+
|indx|new_id_ac|
+----+---------+
| ABC| 6|
| XYZ| 8|
| KBC| 1|
| ABC| 2|
| SPD| 4|
+----+---------+
dfnew = df.join(df1, df.name == df1.indx,how='left').drop(df1.indx).select('new_id_ac','name').sort('name').dropDuplicates(['new_id_ac'])
dfnew.show()
+---------+----+
|new_id_ac|name|
+---------+----+
| 1| KBC|
| 6| ABC|
| 4| SPD|
| 8| XYZ|
| 2| ABC|
+---------+----+

The idea is to create a column of consecutive serial/row numbers and then use them to get the corresponding values from the list.
# Creating the requisite DataFrame
from pyspark.sql.functions import row_number,lit, udf
from pyspark.sql.window import Window
valuesCol = [(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'RAH'),(19,'SPD')]
df = spark.createDataFrame(valuesCol,['id_acc','name'])
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| RAH|
| 19| SPD|
+------+----+
You can create row/serial numbers like done here.
Note that A below is just a dummy value, as we don't need to order tha values. We just want the row number.
w = Window().orderBy(lit('A'))
df = df.withColumn('serial_number', row_number().over(w))
df.show()
+------+----+-------------+
|id_acc|name|serial_number|
+------+----+-------------+
| 10| ABC| 1|
| 20| XYZ| 2|
| 21| KBC| 3|
| 34| RAH| 4|
| 19| SPD| 5|
+------+----+-------------+
As a final step, we will access the elements from the list provided by the OP using the row number. For this we use udf.
new_id_acc = [6,8,1,2,4]
mapping = udf(lambda x: new_id_acc[x-1])
df = df.withColumn('id_acc', mapping(df.serial_number)).drop('serial_number')
df.show()
+------+----+
|id_acc|name|
+------+----+
| 6| ABC|
| 8| XYZ|
| 1| KBC|
| 2| RAH|
| 4| SPD|
+------+----+

Related

Spark Dataframe - Create 12 rows for each cell of a master table

I have a table containing Employee IDs and I'd like to add an additional column for Month containing 12 values (1 for each month). I'd like to create a new table where there is 12 rows for each ID in my list.
Take the following example:
+-----+
|GFCID|
+-----+
| 1|
| 2|
| 3|
+-----+
+---------+
|Yearmonth|
+---------+
| 202101|
| 202102|
| 202203|
| 202204|
| 202205|
+---------+
My desired output is something on the lines of
ID Month
1 Jan
1 Feb
1 March
2 jan
2 March
and so on. I am using pyspark and my current syntax is as follows:
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.join(df2, df.GFCID == df2.Yearmonth, "outer")
df3.show()
And the output is
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| null| 202101|
| 3| null|
| null| 202205|
| null| 202102|
| null| 202204|
| 1| null|
| null| 202203|
| 2| null|
+-----+---------+
I understand this is wrong because there is no common key for the dataframes to join on. Would appreciate your help on this
Here is your code modified with the proper join crossJoin
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.crossJoin(df2)
df3.show()
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| 1| 202101|
| 1| 202102|
| 1| 202203|
| 1| 202204|
| 1| 202205|
| 2| 202101|
| 2| 202102|
| 2| 202203|
| 2| 202204|
| 2| 202205|
| 3| 202101|
| 3| 202102|
| 3| 202203|
| 3| 202204|
| 3| 202205|
+-----+---------+
Another way of doing it without using a join :
from pyspark.sql import functions as F
df2.withColumn("GFCID", F.explode(F.array([F.lit(i) for i in range(1, 13)]))).show()
+---------+-----+
|Yearmonth|GFCID|
+---------+-----+
| 202101| 1|
| 202101| 2|
| 202101| 3|
| 202101| 4|
| 202101| 5|
| 202101| 6|
| 202101| 7|
| 202101| 8|
| 202101| 9|
| 202101| 10|
| 202101| 11|
| 202101| 12|
| 202102| 1|
| 202102| 2|
| 202102| 3|
| 202102| 4|
...

When to use `pandas_udf` of iterator of Series to iterator of Series in pyspark?

In this page: This pandas UDF is useful when the UDF execution requires initializing some state, for example, loading a machine learning model file to apply inference to every input batch.
So, I've tested it:
import pandas as pd
from typing import Iterator
from pyspark.sql.functions import col, pandas_udf
pdf = pd.DataFrame(range(50), columns=["x"])
df = spark.createDataFrame(pdf)
y_bc = spark.sparkContext.broadcast(99999)
#
# Method1: Iterator[pd.Series] -> Iterator[pd.Series]
#
#pandas_udf("long")
def plus_one(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
y = y_bc.value
print(y) # Check how many times it print it out
for x in batch_iter:
print(x)
yield x + 1
#
# Method2: pd.Series -> pd.Series
#
#pandas_udf("long")
def plus_one2(v: pd.Series) -> pd.Series:
y = y_bc.value
print(y) # Check how many times it print it out
return v + 1
df.select(plus_one(col("x"))).show()
df.select(plus_one2(col("x"))).show()
Result
df.select(plus_one(col("x"))).show()
99999
99999
99999
99999
99999
99999
99999
99999
+-----------+
|plus_one(x)|
+-----------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
| 20|
+-----------+
df.select(plus_one2(col("x"))).show()
99999
99999
99999
99999
99999
99999
99999
99999
+------------+
|plus_one2(x)|
+------------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
| 20|
+------------+
only showing top 20 rows
Result analysis:
In plus_one(), it prints y as much as the length of the batch_iter, which is a sort of contradiction to useful when the UDF execution requires initializing some state. I don't see why it gives benefit in this case.
pandas_udf of Series -> Series and pandas_udf of Iterator[Series] -> Iterator[Series] work SAME. I think there is no difference between the way they work?

How to flatten a pyspark dataframes that contains multiple rows per id?

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.
Here's an example to explain what I'm trying to achieve, my dataframe looks like this:
+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1 | 1 | 54 | 2 |
+----+-----+--------+--------+
| 1 | 2 | 0 | 6 |
+----+-----+--------+--------+
| 1 | 3 | 578 | 14 |
+----+-----+--------+--------+
| 2 | 1 | 10 | 1 |
+----+-----+--------+--------+
| 2 | 2 | 6 | 32 |
+----+-----+--------+--------+
| 2 | 3 | 0 | 0 |
+----+-----+--------+--------+
| 3 | 1 | 12 | 2 |
+----+-----+--------+--------+
| 3 | 2 | 20 | 5 |
+----+-----+--------+--------+
| 3 | 3 | 63 | 22 |
+----+-----+--------+--------+
The desired output is the following table:
+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1 | 54 | 0 | 578 | 2 | 6 | 14 |
+----+----------+----------+----------+----------+----------+----------+
| 2 | 10 | 6 | 0 | 1 | 32 | 0 |
+----+----------+----------+----------+----------+----------+----------+
| 3 | 12 | 20 | 63 | 2 | 5 | 22 |
+----+----------+----------+----------+----------+----------+----------+
So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.
Any help would be appreciated!
In Spark 2.4 you can do this way
var df3 =Seq((1,1,54 , 2 ),(1,2,0 , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6 , 32),(2,3,0 , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")
scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
| 1| 1| 54| 2|
| 1| 2| 0| 6|
| 1| 3| 578| 14|
| 2| 1| 10| 1|
| 2| 2| 6| 32|
| 2| 3| 0| 0|
| 3| 1| 12| 2|
| 3| 2| 20| 5|
| 3| 3| 63| 22|
+---+---+------+------+
using coalesce retrieve the first value of the id.
scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))
scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")
Renaming columns
scala> df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
| 1| 54| 2| 0| 6| 578| 14|
| 2| 10| 1| 6| 32| 0| 0|
| 3| 12| 2| 20| 5| 63| 22|
+---+--------+--------+--------+--------+--------+--------+
rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop

Spark: How to aggregate/reduce records based on time difference?

I have time series data in CSV from vehicle with following information:
trip-id
timestamp
speed
The data looks like this:
trip-id | timestamp | speed
001 | 1538204192 | 44.55
001 | 1538204193 | 47.20 <-- start of brake
001 | 1538204194 | 42.14
001 | 1538204195 | 39.20
001 | 1538204196 | 35.30
001 | 1538204197 | 32.22 <-- end of brake
001 | 1538204198 | 34.80
001 | 1538204199 | 37.10
...
001 | 1538204221 | 55.30
001 | 1538204222 | 57.20 <-- start of brake
001 | 1538204223 | 54.60
001 | 1538204224 | 52.15
001 | 1538204225 | 49.27
001 | 1538204226 | 47.89 <-- end of brake
001 | 1538204227 | 50.57
001 | 1538204228 | 53.72
...
A braking event occurs when there's a decrease in speed in 2 consecutive records based on timestamp.
I want to extract the braking events from the data in terms of event start timestamp, end timestamp, start speed & end speed.
+-------------+---------------+-------------+-----------+---------+
| breakID|start timestamp|end timestamp|start speed|end speed|
+-------------+---------------+-------------+-----------+---------+
|0011538204193| 1538204193| 1538204196| 47.2| 35.3|
|0011538204222| 1538204222| 1538204225| 57.2| 49.27|
+-------------+---------------+-------------+-----------+---------+
Here's my take:
Defined a window spec with partition according to trip-id, ordered by timestamp.
Applied window lag to move over consecutive rows and calculate speed difference.
Filter out records which have positive speed difference, as i am interested in braking events only.
Now that I only have records belonging to braking events, I want to group records belonging to same event. I guess i can do this based on the timestamp difference. If the difference between 2 records is 1 second, those 2 records belong to same braking event.
I am stuck here as i do not have a key belonging to same group so i can apply key based aggregation.
My question is:
How can I map to add a key column based on the difference in timestamp? So if 2 records have a difference of 1 seconds, they should have a common key. That way, I can reduce a group based on the newly added key.
Is there any better & more optimized way to achieve this? My approach could be very inefficient as it relies on row by row comparisons. What are the other possible ways to detect these kind of "sub-events" (e.g braking events) in a data-stream belonging to a specific event (data from single vehicle trip)?
Thanks in advance!
Appendix:
Example data file for a trip: https://www.dropbox.com/s/44a0ilogxp60w...
For Pandas users, there is pretty much a common programming pattern using shift() + cumsum() to setup a group-label to identify consecutive rows matching some specific patterns/conditions. With pyspark, we can use Window functions lag() + sum() to do the same and find this group-label (d2 in the following code):
Data Setup:
from pyspark.sql import functions as F, Window
>>> df.orderBy('timestamp').show()
+-------+----------+-----+
|trip-id| timestamp|speed|
+-------+----------+-----+
| 001|1538204192|44.55|
| 001|1538204193|47.20|
| 001|1538204194|42.14|
| 001|1538204195|39.20|
| 001|1538204196|35.30|
| 001|1538204197|32.22|
| 001|1538204198|34.80|
| 001|1538204199|37.10|
| 001|1538204221|55.30|
| 001|1538204222|57.20|
| 001|1538204223|54.60|
| 001|1538204224|52.15|
| 001|1538204225|49.27|
| 001|1538204226|47.89|
| 001|1538204227|50.57|
| 001|1538204228|53.72|
+-------+----------+-----+
>>> df.printSchema()
root
|-- trip-id: string (nullable = true)
|-- unix_timestamp: integer (nullable = true)
|-- speed: double (nullable = true)
Set up two Window Spec (w1, w2):
# Window spec used to find previous speed F.lag('speed').over(w1) and also do the cumsum() to find flag `d2`
w1 = Window.partitionBy('trip-id').orderBy('timestamp')
# Window spec used to find the minimal value of flag `d1` over the partition(`trip-id`,`d2`)
w2 = Window.partitionBy('trip-id', 'd2').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Three flags (d1, d2, d3):
d1 : flag to identify if the previous speed is greater than the current speed, if true d1 = 0, else d1 = 1
d2 : flag to mark the consecutive rows for speed-drop with the same unique number
d3 : flag to identify the minimal value of d1 on the partition('trip-id', 'd2'), only when d3 == 0 can the row belong to a group with speed-drop. this will be used to filter out unrelated rows
df_1 = df.withColumn('d1', F.when(F.lag('speed').over(w1) > F.col('speed'), 0).otherwise(1))\
.withColumn('d2', F.sum('d1').over(w1)) \
.withColumn('d3', F.min('d1').over(w2))
>>> df_1.orderBy('timestamp').show()
+-------+----------+-----+---+---+---+
|trip-id| timestamp|speed| d1| d2| d3|
+-------+----------+-----+---+---+---+
| 001|1538204192|44.55| 1| 1| 1|
| 001|1538204193|47.20| 1| 2| 0|
| 001|1538204194|42.14| 0| 2| 0|
| 001|1538204195|39.20| 0| 2| 0|
| 001|1538204196|35.30| 0| 2| 0|
| 001|1538204197|32.22| 0| 2| 0|
| 001|1538204198|34.80| 1| 3| 1|
| 001|1538204199|37.10| 1| 4| 1|
| 001|1538204221|55.30| 1| 5| 1|
| 001|1538204222|57.20| 1| 6| 0|
| 001|1538204223|54.60| 0| 6| 0|
| 001|1538204224|52.15| 0| 6| 0|
| 001|1538204225|49.27| 0| 6| 0|
| 001|1538204226|47.89| 0| 6| 0|
| 001|1538204227|50.57| 1| 7| 1|
| 001|1538204228|53.72| 1| 8| 1|
+-------+----------+-----+---+---+---+
Remove rows which are not with concern:
df_1 = df_1.where('d3 == 0')
>>> df_1.orderBy('timestamp').show()
+-------+----------+-----+---+---+---+
|trip-id| timestamp|speed| d1| d2| d3|
+-------+----------+-----+---+---+---+
| 001|1538204193|47.20| 1| 2| 0|
| 001|1538204194|42.14| 0| 2| 0|
| 001|1538204195|39.20| 0| 2| 0|
| 001|1538204196|35.30| 0| 2| 0|
| 001|1538204197|32.22| 0| 2| 0|
| 001|1538204222|57.20| 1| 6| 0|
| 001|1538204223|54.60| 0| 6| 0|
| 001|1538204224|52.15| 0| 6| 0|
| 001|1538204225|49.27| 0| 6| 0|
| 001|1538204226|47.89| 0| 6| 0|
+-------+----------+-----+---+---+---+
Final Step:
Now for df_1, group by trip-id and d2, find the min and max of F.struct('timestamp', 'speed') which will return the first and last records in the group, select the corresponding fields from the struct to get the final result:
df_new = df_1.groupby('trip-id', 'd2').agg(
F.min(F.struct('timestamp', 'speed')).alias('start')
, F.max(F.struct('timestamp', 'speed')).alias('end')
).select(
'trip-id'
, F.col('start.timestamp').alias('start timestamp')
, F.col('end.timestamp').alias('end timestamp')
, F.col('start.speed').alias('start speed')
, F.col('end.speed').alias('end speed')
)
>>> df_new.show()
+-------+---------------+-------------+-----------+---------+
|trip-id|start timestamp|end timestamp|start speed|end speed|
+-------+---------------+-------------+-----------+---------+
| 001| 1538204193| 1538204197| 47.20| 32.22|
| 001| 1538204222| 1538204226| 57.20| 47.89|
+-------+---------------+-------------+-----------+---------+
Note: Remove the intermediate dataframe df_1, we can have the following:
df_new = df.withColumn('d1', F.when(F.lag('speed').over(w1) > F.col('speed'), 0).otherwise(1))\
.withColumn('d2', F.sum('d1').over(w1)) \
.withColumn('d3', F.min('d1').over(w2)) \
.where('d3 == 0') \
.groupby('trip-id', 'd2').agg(
F.min(F.struct('timestamp', 'speed')).alias('start')
, F.max(F.struct('timestamp', 'speed')).alias('end')
)\
.select(
'trip-id'
, F.col('start.timestamp').alias('start timestamp')
, F.col('end.timestamp').alias('end timestamp')
, F.col('start.speed').alias('start speed')
, F.col('end.speed').alias('end speed')
)
Hope this helps. Scala code.
Output
+-------------+---------------+-------------+-----------+---------+
| breakID|start timestamp|end timestamp|start speed|end speed|
+-------------+---------------+-------------+-----------+---------+
|0011538204193| 1538204193| 1538204196| 47.2| 35.3|
|0011538204222| 1538204222| 1538204225| 57.2| 49.27|
+-------------+---------------+-------------+-----------+---------+
CODE
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
import org.apache.spark.sql.functions._
scala> df.show
+-------+----------+-----+
|trip-id| timestamp|speed|
+-------+----------+-----+
| 001|1538204192|44.55|
| 001|1538204193| 47.2|
| 001|1538204194|42.14|
| 001|1538204195| 39.2|
| 001|1538204196| 35.3|
| 001|1538204197|32.22|
| 001|1538204198| 34.8|
| 001|1538204199| 37.1|
| 001|1538204221| 55.3|
| 001|1538204222| 57.2|
| 001|1538204223| 54.6|
| 001|1538204224|52.15|
| 001|1538204225|49.27|
| 001|1538204226|47.89|
| 001|1538204227|50.57|
| 001|1538204228|53.72|
+-------+----------+-----+
val overColumns = Window.partitionBy("trip-id").orderBy("timestamp")
val breaksDF = df
.withColumn("speeddiff", lead("speed", 1).over(overColumns) - $"speed")
.withColumn("breaking", when($"speeddiff" < 0, 1).otherwise(0))
scala> breaksDF.show
+-------+----------+-----+-------------------+--------+
|trip-id| timestamp|speed| speeddiff|breaking|
+-------+----------+-----+-------------------+--------+
| 001|1538204192|44.55| 2.6500000000000057| 0|
| 001|1538204193| 47.2| -5.060000000000002| 1|
| 001|1538204194|42.14|-2.9399999999999977| 1|
| 001|1538204195| 39.2|-3.9000000000000057| 1|
| 001|1538204196| 35.3|-3.0799999999999983| 1|
| 001|1538204197|32.22| 2.5799999999999983| 0|
| 001|1538204198| 34.8| 2.3000000000000043| 0|
| 001|1538204199| 37.1| 18.199999999999996| 0|
| 001|1538204221| 55.3| 1.9000000000000057| 0|
| 001|1538204222| 57.2|-2.6000000000000014| 1|
| 001|1538204223| 54.6| -2.450000000000003| 1|
| 001|1538204224|52.15|-2.8799999999999955| 1|
| 001|1538204225|49.27|-1.3800000000000026| 1|
| 001|1538204226|47.89| 2.6799999999999997| 0|
| 001|1538204227|50.57| 3.1499999999999986| 0|
| 001|1538204228|53.72| null| 0|
+-------+----------+-----+-------------------+--------+
val outputDF = breaksDF
.withColumn("breakevent",
when(($"breaking" - lag($"breaking", 1).over(overColumns)) === 1, "start of break")
.when(($"breaking" - lead($"breaking", 1).over(overColumns)) === 1, "end of break"))
scala> outputDF.show
+-------+----------+-----+-------------------+--------+--------------+
|trip-id| timestamp|speed| speeddiff|breaking| breakevent|
+-------+----------+-----+-------------------+--------+--------------+
| 001|1538204192|44.55| 2.6500000000000057| 0| null|
| 001|1538204193| 47.2| -5.060000000000002| 1|start of break|
| 001|1538204194|42.14|-2.9399999999999977| 1| null|
| 001|1538204195| 39.2|-3.9000000000000057| 1| null|
| 001|1538204196| 35.3|-3.0799999999999983| 1| end of break|
| 001|1538204197|32.22| 2.5799999999999983| 0| null|
| 001|1538204198| 34.8| 2.3000000000000043| 0| null|
| 001|1538204199| 37.1| 18.199999999999996| 0| null|
| 001|1538204221| 55.3| 1.9000000000000057| 0| null|
| 001|1538204222| 57.2|-2.6000000000000014| 1|start of break|
| 001|1538204223| 54.6| -2.450000000000003| 1| null|
| 001|1538204224|52.15|-2.8799999999999955| 1| null|
| 001|1538204225|49.27|-1.3800000000000026| 1| end of break|
| 001|1538204226|47.89| 2.6799999999999997| 0| null|
| 001|1538204227|50.57| 3.1499999999999986| 0| null|
| 001|1538204228|53.72| null| 0| null|
+-------+----------+-----+-------------------+--------+--------------+
scala> outputDF.filter("breakevent is not null").select("trip-id", "timestamp", "speed", "breakevent").show
+-------+----------+-----+--------------+
|trip-id| timestamp|speed| breakevent|
+-------+----------+-----+--------------+
| 001|1538204193| 47.2|start of break|
| 001|1538204196| 35.3| end of break|
| 001|1538204222| 57.2|start of break|
| 001|1538204225|49.27| end of break|
+-------+----------+-----+--------------+
outputDF.filter("breakevent is not null").withColumn("breakID",
when($"breakevent" === "start of break", concat($"trip-id",$"timestamp"))
.when($"breakevent" === "end of break", concat($"trip-id", lag($"timestamp", 1).over(overColumns))))
.groupBy("breakID").agg(first($"timestamp") as "start timestamp", last($"timestamp") as "end timestamp", first($"speed") as "start speed", last($"speed") as "end speed").show
+-------------+---------------+-------------+-----------+---------+
| breakID|start timestamp|end timestamp|start speed|end speed|
+-------------+---------------+-------------+-----------+---------+
|0011538204193| 1538204193| 1538204196| 47.2| 35.3|
|0011538204222| 1538204222| 1538204225| 57.2| 49.27|
+-------------+---------------+-------------+-----------+---------+

How to filter rows for a specific aggregate with spark sql?

Normally all rows in a group are passed to an aggregate function. I would like to filter rows using a condition so that only some rows within a group are passed to an aggregate function. Such operation is possible with PostgreSQL. I would like to do the same thing with Spark SQL DataFrame (Spark 2.0.0).
The code could probably look like this:
val df = ... // some data frame
df.groupBy("A").agg(
max("B").where("B").less(10), // there is no such method as `where` :(
max("C").where("C").less(5)
)
So for a data frame like this:
| A | B | C |
| 1| 14| 4|
| 1| 9| 3|
| 2| 5| 6|
The result would be:
|A|max(B)|max(C)|
|1| 9| 4|
|2| 5| null|
Is it possible with Spark SQL?
Note that in general any other aggregate function than max could be used and there could be multiple aggregates over the same column with arbitrary filtering conditions.
val df = Seq(
(1,14,4),
(1,9,3),
(2,5,6)
).toDF("a","b","c")
val aggregatedDF = df.groupBy("a")
.agg(
max(when($"b" < 10, $"b")).as("MaxB"),
max(when($"c" < 5, $"c")).as("MaxC")
)
aggregatedDF.show
>>> df = sc.parallelize([[1,14,1],[1,9,3],[2,5,6]]).map(lambda t: Row(a=int(t[0]),b=int(t[1]),c=int(t[2]))).toDF()
>>> df.registerTempTable('t')
>>> res = sqlContext.sql("select a,max(case when b<10 then b else null end) mb,max(case when c<5 then c else null end) mc from t group by a")
+---+---+----+
| a| mb| mc|
+---+---+----+
| 1| 9| 3|
| 2| 5|null|
+---+---+----+
You can use sql (I believe you do the same thing in Postgres?)
df.groupBy("name","age","id").agg(functions.max("age").$less(20),functions.max("id").$less("30")).show();
Sample Data:
name age id
abc 23 1001
cde 24 1002
efg 22 1003
ghi 21 1004
ijk 20 1005
klm 19 1006
mno 18 1007
pqr 18 1008
rst 26 1009
tuv 27 1010
pqr 18 1012
rst 28 1013
tuv 29 1011
abc 24 1015
Output:
+----+---+----+---------------+--------------+
|name|age| id|(max(age) < 20)|(max(id) < 30)|
+----+---+----+---------------+--------------+
| rst| 26|1009| false| true|
| abc| 23|1001| false| true|
| ijk| 20|1005| false| true|
| tuv| 29|1011| false| true|
| efg| 22|1003| false| true|
| mno| 18|1007| true| true|
| tuv| 27|1010| false| true|
| klm| 19|1006| true| true|
| cde| 24|1002| false| true|
| pqr| 18|1008| true| true|
| abc| 24|1015| false| true|
| ghi| 21|1004| false| true|
| rst| 28|1013| false| true|
| pqr| 18|1012| true| true|
+----+---+----+---------------+--------------+