Fill in row missing values with previous and next non missing values - apache-spark-sql

I know you can forward/backward fill in missing values with next non-missing values with last function combined with a window function.
But I have a data looks like:
Area,Date,Population
A, 1/1/2000, 10000
A, 2/1/2000,
A, 3/1/2000,
A, 4/1/2000, 10030
A, 5/1/2000,
In this example, for May population, I like to fill in 10030 which is easy. But for Feb and Mar, I would like to fill in value is mean of 10000 and 10030, not 10000 or 10030.
Do you know how to implement this?
Thanks,

Get the next and previous value and compute the mean as below-
df2.show(false)
df2.printSchema()
/**
* +----+--------+----------+
* |Area|Date |Population|
* +----+--------+----------+
* |A |1/1/2000|10000 |
* |A |2/1/2000|null |
* |A |3/1/2000|null |
* |A |4/1/2000|10030 |
* |A |5/1/2000|null |
* +----+--------+----------+
*
* root
* |-- Area: string (nullable = true)
* |-- Date: string (nullable = true)
* |-- Population: integer (nullable = true)
*/
val w1 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val w2 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.currentRow, Window.unboundedFollowing)
df2.withColumn("previous", last("Population", ignoreNulls = true).over(w1))
.withColumn("next", first("Population", ignoreNulls = true).over(w2))
.withColumn("new_Population", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
.drop("next", "previous")
.show(false)
/**
* +----+--------+----------+--------------+
* |Area|Date |Population|new_Population|
* +----+--------+----------+--------------+
* |A |1/1/2000|10000 |10000.0 |
* |A |2/1/2000|null |10015.0 |
* |A |3/1/2000|null |10015.0 |
* |A |4/1/2000|10030 |10030.0 |
* |A |5/1/2000|null |10030.0 |
* +----+--------+----------+--------------+
*/

Here is my try.
w1 and w2 are used to partition the window and w3 and w4 are used to fill the preceding and following values. After that, you can give the condition to calculate how fill the Population.
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.unboundedPreceding, Window.currentRow)
w2 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.currentRow, Window.unboundedFollowing)
w3 = Window.partitionBy('Area', 'partition1').orderBy('Date')
w4 = Window.partitionBy('Area', 'partition2').orderBy(f.desc('Date'))
df.withColumn('check', f.col('Population').isNotNull().cast('int')) \
.withColumn('partition1', f.sum('check').over(w1)) \
.withColumn('partition2', f.sum('check').over(w2)) \
.withColumn('first', f.first('Population').over(w3)) \
.withColumn('last', f.first('Population').over(w4)) \
.withColumn('fill', f.when(f.col('first').isNotNull() & f.col('last').isNotNull(), (f.col('first') + f.col('last')) / 2).otherwise(f.coalesce('first', 'last'))) \
.withColumn('Population', f.coalesce('Population', 'fill')) \
.orderBy('Date') \
.select(*df.columns).show(10, False)
+----+--------+----------+
|Area|Date |Population|
+----+--------+----------+
|A |1/1/2000|10000.0 |
|A |2/1/2000|10015.0 |
|A |3/1/2000|10015.0 |
|A |4/1/2000|10030.0 |
|A |5/1/2000|10030.0 |
+----+--------+----------+

Related

Split a dataframe string column by two different delimiters

The following is my dataset:
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
The following is the code I have been trying to use
df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(col("Itemcode"), "/").getItem(1)).withColumn("item3", split(col("Itemcode"), "//").getItem(0))
But it fails when there is a double slash in between first and second item and also fails when there is a double slash between 2nd and 3rd item
Desired output is:
item1 item2 item3
DB9450 DB9450 AD9066
DA0002 DE2396 DF2345
HWC72
GG7183 EB6693
TA444 B9X8X4
You can first replace the // with / then you can split.. Please try the below and let us know if worked
Input
df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+
| reg|postime|
+--------------------+-------+
|DB9450//DB9450/AD...| a|
|DA0002/DE2396//DF...| a|
| HWC72| a|
| GG7183/EB6693| a|
| TA444/B9X8X4:7-2-| a|
+--------------------+-------+
Logic
df_b = df_b.withColumn('split_col', F.regexp_replace(F.col('reg'), "//", "/"))
df_b = df_b.withColumn('split_col', F.split(df_b['split_col'], '/'))
df_b = df_b.withColumn('col1' , F.col('split_col').getItem(0))
df_b = df_b.withColumn('col2' , F.col('split_col').getItem(1))
df_b = df_b.withColumn('col2', F.regexp_replace(F.col('col2'), ":7-2-", ""))
df_b = df_b.withColumn('col3' , F.col('split_col').getItem(2))
Output
+--------------------+-------+--------------------+------+------+------+
| reg|postime| split_col| col1| col2| col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...| a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...| a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
| HWC72| a| [HWC72]| HWC72| null| null|
| GG7183/EB6693| a| [GG7183, EB6693]|GG7183|EB6693| null|
| TA444/B9X8X4:7-2-| a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4| null|
+--------------------+-------+--------------------+------+------+------+
Processing the text as csv works well for this.
First, let's read in the text, replacing double backslashes along the way
Edit: Also removing everything after a colon
val items = """
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
""".replaceAll("//", "/").split(":")(0)
Get the max number of items in a row
to create an appropriate header
val numItems = items.split("\n").map(_.split("/").size).reduce(_ max _)
val header = (1 to numItems).map("Itemcode" + _).mkString("/")
Then we're ready to create a Data Frame
val df = spark.read
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "/")
.option("header", "true")
.csv(spark.sparkContext.parallelize((header + items).split("\n")).toDS)
.filter("Itemcode1 <> 'Itemcode'")
df.show(false)
+---------+-----------+---------+
|Itemcode1|Itemcode2 |Itemcode3|
+---------+-----------+---------+
|DB9450 |DB9450 |AD9066 |
|DA0002 |DE2396 |DF2345 |
|HWC72 |null |null |
|GG7183 |EB6693 |null |
|TA444 |B9X8X4 |null |
+---------+-----------+---------+
Perhaps this is useful (spark>=2.4)-
split and TRANSFORM spark sql function will do the magic as below-
Load the provided test data
val data =
"""
|Itemcode
|
|DB9450//DB9450/AD9066
|
|DA0002/DE2396//DF2345
|
|HWC72
|
|GG7183/EB6693
|
|TA444/B9X8X4:7-2-
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---------------------+
* |Itemcode |
* +---------------------+
* |DB9450//DB9450/AD9066|
* |DA0002/DE2396//DF2345|
* |HWC72 |
* |GG7183/EB6693 |
* |TA444/B9X8X4:7-2- |
* +---------------------+
*
* root
* |-- Itemcode: string (nullable = true)
*/
Use split and TRANSFORM (you can run this query directly in pyspark)
df.withColumn("item_code", expr("TRANSFORM(split(Itemcode, '/+'), x -> split(x, ':')[0])"))
.selectExpr("item_code[0] item1", "item_code[1] item2", "item_code[2] item3")
.show(false)
/**
* +------+------+------+
* |item1 |item2 |item3 |
* +------+------+------+
* |DB9450|DB9450|AD9066|
* |DA0002|DE2396|DF2345|
* |HWC72 |null |null |
* |GG7183|EB6693|null |
* |TA444 |B9X8X4|null |
* +------+------+------+
*/

How to split a column by using length split and MaxSplit in Pyspark dataframe?

For Example
If I have a Column as given below by calling and showing the CSV in Pyspark
+--------+
| Names|
+--------+
|Rahul |
|Ravi |
|Raghu |
|Romeo |
+--------+
if I specify in my functions as Such
Length = 2
Maxsplit = 3
Then I have to get the results as
+----------+-----------+----------+
|Col_1 |Col_2 |Col_3 |
+----------+-----------+----------+
| Ra | hu | l |
| Ra | vi | Null |
| Ra | gh | u |
| Ro | me | o |
+----------+-----------+----------+
Simirarly in Pyspark
Length = 3
Max split = 2 it should provide me the output such as
+----------+-----------+
|Col_1 |Col_2 |
+----------+-----------+
| Rah | ul |
| Rav | i |
| Rag | hu |
| Rom | eo |
+----------+-----------+
This is how it should look like, Thank you
Another way to go about this. Should be faster than any looping or udf solution.
from pyspark.sql import functions as F
def split(df,length,maxsplit):
return df.withColumn('Names',F.split("Names","(?<=\\G{})".format('.'*length)))\
.select(*((F.col("Names")[x]).alias("Col_"+str(x+1)) for x in range(0,maxsplit)))
split(df,3,2).show()
#+-----+-----+
#|Col_1|Col_2|
#+-----+-----+
#| Rah| ul|
#| Rav| i|
#| Rag| hu|
#| Rom| eo|
#+-----+-----+
split(df,2,3).show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| Ra| hu| l|
#| Ra| vi| |
#| Ra| gh| u|
#| Ro| me| o|
#+-----+-----+-----+
Try this,
import pyspark.sql.functions as F
tst = sqlContext.createDataFrame([("Raghu",1),("Ravi",2),("Rahul",3)],schema=["Name","val"])
def fn (split,max_n,tst):
for i in range(max_n):
tst_loop=tst.withColumn("coln"+str(i),F.substring(F.col("Name"),(i*split)+1,split))
tst=tst_loop
return(tst)
tst_res = fn(3,2,tst)
The for loop can also replaced by a list comprehension or reduce, but i felt in you case, a for loop looked neater. they have the same physical plan anyway.
The results
+-----+---+-----+-----+
| Name|val|coln0|coln1|
+-----+---+-----+-----+
|Raghu| 1| Rag| hu|
| Ravi| 2| Rav| i|
|Rahul| 3| Rah| ul|
+-----+---+-----+-----+
Try this
def split(data,length,maxSplit):
start=1
for i in range(0,maxSplit):
data = data.withColumn(f'col_{start}-{start+length-1}',f.substring('channel',start,length))
start=length+1
return data
df = split(data,3,2)
df.show()
+--------+----+-------+-------+
| channel|type|col_1-3|col_4-6|
+--------+----+-------+-------+
| web| 0| web| |
| web| 1| web| |
| web| 2| web| |
| twitter| 0| twi| tte|
| twitter| 1| twi| tte|
|facebook| 0| fac| ebo|
|facebook| 1| fac| ebo|
|facebook| 2| fac| ebo|
+--------+----+-------+-------+
Perhaps this is useful-
Load the test data
Note: written in scala
val Length = 2
val Maxsplit = 3
val df = Seq("Rahul", "Ravi", "Raghu", "Romeo").toDF("Names")
df.show(false)
/**
* +-----+
* |Names|
* +-----+
* |Rahul|
* |Ravi |
* |Raghu|
* |Romeo|
* +-----+
*/
split the string col as per the length and offset
val schema = StructType(Range(1, Maxsplit + 1).map(f => StructField(s"Col_$f", StringType)))
val split = udf((str:String, length: Int, maxSplit: Int) =>{
val splits = str.toCharArray.grouped(length).map(_.mkString).toArray
RowFactory.create(splits ++ Array.fill(maxSplit-splits.length)(null): _*)
}, schema)
val p = df
.withColumn("x", split($"Names", lit(Length), lit(Maxsplit)))
.selectExpr("x.*")
p.show(false)
p.printSchema()
/**
* +-----+-----+-----+
* |Col_1|Col_2|Col_3|
* +-----+-----+-----+
* |Ra |hu |l |
* |Ra |vi |null |
* |Ra |gh |u |
* |Ro |me |o |
* +-----+-----+-----+
*
* root
* |-- Col_1: string (nullable = true)
* |-- Col_2: string (nullable = true)
* |-- Col_3: string (nullable = true)
*/
Dataset[Row] -> Dataset[Array[String]]
val x = df.map(r => {
val splits = r.getString(0).toCharArray.grouped(Length).map(_.mkString).toArray
splits ++ Array.fill(Maxsplit-splits.length)(null)
})
x.show(false)
x.printSchema()
/**
* +-----------+
* |value |
* +-----------+
* |[Ra, hu, l]|
* |[Ra, vi,] |
* |[Ra, gh, u]|
* |[Ro, me, o]|
* +-----------+
*
* root
* |-- value: array (nullable = true)
* | |-- element: string (containsNull = true)
*/

Regex extraction in SQL

I have the following data format from which I'm trying to extract the id part,
{"memberurn"=urn:li:member:10000012}
This is my code,
CAST(regexp_extract(key.memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT) AS member_id
In the output member_id is NULL
What am I doing wrong here?
Try this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()
sc= spark.sparkContext
df= sc.parallelize([
[""" {"memberurn"=urn:li:member:10000012}"""]]).toDF(["a"])
df.show(truncate=False)
+-------------------------------------+
|a |
+-------------------------------------+
| {"memberurn"=urn:li:member:10000012}|
+-------------------------------------+
df1= df.withColumn("id", F.regexp_extract(F.col('a'),
'(urn:li:member:)(\d+)', 2))
df2= df1.withColumn("id",df1["id"].cast(LongType()))
df2.show()
+-------------------------------------+--------+
|a |id |
+-------------------------------------+--------+
| {"memberurn"=urn:li:member:10000012}|10000012|
+-------------------------------------+--------+
print(df2.printSchema())
root
|-- a: string (nullable = true)
|-- id: long (nullable = true)
In scala-
Using regex_extract
val df = spark.range(1).withColumn("memberurn", lit("urn:li:member:10000012"))
df.withColumn("member_id",
expr("""CAST(regexp_extract(memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT)"""))
.show(false)
/**
* +---+----------------------+---------+
* |id |memberurn |member_id|
* +---+----------------------+---------+
* |0 |urn:li:member:10000012|10000012 |
* +---+----------------------+---------+
*/
Using substring_index
df.withColumn("member_id",
substring_index($"memberurn", ":", -1).cast("bigint"))
.show(false)
/**
* +---+----------------------+---------+
* |id |memberurn |member_id|
* +---+----------------------+---------+
* |0 |urn:li:member:10000012|10000012 |
* +---+----------------------+---------+
*/

Spark dataframe transverse of columns

I have ticket transaction systems. Sample dataframe looks like below. Every day will have 2 records with how many ticket & value of tickets were booked through channel(only 2 channel is possible. Passenger,Agent)
date,channel,ticket_qty,ticket_amount
20011231,passenger,500,2500
20011231,agent,100,1100
20020101,passenger,450,2000
20020101,agent,120,1500
I want to make it to single record per date& removing channel. Like below
date,passenger_ticket_qty,passenger_ticket_amount,agent_ticket_qty,agent_ticket_amount
20011231,500,2500,100,1100
20020101,450,2000,120,1500
I have acheived it in below way.
val pas_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "passenger")
val agent_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "agent")
val df = pas_df.as("pdf").join(agent_df.as("adf"), $"adf.date" === $"pdf.date")
.select($"pdf.date" as date,
$"pdf.ticket_qty" as passenger_ticket_qty,
$"pdf.ticket_amount" as passenger_ticket_amount,
$"adf.ticket_qty" agent_ticket_qty,
$"adf.ticket_amount" as agent_ticket_amount)
This is working perfect way.But it takes around 3 hrs since the file 40yrs of records.
Is there a better way to get this done without join?
Thanks in Advance.
Perhaps this is useful-
Load the data provided
val data =
"""
|date,channel,ticket_qty,ticket_amount
|20011231,passenger,500,2500
|20011231,agent,100,1100
|20020101,passenger,450,2000
|20020101,agent,120,1500
""".stripMargin
val stringDS = data.split(System.lineSeparator())
// .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+---------+----------+-------------+
* |date |channel |ticket_qty|ticket_amount|
* +--------+---------+----------+-------------+
* |20011231|passenger|500 |2500 |
* |20011231|agent |100 |1100 |
* |20020101|passenger|450 |2000 |
* |20020101|agent |120 |1500 |
* +--------+---------+----------+-------------+
*
* root
* |-- date: integer (nullable = true)
* |-- channel: string (nullable = true)
* |-- ticket_qty: integer (nullable = true)
* |-- ticket_amount: integer (nullable = true)
*/
Use pivot and first
df.groupBy("date")
.pivot("channel")
.agg(
first("ticket_qty").as("ticket_qty"),
first("ticket_amount").as("ticket_amount")
).show(false)
/**
* +--------+----------------+-------------------+--------------------+-----------------------+
* |date |agent_ticket_qty|agent_ticket_amount|passenger_ticket_qty|passenger_ticket_amount|
* +--------+----------------+-------------------+--------------------+-----------------------+
* |20011231|100 |1100 |500 |2500 |
* |20020101|120 |1500 |450 |2000 |
* +--------+----------------+-------------------+--------------------+-----------------------+
*/

How do I group records that are within a specific time interval using Spark Scala or sql?

I would like to group records in scala only if they have the same ID and their time is within 1 min of each other.
I am thinking conceptually something like this? But I am not really sure
HAVING a.ID = b.ID AND a.time + 30 sec > b.time AND a.time - 30 sec < b.time
| ID | volume | Time |
|:-----------|------------:|:--------------------------:|
| 1 | 10 | 2019-02-17T12:00:34Z |
| 2 | 20 | 2019-02-17T11:10:46Z |
| 3 | 30 | 2019-02-17T13:23:34Z |
| 1 | 40 | 2019-02-17T12:01:02Z |
| 2 | 50 | 2019-02-17T11:10:30Z |
| 1 | 60 | 2019-02-17T12:01:57Z |
to this:
| ID | volume |
|:-----------|------------:|
| 1 | 50 | // (10+40)
| 2 | 70 | // (20+50)
| 3 | 30 |
df.groupBy($"ID", window($"Time", "1 minutes")).sum("volume")
the code above is 1 solution but it always rounds.
For example 2019-02-17T12:00:45Z will have a range of
2019-02-17T12:00:00Z TO 2019-02-17T12:01:00Z.
I am looking for this instead:
2019-02-17T11:45:00Z TO 2019-02-17T12:01:45Z.
Is there a way?
org.apache.spark.sql.functions provides overloaded window functions as below.
1. window(timeColumn: Column, windowDuration: String) : Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}
2. window((timeColumn: Column, windowDuration: String, slideDuration: String):
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
slideDuration Parameter specifying the sliding interval of the window, e.g. 1 minute.A new window will be generated every slideDuration. Must be less than or equal to the windowDuration.
The windows will look like:
{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}
3. window((timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}
For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. This is the perfect overloaded window function which suites your requirement.
Please find working code as below.
import org.apache.spark.sql.SparkSession
object SparkWindowTest extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//Prepare Test Data
val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
(3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
(1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
.toDF("ID", "Volume", "TimeString")
df.show()
df.printSchema()
+---+------+-------------------+
| ID|Volume| TimeString|
+---+------+-------------------+
| 1| 10|2019-02-17 12:00:49|
| 2| 20|2019-02-17 11:10:46|
| 3| 30|2019-02-17 13:23:34|
| 2| 50|2019-02-17 11:10:30|
| 1| 40|2019-02-17 12:01:02|
| 1| 60|2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
//Converted String Timestamp into Timestamp
val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))
//Dropped String Timestamp from DF
val modifiedDF1 = modifiedDF.drop("TimeString")
modifiedDF.show(false)
modifiedDF.printSchema()
+---+------+-------------------+-------------------+
|ID |Volume|TimeString |Time |
+---+------+-------------------+-------------------+
|1 |10 |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
|-- Time: timestamp (nullable = true)
modifiedDF1.show(false)
modifiedDF1.printSchema()
+---+------+-------------------+
|ID |Volume|Time |
+---+------+-------------------+
|1 |10 |2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- Time: timestamp (nullable = true)
//Main logic
val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")
//Renamed all columns of DF.
val newNames = Seq("ID", "WINDOW", "VOLUME")
val finalDF = modifiedDF2.toDF(newNames: _*)
finalDF.show(false)
+---+---------------------------------------------+------+
|ID |WINDOW |VOLUME|
+---+---------------------------------------------+------+
|2 |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50 |
|1 |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60 |
|1 |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50 |
|3 |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30 |
|2 |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20 |
+---+---------------------------------------------+------+
}