How to split data into groups in pyspark - sql

I need to find groups in time series data.
Data sample
I need to output column group based on value and day.
I've tried using lag, lead and row_number but it ended up to nothing.

It seems like you want to increment the group everytime the value changes. If so, this is a kind of gaps-and-islands problem.
Here is one approach that uses lag() and a cumulative sum():
select
value,
day,
sum(case when value = lag_value then 0 else 1 end) over(order by day) grp
from (
select t.*, lag(value) over(order by day) lag_value
from mytable t
) t

PySpark way to do this. Find endpoints of groups using lag, do an incremental sum on this lag to get groups, add 1 to groups to get your desired groups.
from pypsark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().orderBy("day")
df.withColumn("lag", F.when(F.lag("value").over(w1)!=F.col("value"), F.lit(1)).otherwise(F.lit(0)))\
.withColumn("group", F.sum("lag").over(w1) + 1).drop("lag").show()
#+-----+---+-----+
#|value|day|group|
#+-----+---+-----+
#| 1| 1| 1|
#| 1| 2| 1|
#| 1| 3| 1|
#| 1| 4| 1|
#| 1| 5| 1|
#| 2| 6| 2|
#| 2| 7| 2|
#| 1| 8| 3|
#| 1| 9| 3|
#| 1| 10| 3|
#| 1| 11| 3|
#| 1| 12| 3|
#| 1| 13| 3|
#+-----+---+-----+

Related

Split table data based on time gaps

Let's say we have a time-series dataset of entities metadata imported into postgres table Stats:
CREATE EXTENSION IF NOT EXISTS POSTGIS;
DROP TABLE IF EXISTS "Stats";
CREATE TABLE IF NOT EXISTS "Stats"
(
"time" BIGINT,
"id" BIGINT,
"position" GEOGRAPHY(PointZ, 4326)
);
And here is the samples of table:
SELECT
"id",
"time"
FROM
"Stats"
ORDER BY
"id", "time" ASC
id|time|
--+----+
1| 3|
1| 4|
1| 6|
1| 7|
2| 2|
2| 6|
3| 14|
4| 2|
4| 9|
4| 10|
4| 11|
5| 32|
6| 15|
7| 16|
The business requirement is to assign route-id to entities in this table, so when the time for each entity jump over 1 second it means the new flight or route for that entity. the final result would like this for previous samples:
id|time|route_id|
--+----+--------+
1| 3| 1|
1| 4| 1|
1| 6| 2|
1| 7| 2|
2| 2| 1|
2| 6| 2|
3| 14| 1|
4| 2| 1|
4| 9| 2|
4| 10| 2|
4| 11| 2|
5| 32| 1|
6| 15| 1|
7| 16| 1|
And this would be the new summary table of the routes:
id|start_time|end_time|route_id|
--+----------+--------+--------+
1| 3| 4| 1|
1| 6| 7| 2|
2| 2| 2| 1|
2| 6| 6| 2|
3| 14| 14| 1|
4| 2| 2| 1|
4| 9| 11| 2|
5| 32| 32| 1|
6| 15| 15| 1|
7| 16| 16| 1|
So how this complex query should be constructed?
with data as (
select *, row_number() over (partition by id order by "time") rn from Stats
)
select id,
min("time") as start_time, max("time") as end_time,
row_number() over (partition by id order by "time" - rn) as route_id
from data
group by id, "time" - rn
order by id, "time" - rn
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c272bc57786487b0b664648139530ae4
Assuming you have table stats in hand, the following query will create a table by assigning route_id:
Query to assign route_id using recursive-cte:
CREATE TABLE tbl_route AS
with recursive cte AS
(
SELECT id, prev_time, time, rn, rn AS ref_rn, rn AS route_id
FROM
(
SELECT
*,
lag(time) OVER(partition BY id ORDER BY time) AS prev_time,
row_number() OVER(partition BY id ORDER BY time) AS rn
FROM stats
) AS rnt
WHERE rn=1
UNION
SELECT rnt2.id, rnt2.prev_time, rnt2.time, rnt2.rn, cte.rn AS ref_rn,
CASE
WHEN abs(rnt2.time-rnt2.prev_time)<=1 THEN cte.route_id
ELSE cte.route_id+1
END AS route_id
FROM cte
INNER JOIN
(
SELECT
*,
lag(time) OVER(partition BY id ORDER BY time) AS prev_time,
row_number() OVER(partition BY id ORDER BY time) AS rn
FROM stats
) AS rnt2
ON cte.id=rnt2.id AND cte.rn+1 = rnt2.rn
)
SELECT id, time, route_id FROM cte;
Query to check if route_id assigned was correct:
select id, time, route_id
from tbl_route
order by id, time
Query to create new summary table:
select id, min(time) as start_time, max(time) as end_time, route_id
from tbl_route
group by id, route_id
order by id, route_id, start_time, end_time
Recursive-CTE Query breakdown:
Since recursive cte has been used, the query may look messy. However, I tried to break it down as follows:
There are 2 main queries getting appended using UNION, First one will assign route_id for start of each id, Second will do it for rest of the rows for each id
rnt and rnt2 has been created because we need ROW_NUMBER and LAG values to achieve this
We joined cte and rnt2 recursively to assign route_id by checking the difference in the time
DEMO

pyspark: count number of consecutive ones/zeros and change them if streak is to short / to long

i work with a large pyspark dataframe on a cluster and need to write a function that:
finds rows of consecutive zeros in a specific column and, if that streak is shorter than 300 rows, change them all to 1 and
then finds periods of consecutive ones in that column and, if that streak of ones is shorter than 1800 rows, set them all to 0.
Every row has a unique timestamp i can sort them by.
is there a way to make that happen?
Yes you can follow this example where I searched for strikes of less than 3 zeros and converted them to ones:
column = 'data'
date_column = 'timestamp'
min_consecutive_rows = 3
search_num = 0
set_to = 1
df = df.withColumn('binary', F.when(col(column)==search_num, 1).otherwise(0))\
.withColumn('start_streak', F.when(col('binary') != F.lead('binary', -1).over(w), 1).otherwise(0))\
.withColumn('streak_id', F.sum('start_streak').over(Window.orderBy(date_column)))\
.withColumn("streak_counter", F.row_number().over(Window.partitionBy("streak_id").orderBy(date_column)))\
.withColumn('max_streak_counter', F.max('streak_counter').over(Window.partitionBy("streak_id")))\
.withColumn(column, F.when((col('binary')==1) & (col('max_streak_counter') < min_consecutive_rows), set_to).otherwise(col(column)))
Suppose your data column is called data and your date column is called timestamp.
The performed steps are the following:
binary column is used to search only streaks of the desired search_num number. It allows your data to have other numbers rather than only zeros and ones, still searching only streaks of zeros in this case.
start_streak tell us which rows are the start of a new streak
streak_id creates a unique ID for each streak
streak_counter counts the elements on each streak
max_streak_counter tell us the maximum counter of elements for each streak_id
Finally data_output converts the numbers only if the streak is less than min_consecutive_rows parameter and it is composed by the requested search_num numbers (zeros in this case)
Here an example with all the intermediate columns:
| timestamp|data|binary|start_streak|streak_id|streak_counter|max_streak_counter|data_output|
+--------------------+----+------+------------+---------+--------------+------------------+-----------+
|2020-11-11 15:52:...| 1| 0| 0| 0| 1| 5| 1|
|2020-11-12 15:52:...| 2| 0| 0| 0| 2| 5| 2|
|2020-11-13 15:52:...| 3| 0| 0| 0| 3| 5| 3|
|2020-11-14 15:52:...| 4| 0| 0| 0| 4| 5| 4|
|2020-11-15 15:52:...| 1| 0| 0| 0| 5| 5| 1|
|2020-11-16 15:52:...| 0| 1| 1| 1| 1| 2| 1|
|2020-11-17 15:52:...| 0| 1| 0| 1| 2| 2| 1|
|2020-11-18 15:52:...| 1| 0| 1| 2| 1| 1| 1|
|2020-11-19 15:52:...| 0| 1| 1| 3| 1| 4| 0|
|2020-11-20 15:52:...| 0| 1| 0| 3| 2| 4| 0|
|2020-11-21 15:52:...| 0| 1| 0| 3| 3| 4| 0|
|2020-11-22 15:52:...| 0| 1| 0| 3| 4| 4| 0|
+--------------------+----+------+------------+---------+--------------+------------------+-----------+
For the second bullet point just change: column to 'data_output', min_consecutive_rows to 1800, search_num to 1, set_to parameter to 0 and repeat the code above.
For more details about the streak calcuation please visit this post that does a similar logic in pandas.

How to join two dataframes together

I have two dataframes.
One is coming from groupBy and the other is the total summary:
a = data.groupBy("bucket").agg(sum(a.total))
b = data.agg(sum(a.total))
I want to put the total from b to a dataframe so that I can calculate the % on each bucket.
Do you know what kind of join I shall use?
Use .crossJoin then you will get the total from b added to all rows of df a, then you can calculate the percentage.
Example:
a.crossJoin(b).show()
#+------+----------+----------+
#|bucket|sum(total)|sum(total)|
#+------+----------+----------+
#| c| 4| 10|
#| b| 3| 10|
#| a| 3| 10|
#+------+----------+----------+
Instead of CrossJoin you can try using window functions as mentioned below.
df.show()
#+-----+------+
#|total|bucket|
#+-----+------+
#| 1| a|
#| 2| a|
#| 3| b|
#| 4| c|
#+-----+------+
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.window import *
import sys
w=Window.partitionBy(col("bucket"))
w1=Window.orderBy(lit("1")).rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("sum_b",sum(col("total")).over(w)).withColumn("sum_c",sum(col("total")).over(w1)).show()
#+-----+------+-----+-----+
#|total|bucket|sum_b|sum_c|
#+-----+------+-----+-----+
#| 4| c| 4| 10|
#| 3| b| 3| 10|
#| 1| a| 3| 10|
#| 2| a| 3| 10|
#+-----+------+-----+-----+
You can use also collect() as you will return to the driver just a simple result
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
df = spark.sql("select 'A' as bucket, 5 as value union all select 'B' as bucket, 8 as value")
df_total = spark.sql("select 9 as total")
df=df.withColumn('total',lit(df_total.collect()[0]['total']))
+------+-----+-----+
|bucket|value|total|
+------+-----+-----+
| A| 5| 9|
| B| 8| 9|
+------+-----+-----+
df= df.withColumn('pourcentage', col('total') / col('value'))
+------+-----+-----+-----------+
|bucket|value|total|pourcentage|
+------+-----+-----+-----------+
| A| 5| 9| 1.8|
| B| 8| 9| 1.125|
+------+-----+-----+-----------+

How to get last row value when flag is 0 and get the current row value to new column when flag 1 in pyspark dataframe

Scenario 1 when Flag 1 :
For the row where Flag is 1 Copy trx_date to Destination
Scenario 2 When Flag 0 :
For the row where Flag is 0 Copy the previous Destination Value
Input :
+-----------+----+----------+
|customer_id|Flag| trx_date|
+-----------+----+----------+
| 1| 1| 12/3/2020|
| 1| 0| 12/4/2020|
| 1| 1| 12/5/2020|
| 1| 1| 12/6/2020|
| 1| 0| 12/7/2020|
| 1| 1| 12/8/2020|
| 1| 0| 12/9/2020|
| 1| 0|12/10/2020|
| 1| 0|12/11/2020|
| 1| 1|12/12/2020|
| 2| 1| 12/1/2020|
| 2| 0| 12/2/2020|
| 2| 0| 12/3/2020|
| 2| 1| 12/4/2020|
+-----------+----+----------+
Output :
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Code to generate spark Dataframe :
df = spark.createDataFrame([(1,1,'12/3/2020'),(1,0,'12/4/2020'),(1,1,'12/5/2020'),
(1,1,'12/6/2020'),(1,0,'12/7/2020'),(1,1,'12/8/2020'),(1,0,'12/9/2020'),(1,0,'12/10/2020'),
(1,0,'12/11/2020'),(1,1,'12/12/2020'),(2,1,'12/1/2020'),(2,0,'12/2/2020'),(2,0,'12/3/2020'),
(2,1,'12/4/2020')], ["customer_id","Flag","trx_date"])
Pyspark way to do this. After getting trx_date in datetype, First get incremental sum of Flag to create the groupings we need in order to use the first function on a window partitioned by those groupings. We can use date_format to get both columns back to desired date format. I assumed your format was MM/dd/yyyy, if it was different please change it to dd/MM/yyyy in the code.
df.show() #sample data
#+-----------+----+----------+
#|customer_id|Flag| trx_date|
#+-----------+----+----------+
#| 1| 1| 12/3/2020|
#| 1| 0| 12/4/2020|
#| 1| 1| 12/5/2020|
#| 1| 1| 12/6/2020|
#| 1| 0| 12/7/2020|
#| 1| 1| 12/8/2020|
#| 1| 0| 12/9/2020|
#| 1| 0|12/10/2020|
#| 1| 0|12/11/2020|
#| 1| 1|12/12/2020|
#| 2| 1| 12/1/2020|
#| 2| 0| 12/2/2020|
#| 2| 0| 12/3/2020|
#| 2| 1| 12/4/2020|
#+-----------+----+----------+
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().orderBy("customer_id","trx_date")
w1=Window().partitionBy("Flag2").orderBy("trx_date").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("trx_date", F.to_date("trx_date", "MM/dd/yyyy"))\
.withColumn("Flag2", F.sum("Flag").over(w))\
.withColumn("destination", F.when(F.col("Flag")==0, F.first("trx_date").over(w1)).otherwise(F.col("trx_date")))\
.withColumn("trx_date", F.date_format("trx_date","MM/dd/yyyy"))\
.withColumn("destination", F.date_format("destination", "MM/dd/yyyy"))\
.orderBy("customer_id","trx_date").drop("Flag2").show()
#+-----------+----+----------+-----------+
#|customer_id|Flag| trx_date|destination|
#+-----------+----+----------+-----------+
#| 1| 1|12/03/2020| 12/03/2020|
#| 1| 0|12/04/2020| 12/03/2020|
#| 1| 1|12/05/2020| 12/05/2020|
#| 1| 1|12/06/2020| 12/06/2020|
#| 1| 0|12/07/2020| 12/06/2020|
#| 1| 1|12/08/2020| 12/08/2020|
#| 1| 0|12/09/2020| 12/08/2020|
#| 1| 0|12/10/2020| 12/08/2020|
#| 1| 0|12/11/2020| 12/08/2020|
#| 1| 1|12/12/2020| 12/12/2020|
#| 2| 1|12/01/2020| 12/01/2020|
#| 2| 0|12/02/2020| 12/01/2020|
#| 2| 0|12/03/2020| 12/01/2020|
#| 2| 1|12/04/2020| 12/04/2020|
#+-----------+----+----------+-----------+
You can use window functions. I am unsure whether spark sql supports the standard ignore nulls option to lag().
If it does, you can just do:
select
t.*,
case when flag = 1
then trx_date
else lag(case when flag = 1 then trx_date end ignore nulls)
over(partition by customer_id order by trx_date)
end destination
from mytable t
Else, you can build groups with a window sum first:
select
customer_id,
flag,
trx_date,
case when flag = 1
then trx_date
else min(trx_date) over(partition by customer_id, grp order by trx_date)
end destination
from (
select t.*, sum(flag) over(partition by customer_id order by trx_date) grp
from mytable t
) t
You can achieve this in the following way if you are considering dataframe API
#Convert date format while creating window itself
window = Window().orderBy("customer_id",f.to_date('trx_date','MM/dd/yyyy'))
df1 = df.withColumn('destination', f.when(f.col('Flag')==1,f.col('trx_date'))).\
withColumn('destination',f.last(f.col('destination'),ignorenulls=True).over(window))
df1.show()
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Hope it helps.

SQL or Pyspark - Get the last time a column had a different value for each ID

I am using pyspark so I have tried both pyspark code and SQL.
I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table:
+---+-------+-------+----+
| ID|USER_ID|ADDRESS|TIME|
+---+-------+-------+----+
| 1| 1| A| 10|
| 2| 1| B| 15|
| 3| 1| A| 20|
| 4| 1| A| 40|
| 5| 1| A| 45|
+---+-------+-------+----+
The correct new column I would like is as below:
+---+-------+-------+----+---------+
| ID|USER_ID|ADDRESS|TIME|LAST_DIFF|
+---+-------+-------+----+---------+
| 1| 1| A| 10| null|
| 2| 1| B| 15| 10|
| 3| 1| A| 20| 15|
| 4| 1| A| 40| 15|
| 5| 1| A| 45| 15|
+---+-------+-------+----+---------+
I have tried using different windows but none ever seem to get exactly what I want. Any ideas?
A simplified version of #jxc's answer.
from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
.withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()
Use lag with running sum to assign groups when there is a change in the column value (based on the defined window). Get the time from the previous row, which will be used in the next step.
Once you get the groups, use the running minimum to get the last timestamp of the column value change. (Suggest you look at the intermediate results to understand the transformations better)
One way using two Window specs:
from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window
w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')
# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))
# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))
df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+
|USER_ID| g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
| 1| 1| 1| A| 10| 10| null|
| 1| 2| 2| B| 15| 15| 10|
| 1| 3| 3| A| 20| 105| 15|
| 1| 3| 4| A| 40| 105| 15|
| 1| 3| 5| A| 45| 105| 15|
+-------+---+---+-------+----+----+---------+
df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')