Split table data based on time gaps - sql

Let's say we have a time-series dataset of entities metadata imported into postgres table Stats:
CREATE EXTENSION IF NOT EXISTS POSTGIS;
DROP TABLE IF EXISTS "Stats";
CREATE TABLE IF NOT EXISTS "Stats"
(
"time" BIGINT,
"id" BIGINT,
"position" GEOGRAPHY(PointZ, 4326)
);
And here is the samples of table:
SELECT
"id",
"time"
FROM
"Stats"
ORDER BY
"id", "time" ASC
id|time|
--+----+
1| 3|
1| 4|
1| 6|
1| 7|
2| 2|
2| 6|
3| 14|
4| 2|
4| 9|
4| 10|
4| 11|
5| 32|
6| 15|
7| 16|
The business requirement is to assign route-id to entities in this table, so when the time for each entity jump over 1 second it means the new flight or route for that entity. the final result would like this for previous samples:
id|time|route_id|
--+----+--------+
1| 3| 1|
1| 4| 1|
1| 6| 2|
1| 7| 2|
2| 2| 1|
2| 6| 2|
3| 14| 1|
4| 2| 1|
4| 9| 2|
4| 10| 2|
4| 11| 2|
5| 32| 1|
6| 15| 1|
7| 16| 1|
And this would be the new summary table of the routes:
id|start_time|end_time|route_id|
--+----------+--------+--------+
1| 3| 4| 1|
1| 6| 7| 2|
2| 2| 2| 1|
2| 6| 6| 2|
3| 14| 14| 1|
4| 2| 2| 1|
4| 9| 11| 2|
5| 32| 32| 1|
6| 15| 15| 1|
7| 16| 16| 1|
So how this complex query should be constructed?

with data as (
select *, row_number() over (partition by id order by "time") rn from Stats
)
select id,
min("time") as start_time, max("time") as end_time,
row_number() over (partition by id order by "time" - rn) as route_id
from data
group by id, "time" - rn
order by id, "time" - rn
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c272bc57786487b0b664648139530ae4

Assuming you have table stats in hand, the following query will create a table by assigning route_id:
Query to assign route_id using recursive-cte:
CREATE TABLE tbl_route AS
with recursive cte AS
(
SELECT id, prev_time, time, rn, rn AS ref_rn, rn AS route_id
FROM
(
SELECT
*,
lag(time) OVER(partition BY id ORDER BY time) AS prev_time,
row_number() OVER(partition BY id ORDER BY time) AS rn
FROM stats
) AS rnt
WHERE rn=1
UNION
SELECT rnt2.id, rnt2.prev_time, rnt2.time, rnt2.rn, cte.rn AS ref_rn,
CASE
WHEN abs(rnt2.time-rnt2.prev_time)<=1 THEN cte.route_id
ELSE cte.route_id+1
END AS route_id
FROM cte
INNER JOIN
(
SELECT
*,
lag(time) OVER(partition BY id ORDER BY time) AS prev_time,
row_number() OVER(partition BY id ORDER BY time) AS rn
FROM stats
) AS rnt2
ON cte.id=rnt2.id AND cte.rn+1 = rnt2.rn
)
SELECT id, time, route_id FROM cte;
Query to check if route_id assigned was correct:
select id, time, route_id
from tbl_route
order by id, time
Query to create new summary table:
select id, min(time) as start_time, max(time) as end_time, route_id
from tbl_route
group by id, route_id
order by id, route_id, start_time, end_time
Recursive-CTE Query breakdown:
Since recursive cte has been used, the query may look messy. However, I tried to break it down as follows:
There are 2 main queries getting appended using UNION, First one will assign route_id for start of each id, Second will do it for rest of the rows for each id
rnt and rnt2 has been created because we need ROW_NUMBER and LAG values to achieve this
We joined cte and rnt2 recursively to assign route_id by checking the difference in the time
DEMO

Related

SQL grouped running sum

I have some data like this
data = [("1","1"), ("1","1"), ("1","1"), ("2","1"), ("2","1"), ("3","1"), ("3","1"), ("4","1"),]
df =spark.createDataFrame(data=data,schema=["id","imp"])
df.createOrReplaceTempView("df")
+---+---+
| id|imp|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 1|
| 2| 1|
| 3| 1|
| 3| 1|
| 4| 1|
+---+---+
I want the count of IDs grouped by ID, it's running sum and total sum. This is the code I'm using
query = """
select id,
count(id) as count,
sum(count(id)) over (order by count(id) desc) as running_sum,
sum(count(id)) over () as total_sum
from df
group by id
order by count desc
"""
spark.sql(query).show()
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 7| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+
The problem is with the running_sum column. For some reason it automatically groups the count 2 while summing and shows 7 for both ID 2 and 3.
This is the result I'm expecting
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 5| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+
You should do the running sum in an outer query.
spark.sql('''
select *,
sum(cnt) over (order by id rows between unbounded preceding and current row) as run_sum,
sum(cnt) over (partition by '1') as tot_sum
from (
select id, count(id) as cnt
from data_tbl
group by id)
'''). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+
Using dataframe API
data_sdf. \
groupBy('id'). \
agg(func.count('id').alias('cnt')). \
withColumn('run_sum',
func.sum('cnt').over(wd.partitionBy().orderBy('id').rowsBetween(-sys.maxsize, 0))
). \
withColumn('tot_sum', func.sum('cnt').over(wd.partitionBy())). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+

How to split data into groups in pyspark

I need to find groups in time series data.
Data sample
I need to output column group based on value and day.
I've tried using lag, lead and row_number but it ended up to nothing.
It seems like you want to increment the group everytime the value changes. If so, this is a kind of gaps-and-islands problem.
Here is one approach that uses lag() and a cumulative sum():
select
value,
day,
sum(case when value = lag_value then 0 else 1 end) over(order by day) grp
from (
select t.*, lag(value) over(order by day) lag_value
from mytable t
) t
PySpark way to do this. Find endpoints of groups using lag, do an incremental sum on this lag to get groups, add 1 to groups to get your desired groups.
from pypsark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().orderBy("day")
df.withColumn("lag", F.when(F.lag("value").over(w1)!=F.col("value"), F.lit(1)).otherwise(F.lit(0)))\
.withColumn("group", F.sum("lag").over(w1) + 1).drop("lag").show()
#+-----+---+-----+
#|value|day|group|
#+-----+---+-----+
#| 1| 1| 1|
#| 1| 2| 1|
#| 1| 3| 1|
#| 1| 4| 1|
#| 1| 5| 1|
#| 2| 6| 2|
#| 2| 7| 2|
#| 1| 8| 3|
#| 1| 9| 3|
#| 1| 10| 3|
#| 1| 11| 3|
#| 1| 12| 3|
#| 1| 13| 3|
#+-----+---+-----+

Apache spark window, chose previous last item based on some condition

I have an input data which has id, pid, pname, ppid which are id (can think it is time), pid (process id), pname (process name), ppid (parent process id) who created pid
+---+---+-----+----+
| id|pid|pname|ppid|
+---+---+-----+----+
| 1| 1| 5| -1|
| 2| 1| 7| -1|
| 3| 2| 9| 1|
| 4| 2| 11| 1|
| 5| 3| 5| 1|
| 6| 4| 7| 2|
| 7| 1| 9| 3|
+---+---+-----+----+
now need to find ppname (parent process name) which is the last pname (previous pnames) of following condition previous.pid == current.ppid
expected result for previous example:
+---+---+-----+----+------+
| id|pid|pname|ppid|ppname|
+---+---+-----+----+------+
| 1| 1| 5| -1| -1|
| 2| 1| 7| -1| -1| no item found above with pid=-1
| 3| 2| 9| 1| 7| last pid = 1(ppid) above, pname=7
| 4| 2| 11| 1| 7|
| 5| 3| 5| 1| 7|
| 6| 4| 7| 2| 11| last pid = 2(ppid) above, pname=11
| 7| 1| 9| 3| 5| last pid = 3(ppid) above, pname=5
+---+---+-----+----+------+
I can join by itself based on pid==ppid then take diff between ids and pick row with min positive difference maybe then join back again for the cases where we didn't find any positive diffs (-1 case).
But I am thinking it is almost like a cross join, which I might not afford since I have 100M rows.

How to filter rows in SQL statement for aggregate function by window function?

I have some table and provide tools to the user to generate new columns based on existings.
Table:
+---+
| a|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
New column name: b
New column rule must be like: max(a) over(WHERE a < 3)
How to correct write this?
Result must be like SQL statement: SELECT *, (SELECT max(a) FROM table WHERE a < 3) as b FROM table. And returns:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 5| 2|
+---+---+
But I can't wrote inside over() WHERE statement and can't allow user to know name of table. How do I solve this problem?
Just use a window function with case:
select a, max(case when a < 3 then a end) over () as b
from t;

Converting rows into columns tsql

I have a table like
CId| RId| No
---+----+----
1| 10| 100
1| 20| 20
1| 30| 10
2| 10| 200
2| 30| 20
3| 40| 25
here, RId represents "NoToAttend" (10),"NoNotToAttend" (20),"NoWait"(30),"Backup" (40) etc...
I need to have a result table that will look like
Cid| "NoToAttend"| "NoNotToAttend"| "NoWait"| "Backup"
---+--------------+------------------+---------+----------
1| 100| 20| null|
2| 200| null| 20| null
3| null| null| null| 25
I am not sure on how to use PIVOT. Need help on this
You can use the PIVOT Function and just alias your columns:
SELECT pvt.CID,
[NoToAttend] = pvt.[10],
[NoNotToAttend] = pvt.[20],
[NoWait] = pvt.[30],
[Backup] = pvt.[40]
FROM T
PIVOT
( SUM([No])
FOR RID IN ([10], [20, [30], [40])
) pvt;