Spark SQL join on closest date by ID - sql

I have two tables below that I'm trying to join based on ID and the closest available weekly_dt date based on the ingest_date column.
in standard ANSI SQL I usually use a correlated sub query and limit the query to one result per row so there is no aggregate error, however doing this in standard SparkSQL gives me the following error
AnalysisException: Correlated scalar subqueries must be aggregated: GlobalLimit 1
Setup
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
np.random.seed(25)
A1 = [('A1', i.date(), np.random.randint(0,50)) for i in pd.date_range('01 Jan 2021', '21 Jan 2021',freq='D')]
A2 = [('A2', i.date(), np.random.randint(0,50)) for i in pd.date_range('01 Jan 2021', '21 Jan 2021',freq='D')]
df_a = spark.createDataFrame(A1 + A2, ['id','ingest_date','amt'])
weekly_scores = [
('A1', pd.Timestamp('01 Jan 2021').date(), '0.5'),
('A1', pd.Timestamp('08 Jan 2021').date(), '0.3'),
('A1', pd.Timestamp('15 Jan 2021').date(), '0.8'),
('A1', pd.Timestamp('22 Jan 2021').date(), '0.6'),
('A2', pd.Timestamp('01 Jan 2021').date(), '0.6'),
('A2', pd.Timestamp('08 Jan 2021').date(), '0.1'),
('A2', pd.Timestamp('15 Jan 2021').date(), '0.9'),
('A2', pd.Timestamp('22 Jan 2021').date(), '0.3'),
]
df_b = spark.createDataFrame(weekly_scores, ['id','weekly_dt','score'])
Tables
df_a.show()
+---+-----------+---+
| id|ingest_date|amt|
+---+-----------+---+
| A1| 2021-01-01| 26|
| A1| 2021-01-02| 1|
| A1| 2021-01-03| 0|
| A1| 2021-01-04| 31|
| A1| 2021-01-05| 41|
| A1| 2021-01-06| 46|
| A1| 2021-01-07| 11|
| A1| 2021-01-08| 0|
| A1| 2021-01-09| 14|
| A1| 2021-01-10| 5|
| A1| 2021-01-11| 0|
| A1| 2021-01-12| 35|
| A1| 2021-01-13| 5|
| A1| 2021-01-14| 43|
| A1| 2021-01-15| 18|
| A1| 2021-01-16| 31|
| A1| 2021-01-17| 44|
| A1| 2021-01-18| 25|
| A1| 2021-01-19| 47|
| A1| 2021-01-20| 36|
+---+-----------+---+
df_b.show()
+---+----------+-----+
| id| weekly_dt|score|
+---+----------+-----+
| A1|2021-01-01| 0.5|
| A1|2021-01-08| 0.3|
| A1|2021-01-15| 0.8|
| A1|2021-01-22| 0.6|
| A2|2021-01-01| 0.6|
| A2|2021-01-08| 0.1|
| A2|2021-01-15| 0.9|
| A2|2021-01-22| 0.3|
+---+----------+-----+
Expected Output.
id ingest_date amt weekly_dt score
0 A1 2021-01-01 26 2021-01-01 0.5
4 A1 2021-01-02 1 2021-01-01 0.5
8 A1 2021-01-03 0 2021-01-01 0.5
12 A1 2021-01-04 31 2021-01-01 0.5
17 A1 2021-01-05 41 2021-01-08 0.3
21 A1 2021-01-06 46 2021-01-08 0.3
25 A1 2021-01-07 11 2021-01-08 0.3
29 A1 2021-01-08 0 2021-01-08 0.3
33 A1 2021-01-09 14 2021-01-08 0.3
37 A1 2021-01-10 5 2021-01-08 0.3
41 A1 2021-01-11 0 2021-01-08 0.3
46 A1 2021-01-12 35 2021-01-15 0.8
50 A1 2021-01-13 5 2021-01-15 0.8
54 A1 2021-01-14 43 2021-01-15 0.8
58 A1 2021-01-15 18 2021-01-15 0.8
62 A1 2021-01-16 31 2021-01-15 0.8
66 A1 2021-01-17 44 2021-01-15 0.8
70 A1 2021-01-18 25 2021-01-15 0.8
75 A1 2021-01-19 47 2021-01-22 0.6
79 A1 2021-01-20 36 2021-01-22 0.6
83 A1 2021-01-21 43 2021-01-22 0.6
84 A2 2021-01-01 32 2021-01-01 0.6
88 A2 2021-01-02 37 2021-01-01 0.6
92 A2 2021-01-03 11 2021-01-01 0.6
96 A2 2021-01-04 21 2021-01-01 0.6
101 A2 2021-01-05 29 2021-01-08 0.1
105 A2 2021-01-06 48 2021-01-08 0.1
109 A2 2021-01-07 12 2021-01-08 0.1
113 A2 2021-01-08 40 2021-01-08 0.1
117 A2 2021-01-09 30 2021-01-08 0.1
121 A2 2021-01-10 28 2021-01-08 0.1
125 A2 2021-01-11 41 2021-01-08 0.1
130 A2 2021-01-12 12 2021-01-15 0.9
134 A2 2021-01-13 10 2021-01-15 0.9
138 A2 2021-01-14 10 2021-01-15 0.9
142 A2 2021-01-15 31 2021-01-15 0.9
146 A2 2021-01-16 13 2021-01-15 0.9
150 A2 2021-01-17 31 2021-01-15 0.9
154 A2 2021-01-18 11 2021-01-15 0.9
159 A2 2021-01-19 15 2021-01-22 0.3
163 A2 2021-01-20 18 2021-01-22 0.3
167 A2 2021-01-21 49 2021-01-22 0.3
Spark Query
SELECT
a.id,
a.ingestion_date,
a.amt,
b.weekly_dt,
b.score
FROM a
LEFT JOIN b
ON a.id = b.id
AND a.ingestion_date =
(
SELECT weekly_dt
FROM b
WHERE id = b.id
ORDER BY DATEDIFF(a.ingest_date, weekly_dt) ASC
LIMIT 1
)
Edit,
I know I can create a window and use a dense_rank() to order the results but I wonder if this is the best method?
from pyspark.sql import Window
s = spark.sql("""
SELECT
a.id,
a.ingest_date,
a.amt,
b.weekly_dt,
b.score
FROM a
LEFT JOIN b
ON b.id = a.id
""").withColumn('delta',
F.abs(F.datediff(F.col('ingest_date'), F.col('weekly_dt')
)
)
)
s.withColumn('t',
F.dense_rank().over(
Window.partitionBy('id','ingest_date').orderBy(F.asc('delta')))
).filter('t == 1').drop('t','delta').show()
id ingest_date amt weekly_dt score
0 A2 2021-01-01 32 2021-01-01 0.6
1 A2 2021-01-02 37 2021-01-01 0.6
2 A2 2021-01-03 11 2021-01-01 0.6
3 A2 2021-01-04 21 2021-01-01 0.6
4 A2 2021-01-05 29 2021-01-08 0.1
5 A2 2021-01-06 48 2021-01-08 0.1
6 A2 2021-01-07 12 2021-01-08 0.1
7 A2 2021-01-08 40 2021-01-08 0.1
8 A2 2021-01-09 30 2021-01-08 0.1
9 A2 2021-01-10 28 2021-01-08 0.1
10 A2 2021-01-11 41 2021-01-08 0.1
11 A2 2021-01-12 12 2021-01-15 0.9
12 A2 2021-01-13 10 2021-01-15 0.9
13 A2 2021-01-14 10 2021-01-15 0.9
14 A2 2021-01-15 31 2021-01-15 0.9

I would replace the subquery with a limit with a window function :
df = df_a.join(df_b, on="id")
df = (
df.withColumn(
"rnk",
F.row_number().over(
W.partitionBy("id", "ingest_date").orderBy(
F.abs(F.datediff("ingest_date", "weekly_dt"))
)
),
)
.where("rnk=1")
.drop("rnk")
)
df.show()
+---+-----------+---+----------+-----+
| id|ingest_date|amt| weekly_dt|score|
+---+-----------+---+----------+-----+
| A2| 2021-01-01| 31|2021-01-01| 0.6|
| A2| 2021-01-02| 48|2021-01-01| 0.6|
| A2| 2021-01-03| 47|2021-01-01| 0.6|
| A2| 2021-01-04| 9|2021-01-01| 0.6|
| A2| 2021-01-05| 16|2021-01-08| 0.1|
| A2| 2021-01-06| 44|2021-01-08| 0.1|
| A2| 2021-01-07| 45|2021-01-08| 0.1|
| A2| 2021-01-08| 21|2021-01-08| 0.1|
| A2| 2021-01-09| 36|2021-01-08| 0.1|
| A2| 2021-01-10| 9|2021-01-08| 0.1|
| A2| 2021-01-11| 32|2021-01-08| 0.1|
| A2| 2021-01-12| 10|2021-01-15| 0.9|
| A2| 2021-01-13| 47|2021-01-15| 0.9|
| A2| 2021-01-14| 42|2021-01-15| 0.9|
| A2| 2021-01-15| 1|2021-01-15| 0.9|
| A2| 2021-01-16| 22|2021-01-15| 0.9|
| A2| 2021-01-17| 27|2021-01-15| 0.9|
| A2| 2021-01-18| 49|2021-01-15| 0.9|
| A2| 2021-01-19| 18|2021-01-22| 0.3|
| A2| 2021-01-20| 28|2021-01-22| 0.3|
+---+-----------+---+----------+-----+
only showing top 20 rows

Related

How do I split a duration of time into hourly intervals in BigQuery?

This is my table.
Location
Date
Employee
start
end
A
2021-01-01
A1
10
15
A
2021-01-01
A1
15
16
B
2021-01-01
B1
16
21
C
2021-01-01
C1
11
15
Here is the expected output:
Location
Date
Employee
start
end
A
2021-01-01
A1
10
15
A
2021-01-01
A1
11
15
A
2021-01-01
A1
12
15
A
2021-01-01
A1
13
15
A
2021-01-01
A1
14
15
A
2021-01-01
A1
15
15
A
2021-01-01
A1
15
16
A
2021-01-01
A1
16
16
B
2021-01-01
B1
16
21
B
2021-01-01
B1
17
21
B
2021-01-01
B1
18
21
B
2021-01-01
B1
19
21
B
2021-01-01
B1
20
21
B
2021-01-01
B1
21
21
C
2021-01-01
C1
11
15
C
2021-01-01
C1
12
15
C
2021-01-01
C1
13
15
C
2021-01-01
C1
14
15
C
2021-01-01
C1
15
15
Please help me how to split like this in BigQuery.
SELECT * EXCEPT(t) REPLACE(t AS start)
FROM my_table, UNNEST(GENERATE_ARRAY(start, `end`)) t
ORDER BY Location, start;
Query results:
+----------+------------+----------+-------+-----+
| Location | Date | Employee | start | end |
+----------+------------+----------+-------+-----+
| A | 2021-01-01 | A1 | 10 | 15 |
| A | 2021-01-01 | A1 | 11 | 15 |
| A | 2021-01-01 | A1 | 12 | 15 |
| A | 2021-01-01 | A1 | 13 | 15 |
| A | 2021-01-01 | A1 | 14 | 15 |
| A | 2021-01-01 | A1 | 15 | 15 |
| A | 2021-01-01 | A1 | 15 | 16 |
| A | 2021-01-01 | A1 | 16 | 16 |
| B | 2021-01-01 | B1 | 16 | 21 |
| B | 2021-01-01 | B1 | 17 | 21 |
| B | 2021-01-01 | B1 | 18 | 21 |
| B | 2021-01-01 | B1 | 19 | 21 |
| B | 2021-01-01 | B1 | 20 | 21 |
| B | 2021-01-01 | B1 | 21 | 21 |
| C | 2021-01-01 | C1 | 11 | 15 |
| C | 2021-01-01 | C1 | 12 | 15 |
| C | 2021-01-01 | C1 | 13 | 15 |
| C | 2021-01-01 | C1 | 14 | 15 |
| C | 2021-01-01 | C1 | 15 | 15 |
+----------+------------+----------+-------+-----+

How to resample pandas to hydrologic year (Sep 1 - Aug 31)

I'd like to analyze some daily data by hydrologic year: From 1 September to 31 August. I've created a synthetic data set with:
import pandas as pd
t = pd.date_range(start='2015-01-01', freq='D', end='2021-09-03')
df = pd.DataFrame(index = t)
df['hydro_year'] = df.index.year
df['hydro_year'].loc[df.index.month >= 9] += 1
df['id'] = df['hydro_year'] - df.index.year[0]
df['count'] = 1
Note that in reality I do not have a hydro_year column so I do not use groupby. I would expect the following to resample by hydrologic year:
print(df['2015-09-01':].resample('12M').agg({'hydro_year':'mean','id':'mean','count':'sum'}))
But the output does not align:
| | hydro_year | id | count |
|---------------------+------------+---------+-------|
| 2015-09-30 00:00:00 | 2016 | 1 | 30 |
| 2016-09-30 00:00:00 | 2016.08 | 1.08197 | 366 |
| 2017-09-30 00:00:00 | 2017.08 | 2.08219 | 365 |
| 2018-09-30 00:00:00 | 2018.08 | 3.08219 | 365 |
| 2019-09-30 00:00:00 | 2019.08 | 4.08219 | 365 |
| 2020-09-30 00:00:00 | 2020.08 | 5.08197 | 366 |
| 2021-09-30 00:00:00 | 2021.01 | 6.00888 | 338 |
However, if I start a day earlier, then things do align, except the first day is 'early' and dangling alone...
| | hydro_year | id | count |
|---------------------+------------+----+-------|
| 2015-08-31 00:00:00 | 2015 | 0 | 1 |
| 2016-08-31 00:00:00 | 2016 | 1 | 366 |
| 2017-08-31 00:00:00 | 2017 | 2 | 365 |
| 2018-08-31 00:00:00 | 2018 | 3 | 365 |
| 2019-08-31 00:00:00 | 2019 | 4 | 365 |
| 2020-08-31 00:00:00 | 2020 | 5 | 366 |
| 2021-08-31 00:00:00 | 2021 | 6 | 365 |
| 2022-08-31 00:00:00 | 2022 | 7 | 3 |
IIUC, you can use 12MS (Start) instead of 12M:
>>> df['2015-09-01':].resample('12MS') \
.agg({'hydro_year':'mean','id':'mean','count':'sum'})
hydro_year id count
2015-09-01 2016.0 1.0 366
2016-09-01 2017.0 2.0 365
2017-09-01 2018.0 3.0 365
2018-09-01 2019.0 4.0 365
2019-09-01 2020.0 5.0 366
2020-09-01 2021.0 6.0 365
2021-09-01 2022.0 7.0 3
We can try with Anchored Offsets annually starting with SEP:
resampled_df = df['2015-09-01':].resample('AS-SEP').agg({
'hydro_year': 'mean', 'id': 'mean', 'count': 'sum'
})
hydro_year id count
2015-09-01 2016.0 1.0 366
2016-09-01 2017.0 2.0 365
2017-09-01 2018.0 3.0 365
2018-09-01 2019.0 4.0 365
2019-09-01 2020.0 5.0 366
2020-09-01 2021.0 6.0 365
2021-09-01 2022.0 7.0 3

How to detect missing values if the dataframe has removed the missing rows already?

I have a dataframe which contains time series data of 30 consecutive days, each day is supposed to contain data of 24 hours from 0 to 23, so there suppose to have 24*30 = 720 rows in the dataframe. However, there are some rows containing missing records of the column "Fooo" already being removed from the dataframe.
Index | DATE(YYYY/MM/DD) | Hour | Fooo
0 | 2015/01/01 | 0 | x
1 | 2015/01/01 | 1 | xy
2 | ... | ... | z
23 | 2015/01/01 | 23 | z
24 | 2015/01/02 | 0 | z
25 | 2015/01/02 | 2 | bz
... | ... | ... | z
46 | 2015/01/02 | 23 | zz
...
...
680 | 2015/01/30 | 1 | z
681 | 2015/01/30 | 3 | bz
... | ... | ... | z
701 | 2015/01/30 | 23 | zz
I would like to rewrite the dataframe so that it contains full 720 rows, with missing values in the column "Fooo" being filled with "NA".
Index | DATE(YYYY/MM/DD) | Hour | Fooo
0 | 2015/01/01 | 0 | x
1 | 2015/01/01 | 1 | xy
2 | ... | ... | z
23 | 2015/01/01 | 23 | z
24 | 2015/01/02 | 0 | z
25 | 2015/01/02 | 1 | NA
26 | 2015/01/02 | 2 | bz
... | ... | ... | z
47 | 2015/01/02 | 23 | zz
...
...
690 | 2015/01/30 | 0 | NA
691 | 2015/01/30 | 1 | z
692 | 2015/01/30 | 2 | NA
693 | 2015/01/30 | 3 | bz
... | ... | ... | z
719 | 2015/01/30 | 23 | zz
How can I do that in pandas? I tried to create another dataframe with one column "Hour" like this:
Index | Hour |
0 | 0 |
1 | 1 |
2 | ... |
23 | 23 |
24 | 0 |
25 | 1 |
26 | 2 |
... | ...
47 | 23 |
...
...
690 | 0 |
691 | 1 |
692 | 2
693 | 3 |
... | |
719 | 23 |
then outer join it with the original one, but it did not work.
Create helper DataFrame by product and DataFrame.merge with left join:
from itertools import product
df['DATE(YYYY/MM/DD)'] = pd.to_datetime(df['DATE(YYYY/MM/DD)'])
df1 = pd.DataFrame(list(product(df['DATE(YYYY/MM/DD)'].unique(), range(27))),
columns=['DATE(YYYY/MM/DD)','Hour'])
df = df1.merge(df, how='left')
print (df.head(10))
DATE(YYYY/MM/DD) Hour Fooo
0 2015-01-01 0 x
1 2015-01-01 1 xy
2 2015-01-01 2 NaN
3 2015-01-01 3 NaN
4 2015-01-01 4 NaN
5 2015-01-01 5 NaN
6 2015-01-01 6 NaN
7 2015-01-01 7 NaN
8 2015-01-01 8 NaN
9 2015-01-01 9 NaN
Or create MultiIndex by MultiIndex.from_product and use DataFrame.reindex for append missing rows:
df['DATE(YYYY/MM/DD)'] = pd.to_datetime(df['DATE(YYYY/MM/DD)'])
mux = pd.MultiIndex.from_product([df['DATE(YYYY/MM/DD)'].unique(), range(27)],
names=['DATE(YYYY/MM/DD)','Hour'])
df = df.set_index(['DATE(YYYY/MM/DD)','Hour']).reindex(mux).reset_index()
print (df.head(10))
DATE(YYYY/MM/DD) Hour Fooo
0 2015-01-01 0 x
1 2015-01-01 1 xy
2 2015-01-01 2 NaN
3 2015-01-01 3 NaN
4 2015-01-01 4 NaN
5 2015-01-01 5 NaN
6 2015-01-01 6 NaN
7 2015-01-01 7 NaN
8 2015-01-01 8 NaN
9 2015-01-01 9 NaN

How can one assign a rank that increases--rather than the same value as rank() and dense_rank() does--to later group of values previously encountered?

date id b bc x
2017-06-01 a35b3y26f 3 0.19 1
2017-06-02 a35b3y26f 3 0.19 1
2017-06-03 a35b3y26f 3 0.23 2
2017-06-04 a35b3y26f 3 0.12 3
2017-06-05 a35b3y26f 3 0.21 4
2017-06-06 a35b3y26f 3 0.19 5
2017-06-07 a35b3y26f 3 0.28 6
2017-06-08 a35b3y26f 3 0 7
2017-06-09 a35b3y26f 3 0 7
2017-06-10 a35b3y26f 3 0.15 8
2017-06-11 a35b3y26f 3 0.3 9
2017-06-12 a35b3y26f 3 0.17 10
2017-06-13 a35b3y26f 3 0.27 11
2017-06-14 a35b3y26f 3 0.28 12
2017-06-15 a35b3y26f 3 0.18 13
2017-06-16 a35b3y26f 3 0 14
2017-06-17 a35b3y26f 3 0.2 15
2017-06-18 a35b3y26f 3 0 16
2017-06-19 a35b3y26f 3 0.28 17
2017-06-20 a35b3y26f 3 0.25 18
2017-06-21 a35b3y26f 3 0.19 19
2017-06-22 a35b3y26f 3 0.23 20
2017-06-23 a35b3y26f 3 0 21
2017-06-24 a35b3y26f 3 0 21
2017-06-25 a35b3y26f 3 0.13 22
Above, column x represents the values that I wish to have outputted in the result-set.
Is there a way using the existing windowing functions provided by PostgreSQL that I can obtain this outcome?
One way is to use sum and lag functions:
SELECT "date", "id", "b", "bc", "x",
SUM( xxxxx ) OVER (order by "date") As X
FROM (
SELECT *,
CASE "bc"
WHEN lag( "bc" ) over (order by "date")
THEN 0 ELSE 1
END as xxxxx
FROM table1
) x
Demo: http://sqlfiddle.com/#!17/8dab6/4
| date | id | b | bc | x | x |
|----------------------|-----------|---|------|----|----|
| 2017-06-01T00:00:00Z | a35b3y26f | 3 | 0.19 | 1 | 1 |
| 2017-06-02T00:00:00Z | a35b3y26f | 3 | 0.19 | 1 | 1 |
| 2017-06-03T00:00:00Z | a35b3y26f | 3 | 0.23 | 2 | 2 |
| 2017-06-04T00:00:00Z | a35b3y26f | 3 | 0.12 | 3 | 3 |
| 2017-06-05T00:00:00Z | a35b3y26f | 3 | 0.21 | 4 | 4 |
| 2017-06-06T00:00:00Z | a35b3y26f | 3 | 0.19 | 5 | 5 |
| 2017-06-07T00:00:00Z | a35b3y26f | 3 | 0.28 | 6 | 6 |
| 2017-06-08T00:00:00Z | a35b3y26f | 3 | 0 | 7 | 7 |
| 2017-06-09T00:00:00Z | a35b3y26f | 3 | 0 | 7 | 7 |
| 2017-06-10T00:00:00Z | a35b3y26f | 3 | 0.15 | 8 | 8 |
| 2017-06-11T00:00:00Z | a35b3y26f | 3 | 0.3 | 9 | 9 |
| 2017-06-12T00:00:00Z | a35b3y26f | 3 | 0.17 | 10 | 10 |
| 2017-06-13T00:00:00Z | a35b3y26f | 3 | 0.27 | 11 | 11 |
| 2017-06-14T00:00:00Z | a35b3y26f | 3 | 0.28 | 12 | 12 |
| 2017-06-15T00:00:00Z | a35b3y26f | 3 | 0.18 | 13 | 13 |
| 2017-06-16T00:00:00Z | a35b3y26f | 3 | 0 | 14 | 14 |
| 2017-06-17T00:00:00Z | a35b3y26f | 3 | 0.2 | 15 | 15 |
| 2017-06-18T00:00:00Z | a35b3y26f | 3 | 0 | 16 | 16 |
| 2017-06-19T00:00:00Z | a35b3y26f | 3 | 0.28 | 17 | 17 |
| 2017-06-20T00:00:00Z | a35b3y26f | 3 | 0.25 | 18 | 18 |
| 2017-06-21T00:00:00Z | a35b3y26f | 3 | 0.19 | 19 | 19 |
| 2017-06-22T00:00:00Z | a35b3y26f | 3 | 0.23 | 20 | 20 |
| 2017-06-23T00:00:00Z | a35b3y26f | 3 | 0 | 21 | 21 |
| 2017-06-24T00:00:00Z | a35b3y26f | 3 | 0 | 21 | 21 |
| 2017-06-25T00:00:00Z | a35b3y26f | 3 | 0.13 | 22 | 22 |

Oracle: Select parallel entries

I am searching the most efficient way to make a relatively complicated query in a relatively large table.
The concept is that:
I have a table that holds records of phases that can run parallel to each other
The amount of records exceeds the 5 millions (and increases)
The time period starts about 5 years ago
Due to performance reasons, this select could be applied on the last 3 months period of time with 300.000 records (only if it is not physically possible to do it for the whole table)
Oracle version: 11g
The data sample seems as following
Table Phases (ID, START_TS, END_TS, PRIO)
1 10:00:00 10:20:10 10
2 10:05:00 10:10:00 11
3 10:05:20 10:15:00 9
4 10:16:00 10:25:00 8
5 10:24:00 10:45:15 1
6 10:26:00 10:30:00 10
7 10:27:00 10:35:00 15
8 10:34:00 10:50:00 5
9 10:50:00 10:55:00 20
10 10:55:00 11:00:00 15
Above you can see how the information is currently stored (of course there are several other columns with irrelevant information).
There are two requirements (or problems to be solved)
If we sum the duration of all the phases, the result is MUCH more than an hour that the above data represent. (There could be holes between the phases, so taking the first start_ts and the last end_ts would not be sufficient).
The data should be displayed in a form that it would be visible which phases run parallel with which and which phase had the highest priority at each time, as shown in the expected view below
Here it is easy to distinct the highest priority phase at each time (HIGHEST_PRIO), and adding their duration would result the actual total duration.
View V_Parallel_Phases (ID, START_TS, END_TS, PRIO, HIGHEST_PRIO)
-> Optional Columns: Part_of_ID / Runs_Parallel
1 10:00:00 10:05:20 10 True (--> Part_1 / False)
1 10:05:20 10:15:00 10 False (--> Part_2 / True)
2 10:05:00 10:10:00 11 False (--> Part_1 / True)
3 10:05:20 10:15:00 9 True (--> Part_1 / True)
1 10:15:00 10:16:00 10 True (--> Part_3 / True)
1 10:16:00 10:20:10 10 False (--> Part_4 / True)
4 10:16:00 10:24:00 8 True (--> Part_1 / True)
4 10:24:00 10:25:00 8 False (--> Part_2 / True)
5 10:24:00 10:45:15 1 True (--> Part_1 / True)
6 10:26:00 10:30:00 10 False (--> Part_1 / True)
7 10:27:00 10:35:00 15 False (--> Part_1 / True)
8 10:34:00 10:45:15 5 False (--> Part_1 / True)
8 10:45:15 10:50:00 5 True (--> Part_2 / True)
9 10:50:00 10:55:00 20 True (--> Part_2 / False)
10 10:55:00 11:00:00 15 True (--> Part_2 / False)
Unfortunately I am not aware of an efficient way to make this query. The current solution was to make the above calculations programmatically in the tool that generates a large report but it was a total failure. From the 30 seconds that were needed before this calculations, now it needs over 10 minutes without taking event into consideration the priorities of the phases..
Then I thought of translating this code into sql in either: a) a view b) a materialized view c) a table that I would fill with a procedure once in a while (depending on the required duration).
PS: I am aware that oracle has some analytical functions that can handle complicated queries but I am not aware of which could actually help me in the current problem.
Thank you in advance!
This is an incomplete answer, but I need to know if this approach is viable before going on. I believe it is possible to do completely in SQL, but I am not sure how the performance will be.
First find out all points in time where there is a transition:
CREATE VIEW Events AS
SELECT START_TS AS TS
FROM Phases
UNION
SELECT END_TS AS TS
FROM Phases
;
Then create (start, end) tuples from those points in time:
CREATE VIEW Segments AS
SELECT START.TS AS START_TS,
MIN(END.TS) AS END_TS
FROM Events AS START
JOIN Events AS END
WHERE START.TS < END.TS
;
From here on, doing the rest should be fairly straight forward. Here is a query that lists the segments and all the phases that are active in the given segment:
SELECT *
FROM Segments
JOIN Phases
WHERE Segments.START_TS BETWEEN Phases.START_TS AND Phases.END_TS
AND Segments.END_TS BETWEEN Phases.START_TS AND Phases.END_TS
ORDER BY Segments.START_TS
;
The rest can be done with subselects and some aggregates.
| START_TS | END_TS | ID | START_TS | END_TS | PRIO |
|----------|----------|----|----------|----------|------|
| 10:00:00 | 10:05:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:00 | 10:05:20 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:00 | 10:05:20 | 2 | 10:05:00 | 10:10:00 | 11 |
| 10:05:20 | 10:10:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:05:20 | 10:10:00 | 2 | 10:05:00 | 10:10:00 | 11 |
| 10:05:20 | 10:10:00 | 3 | 10:05:20 | 10:15:00 | 9 |
| 10:10:00 | 10:15:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:10:00 | 10:15:00 | 3 | 10:05:20 | 10:15:00 | 9 |
| 10:15:00 | 10:16:00 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:16:00 | 10:20:10 | 1 | 10:00:00 | 10:20:10 | 10 |
| 10:16:00 | 10:20:10 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:20:10 | 10:24:00 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:24:00 | 10:25:00 | 4 | 10:16:00 | 10:25:00 | 8 |
| 10:24:00 | 10:25:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:25:00 | 10:26:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:26:00 | 10:27:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:26:00 | 10:27:00 | 6 | 10:26:00 | 10:30:00 | 10 |
| 10:27:00 | 10:30:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:27:00 | 10:30:00 | 6 | 10:26:00 | 10:30:00 | 10 |
| 10:27:00 | 10:30:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:30:00 | 10:34:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:30:00 | 10:34:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:34:00 | 10:35:00 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:34:00 | 10:35:00 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:34:00 | 10:35:00 | 7 | 10:27:00 | 10:35:00 | 15 |
| 10:35:00 | 10:45:15 | 5 | 10:24:00 | 10:45:15 | 1 |
| 10:35:00 | 10:45:15 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:45:15 | 10:50:00 | 8 | 10:34:00 | 10:50:00 | 5 |
| 10:50:00 | 10:55:00 | 9 | 10:50:00 | 10:55:00 | 20 |
| 10:55:00 | 11:00:00 | 10 | 10:55:00 | 11:00:00 | 15 |
There is a SQL fiddle demonstrating the whole thing here:
http://sqlfiddle.com/#!9/d801b/2