SQL aggregation query, non linear cost - sql

I'm doing a complex aggregation of some timeseris GPS data in a postgres 13 + postgis 3 + timescaleDB 2 database. The table I'm looking at has several million entries per day and I want to do an aggregation (one row per day, per gps_id, per group gap ID) for several months.
Lets say that I've created a function to perform the aggregation:
--pseudo code, won't actually work...
CREATE FUNCTION my_agg_func(starttime, endtime)
AS
WITH gps_window AS
(SELECT gps.id,
gps.geom,
gps.time,
-- find where there are 1 hour gaps in data
lag(ais.time) OVER (PARTITION BY gps.id ORDER BY gps.time) <= (gps.time - '01:00:00'::interval) AS time_step,
-- find where there are 0.1 deg gaps in position
st_distance(gps.geom, lag(gps.geom) OVER (PARTITION BY gps.id ORDER BY gps.time)) >= 0.1 AS dist_step
FROM gps
WHERE gps.time BETWEEN starttime AND endtime
), groups AS (
SELECT gps_window.id,
gps_window.geom,
gps_window.time,
count(*) FILTER (WHERE gps_window.time_step) OVER (PARTITION BY gps_window.id ORDER BY gps_window.time) AS time_grp,
count(*) FILTER (WHERE gps_window.dist_step) OVER (PARTITION BY gps_window.id ORDER BY gps_window.time) AS dist_grp
FROM gps_window
--get rid of duplicate points
WHERE gps_window.dist > 0
)
SELECT
gps_id,
date(gps.time),
time_grp,
dist_grp
st_setsrid(st_makeline(gps_window."position" ORDER BY gps_window.event_time), 4326) AS geom,
FROM groups
WHERE gps_time BETWEEN starttime AND endtime
GROUP BY gps.id, date(gps.time), time_grp, dist_grp
where the gap_id functions are checking for sequential gps points from the same gps_id that are too distant from each other, traveled unreasonably fast or the time between messages was too much. The aggregates are basically creating a line from the gps points. The end result of is a bunch of lines where all the points in the line are "reasonable".
To run the aggregation function for 1 day (starttime = '2020-01-01', endtime = '2020-01-02') it takes about 12 secs to complete. If I choose a week of data, it takes 10 minutes. If I choose a month of data it takes 15h+ to complete.
I would expect linear performance since the data is going to be grouped per day anyway but this isn't the case. The obvious way to get around this performance bottleneck would be to run this in a for loop:
for date in date_range(starttime, endtime):
my_agg_func(date, date+1)
I can do this in Python but any ideas how to either get a for loop running in postgres or to alter the aggregation query to be linear?

The aggregation of time intervals (known as COLLAPSE operator in SQL litterature) leads to write complex queries whose execution cost can be exponential or polynomial depending on the method used. The old classical SQL formulations of SNODGRASS or Chris Date are exponential. Recently Itzik Ben Gan, a Microsoft SQL Server MVP write a polynomial form which gives excellent response times, but uses the CROSS APPLY, operator invented by Microsoft and since taken over by Oracle ... The queries are as follows:
WITH
C1 AS (SELECT ITV_ITEM, ITV_DEBUT AS ts, +1 AS genre, NULL AS e,
ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ITV_DEBUT) AS s
FROM T_INTERVAL_ITV
UNION ALL
SELECT ITV_ITEM, ITV_FIN AS ts, -1 AS genre,
ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ITV_FIN) AS e,
NULL AS s
FROM T_INTERVAL_ITV),
C2 AS (SELECT C1.*, ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ts, genre DESC)
AS se
FROM C1),
C3 AS (SELECT ITV_ITEM, ts,
FLOOR((ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ts) - 1) / 2 + 1)
AS grpnum
FROM C2
WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0),
C4 AS (SELECT ITV_ITEM, MIN(ts) AS ITV_DEBUT, max(ts) AS ITV_FIN
FROM C3
GROUP BY ITV_ITEM, grpnum)
SELECT A.ITV_ITEM, A.ITV_DEBUT, A.ITV_FIN
FROM (SELECT DISTINCT ITV_ITEM
FROM T_INTERVAL_ITV) AS U
CROSS APPLY (SELECT *
FROM C4
WHERE ITV_ITEM = U.ITV_ITEM) AS A
ORDER BY ITV_ITEM, ITV_DEBUT, ITV_FIN;
Can you transform this MS SQL Server specific query by using the LATERAL join ? This will help you to have a better execution time.

Related

Query (or algorithm?) to find cheapest overlapping flights from two distinct locations to one shared location

Say we have a dataset of 500 000 flights from Los Angeles to 80 cities in Europe and back and from Saint Petersburg to same 80 cities in Europe and back. We want to find such 4 flights:
from LA to city X, from city X back to LA, from St P to city X and from city X back to St P
all 4 flights have to be in a time window of 4 days
we are looking for the cheapest combined price of 4 flights
city X can be any of 80 cities, we want to find such cheapest combination for all of them and get the list of these 80 combinations
The data is stored in BigQuery and I've created an SQL query, but it has 3 joins and I assume that under the hood it can have complexity of O(n^4), because the query didn't finish in 30 minutes and I had to abort it.
Here's the schema for the table:
See the query below:
select * from (
select in_led.`from` as city,
in_led.price + out_led.price + in_lax.price + out_lax.price as total_price,
out_led.carrier as out_led_carrier,
out_led.departure as out_led_departure,
in_led.departure as in_led_date,
in_led.carrier as in_led_carrier,
out_lax.carrier as out_lax_carrier,
out_lax.departure as out_lax_departure,
in_lax.departure as in_lax_date,
in_lax.carrier as in_lax_carrier,
row_number() over(partition by in_led.`from` order by in_led.price + out_led.price + in_lax.price + out_lax.price) as rn
from skyscanner.quotes as in_led
join skyscanner.quotes as out_led on out_led.`to` = in_led.`from`
join skyscanner.quotes as out_lax on out_lax.`to` = in_led.`from`
join skyscanner.quotes as in_lax on in_lax.`from` = in_led.`from`
where in_led.`to` = "LED"
and out_led.`from` = "LED"
and in_lax.`to` in ("LAX", "LAXA")
and out_lax.`from` in ("LAX", "LAXA")
and DATE_DIFF(DATE(in_led.departure), DATE(out_led.departure), DAY) < 4
and DATE_DIFF(DATE(in_led.departure), DATE(out_led.departure), DAY) > 0
and DATE_DIFF(DATE(in_lax.departure), DATE(out_lax.departure), DAY) < 4
and DATE_DIFF(DATE(in_lax.departure), DATE(out_lax.departure), DAY) > 0
order by total_price
)
where rn=1
Additional details:
all flights' departure dates fall in a 120 days window
Questions:
Is there a way to optimize this query for better performance?
How to properly classify this problem? The brute force solution is way too slow, but I'm failing to see what type of problem this is. Certainly doesn't look like something for graphs, kinda feels like sorting the table a couple of times by different fields with a stable sort might help, but still seems sub-optimal.
Below is for BigQuery Standard SQL
The brute force solution is way too slow, but I'm failing to see what type of problem this is.
so I would like to see solutions other than brute force if anyone here has ideas
#standardSQL
WITH temp AS (
SELECT DISTINCT *, UNIX_DATE(DATE(departure)) AS dep FROM `skyscanner.quotes`
), round_trips AS (
SELECT t1.from, t1.to, t2.to AS back, t1.price, t1.departure, t1.dep first_day, t1.carrier, t2.departure AS departure2, t2.dep AS last_day, t2.price AS price2, t2.carrier AS carrier2,
FROM temp t1
JOIN temp t2
ON t1.to = t2.from
AND t1.from = t2.to
AND t2.dep BETWEEN t1.dep + 1 AND t1.dep + 3
WHERE t1.from IN ('LAX', 'LED')
)
SELECT cityX, total_price,
( SELECT COUNT(1)
FROM UNNEST(GENERATE_ARRAY(t1.first_day, t1.last_day)) day
JOIN UNNEST(GENERATE_ARRAY(t2.first_day, t2.last_day)) day
USING(day)
) overlap_days_in_cityX,
(SELECT AS STRUCT departure, price, carrier, departure2, price2, carrier2
FROM UNNEST([t1])) AS LAX_CityX_LAX,
(SELECT AS STRUCT departure, price, carrier, departure2, price2, carrier2
FROM UNNEST([t2])) AS LED_CityX_LED
FROM (
SELECT AS VALUE ARRAY_AGG(t ORDER BY total_price LIMIT 1)[OFFSET(0)]
FROM (
SELECT t1.to cityX, t1.price + t1.price2 + t2.price + t2.price2 AS total_price, t1, t2
FROM round_trips t1
JOIN round_trips t2
ON t1.to = t2.to
AND t1.from < t2.from
AND t1.departure2 > t2.departure
AND t1.departure < t2.departure2
) t
GROUP BY cityX
)
ORDER BY overlap_days_in_cityX DESC, total_price
with output (just top 10 out of total 60 rows)
Brief explanation:
temp CTE: Dedup data and introduce dep field - number of days since epoch to eliminate costly TIMESTAMP functions
round_trips CTE: identify all round trip candidates with at most 4 days apart
identify those LAX and LED round trips which have overlaps
for each cityX take the cheapest combination
final output does extra calculation on overlapping days in cityX and lean a little output to have info about all involve flights
Note: in your data - duration field are all zeros - so it is not involved - but if you would have it - it is easy to add it to logic
the query didn't finish in 30 minutes and I had to abort it.
Is there a way to optimize this query for better performance?
My "generic recommendation" is to always learn the data, profile it, clean it - before actual coding! In your example - the data you shared has 469352 rows full of duplicates. After you remove duplicates - you got ONLY 14867 rows. So then I run your original query against that cleaned data and it took ONLY 97 sec to get result. Obviously, it does not mean we cannot optimize code itself - but at least this addresses your issue with "query didn't finish in 30 minutes and I had to abort it"

SQL Azure query aggregate performance issue

I'm trying to improve our SQL Azure database performamce, trying to change the use of CURSOR while this is (as everybody told me) something to avoid.
Our table is about GPS information, rows with a id clustered index and secondary indexes on device, timestamp and geography index on location.
I'm trying to compute some statistic such minimum speed (doppler and computed), total distance, average speed, ... along period for a specific device.
I have NO choice on the stat and CAN'T change the table or output because of production.
I have a clear performance issue when running this inline tbl function on my SQL Azure DB.
ALTER FUNCTION [dbo].[fn_logMetrics_3]
(
#p_device smallint,
#p_from dateTime,
#p_to dateTime,
#p_moveThresold int = 1
)
RETURNS TABLE
AS
RETURN
(
WITH CTE AS
(
SELECT
ROW_NUMBER() OVER(ORDER BY timestamp) AS RowNum,
Timestamp,
Location,
Alt,
Speed
FROM
LogEvents
WHERE
Device = #p_device
AND Timestamp >= #p_from
AND Timestamp <= #p_to),
CTE1 AS
(
SELECT
t1.Speed as Speed,
t1.Alt as Alt,
t2.Alt - t1.Alt as DeltaElevation,
t1.Timestamp as Time0,
t2.Timestamp as Time1,
DATEDIFF(second, t2.Timestamp, t1.Timestamp) as Duration,
t1.Location.STDistance(t2.Location) as Distance
FROM
CTE t1
INNER JOIN
CTE t2 ON t1.RowNum = t2.RowNum + 1),
CTE2 AS
(
SELECT
Speed, Alt,
DeltaElevation,
Time0, Time1,
Duration,
Distance,
CASE
WHEN Duration <> 0
THEN (Distance / Duration) * 3.6
ELSE NULL
END AS CSpeed,
CASE
WHEN DeltaElevation > 0
THEN DeltaElevation
ELSE NULL
END As PositiveAscent,
CASE
WHEN DeltaElevation < 0
THEN DeltaElevation
ELSE NULL
END As NegativeAscent,
CASE
WHEN Distance < #p_moveThresold
THEN Duration
ELSE NULL
END As StopTime,
CASE
WHEN Distance > #p_moveThresold
THEN Duration
ELSE NULL
END As MoveTime
FROM
CTE1 t1
)
SELECT
COUNT(*) as Count,
MIN(Speed) as HSpeedMin, MAX(Speed) as HSpeedMax,
AVG(Speed) as HSpeedAverage,
MIN(CSpeed) as CHSpeedMin, MAX(CSpeed) as CHSpeedMax,
AVG(CSpeed) as CHSpeedAverage,
SUM(Distance) as CumulativeDistance,
MAX(Alt) as AltMin, MIN(Alt) as AltMax,
SUM(PositiveAscent) as PositiveAscent,
SUM(NegativeAscent) as NegativeAscent,
SUM(StopTime) as StopTime,
SUM(MoveTime) as MoveTime
FROM
CTE2 t1
)
The broad idea is
CTE is selecting the correponding rows, following the parameters
CTE1 perform aggregation within two consecutive row, in order to get Duration and Distance
then CTE2 perform operation on these Distance and Duration
Finally the last select is doing aggregation such sum and average over each columns
Everything working pretty well, until the last SELECT call where the agregate function (which are only few sum and average) killed the performance.
This query selecting 1500 rows against table with 4M rows is taking 1500ms.
when replacing the last select with
SELECT ÇOUNT(*) as count FROM CTE2 t1
then it's take only few ms.. (down to 2ms according to SQL Studio statistics).
with
SELECT
COUNT(*) as Count,
SUM(MoveTime) as MoveTime
it's about 125ms
with
SELECT
COUNT(*) as Count,
SUM(StopTime) as StopTime,
SUM(MoveTime) as MoveTime
it's about 250ms
like each aggregate are running on consecutive loop operation over all the row, within the same thread and without beeing parallelized
For information, the CURSOR version (I wrote couple of year ago) of this function is running actually at least twice fast...
What is wrong with this aggregate? How to optimize it?
UPDATE :
The query plans for
SELECT COUNT(*) as Count
The query plans for the full Select with agregate
According the answer of Joe C, I introduce a #tmp table in the plans and perform the aggregate on it. The result is about twice as fast, which is an interesting fact.

SQL Server get customer with 7 consecutive transactions

I am trying to write a query that would get the customers with 7 consecutive transactions given a list of CustomerKeys.
I am currently doing a self join on Customer fact table that has 700 Million records in SQL Server 2008.
This is is what I came up with but its taking a long time to run. I have an clustered index as (CustomerKey, TranDateKey)
SELECT
ct1.CustomerKey,ct1.TranDateKey
FROM
CustomerTransactionFact ct1
INNER JOIN
#CRTCustomerList dl ON ct1.CustomerKey = dl.CustomerKey --temp table with customer list
INNER JOIN
dbo.CustomerTransactionFact ct2 ON ct1.CustomerKey = ct2.CustomerKey -- Same Customer
AND ct2.TranDateKey >= ct1.TranDateKey
AND ct2.TranDateKey <= CONVERT(VARCHAR(8), (dateadd(d, 6, ct1.TranDateTime), 112) -- Consecutive Transactions in the last 7 days
WHERE
ct1.LogID >= 82800000
AND ct2.LogID >= 82800000
AND ct1.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
AND ct2.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
GROUP BY
ct1.CustomerKey,ct1.TranDateKey
HAVING
COUNT(*) = 7
Please help make it more efficient. Is there a better way to write this query in 2008?
You can do this using window functions, which should be much faster. Assuming that TranDateKey is a number and you can subtract a sequential number from it, then the difference constant for consecutive days.
You can put this in a query like this:
SELECT CustomerKey, MIN(TranDateKey), MAX(TranDateKey)
FROM (SELECT ct.CustomerKey, ct.TranDateKey,
(ct.TranDateKey -
DENSE_RANK() OVER (PARTITION BY ct.CustomerKey, ct.TranDateKey)
) as grp
FROM CustomerTransactionFact ct INNER JOIN
#CRTCustomerList dl
ON ct.CustomerKey = dl.CustomerKey
) t
GROUP BY CustomerKey, grp
HAVING COUNT(*) = 7;
If your date key is something else, there is probably a way to modify the query to handle that, but you might have to join to the dimension table.
This would be a perfect task for a COUNT(*) OVER (RANGE ...), but SQL Server 2008 supports only a limited syntax for Windowed Aggregate Functions.
SELECT CustomerKey, MIN(TranDateKey), COUNT(*)
FROM
(
SELECT CustomerKey, TranDateKey,
dateadd(d,-ROW_NUMBER()
OVER (PARTITION BY CustomerKey
ORDER BY TranDateKey),TranDateTime) AS dummyDate
FROM CustomerTransactionFact
) AS dt
GROUP BY CustomerKey, dummyDate
HAVING COUNT(*) >= 7
The dateadd calculates the difference between the current TranDateTime and a Row_Number over all date per customer. The resulting dummyDatehas no actual meaning, but is the same meaningless date for consecutive dates.

Can I repeat a Union Join in Oracle SQL for different values defined by another query?

I have the following query that returns the decile for a group of users based on their mark per programme:
SELECT prog_code,
user_code,
user_mark,
NTILE(10) over (order by user_mark DESC) DECILE
FROM grade_result
where user_mark IS NOT NULL
and prog_year = '2011'
AND prog_code = 'ALPHA'
I need to run this for a total of 40 different prog_code values at the same time, which could be joined together via 39 union joins, but this seems massively inefficient (I can't run this as single select statement as the decile would then be for all programmes, rather than by programme). Is there a way I can get the query to repeat (loop?) for each of these 40 values as a Union, without having to enter each one myself?
I can return the programme codes and a rownum in a separate query or subquery if this is any use:
ROWNUM PROG_CODE
1 ALPHA
2 BETA
3 GAMMA
4 DELTA
5 ECHO
Can you simply use the partition clause in your NTILE function?
SELECT prog_code,
user_code,
user_mark,
NTILE(10) over (PARTITION BY prog_code ORDER BY user_mark DESC) DECILE
FROM grade_result
where user_mark IS NOT NULL
and prog_year = '2011';

SQL Calculate Difference Between Current and Last Value By Timestamp Column

I am looking to calculate the difference between the current & last Value organised by the timestamp column?
My table is organised as follows:
MeterID(PK,FK,int.not null), ActualTimeStamp(smalldatetime,not null), Value(float,null)
Meter ID ActualTimeStamp Value
312514 2013-01-01 08:08:00 72
312514 2013-01-01 08:07:00 12
So my answer should be 72 - 12 = 60
The only solutions I can seem to find are using Row Number which i dont have an option of, if anyone can assist id really apprecieate it as its busting my brain.
Here's a query that can help you. Just modify this to fit your need/table names/etc.
with sub as (
select meterid,
actualtimestamp,
value,
row_number() over (partition by meterid order by actualtimestamp desc) as rn
from test
)
select meterid,
actualtimestamp,
value,
value - isnull((select value
from sub
where s.meterid = meterid
and rn = s.rn + 1), value) as answer
from sub s
order by meterid, actualtimestamp desc;
Basically what it does is that it adds a row number using the row_number() aggregate function. Using the row number, the query tries to get the value from the previous entry and getting the value difference.
Try the fiddler here
In SQL Server 2008, I would recommend a outer applyhere the short code of find diff with your requirement
select t.*, isnull((t.value - tprev.value),0) as diff
from test t outer apply
(select top 1 tprev.*
from test tprev
where tprev.meterid = t.meterid and
tprev.actualtimestamp < t.actualtimestamp
order by tprev.actualtimestamp desc
)tprev