SQL Azure query aggregate performance issue

SQL Azure query aggregate performance issue - sql

I'm trying to improve our SQL Azure database performamce, trying to change the use of CURSOR while this is (as everybody told me) something to avoid.
Our table is about GPS information, rows with a id clustered index and secondary indexes on device, timestamp and geography index on location.
I'm trying to compute some statistic such minimum speed (doppler and computed), total distance, average speed, ... along period for a specific device.
I have NO choice on the stat and CAN'T change the table or output because of production.
I have a clear performance issue when running this inline tbl function on my SQL Azure DB.
ALTER FUNCTION [dbo].[fn_logMetrics_3]
(
#p_device smallint,
#p_from dateTime,
#p_to dateTime,
#p_moveThresold int = 1
)
RETURNS TABLE
AS
RETURN
(
WITH CTE AS
(
SELECT
ROW_NUMBER() OVER(ORDER BY timestamp) AS RowNum,
Timestamp,
Location,
Alt,
Speed
FROM
LogEvents
WHERE
Device = #p_device
AND Timestamp >= #p_from
AND Timestamp <= #p_to),
CTE1 AS
(
SELECT
t1.Speed as Speed,
t1.Alt as Alt,
t2.Alt - t1.Alt as DeltaElevation,
t1.Timestamp as Time0,
t2.Timestamp as Time1,
DATEDIFF(second, t2.Timestamp, t1.Timestamp) as Duration,
t1.Location.STDistance(t2.Location) as Distance
FROM
CTE t1
INNER JOIN
CTE t2 ON t1.RowNum = t2.RowNum + 1),
CTE2 AS
(
SELECT
Speed, Alt,
DeltaElevation,
Time0, Time1,
Duration,
Distance,
CASE
WHEN Duration <> 0
THEN (Distance / Duration) * 3.6
ELSE NULL
END AS CSpeed,
CASE
WHEN DeltaElevation > 0
THEN DeltaElevation
ELSE NULL
END As PositiveAscent,
CASE
WHEN DeltaElevation < 0
THEN DeltaElevation
ELSE NULL
END As NegativeAscent,
CASE
WHEN Distance < #p_moveThresold
THEN Duration
ELSE NULL
END As StopTime,
CASE
WHEN Distance > #p_moveThresold
THEN Duration
ELSE NULL
END As MoveTime
FROM
CTE1 t1
)
SELECT
COUNT(*) as Count,
MIN(Speed) as HSpeedMin, MAX(Speed) as HSpeedMax,
AVG(Speed) as HSpeedAverage,
MIN(CSpeed) as CHSpeedMin, MAX(CSpeed) as CHSpeedMax,
AVG(CSpeed) as CHSpeedAverage,
SUM(Distance) as CumulativeDistance,
MAX(Alt) as AltMin, MIN(Alt) as AltMax,
SUM(PositiveAscent) as PositiveAscent,
SUM(NegativeAscent) as NegativeAscent,
SUM(StopTime) as StopTime,
SUM(MoveTime) as MoveTime
FROM
CTE2 t1
)
The broad idea is
CTE is selecting the correponding rows, following the parameters
CTE1 perform aggregation within two consecutive row, in order to get Duration and Distance
then CTE2 perform operation on these Distance and Duration
Finally the last select is doing aggregation such sum and average over each columns
Everything working pretty well, until the last SELECT call where the agregate function (which are only few sum and average) killed the performance.
This query selecting 1500 rows against table with 4M rows is taking 1500ms.
when replacing the last select with
SELECT ÇOUNT(*) as count FROM CTE2 t1
then it's take only few ms.. (down to 2ms according to SQL Studio statistics).
with
SELECT
COUNT(*) as Count,
SUM(MoveTime) as MoveTime
it's about 125ms
with
SELECT
COUNT(*) as Count,
SUM(StopTime) as StopTime,
SUM(MoveTime) as MoveTime
it's about 250ms
like each aggregate are running on consecutive loop operation over all the row, within the same thread and without beeing parallelized
For information, the CURSOR version (I wrote couple of year ago) of this function is running actually at least twice fast...
What is wrong with this aggregate? How to optimize it?
UPDATE :
The query plans for
SELECT COUNT(*) as Count
The query plans for the full Select with agregate
According the answer of Joe C, I introduce a #tmp table in the plans and perform the aggregate on it. The result is about twice as fast, which is an interesting fact.

Related

SQL aggregation query, non linear cost

I'm doing a complex aggregation of some timeseris GPS data in a postgres 13 + postgis 3 + timescaleDB 2 database. The table I'm looking at has several million entries per day and I want to do an aggregation (one row per day, per gps_id, per group gap ID) for several months.
Lets say that I've created a function to perform the aggregation:
--pseudo code, won't actually work...
CREATE FUNCTION my_agg_func(starttime, endtime)
AS
WITH gps_window AS
(SELECT gps.id,
gps.geom,
gps.time,
-- find where there are 1 hour gaps in data
lag(ais.time) OVER (PARTITION BY gps.id ORDER BY gps.time) <= (gps.time - '01:00:00'::interval) AS time_step,
-- find where there are 0.1 deg gaps in position
st_distance(gps.geom, lag(gps.geom) OVER (PARTITION BY gps.id ORDER BY gps.time)) >= 0.1 AS dist_step
FROM gps
WHERE gps.time BETWEEN starttime AND endtime
), groups AS (
SELECT gps_window.id,
gps_window.geom,
gps_window.time,
count(*) FILTER (WHERE gps_window.time_step) OVER (PARTITION BY gps_window.id ORDER BY gps_window.time) AS time_grp,
count(*) FILTER (WHERE gps_window.dist_step) OVER (PARTITION BY gps_window.id ORDER BY gps_window.time) AS dist_grp
FROM gps_window
--get rid of duplicate points
WHERE gps_window.dist > 0
)
SELECT
gps_id,
date(gps.time),
time_grp,
dist_grp
st_setsrid(st_makeline(gps_window."position" ORDER BY gps_window.event_time), 4326) AS geom,
FROM groups
WHERE gps_time BETWEEN starttime AND endtime
GROUP BY gps.id, date(gps.time), time_grp, dist_grp
where the gap_id functions are checking for sequential gps points from the same gps_id that are too distant from each other, traveled unreasonably fast or the time between messages was too much. The aggregates are basically creating a line from the gps points. The end result of is a bunch of lines where all the points in the line are "reasonable".
To run the aggregation function for 1 day (starttime = '2020-01-01', endtime = '2020-01-02') it takes about 12 secs to complete. If I choose a week of data, it takes 10 minutes. If I choose a month of data it takes 15h+ to complete.
I would expect linear performance since the data is going to be grouped per day anyway but this isn't the case. The obvious way to get around this performance bottleneck would be to run this in a for loop:
for date in date_range(starttime, endtime):
my_agg_func(date, date+1)
I can do this in Python but any ideas how to either get a for loop running in postgres or to alter the aggregation query to be linear?

The aggregation of time intervals (known as COLLAPSE operator in SQL litterature) leads to write complex queries whose execution cost can be exponential or polynomial depending on the method used. The old classical SQL formulations of SNODGRASS or Chris Date are exponential. Recently Itzik Ben Gan, a Microsoft SQL Server MVP write a polynomial form which gives excellent response times, but uses the CROSS APPLY, operator invented by Microsoft and since taken over by Oracle ... The queries are as follows:
WITH
C1 AS (SELECT ITV_ITEM, ITV_DEBUT AS ts, +1 AS genre, NULL AS e,
ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ITV_DEBUT) AS s
FROM T_INTERVAL_ITV
UNION ALL
SELECT ITV_ITEM, ITV_FIN AS ts, -1 AS genre,
ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ITV_FIN) AS e,
NULL AS s
FROM T_INTERVAL_ITV),
C2 AS (SELECT C1.*, ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ts, genre DESC)
AS se
FROM C1),
C3 AS (SELECT ITV_ITEM, ts,
FLOOR((ROW_NUMBER() OVER(PARTITION BY ITV_ITEM ORDER BY ts) - 1) / 2 + 1)
AS grpnum
FROM C2
WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0),
C4 AS (SELECT ITV_ITEM, MIN(ts) AS ITV_DEBUT, max(ts) AS ITV_FIN
FROM C3
GROUP BY ITV_ITEM, grpnum)
SELECT A.ITV_ITEM, A.ITV_DEBUT, A.ITV_FIN
FROM (SELECT DISTINCT ITV_ITEM
FROM T_INTERVAL_ITV) AS U
CROSS APPLY (SELECT *
FROM C4
WHERE ITV_ITEM = U.ITV_ITEM) AS A
ORDER BY ITV_ITEM, ITV_DEBUT, ITV_FIN;
Can you transform this MS SQL Server specific query by using the LATERAL join ? This will help you to have a better execution time.

Speed up SQL simple Query

We have a table called PROTOKOLL, with the following definition:
PROTOKOLL TableDefinition
The table has 10 million pcs of records.
SELECT *
FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM rnum
FROM (SELECT t0.*, t1.*
FROM PROTOKOLL t0
, PROTOKOLL t1
WHERE (
(
(t0.BENUTZER_ID = 'A07BU0006')
AND (t0.TYP = 'E')
) AND
(
(t1.UUID = t0.ANDERES_PROTOKOLL_UUID)
AND
(t1.TYP = 'A')
)
)
ORDER BY t0.ZEITPUNKT DESC
) a
WHERE ROWNUM <= 4999) WHERE rnum > 0;
So practically we join the table with itself through ANDERES_PROTOKOLL_UUID field, we apply simple filterings. The results are sorted with creation time and the number of the result record set is limited to 5000.
The elapsed time of the query is about 10 Minutes! --- which is not acceptable ☹
I already have the execution plan and statistic information in place and trying to figure out how to speed up the query, pls. find them attached.
My first recognition, that the optimizer puts “"P"."ANDERES_PROTOKOLL_UUID" IS NOT NULL” condition additionally to the where clause, but I do not know why. Is it a problem?
Or where are the bottleneck of the query?
How can I avoid….Any suggestion is welcome.

SQL Server get customer with 7 consecutive transactions

I am trying to write a query that would get the customers with 7 consecutive transactions given a list of CustomerKeys.
I am currently doing a self join on Customer fact table that has 700 Million records in SQL Server 2008.
This is is what I came up with but its taking a long time to run. I have an clustered index as (CustomerKey, TranDateKey)
SELECT
ct1.CustomerKey,ct1.TranDateKey
FROM
CustomerTransactionFact ct1
INNER JOIN
#CRTCustomerList dl ON ct1.CustomerKey = dl.CustomerKey --temp table with customer list
INNER JOIN
dbo.CustomerTransactionFact ct2 ON ct1.CustomerKey = ct2.CustomerKey -- Same Customer
AND ct2.TranDateKey >= ct1.TranDateKey
AND ct2.TranDateKey <= CONVERT(VARCHAR(8), (dateadd(d, 6, ct1.TranDateTime), 112) -- Consecutive Transactions in the last 7 days
WHERE
ct1.LogID >= 82800000
AND ct2.LogID >= 82800000
AND ct1.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
AND ct2.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
GROUP BY
ct1.CustomerKey,ct1.TranDateKey
HAVING
COUNT(*) = 7
Please help make it more efficient. Is there a better way to write this query in 2008?

You can do this using window functions, which should be much faster. Assuming that TranDateKey is a number and you can subtract a sequential number from it, then the difference constant for consecutive days.
You can put this in a query like this:
SELECT CustomerKey, MIN(TranDateKey), MAX(TranDateKey)
FROM (SELECT ct.CustomerKey, ct.TranDateKey,
(ct.TranDateKey -
DENSE_RANK() OVER (PARTITION BY ct.CustomerKey, ct.TranDateKey)
) as grp
FROM CustomerTransactionFact ct INNER JOIN
#CRTCustomerList dl
ON ct.CustomerKey = dl.CustomerKey
) t
GROUP BY CustomerKey, grp
HAVING COUNT(*) = 7;
If your date key is something else, there is probably a way to modify the query to handle that, but you might have to join to the dimension table.

This would be a perfect task for a COUNT(*) OVER (RANGE ...), but SQL Server 2008 supports only a limited syntax for Windowed Aggregate Functions.
SELECT CustomerKey, MIN(TranDateKey), COUNT(*)
FROM
(
SELECT CustomerKey, TranDateKey,
dateadd(d,-ROW_NUMBER()
OVER (PARTITION BY CustomerKey
ORDER BY TranDateKey),TranDateTime) AS dummyDate
FROM CustomerTransactionFact
) AS dt
GROUP BY CustomerKey, dummyDate
HAVING COUNT(*) >= 7
The dateadd calculates the difference between the current TranDateTime and a Row_Number over all date per customer. The resulting dummyDatehas no actual meaning, but is the same meaningless date for consecutive dates.

TSQL Last Record Efficiency Cursor, SubQuery, or CTE

Consider the following query...
SELECT
*
,CAST(
(CurrentSampleDateTime - PreviousSampleDateTime) AS FLOAT
) * 24.0 * 60.0 AS DeltaMinutes
FROM
(
SELECT
C.SampleDateTime AS CurrentSampleDateTime
,C.Location
,C.CurrentValue
,(
SELECT TOP 1
Previous.SampleDateTime
FROM Samples AS Previous
WHERE
Previous.Location = C.Location
AND Previous.SampleDateTime < C.SampleDateTime
ORDER BY Previous.SampleDateTime DESC
) AS PreviousSampleDateTime
FROM Samples AS C
) AS TempResults
Assuming all things being equal such as indexing, etc is this the most efficient way of achieving the above results? That is using a SubQuery to retrieve the last record?
Would I be better off creating a cursor that orders by Location, SampleDateTime and setting up variables for CurrentSampleDateTime and PreviousSampleDateTime...setting the Previous to the Current at the bottom of the while loop?
I'm not very good with CTE's is this something that could be accomplished more efficiently with a CTE? If so what would that look like?
I'm likely going to have to retrieve PreviousValue along with Previous SampleDateTime in order to get an average of the two. Does that change the results any.
Long story short what is the best/most efficient way of holding onto the values of a previous record if you need to use those values in calculations on the current record?
----UPDATE
I should note that I have a clustered index on Location, SampleDateTime, CurrentValue so maybe that is what is affecting the results more than anything.
with 5,591,571 records my query (the one above) on average takes 3 mins and 20 seconds
The CTE that Joachim Isaksson below on average is taking 5 mins and 15 secs.
Maybe it's taking longer because it's not using the clustered index but is using the rownumber for the joins?
I started testing the cursor method but it's already at 10 minutes...so no go on that one.
I'll give it a day or so but think I will accept the CTE answer provided by Joachim Isaksson just because I found a new method of getting the last row.
Can anyone concur that it's the index on Location, SampleDateTime, CurrentValue that is making the subquery method faster?
I don't have SQL Server 2012 so can't test the LEAD/LAG method. I'd bet that would be quicker than anything I've tried assuming Microsoft implemented that efficiently. Probably just have to swap a pointer to a memory reference at the end of each row.

If you are using SQL Server 2012, you can use the LAG window function that retrieves the value of the specified column from the previous row. It returns null if there is no previous row.
SELECT
a.*,
CAST((a.SampleDateTime - LAG(a.SampleDateTime) OVER(PARTITION BY a.location ORDER BY a.SampleDateTime ASC)) AS FLOAT)
* 24.0 * 60.0 AS DeltaMinutes
FROM samples a
ORDER BY
a.location,
a.SampleDateTime
You'd have to run some tests to see if it's faster. If you're not using SQL Server 2012 then at least this may give others an idea of how it can be done with 2012. I like #Joachim Isaksson 's answer using a CTE with a Row_Number()/Partition By for 2008 and 2005.
SQL Fiddle
Have you considered creating a temp table to use instead of a CTE or subquery? You can create indexes on the temp table that are more suited for the join on RowNumber.
CREATE TABLE #tmp (
RowNumber INT,
Location INT,
SampleDateTime DATETIME,
CurrentValue INT)
;
INSERT INTO #tmp
SELECT
ROW_NUMBER() OVER (PARTITION BY Location
ORDER BY SampleDateTime DESC) rn,
Location,
SampleDateTime,
CurrentValue
FROM Samples
;
CREATE INDEX idx_location_row ON #tmp(Location,RowNumber) INCLUDE (SampleDateTime,CurrentValue);
SELECT
a.Location,
a.SampleDateTime,
a.CurrentValue,
CAST((a.SampleDateTime - b.SampleDateTime) AS FLOAT) * 24.0 * 60.0 AS DeltaMinutes
FROM #tmp a
LEFT JOIN #tmp b ON
a.Location = b.Location
AND b.RowNumber = a.RowNumber +1
ORDER BY
a.Location,
a.SampleDateTime
SQL Fiddle #2

As always, testing with your real data is king.
Here's a CTE version that shows the samples for each location with time deltas from the previous sample. It uses OVER ranking, which usually does well in comparison to subqueries for solving the same problem.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Location
ORDER BY SampleDateTime DESC) rn
FROM Samples
)
SELECT a.*,CAST((a.SampleDateTime - b.SampleDateTime) AS FLOAT)
* 24.0 * 60.0 AS DeltaMinutes
FROM cte a
LEFT JOIN cte b ON a.Location = b.Location AND b.rn = a.rn +1
An SQLfiddle to test with.

Sorting twice on same column

I'm having a bit of a weird question, given to me by a client.
He has a list of data, with a date between parentheses like so:
Foo (14/08/2012)
Bar (15/08/2012)
Bar (16/09/2012)
Xyz (20/10/2012)
However, he wants the list to be displayed as follows:
Foo (14/08/2012)
Bar (16/09/2012)
Bar (15/08/2012)
Foot (20/10/2012)
(notice that the second Bar has moved up one position)
So, the logic behind it is, that the list has to be sorted by date ascending, EXCEPT when two rows have the same name ('Bar'). If they have the same name, it must be sorted with the LATEST date at the top, while staying in the other sorting order.
Is this even remotely possible? I've experimented with a lot of ORDER BY clauses, but couldn't find the right one. Does anyone have an idea?
I should have specified that this data comes from a table in a sql server database (the Name and the date are in two different columns). So I'm looking for a SQL-query that can do the sorting I want.
(I've dumbed this example down quite a bit, so if you need more context, don't hesitate to ask)

This works, I think
declare #t table (data varchar(50), date datetime)
insert #t
values
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
select t.*
from #t t
inner join (select data, COUNT(*) cg, MAX(date) as mg from #t group by data) tc
on t.data = tc.data
order by case when cg>1 then mg else date end, date desc
produces
data date
---------- -----------------------
Foo 2012-08-14 00:00:00.000
Bar 2012-09-16 00:00:00.000
Bar 2012-08-15 00:00:00.000
Xyz 2012-10-20 00:00:00.000

A way with better performance than any of the other posted answers is to just do it entirely with an ORDER BY and not a JOIN or using CTE:
DECLARE #t TABLE (myData varchar(50), myDate datetime)
INSERT INTO #t VALUES
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
SELECT *
FROM #t t1
ORDER BY (SELECT MIN(t2.myDate) FROM #t t2 WHERE t2.myData = t1.myData), T1.myDate DESC
This does exactly what you request and will work with any indexes and much better with larger amounts of data than any of the other answers.
Additionally it's much more clear what you're actually trying to do here, rather than masking the real logic with the complexity of a join and checking the count of joined items.

This one uses analytic functions to perform the sort, it only requires one SELECT from your table.
The inner query finds gaps, where the name changes. These gaps are used to identify groups in the next query, and the outer query does the final sorting by these groups.
I have tried it here (SQL Fiddle) with extended test-data.
SELECT name, dat
FROM (
SELECT name, dat, SUM(gap) over(ORDER BY dat, name) AS grp
FROM (
SELECT name, dat,
CASE WHEN LAG(name) OVER (ORDER BY dat, name) = name THEN 0 ELSE 1 END AS gap
FROM t
) x
) y
ORDER BY grp, dat DESC
Extended test-data
('Bar','2012-08-12'),
('Bar','2012-08-11'),
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-08-16'),
('Bar','2012-09-17'),
('Xyz','2012-10-20')
Result
Bar 2012-08-12
Bar 2012-08-11
Foo 2012-08-14
Bar 2012-09-17
Bar 2012-08-16
Bar 2012-08-15
Xyz 2012-10-20

I think that this works, including the case I asked about in the comments:
declare #t table (data varchar(50), [date] datetime)
insert #t
values
('Foo','20120814'),
('Bar','20120815'),
('Bar','20120916'),
('Xyz','20121020')
; With OuterSort as (
select *,ROW_NUMBER() OVER (ORDER BY [date] asc) as rn from #t
)
--Now we need to find contiguous ranges of the same data value, and the min and max row number for such a range
, Islands as (
select data,rn as rnMin,rn as rnMax from OuterSort os where not exists (select * from OuterSort os2 where os2.data = os.data and os2.rn = os.rn - 1)
union all
select i.data,rnMin,os.rn
from
Islands i
inner join
OuterSort os
on
i.data = os.data and
i.rnMax = os.rn-1
), FullIslands as (
select
data,rnMin,MAX(rnMax) as rnMax
from Islands
group by data,rnMin
)
select
*
from
OuterSort os
inner join
FullIslands fi
on
os.rn between fi.rnMin and fi.rnMax
order by
fi.rnMin asc,os.rn desc
It works by first computing the initial ordering in the OuterSort CTE. Then, using two CTEs (Islands and FullIslands), we compute the parts of that ordering in which the same data value appears in adjacent rows. Having done that, we can compute the final ordering by any value that all adjacent values will have (such as the lowest row number of the "island" that they belong to), and then within an "island", we use the reverse of the originally computed sort order.
Note that this may, though, not be too efficient for large data sets. On the sample data it shows up as requiring 4 table scans of the base table, as well as a spool.

Try something like...
ORDER BY CASE date
WHEN '14/08/2012' THEN 1
WHEN '16/09/2012' THEN 2
WHEN '15/08/2012' THEN 3
WHEN '20/10/2012' THEN 4
END
In MySQL, you can do:
ORDER BY FIELD(date, '14/08/2012', '16/09/2012', '15/08/2012', '20/10/2012')
In Postgres, you can create a function FIELD and do:
CREATE OR REPLACE FUNCTION field(anyelement, anyarray) RETURNS numeric AS $$
SELECT
COALESCE((SELECT i
FROM generate_series(1, array_upper($2, 1)) gs(i)
WHERE $2[i] = $1),
0);
$$ LANGUAGE SQL STABLE
If you do not want to use the CASE, you can try to find an implementation of the FIELD function to SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas