I have a data set consisting of time-stamped values, and absolute (meter) values. Sometimes the meter values reset to zero, which means I have to iterate through and calculate a delta one-by-one, and then add it up to get the total for a given period.
For example:
Timestamp Value
2009-01-01 100
2009-01-02 105
2009-01-03 120
2009-01-04 0
2009-01-05 9
the total here is 29, calculated as:
(105 - 100) + (120 - 105) + (0) + (9 - 0) = 29
I'm using MS-SQL server for this, and open to any suggestions.
Right now, I'm using a cursor to do this, which checks that the delta isn't negative, and then totals it up:
DECLARE CURSOR curTest CURSOR FAST_FORWARD FOR
SELECT value FROM table ORDER BY timestamp
OPEN curTest
DECLARE #delta bigint, #current bigint, #last bigint
SET #delta = 0
FETCH curTest INTO #current
WHILE ##FETCH_STATUS = 0
BEGIN
IF (#current IS NOT NULL) AND (#current > 0)
BEGIN
IF (#last IS NOT NULL) AND (#current > #last)
SET #delta = #delta + (#current - #last)
SET #last = #current
FETCH curTest INTO #current
END
END
CLOSE curTest
DEALLOCATE curTest
It would be nice to get a data set like:
Timestamp Value LastValue
2009-01-01 100 NULL
2009-01-02 105 100
2009-01-03 120 105
2009-01-04 0 120
2009-01-05 9 0
as then it would be easy to grab the deltas, filter for (Value > LastValue), and do a SUM().
I tried:
SELECT m1.timestamp, m1.value,
( SELECT TOP 1 m2.value FROM table WHERE m2.timestamp < m1.timestamp ORDER BY m2.timestamp DESC ) as LastValue
FROM table
but this actually turns out to be slower than the cursor: When I run these together in SQL studio with 'show execution plan' on, the relative cost of this is 100% (with 7 or 8 operations - the majority in a clustered index scan on timestamp), and the cursor is 0% (with 3 operations).
(What I'm not showing here for simplicity is that I have several different sets of numbers, with a foreign key in this table as well - so there is also always a WHERE clause limiting to a specific set. I have several places where I calculate these totals for a given time period for several sets at once, and thus it becomes quite the performance bottleneck. The non-cursor method can also be easily modified to GROUP BY the key and return all the sets at once - but this actually is even slower in my testing than running the cursor multiple times, because there is the additional overhead of the GROUP BY and SUM() operation, aside from it being slower overall anyways.)
Much the same...
create table #temp ([timestamp] date,value int);
insert into #temp (timestamp,value) values ('2009-01-01',100)
insert into #temp (timestamp,value) values ('2009-01-02',105)
insert into #temp (timestamp,value) values ('2009-01-03',120)
insert into #temp (timestamp,value) values ('2009-01-04',0)
insert into #temp (timestamp,value) values ('2009-01-05',9);
with numbered as
(
select ROW_NUMBER() over (order by timestamp) id,value from #temp
)
select sum(n1.value-n2.value) from numbered n1 join numbered n2 on n1.id=n2.id+1 where n1.value!=0
drop table #temp;
Result is 29, as specified.
Start with row_number, then join back to yourself.
with numbered as
(
SELECT value, row_number() over (order by timestamp) as Rownum
FROM table
)
select sum(n1.value - n2.value)
from numbered n1
join
numbered n2 on n1.Rownum = n2.Rownum +1
Actually... you only want to pick up increases... so put a WHERE clause in, saying "WHERE n1.value > n2.value".
And... make sure I've put them the right way around... I've just changed it from -1 to +1, because I think I had it flipped.
Easy!
Rob
There are too many unnecessary joins in your algorithm.
Calculating the difference between each meter reading and its subsequent meter reading is a waste of resources. As a real world example, imagine if my electric company read my meter each day to how much electricity I used, and summed daily values to determine my monthly total - it just doesn't make sense. They simply determine the total based on the start value and the end value!
Simply calculate the difference between the first and last readings and adjust to account for the 'resets'. Your formula simply becomes:
total value = (final value) - (initial value)
+ (miscellaneous reductions in value, i.e. resets)
total value = (9) - (100) + (120)
= 29
It's trivial to find the final value and initial value. Just find the total amount by which 'meter' was reduced during 'resets', and add this to the total. Unless there are more reset records than measurement records, this will always be more efficient.
To borrow from spender's solution, the 'reset' value could be calculated by
create table...
select sum(n1.value-n2.value) from numbered n1 join numbered n2
on n1.id=n2.id+1 where n1.value=0 //note value=0 rather than value!=0
Related
I have a SQL database that collects temperature and sensor data from the barn.
The table definition is:
CREATE TABLE [dbo].[DataPoints]
(
[timestamp] [datetime] NOT NULL,
[pointname] [nvarchar](50) NOT NULL,
[pointvalue] [float] NOT NULL
)
The sensors report outside temperature (degrees), inside temperature (degrees), and heating (as on/off).
Sensors create a record when the previous reading has changed, so temperatures are generated every few minutes, one record for heat coming ON, one for heat going OFF, and so on.
I'm interested in how many minutes of heat has been used overnight, so a 24-hour period from 6 AM yesterday to 6 AM today would work fine.
This query:
SELECT *
FROM [home_network].[dbo].[DataPoints]
WHERE (pointname = 'Heaters')
AND (timestamp BETWEEN '2022-12-18 06:00:00' AND '2022-12-19 06:00:00')
ORDER BY timestamp
returns this data:
2022-12-19 02:00:20 | Heaters | 1
2022-12-19 02:22:22 | Heaters | 0
2022-12-19 03:43:28 | Heaters | 1
2022-12-19 04:25:31 | Heaters | 0
The end result should be 22 minutes + 42 minutes = 64 minutes of heat, but I can't see how to get this result from a single query. It also just happens that this result set has two complete heat on/off cycles, but that will not always be the case. So, if the first heat record was = 0, that means that at 6 AM, the heat was already on, but the start time won't show in the query. The same idea applies if the last heat record is =1 at, say 05:15, which means 45 minutes have to be added to the total.
Is it possible to get this minutes-of-heat-time result with a single query? Actually, I don't know the right approach, and it doesn't matter if I have to run several queries. If needed, I could use a small app that reads the raw data, and applies logic outside of SQL to arrive at the total. But I'd prefer to be able to do this within SQL.
This isn't a complete answer, but it should help you get started. From the SQL in the post, I'm assuming you're using SQL Server. I've formatted the code to match. Replace #input with your query above if you want to test on your own data. (SELECT * FROM [home_network].[dbo]...)
--generate dummy table with sample output from question
declare #input as table(
[timestamp] [datetime] NOT NULL,
[pointname] [nvarchar](50) NOT NULL,
[pointvalue] [float] NOT NULL
)
insert into #input values
('2022-12-19 02:00:20','Heaters',1),
('2022-12-19 02:22:22','Heaters',0),
('2022-12-19 03:43:28','Heaters',1),
('2022-12-19 04:25:31','Heaters',0);
--Append a row number to the result
WITH A as (
SELECT *,
ROW_NUMBER() OVER(ORDER BY(SELECT 1)) as row_count
from #input)
--Self join the table using the row number as a guide
SELECT sum(datediff(MINUTE,startTimes.timestamp,endTimes.timestamp))
from A as startTimes
LEFT JOIN A as endTimes on startTimes.row_count=endTimes.row_count-1
--Only show periods of time where the heater is turned on at the start
WHERE startTimes.row_count%2=1
Your problem can be divided into 2 steps:
Filter sensor type and date range, while also getting time span of each record by calculating date difference between timestamp of current record and the next one in chronological order.
Filter records with ON status and summarize the duration
(Optional) convert to HH:MM:SS format to display
Here's the my take on the problem with comments of what I do in each step, all combined into 1 single query.
-- Step 3: Convert output to HH:MM:SS, this is just for show and can be reduced
SELECT STUFF(CONVERT(VARCHAR(8), DATEADD(SECOND, total_duration, 0), 108),
1, 2, CAST(FLOOR(total_duration / 3600) AS VARCHAR(5)))
FROM (
-- Step 2: select records with status ON (1) and aggregate total duration in seconds
SELECT sum(duration) as total_duration
FROM (
-- Step 1: Use LEAD to get next adjacent timestamp and calculate date difference (time span) between the current record and the next one in time order
SELECT TOP 100 PERCENT
DATEDIFF(SECOND, timestamp, LEAD(timestamp, 1, '2022-12-19 06:00:00') OVER (ORDER BY timestamp)) as duration,
pointvalue
FROM [dbo].[DataPoints]
-- filtered by sensor name and time range
WHERE pointname = 'Heaters'
AND (timestamp BETWEEN '2022-12-18 06:00:00' AND '2022-12-19 06:00:00')
ORDER BY timestamp ASC
) AS tmp
WHERE tmp.pointvalue = 1
) as tmp2
Note: As the last record does not have next adjacent timestamp, it will be filled with the end time of inspection (In this case it's 6AM of the next day).
I do not really think it would be possible to achieve within single query.
Option 1:
implement stored procedure where you can implement some logic how to calculate these periods.
Option 2:
add new column (duration) and on insert new record calculate difference between NOW and previous timestamp and update duration for previous record
Background: I am running TeslaMate/Grafana for monitoring my car status, one of the gauges plots the battery level fetched from database. My server is located remotely and running in a Dock from an old NAS, so both query performance and network overhead matters.
I found the koisk page frequently hangs and by investigation, it might caused by the query -- two of the plots returns 10~100k rows of results from database. I want to limit the number of rows returned by SQL queries, as the plots certainly don't have that much precision for drawing such detailed intervals.
I tried to follow this answer and use row_number() to pop only 100-th rows of results, but more complicated issues turned up, that is, the time intervals among rows are not consistent.
The car has 4 status, driving / online / asleep / offline.
If the car is at driving status, the time interval could be less than 200ms as the car pushes the status whenever it has new data.
If the car is at online status, the time interval could be several minutes as the system actively fetches the status from the car.
Even worse, if the system thinks the car is going to sleep and need to stop fetching status (to avoid preventing the car to sleep), the interval could be 40 minutes maximum depend on settings.
If the car is in asleep/offline status, no data is recorded at all.
This obviously makes skipping every n-th rows a bad idea, as for case 2-4 above, lots of data points might missing so that Grafana cannot plot correct graph representing the battery level at satisfactory precision.
I wonder if there's any possible to skip the rows by time interval from a datetime field rather than row_number() without much overhead from the query? i.e., fetch every row with minimal 1000ms from the previous row.
E.g., I have following data in the table, I want the rows returned are row 1, 4 and 5.
row date
[1] 1610000001000
[2] 1610000001100
[3] 1610000001200
[4] 1610000002000
[5] 1610000005000
The current (problematic) method I am using is as follows:
SELECT $__time(t.date), t.battery_level AS "SOC [%]"
FROM (
SELECT date, battery_level, row_number() OVER(ORDER BY date ASC) AS row
FROM (
SELECT battery_level, date
FROM positions
WHERE car_id = $car_id AND $__timeFilter(date)
UNION ALL
SELECT battery_level, date
FROM charges c
JOIN charging_processes p ON p.id = c.charging_process_id
WHERE $__timeFilter(date) AND p.car_id = $car_id) AS data
ORDER BY date ASC) as t
WHERE t.row % 100 = 0;
This method clearly gives problem that only returns alternate rows instead of what I wanted (given the last row reads t.row % 2 = 0)
PS: please ignore the table structures and UNION from the sample code, I haven't dig deep enough to the tables which could be other tweaks but irrelevant to this question anyway.
Thanks in advance!
You can use a recursive CTE:
WITH RECURSIVE rec(cur_row, cur_date) AS (
(
SELECT row, date
FROM t
ORDER BY date
LIMIT 1
)
UNION ALL
(
SELECT row, date
FROM t
JOIN rec
ON t.date >= cur_date + 1000
ORDER BY t.date
LIMIT 1
)
)
SELECT *
FROM rec;
cur_row
cur_date
1
1610000001000
4
1610000002000
5
1610000005000
View on DB Fiddle
Using a function instead would probably be faster:
CREATE OR REPLACE FUNCTION f() RETURNS SETOF t AS
$$
DECLARE
row t%ROWTYPE;
cur_date BIGINT;
BEGIN
FOR row IN
SELECT *
FROM t
ORDER BY date
LOOP
IF row.date >= cur_date + 1000 OR cur_date IS NULL
THEN
cur_date := row.date;
RETURN NEXT row;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT *
FROM f();
row
date
1
1610000001000
4
1610000002000
5
1610000005000
I have the following code that gets the months between two date ranges using a CTE
declare
#date_start DateTime,
#date_end DateTime
;WITH totalMonths AS
(
SELECT
DATEDIFF(MONTH, #date_start, #date_end) totalM
),
numbers AS
(
SELECT 1 num
UNION ALL
SELECT n.num + 1 num
FROM numbers n, totalMonths c
WHERE n.num <= c.totalM
)
SELECT
CONVERT(varchar(6), DATEADD(MONTH, numbers.num - 1, #date_start), 112)
FROM
numbers
OPTION (MAXRECURSION 0);
This works, but I do not understand how it works
Especially this part
numbers AS
(
SELECT 1 num
UNION ALL
SELECT n.num + 1 num
FROM numbers n, totalMonths c
WHERE n.num <= c.totalM
)
Thanks in advance, sorry for my English
This query is using two CTEs, one recursive, to generate a list of values from nothing (SQL isn't really good at doing this).
totalMonths AS (SELECT DATEDIFF(MONTH, #date_start, #date_end) totalM),
This is part is basically a convoluted way of binding the result of the DATEDIFF to the name totalM. This could've been implemented as just a variable if you can declare things:
DECLARE #totalM int = DATEDIFF(MONTH, #date_start, #date_end);
Then you would of course use #totalM to refer to the value.
numbers AS (
SELECT 1 num
UNION ALL
SELECT n.num+1 num FROM numbers n, totalMonths c
WHERE n.num<= c.totalM
)
This part is essentially a simple loop implemented using recursion to generate the numbers from 1 to totalMonths. The first SELECT specifies the first value (1) and the one after that specifies the next value, which is int greater than the previous one. Evaluating recursive CTEs has somewhat special semantics so it's a good idea to read up on them. Finally the WHERE specifies the stopping condition so that the recursion doesn't go on forever.
What all this does is generate an equivalent to a physical "numbers" table that just has one column the numbers from 1 onwards.
The SELECT at the very end uses the result of the numbers CTE to generate a bunch of dates.
Note that the OPTION (MAXRECURSION 0) at the end is also relevant to the recursive CTE. This disables the server-wide recursion depth limit so that the number generating query doesn't stop short if the range is very long, or a bothersome DBA set a very low default limit.
totalMonths query evaluates to a scalar result (single value) indicating the number of months that need to be generated. It probably makes more sense to just do this inline instead of using a named CTE.
numbers generates a sequence of rows with a column called num starting at 1 and ending at totalM + 1 which was computed in the previous step. It is able to reference this value by means of a cross join. Since there's only one row it essentially just appends that one column to the table horizontally. The query is recursive so each pass adds a new row to the result by adding 1 to the last added row (really just the one column) until the value of the previously added row exceeds totalM. The first half of the union is the starting value; the second half refers to itself via from numbers and incrementally builds the result in a sort of loop.
The output is derived from the numbers input. One is subtracted from each num giving a range from 0 to totalM and that value is treated as the number of months to add to the starting date. The date value is converted to a varchar of length six which means the final two characters containing the day are truncated.
Suppose that #date_start is January 31, 2016 and #date_end is March 1, 2016. There is never any comparison of the actual date values so it doesn't matter that March 31 is generated in the sequence but also falls later than the passed #date_end value. Any dates in the respective start and end months can be chosen to generate identical sequences.
SELECT 1 num
is your starting point of your recursive CTE and that is (numbers n) in the first teration.In the second iteration the out put of the first
SELECT n.num+1 num FROM numbers n, totalMonths c
WHERE n.num <= c.totalM
becomes numbers (n) and so on.
Following is the table and script of this table.
DECLARE #temp TABLE (PPId INT, SVPId INT, Minimum INT, Maximum INT)
INSERT INTO #temp VALUES(1,1,8,20)
INSERT INTO #temp VALUES(2,1,21,100)
Minimum & Maximum are passed in as parameter. I want to find all rows that fall in the given range.
E.g.;
If #minimum = 9 and #maximum = 15
then it falls in the range of first
row.
If #minimum = 21 and #maximum = 22
then it falls in the range of 2nd
row.
If #minimum = 7 and #maximum = 25
then it falls in the range of both
rows so both rows should be returned.
Thanks.
When comparing ranges like this, it's easier to look for the case where ranges don't overlap. There are many different ways that two ranges can overlap, but there is only two ways that the don't overlap:
select *
from #temp
where not (#maximum < Minimum or #minimum > Maximum)
SELECT *
FROM #temp
WHERE minimum <= #max
AND maximum >= #min
My suggested answer is so simple I suspect either I'm missing something or the question is not complete?
SELECT *
FROM #temp
WHERE Minimum < #Minimum
AND Maximum > #Maximum
I can see what you're trying to do. You want to know how many min/max ranges overlap with the provide min/max range. Try this:
SELECT * FROM #temp T
WHERE #minimum BETWEEN T.minimum AND T.maximum
OR #maximum BETWEEN T.minimum AND T.maximum
OR T.minimum BETWEEN #minimum AND #maximum
OR T.maximum BETWEEN #minimum AND #maximum
This should return all rows that intersect with the interval [#minimum, #maximum].
I'm leaving out all the cursor setup and the SELECT from the temp table for brevity. Basically, this code computes a running balance for all transactions per transaction.
WHILE ##fetch_status = 0
BEGIN
set #balance = #balance+#amount
insert into #tblArTran values ( --from artran table
#artranid, #trandate, #type,
#checkNumber, #refNumber,#custid,
#amount, #taxAmount, #balance, #postedflag, #modifieddate )
FETCH NEXT FROM artranCursor into
#artranid, #trandate, #type, #checkNumber, #refNumber,
#amount, #taxAmount,#postedFlag,#custid, #modifieddate
END
Inspired by this code from an answer to another question,
SELECT #nvcConcatenated = #nvcConcatenated + C.CompanyName + ', '
FROM tblCompany C
WHERE C.CompanyID IN (1,2,3)
I was wondering if SQL had the ability to sum numbers in the same way it's concatonating strings, if you get my meaning. That is, to create a "running balance" per row, without using a cursor.
Is it possible?
You might want to take a look at the update to local variable solution here: http://geekswithblogs.net/Rhames/archive/2008/10/28/calculating-running-totals-in-sql-server-2005---the-optimal.aspx
DECLARE #SalesTbl TABLE (DayCount smallint, Sales money, RunningTotal money)
DECLARE #RunningTotal money
SET #RunningTotal = 0
INSERT INTO #SalesTbl
SELECT DayCount, Sales, null
FROM Sales
ORDER BY DayCount
UPDATE #SalesTbl
SET #RunningTotal = RunningTotal = #RunningTotal + Sales
FROM #SalesTbl
SELECT * FROM #SalesTbl
Outperforms all other methods, but has some doubts about guaranteed row order. Seems to work fine when temp table is indexed though..
Nested sub-query 9300 ms
Self join 6100 ms
Cursor 400 ms
Update to local variable 140 ms
SQL can create running totals without using cursors, but it's one of the few cases where a cursor is actually more performant than a set-based solution (given the operators currently available in SQL Server). Alternatively, a CLR function can sometimes shine well. Itzik Ben-Gan did an excellent series in SQL Server Magazine on running aggregates. The series concluded last month, but you can get access to all of the articles if you have an online subscription.
Edit: here's his latest article in the series (SQL CLR).
Given that you can access the whole series by purchasing an online monthly pass for one month - less than 6 bucks - it's worth your while if you're interested in looking at the problem from all angles. Itzik is a Microsoft MVP and a very bright TSQL coder.
In Oracle and PostgreSQL 8.4 you can use window functions:
SELECT SUM(value) OVER (ORDER BY id)
FROM mytable
In MySQL, you can use a session variable for the same purpose:
SELECT #sum := #sum + value
FROM (
SELECT #sum := 0
) vars, mytable
ORDER BY
id
In SQL Server, it's a rare example of a task for which a cursor is a preferred solution.
An example of calculating a running total for each record, but only if the OrderDate for the records are on the same date. Once the OrderDate is for a different day, then a new running total will be started and accumulated for the new day: (assume the table structure and data)
select O.OrderId,
convert(char(10),O.OrderDate,101) as 'Order Date',
O.OrderAmt,
(select sum(OrderAmt) from Orders
where OrderID <= O.OrderID and
convert(char(10),OrderDate,101)
= convert(char(10),O.OrderDate,101))
'Running Total'
from Orders O
order by OrderID
Here are the results returned from the query using sample Orders Table:
OrderId Order Date OrderAmt Running Total
----------- ---------- ---------- ---------------
1 10/11/2003 10.50 10.50
2 10/11/2003 11.50 22.00
3 10/11/2003 1.25 23.25
4 10/12/2003 100.57 100.57
5 10/12/2003 19.99 120.56
6 10/13/2003 47.14 47.14
7 10/13/2003 10.08 57.22
8 10/13/2003 7.50 64.72
9 10/13/2003 9.50 74.22
Note that the "Running Total" starts out with a value of 10.50, and then becomes 22.00, and finally becomes 23.25 for OrderID 3, since all these records have the same OrderDate (10/11/2003). But when OrderID 4 is displayed the running total is reset, and the running total starts over again. This is because OrderID 4 has a different date for its OrderDate, then OrderID 1, 2, and 3. Calculating this running total for each unique date is once again accomplished by using a correlated sub query, although an extra WHERE condition is required, which identified that the OrderDate's on different records need to be the same day. This WHERE condition is accomplished by using the CONVERT function to truncate the OrderDate into a MM/DD/YYYY format.
In SQL Server 2012 and up you can just use the Sum windowing function directly against the original table:
SELECT
artranid,
trandate,
type,
checkNumber,
refNumber,
custid,
amount,
taxAmount,
Balance = Sum(amount) OVER (ORDER BY trandate ROWS UNBOUNDED PRECEDING),
postedflag,
modifieddate
FROM
dbo.Sales
;
This will perform very well compared to all solutions and will not have the potential for errors as found in the "quirky update".
Note that you should use the ROWS version when possible; the RANGE version may perform less well.
You can just include a correlated subquery in the select clause. (This will perform poorly for very large result sets) but
Select <other stuff>,
(Select Sum(ColumnVal) From Table
Where OrderColumn <= T.OrderColumn) As RunningTotal
From Table T
Order By OrderColumn
You can do a running count, here is an example, keep in mind that this is actually not that fast since it has to scan the table for every row, if your table is large this can be quite time consuming and costly
create table #Test (id int, Value decimal(16,4))
insert #Test values(1,100)
insert #Test values(2,100)
insert #Test values(3,100)
insert #Test values(4,200)
insert #Test values(5,200)
insert #Test values(6,200)
insert #Test values(7,200)
select *,(select sum(Value) from #Test t2 where t2.id <=t1.id) as SumValues
from #test t1
id Value SumValues
1 100.0000 100.0000
2 100.0000 200.0000
3 100.0000 300.0000
4 200.0000 500.0000
5 200.0000 700.0000
6 200.0000 900.0000
7 200.0000 1100.0000
On SQLTeam there's also an article about calculating running totals. There is a comparison of 3 ways to do it, along with some performance measuring:
using cursors
using a subselect (as per SQLMenace's post)
using a CROSS JOIN
Cursors outperform by far the other solutions, but if you must not use cursors, there's at least an alternative.
That that SELECT #nvcConcatonated bit is only returning a single concatenated value. (Although it's computing the intermediate values on a per-row basis, you're only able to retrieve the final value).
So, I think the answer is no. If you wanted a single final sum value you would of course just use SUM.
I'm not saying you can't do it, I'm just saying you can't do it using this 'trick'.
Note that using a variable to accomplish this such as in the following may fail in a multiprocessor system because separate rows could get calculated on different processors and may end up using the same starting value. My understanding is that a query hint could be used to force it to use a single thread, but I do not have that information handy.
UPDATE #SalesTbl
SET #RunningTotal = RunningTotal = #RunningTotal + Sales
FROM #SalesTbl
Using one of the other options (a cursor, a window function, or nested queries) is typically going to be your safest bet for reliable results.
select TransactionDate, amount, amount + (sum x.amount from transactions x where x.TransactionDate < Transactions) Runningtotal from Transactions
where
x.TransactionDate < Transactions
could be any condition that will represent all the previous records aside from the current one