Get average and standard deviation on difference between row values - sql

Given the following table:
CREATE TABLE datapoints
(
id serial NOT NULL,
datasource text,
"timestamp" integer,
value text,
CONSTRAINT datapoints_pkey PRIMARY KEY (id)
)
How can I calculate the average and standard deviation of the difference in timestamp1 from one row to the next?
What I mean is, if the data looks like this:
timestamp
---------
1385565639
1385565641
1385565643
I would like to calculate the average and standard deviation on the following data:
timestamp difference
--------------------
0
2
2
Is this even possible in a single query?

First one returns the difference and second one ruturns the stddev and avg:
--difference
WITH rn as(
SELECT timestamp , row_number()over() rown
FROM datapoints order by timestamp
)
SELECT ta.rown, tb.rown,tb.timestamp - ta.timestamp
FROM rn as ta,rn as tb
WHERE ta.rown=tb.rown+1 ;
--avg, stddev
WITH rn as(
SELECT timestamp , row_number()over() rown
FROM datapoints
ORDER BY timestamp
)
SELECT stddev(tb.timestamp - ta.timestamp), avg(tb.timestamp - ta.timestamp)
FROM rn as ta,rn as tb
WHERE ta.rown=tb.rown+1 ;

Unless I misunderstood or oversimplified your question
something like this might be helpful.
select t2.timestamp - t1.timestamp
from
TableName t1
join TableName t2 on
(
t1.timestamp < t2.timestamp
and
(
not exists select null from TableName tMid
where
tMid.timestamp > t1.timestamp and tMid.timestamp < t2.timestamp
)
)
I doubt this is the most efficient thing to do but you mentioned you want it done with one single query.
Just giving you an idea.
If your IDs are consecutive, you could do the join much simpler
(on t1.ID = t2.ID-1 or something similar).
Then also you need to see how to also include the last/first difference
(maybe you try an outer join). I think my query misses that one.
Never mind, seems I probably misunderstood your question.
This seems useful for your case.
SQL: Show average and min/max within standard deviations

Related

How to return column changes in a column [duplicate]

I need to calculate the difference of a column between two lines of a table. Is there any way I can do this directly in SQL? I'm using Microsoft SQL Server 2008.
I'm looking for something like this:
SELECT value - (previous.value) FROM table
Imagining that the "previous" variable reference the latest selected row. Of course with a select like that I will end up with n-1 rows selected in a table with n rows, that's not a probably, actually is exactly what I need.
Is that possible in some way?
Use the lag function:
SELECT value - lag(value) OVER (ORDER BY Id) FROM table
Sequences used for Ids can skip values, so Id-1 does not always work.
SQL has no built in notion of order, so you need to order by some column for this to be meaningful. Something like this:
select t1.value - t2.value from table t1, table t2
where t1.primaryKey = t2.primaryKey - 1
If you know how to order things but not how to get the previous value given the current one (EG, you want to order alphabetically) then I don't know of a way to do that in standard SQL, but most SQL implementations will have extensions to do it.
Here is a way for SQL server that works if you can order rows such that each one is distinct:
select rank() OVER (ORDER BY id) as 'Rank', value into temp1 from t
select t1.value - t2.value from temp1 t1, temp1 t2
where t1.Rank = t2.Rank - 1
drop table temp1
If you need to break ties, you can add as many columns as necessary to the ORDER BY.
WITH CTE AS (
SELECT
rownum = ROW_NUMBER() OVER (ORDER BY columns_to_order_by),
value
FROM table
)
SELECT
curr.value - prev.value
FROM CTE cur
INNER JOIN CTE prev on prev.rownum = cur.rownum - 1
Oracle, PostgreSQL, SQL Server and many more RDBMS engines have analytic functions called LAG and LEAD that do this very thing.
In SQL Server prior to 2012 you'd need to do the following:
SELECT value - (
SELECT TOP 1 value
FROM mytable m2
WHERE m2.col1 < m1.col1 OR (m2.col1 = m1.col1 AND m2.pk < m1.pk)
ORDER BY
col1, pk
)
FROM mytable m1
ORDER BY
col1, pk
, where COL1 is the column you are ordering by.
Having an index on (COL1, PK) will greatly improve this query.
LEFT JOIN the table to itself, with the join condition worked out so the row matched in the joined version of the table is one row previous, for your particular definition of "previous".
Update: At first I was thinking you would want to keep all rows, with NULLs for the condition where there was no previous row. Reading it again you just want that rows culled, so you should an inner join rather than a left join.
Update:
Newer versions of Sql Server also have the LAG and LEAD Windowing functions that can be used for this, too.
select t2.col from (
select col,MAX(ID) id from
(
select ROW_NUMBER() over(PARTITION by col order by col) id ,col from testtab t1) as t1
group by col) as t2
The selected answer will only work if there are no gaps in the sequence. However if you are using an autogenerated id, there are likely to be gaps in the sequence due to inserts that were rolled back.
This method should work if you have gaps
declare #temp (value int, primaryKey int, tempid int identity)
insert value, primarykey from mytable order by primarykey
select t1.value - t2.value from #temp t1
join #temp t2
on t1.tempid = t2.tempid - 1
Another way to refer to the previous row in an SQL query is to use a recursive common table expression (CTE):
CREATE TABLE t (counter INTEGER);
INSERT INTO t VALUES (1),(2),(3),(4),(5);
WITH cte(counter, previous, difference) AS (
-- Anchor query
SELECT MIN(counter), 0, MIN(counter)
FROM t
UNION ALL
-- Recursive query
SELECT t.counter, cte.counter, t.counter - cte.counter
FROM t JOIN cte ON cte.counter = t.counter - 1
)
SELECT counter, previous, difference
FROM cte
ORDER BY counter;
Result:
counter
previous
difference
1
0
1
2
1
1
3
2
1
4
3
1
5
4
1
The anchor query generates the first row of the common table expression cte where it sets cte.counter to column t.counter in the first row of table t, cte.previous to 0, and cte.difference to the first row of t.counter.
The recursive query joins each row of common table expression cte to the previous row of table t. In the recursive query, cte.counter refers to t.counter in each row of table t, cte.previous refers to cte.counter in the previous row of cte, and t.counter - cte.counter refers to the difference between these two columns.
Note that a recursive CTE is more flexible than the LAG and LEAD functions because a row can refer to any arbitrary result of a previous row. (A recursive function or process is one where the input of the process is the output of the previous iteration of that process, except the first input which is a constant.)
I tested this query at SQLite Online.
You can use the following funtion to get current row value and previous row value:
SELECT value,
min(value) over (order by id rows between 1 preceding and 1
preceding) as value_prev
FROM table
Then you can just select value - value_prev from that select and get your answer

How do you call previous row in a where clause?

I am trying to figure out how to get rid of results that occur close together. For example the rows have a create timestamp (source_time). I want to remove results that occur within 10 seconds of each other.
I thought lag() might do it, but I can't use that in the where clause.
select *
from table
where source_time - previous(source_time) >= 10 second
Very rough code, but I am not sure how to call the previous source time. I have translated them to timestamps and used timestamp_diff(source_time, x, second) >= 10 but not sure how to make x the previous value.
Hopefully this is clear.
You can do this with subqueries.
delete table t1
where t1.id in (
select t2.id
from (
select
id,
source_time - lag(source_time) over (order by source_time) as time_diff
from table
) t2
where t2.time_diff < 10 second
)
Keep in mind this can potentially leave large gaps in your records if. For example, if you get a row every 9 seconds for an hour you'll delete all but the last record in that hour.
You might instead partition the source_time every 10 seconds and delete anything with a row_number > 1.
delete table t1
where t1.id in (
select t2.id
from (
select
id,
source_time,
row_number() over(
partition by source_time - make_interval(second => extract(second from source_time) % 10)
order by source_time asc
) rownum
from table
) t2
where rownum > 1
)

Trying to get the greatest value from a customer on a given day

What I need to do: if a customer makes more than one transaction in a day, I need to display the greatest value (and ignore any other values).
The query is pretty big, but the code I inserted below is the focus of the issue. I’m not getting the results I need. The subselect ideally should be reducing the number of rows the query generates since I don’t need all the transactions, just the greatest one, however my code isn’t cutting it. I’m getting the exact same number of rows with or without the subselect.
Note: I don’t actually have a t. in the actual query, there’s just a dozen or so other fields being pulled in. I added the t.* just to simplify the code example.*
SELECT
t.*,
(SELECT TOP (1)
t1.CustomerGUID
t1.Value
t1.Date
FROM #temp t1
WHERE t1.CustomerGUID = t.CustomerGUID
AND t1.Date = t.Date
ORDER BY t1.Value DESC) AS “Value”
FROM #temp t
Is there an obvious flaw in my code or is there a better way to achieve the result of getting the greatest value transaction per day per customer?
Thanks
you may want to do as follows:
SELECT
t1.CustomerGUID,
t1.Date,
MAX(t1.Value) AS Value
FROM #temp t1
GROUP BY
t1.CustomerGUID,
t1.Date
You can use row_number() as shown below.
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerGUID ORDER BY Date Desc) AS SrNo FROM <YourTable>
)
<YourTable>
WHERE
SrNo = 1
Sample data will be more helpful.
Try this window function:
MAX(value) OVER(PARTITION BY date,customer ORDER BY value DESC)
Its faster and more efficient.
Probably many other ways to do it, but this one is simple and works
select t.*
from (
select
convert(varchar(8), r.date,112) one_day
,max(r.Value) max_sale
from #temp r
group by convert(varchar(8), r.date,112)
) e
inner join #temp t on t.value = e.max_sale and convert(varchar(8), t.date,112) = e.one_day
if you have 2 people who spend the exact same amount that's also max, you'll get 2 records for that day.
the convert(varchar(8), r.date,112) will perform as desired on date, datetime and datetime2 data types. If you're date is a varchar,char,nchar or nvarchar you'll want to examine the data to find out if you left(t.date,10) or left(t.date,8) it.
If i've understood your requirement correctly you have stated"greatest value transaction per day per customer". That suggests to me you don't want 1 row per customer in the output but a row per day per customer.
To achieve this you can group on the day like this
Select t.customerid, datepart(day,t.date) as Daydate,
max(t.value) as value from #temp t group by
t.customerid, datepart(day,t.date);

How to set updating row's field with value of closest to it by date another field?

I have a huge table with 2m+ rows.
The structure is like that:
ThingName (STRING),
Date (DATE),
Value (INT64)
Sometimes Value is null and I need to fix it by setting it with NOT NULL Value of closest to it by Date row corresponding to ThingName...
And I am totally not SQL guy.
I tried to describe my task with this query (and simplified it a lot by using only previous dates (but actually I need to check future dates too)):
update my_tbl as SDP
set SDP.Value = (select SDPI.Value
from my_tbl as SDPI
where SDPI.Date < SDP.Date
and SDP.ThingName = SDPI.ThingName
and SDPI.Value is not null
order by SDPI.Date desc limit 1)
where SDP.Value is null;
There I try to set updating row Value with one that I select from same table for same ThingName and with limit 1 I leave only single result.
But query editor tell me this:
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Actually, I am not sure at all that my task can be solved just with query.
So, can anyone help me? If this is impossible, then tell me this, if it possible, tell me what SQL constructions may help me.
Below is for BigQuery Standard SQL
In many (if not most) cases you don't want to update your table (as it incur extra cost and limitations associated with DML statements) but rather can adjust 'missing' values in-query - like in below example:
#standardSQL
SELECT
ThingName,
date,
IFNULL(value,
LAST_VALUE(value IGNORE NULLS)
OVER(PARTITION BY thingname ORDER BY date)
) AS value
FROM `project.dataset.my_tbl`
If for some reason you actually need to update the table - above statement will not help as DML's UPDATE does not allow use of analytic functions, so you need to use another approach. For example as below one
#standardSQL
SELECT
t1.ThingName, t1.date,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] AS value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY t1.ThingName, t1.date, t1.value
and now you can use it to update your table as in example below
#standardSQL
UPDATE `project.dataset.my_tbl` t
SET value = new_value
FROM (
SELECT TO_JSON_STRING(t1) AS id,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] new_value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY id
)
WHERE TO_JSON_STRING(t) = id
In BigQuery, updates are rather rare. The logic you seem to want is:
select t.*,
coalesce(value,
lag(value ignore nulls) over (partition by thingname order by date)
) as value
from my_tbl;
I don't really see a reason to save this back in the table.

Is there a way to access the "previous row" value in a SELECT statement?

I need to calculate the difference of a column between two lines of a table. Is there any way I can do this directly in SQL? I'm using Microsoft SQL Server 2008.
I'm looking for something like this:
SELECT value - (previous.value) FROM table
Imagining that the "previous" variable reference the latest selected row. Of course with a select like that I will end up with n-1 rows selected in a table with n rows, that's not a probably, actually is exactly what I need.
Is that possible in some way?
Use the lag function:
SELECT value - lag(value) OVER (ORDER BY Id) FROM table
Sequences used for Ids can skip values, so Id-1 does not always work.
SQL has no built in notion of order, so you need to order by some column for this to be meaningful. Something like this:
select t1.value - t2.value from table t1, table t2
where t1.primaryKey = t2.primaryKey - 1
If you know how to order things but not how to get the previous value given the current one (EG, you want to order alphabetically) then I don't know of a way to do that in standard SQL, but most SQL implementations will have extensions to do it.
Here is a way for SQL server that works if you can order rows such that each one is distinct:
select rank() OVER (ORDER BY id) as 'Rank', value into temp1 from t
select t1.value - t2.value from temp1 t1, temp1 t2
where t1.Rank = t2.Rank - 1
drop table temp1
If you need to break ties, you can add as many columns as necessary to the ORDER BY.
WITH CTE AS (
SELECT
rownum = ROW_NUMBER() OVER (ORDER BY columns_to_order_by),
value
FROM table
)
SELECT
curr.value - prev.value
FROM CTE cur
INNER JOIN CTE prev on prev.rownum = cur.rownum - 1
Oracle, PostgreSQL, SQL Server and many more RDBMS engines have analytic functions called LAG and LEAD that do this very thing.
In SQL Server prior to 2012 you'd need to do the following:
SELECT value - (
SELECT TOP 1 value
FROM mytable m2
WHERE m2.col1 < m1.col1 OR (m2.col1 = m1.col1 AND m2.pk < m1.pk)
ORDER BY
col1, pk
)
FROM mytable m1
ORDER BY
col1, pk
, where COL1 is the column you are ordering by.
Having an index on (COL1, PK) will greatly improve this query.
LEFT JOIN the table to itself, with the join condition worked out so the row matched in the joined version of the table is one row previous, for your particular definition of "previous".
Update: At first I was thinking you would want to keep all rows, with NULLs for the condition where there was no previous row. Reading it again you just want that rows culled, so you should an inner join rather than a left join.
Update:
Newer versions of Sql Server also have the LAG and LEAD Windowing functions that can be used for this, too.
select t2.col from (
select col,MAX(ID) id from
(
select ROW_NUMBER() over(PARTITION by col order by col) id ,col from testtab t1) as t1
group by col) as t2
The selected answer will only work if there are no gaps in the sequence. However if you are using an autogenerated id, there are likely to be gaps in the sequence due to inserts that were rolled back.
This method should work if you have gaps
declare #temp (value int, primaryKey int, tempid int identity)
insert value, primarykey from mytable order by primarykey
select t1.value - t2.value from #temp t1
join #temp t2
on t1.tempid = t2.tempid - 1
Another way to refer to the previous row in an SQL query is to use a recursive common table expression (CTE):
CREATE TABLE t (counter INTEGER);
INSERT INTO t VALUES (1),(2),(3),(4),(5);
WITH cte(counter, previous, difference) AS (
-- Anchor query
SELECT MIN(counter), 0, MIN(counter)
FROM t
UNION ALL
-- Recursive query
SELECT t.counter, cte.counter, t.counter - cte.counter
FROM t JOIN cte ON cte.counter = t.counter - 1
)
SELECT counter, previous, difference
FROM cte
ORDER BY counter;
Result:
counter
previous
difference
1
0
1
2
1
1
3
2
1
4
3
1
5
4
1
The anchor query generates the first row of the common table expression cte where it sets cte.counter to column t.counter in the first row of table t, cte.previous to 0, and cte.difference to the first row of t.counter.
The recursive query joins each row of common table expression cte to the previous row of table t. In the recursive query, cte.counter refers to t.counter in each row of table t, cte.previous refers to cte.counter in the previous row of cte, and t.counter - cte.counter refers to the difference between these two columns.
Note that a recursive CTE is more flexible than the LAG and LEAD functions because a row can refer to any arbitrary result of a previous row. (A recursive function or process is one where the input of the process is the output of the previous iteration of that process, except the first input which is a constant.)
I tested this query at SQLite Online.
You can use the following funtion to get current row value and previous row value:
SELECT value,
min(value) over (order by id rows between 1 preceding and 1
preceding) as value_prev
FROM table
Then you can just select value - value_prev from that select and get your answer