Back fill timeseries data in SQL

Back fill timeseries data in SQL - sql

I have data in a SQL (Vertica) database table that looks like this...
ts src val
---------------------------------
10:25:10 C 72
10:25:09 A 13
10:25:08 A 99
10:25:05 B 22
10:25:02 C 71
I need to "rotate" it into columns and backfill the last known value by the src column like so.
ts a_val b_val c_val
----------------------------
10:25:10 13 22 72
10:25:09 13 22 71
10:25:08 99 22 71
10:25:05 null 22 71
10:25:02 null null 71
I know all the possible values of the src ahead of time.

Probably the easiest way is with correlated subqueries. This won't necessarily have the best performance:
select t.ts,
(select t2.val from table t2 where t2.ts <= t.ts and t2.src = 'a' order by t2.ts desc) as val_a,
(select t2.val from table t2 where t2.ts <= t.ts and t2.src = 'b' order by t2.ts desc) as val_b,
(select t2.val from table t2 where t2.ts <= t.ts and t2.src = 'c' order by t2.ts desc) as val_c
from table t;
An index on table(ts, src, val) might help the subqueries in a database other than Vertica.

Use analytic functions. Something like:
SELECT ts
, src
, MIN(val) val
FROM (
SELECT ts
, src
, first_value(val) OVER (
PARTITION BY src
ORDER BY ts
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) val
FROM table
) tab
GROUP BY 1, 2
ORDER BY 1, 2

Related

Rolling Average in SQL with Partition [duplicate]

declare #t table
(
id int,
SomeNumt int
)
insert into #t
select 1,10
union
select 2,12
union
select 3,3
union
select 4,15
union
select 5,23
select * from #t
the above select returns me the following.
id SomeNumt
1 10
2 12
3 3
4 15
5 23
How do I get the following:
id srome CumSrome
1 10 10
2 12 22
3 3 25
4 15 40
5 23 63

select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as sum
from #t t1
inner join #t t2 on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id
SQL Fiddle example
Output
| ID | SOMENUMT | SUM |
-----------------------
| 1 | 10 | 10 |
| 2 | 12 | 22 |
| 3 | 3 | 25 |
| 4 | 15 | 40 |
| 5 | 23 | 63 |
Edit: this is a generalized solution that will work across most db platforms. When there is a better solution available for your specific platform (e.g., gareth's), use it!

The latest version of SQL Server (2012) permits the following.
SELECT
RowID,
Col1,
SUM(Col1) OVER(ORDER BY RowId ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Col2
FROM tablehh
ORDER BY RowId
or
SELECT
GroupID,
RowID,
Col1,
SUM(Col1) OVER(PARTITION BY GroupID ORDER BY RowId ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Col2
FROM tablehh
ORDER BY RowId
This is even faster. Partitioned version completes in 34 seconds over 5 million rows for me.
Thanks to Peso, who commented on the SQL Team thread referred to in another answer.

For SQL Server 2012 onwards it could be easy:
SELECT id, SomeNumt, sum(SomeNumt) OVER (ORDER BY id) as CumSrome FROM #t
because ORDER BY clause for SUM by default means RANGE UNBOUNDED PRECEDING AND CURRENT ROW for window frame ("General Remarks" at https://msdn.microsoft.com/en-us/library/ms189461.aspx)

Let's first create a table with dummy data:
Create Table CUMULATIVESUM (id tinyint , SomeValue tinyint)
Now let's insert some data into the table;
Insert Into CUMULATIVESUM
Select 1, 10 union
Select 2, 2 union
Select 3, 6 union
Select 4, 10
Here I am joining same table (self joining)
Select c1.ID, c1.SomeValue, c2.SomeValue
From CumulativeSum c1, CumulativeSum c2
Where c1.id >= c2.ID
Order By c1.id Asc
Result:
ID SomeValue SomeValue
-------------------------
1 10 10
2 2 10
2 2 2
3 6 10
3 6 2
3 6 6
4 10 10
4 10 2
4 10 6
4 10 10
Here we go now just sum the Somevalue of t2 and we`ll get the answer:
Select c1.ID, c1.SomeValue, Sum(c2.SomeValue) CumulativeSumValue
From CumulativeSum c1, CumulativeSum c2
Where c1.id >= c2.ID
Group By c1.ID, c1.SomeValue
Order By c1.id Asc
For SQL Server 2012 and above (much better performance):
Select
c1.ID, c1.SomeValue,
Sum (SomeValue) Over (Order By c1.ID )
From CumulativeSum c1
Order By c1.id Asc
Desired result:
ID SomeValue CumlativeSumValue
---------------------------------
1 10 10
2 2 12
3 6 18
4 10 28
Drop Table CumulativeSum

A CTE version, just for fun:
;
WITH abcd
AS ( SELECT id
,SomeNumt
,SomeNumt AS MySum
FROM #t
WHERE id = 1
UNION ALL
SELECT t.id
,t.SomeNumt
,t.SomeNumt + a.MySum AS MySum
FROM #t AS t
JOIN abcd AS a ON a.id = t.id - 1
)
SELECT * FROM abcd
OPTION ( MAXRECURSION 1000 ) -- limit recursion here, or 0 for no limit.
Returns:
id SomeNumt MySum
----------- ----------- -----------
1 10 10
2 12 22
3 3 25
4 15 40
5 23 63

Late answer but showing one more possibility...
Cumulative Sum generation can be more optimized with the CROSS APPLY logic.
Works better than the INNER JOIN & OVER Clause when analyzed the actual query plan ...
/* Create table & populate data */
IF OBJECT_ID('tempdb..#TMP') IS NOT NULL
DROP TABLE #TMP
SELECT * INTO #TMP
FROM (
SELECT 1 AS id
UNION
SELECT 2 AS id
UNION
SELECT 3 AS id
UNION
SELECT 4 AS id
UNION
SELECT 5 AS id
) Tab
/* Using CROSS APPLY
Query cost relative to the batch 17%
*/
SELECT T1.id,
T2.CumSum
FROM #TMP T1
CROSS APPLY (
SELECT SUM(T2.id) AS CumSum
FROM #TMP T2
WHERE T1.id >= T2.id
) T2
/* Using INNER JOIN
Query cost relative to the batch 46%
*/
SELECT T1.id,
SUM(T2.id) CumSum
FROM #TMP T1
INNER JOIN #TMP T2
ON T1.id > = T2.id
GROUP BY T1.id
/* Using OVER clause
Query cost relative to the batch 37%
*/
SELECT T1.id,
SUM(T1.id) OVER( PARTITION BY id)
FROM #TMP T1
Output:-
id CumSum
------- -------
1 1
2 3
3 6
4 10
5 15

Select
*,
(Select Sum(SOMENUMT)
From #t S
Where S.id <= M.id)
From #t M

You can use this simple query for progressive calculation :
select
id
,SomeNumt
,sum(SomeNumt) over(order by id ROWS between UNBOUNDED PRECEDING and CURRENT ROW) as CumSrome
from #t

There is a much faster CTE implementation available in this excellent post:
http://weblogs.sqlteam.com/mladenp/archive/2009/07/28/SQL-Server-2005-Fast-Running-Totals.aspx
The problem in this thread can be expressed like this:
DECLARE #RT INT
SELECT #RT = 0
;
WITH abcd
AS ( SELECT TOP 100 percent
id
,SomeNumt
,MySum
order by id
)
update abcd
set #RT = MySum = #RT + SomeNumt
output inserted.*

For Ex: IF you have a table with two columns one is ID and second is number and wants to find out the cumulative sum.
SELECT ID,Number,SUM(Number)OVER(ORDER BY ID) FROM T

Once the table is created -
select
A.id, A.SomeNumt, SUM(B.SomeNumt) as sum
from #t A, #t B where A.id >= B.id
group by A.id, A.SomeNumt
order by A.id

The SQL solution wich combines "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" and "SUM" did exactly what i wanted to achieve.
Thank you so much!
If it can help anyone, here was my case. I wanted to cumulate +1 in a column whenever a maker is found as "Some Maker" (example). If not, no increment but show previous increment result.
So this piece of SQL:
SUM( CASE [rmaker] WHEN 'Some Maker' THEN 1 ELSE 0 END)
OVER
(PARTITION BY UserID ORDER BY UserID,[rrank] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Cumul_CNT
Allowed me to get something like this:
User 1 Rank1 MakerA 0
User 1 Rank2 MakerB 0
User 1 Rank3 Some Maker 1
User 1 Rank4 Some Maker 2
User 1 Rank5 MakerC 2
User 1 Rank6 Some Maker 3
User 2 Rank1 MakerA 0
User 2 Rank2 SomeMaker 1
Explanation of above: It starts the count of "some maker" with 0, Some Maker is found and we do +1. For User 1, MakerC is found so we dont do +1 but instead vertical count of Some Maker is stuck to 2 until next row.
Partitioning is by User so when we change user, cumulative count is back to zero.
I am at work, I dont want any merit on this answer, just say thank you and show my example in case someone is in the same situation. I was trying to combine SUM and PARTITION but the amazing syntax "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" completed the task.
Thanks!
Groaker

Above (Pre-SQL12) we see examples like this:-
SELECT
T1.id, SUM(T2.id) AS CumSum
FROM
#TMP T1
JOIN #TMP T2 ON T2.id < = T1.id
GROUP BY
T1.id
More efficient...
SELECT
T1.id, SUM(T2.id) + T1.id AS CumSum
FROM
#TMP T1
JOIN #TMP T2 ON T2.id < T1.id
GROUP BY
T1.id

Try this
select
t.id,
t.SomeNumt,
sum(t.SomeNumt) Over (Order by t.id asc Rows Between Unbounded Preceding and Current Row) as cum
from
#t t
group by
t.id,
t.SomeNumt
order by
t.id asc;

Try this:
CREATE TABLE #t(
[name] varchar NULL,
[val] [int] NULL,
[ID] [int] NULL
) ON [PRIMARY]
insert into #t (id,name,val) values
(1,'A',10), (2,'B',20), (3,'C',30)
select t1.id, t1.val, SUM(t2.val) as cumSum
from #t t1 inner join #t t2 on t1.id >= t2.id
group by t1.id, t1.val order by t1.id

Without using any type of JOIN cumulative salary for a person fetch by using follow query:
SELECT * , (
SELECT SUM( salary )
FROM `abc` AS table1
WHERE table1.ID <= `abc`.ID
AND table1.name = `abc`.Name
) AS cum
FROM `abc`
ORDER BY Name

What would be the best way to write a query to produce a table given the following data?

I have a table that contains the following data:
ADD_Col Data OrderId Output NEW_ADD Col1 Col2
----- ------ ------- -----> ------- -------- -------
AD*A*1 A 96 A 1 2
AD*A*1 B 95 B 1 1
AD*A*1 C 94 C 0.8 1
AD*A*1 D 93 D 5 2
AD*A*2 1 92
AD*A*2 1 91
AD*A*2 0.8 90
AD*A*2 5 89
AD*A*3 2 88
AD*A*3 1 87
AD*A*3 1 86
AD*A*3 2 85
This data is all in the same table and I need to link each letter to each factor. I was thinking of doing a ROW_NUMBER() and joining based on the respective row number and assign my letter the same number either that or DENSERANK. What would be the best way to achieve this? If you can please provide query examples that would be great thanks.

Seems like what you need to do is normalise your data here. Here I use PARSENAME to get the "column Number", and then ROW_NUMBER to number the relevant rows in the groups. Finally I use a Cross tab to Pivot to data:
WITH CTE AS(
SELECT V.[Key],
V.data,
V.[Order],
PARSENAME(REPLACE(V.[Key],'*','.'),1) AS ColNo,
ROW_NUMBER() OVER (PARTITION BY V.[Key] ORDER BY V.[Order] DESC) AS RN
FROM (VALUES('AD*A*1','A',96),
('AD*A*1','B',95),
('AD*A*1','C',94),
('AD*A*1','D',93),
('AD*A*2','1',92),
('AD*A*2','1',91),
('AD*A*2','0.8',90),
('AD*A*2','5',89),
('AD*A*3','2',88),
('AD*A*3','1',87),
('AD*A*3','1',86),
('AD*A*3','2',85))V([Key],[data],[Order]))
SELECT MAX(CASE C.ColNo WHEN '1' THEN C.[data] END) AS New_ADD,
MAX(CASE C.ColNo WHEN '2' THEN C.[data] END) AS Col1,
MAX(CASE C.ColNo WHEN '3' THEN C.[data] END) AS Col2
FROM CTE C
GROUP BY C.RN;

For your sample data this will work:
with cte as (
select *,
row_number() over (partition by [key] order by [OrderId desc]) rn,
dense_rank() over (order by [key]) rk
from tablename
)
select t1.data,
max(case when t2.rk = 2 then t2.data end) col1,
max(case when t2.rk = 3 then t2.data end) col2
from (select * from cte where rk = 1) t1
inner join (select * from cte where rk in (2, 3)) t2
on t2.rn = t1.rn
group by t1.data
See the demo.
Results:
> data | col1 | col2
> :--- | :--- | :---
> A | 1 | 2
> B | 1 | 1
> C | 0.8 | 1
> D | 5 | 2

select t1.Data "Key"
, t2.Data "Col1"
, t3.Data "Col2"
from ((SELECT Data,
row_number() over (order by Key_C) rn
from my_table
where Key_C = 'AD*A*1') t1
left join
(SELECT Data,
row_number() over (order by Key_C) rn
from my_table
where Key_C = 'AD*A*2') t2
on t1.rn = t2.rn
left join
(SELECT Data,
row_number() over (order by Key_C) rn
from my_table
where Key_C = 'AD*A*3') t3
on t2.rn = t3.rn);
Here is the DEMO

DROP TABLE IF EXISTS #RawData
SELECT
[ADD_Col]
,[Data]
,[OrderId]
,REPLACE([ADD_Col], 'AD*A*', '') AS [Level]
,DENSE_RANK() OVER (PARTITION BY [ADD_Col] ORDER BY [OrderId] DESC) AS [Grouping]
INTO
#RawData
FROM
[SourceTable]
SELECT
rd.[Data]
,rdc1.[Data] AS [Col1]
,rdc2.[Data] AS [Col2]
FROM
#RawData AS rd
LEFT OUTER JOIN #RawData AS rdc1
ON rdc1.[Level] = 2
AND rd.[Grouping] = rdc1.[Grouping]
LEFT OUTER JOIN #RawData AS rdc2
ON rdc2.[Level] = 3
AND rd.[Grouping] = rdc2.[Grouping]
WHERE
rd.[Level] = 1

Select full row for the stats mode value

I have a table like below -
id cola colb colc
1 45 ab cd
1 45 ef cd
1 50 ab av
2 20 cd sc
2 13 cd cd
2 20 as sd
I want to first get the stats mode value of cola partition by id. In this case its 45 for 1 and 20 for 2 and then select the full row of the selected stats_mode value. is there any way to do it in one sql instead of creating inline queries?
Expected result:-
id cola colb colc
1 45 ab cd
2 20 as sd

You could try using some subquery
select m2.* from my_table m2
inner join (
select min(m1.colb) min_colb, t1.cola, t1.id
from my_table m1
inner join (
select cola, id
from my_table
group by cola,id
having count(*)>1
) t1 on t1.cola = m1.cola and t1.id = m1.id
) t2 on t2.cola = m2.cola and t2.id = m2.id and t2.min_colb = m2.colb

is there any way to do it in one sql instead of creating inline queries?
No, you need subqueries to perform this kind of data operations.

The statistical mode is the most commonly occurring value. You can do this with window functions:
select t.*
from (select t.*,
row_number() over (partition by id order by cnt desc) as seqnum_mode
from (select t.*,
count(*) over (partition by id, cola) as cnt
from t
) t
) t
where seqnum_mode = 1;

select a column value and the closest/nearest value from it in the same column

I have two columns
Key,Val
1 31
2 43
3 41
4 100
and my expected output is
Key,Val,closestVal
1 31 41
2 43 41
3 41 43
4 100 43
what can be the simplest sql query to have the expected output?

We can use ABS along with ROW_NUMBER here:
WITH cte AS (
SELECT t1.Key, t1.Val, t2.Val AS closestVal,
ROW_NUMBER() OVER (PARTITION BY t1.Key ORDER BY ABS(t1.Val - t2.Val)) rn
FROM yourTable t1
INNER JOIN yourTable t2
ON t1.Key <> t2.Key
)
SELECT Key, Val, closestVal
FROM cte
WHERE rn = 1;
Demo
Note: The above demo is for SQL Server, not Teradata. If KEY is a reserved keyword in Teradata, then you will have to escape it if you plan to use it as a column name.

I think the most performance query would use lag() and lead() -- which for some reason Teradata doesn't support directly. But:
select t.*,
(case when abs(val - min(val) over (order by val rows between 1 preceding and 1 preceding)) <
abs(val - min(val) over (order by val rows between 1 following and 1 following)
then min(val) over (order by val rows between 1 preceding and 1 preceding))
else min(val) over (order by val rows between 1 following and 1 following)
end) as closest_val
from t;
In other words, no subqueries or joins are needed, only window functions.

How to compute the diff between records?

My table records is like below
ym cnt
200901 57
200902 62
200903 67
...
201001 84
201002 75
201003 75
...
201101 79
201102 77
201103 80
...
I want to computer the diff between current month and per month .
the result would like below ...
ym cnt diff
200901 57 57
200902 62 5 (62 - 57)
200903 67 5 (67 - 62)
...
201001 84 ...
201002 75
201003 75
...
201101 79
201102 77
201103 80
...
Can anyone told me how to wrote a sql to got the result and with a good performance ?
UPDATE:
sorry for simple words
my solution is
step1: input the currentmonth data into temp table1
step2: input the permonth data into temp table2
step3: left join 2 tables to compute the result
Temp_Table1
SELECT (ym - 1) as ym , COUNT( item_cnt ) as cnt
FROM _table
GROUP BY (ym - 1 )
order by ym
Temp_Table2
SELECT ym , COUNT( item_cnt ) as cnt
FROM _table
GROUP BY ym
order by ym
select ym , (b.cnt - a.cnt) as diff from Temp_Table2 a
left join Temp_Table1 b
on a.ym = b.ym
*If i want to compare the diff between the month in this year and last year
I can only change the ym - 1 to ym - 100*
but , actually , the group by key is not only ym
there is max 15 keys and max 100 millions records
so , I wonder a good solution can easy to manager the source
and good performance.

For MSSQL, this has one reference to the table, so potentially it can be faster (maybe not) than left join which has two references to the table:
-- ================
-- sample data
-- ================
declare #t table
(
ym varchar(6),
cnt int
)
insert into #t values ('200901', 57)
insert into #t values ('200902', 62)
insert into #t values ('200903', 67)
insert into #t values ('201001', 84)
insert into #t values ('201002', 75)
insert into #t values ('201003', 75)
-- ===========================
-- solution
-- ===========================
select
ym2,
diff = case when cnt1 is null then cnt2
when cnt2 is null then cnt1
else cnt2 - cnt1
end
from
(
select
ym1 = max(case when k = 2 then ym end),
cnt1 = max(case when k = 2 then cnt end),
ym2 = max(case when k = 1 then ym end),
cnt2 = max(case when k = 1 then cnt end)
from
(
select
*,
rn = row_number() over(order by ym)
from #t
) t1
cross join
(
select k = 1 union all select k = 2
) t2
group by rn + k
) t
where ym2 is not null

Can anyone told me how to wrote a sql to got the result
Absolutely. Simply get the row with the next highest date, and subtract.
and with a good performance ?
No. Relational databases are not really meant to be traversed linearly, and even using indexes appropriately would require a virtual linear traversal.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Back fill timeseries data in SQL - sql

Use analytic functions. Something like: SELECT ts , src , MIN(val) val FROM ( SELECT ts , src , first_value(val) OVER ( PARTITION BY src ORDER BY ts ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) val FROM table ) tab GROUP BY 1, 2 ORDER BY 1, 2

Related

Rolling Average in SQL with Partition [duplicate]

What would be the best way to write a query to produce a table given the following data?

Select full row for the stats mode value

select a column value and the closest/nearest value from it in the same column

How to compute the diff between records?

Categories

Resources