Doing lag queries using hive on irregular timeseries

Doing lag queries using hive on irregular timeseries - sql

I am trying to get a lag of one my column on an irregular time series. The data would be as follow
time stamp (seconds), temperature
1, 20
4,12
6,13
7,18
the new dataset should be as follow
time stamp (seconds), temperature, lagged_1_temperature
1, 20,0
4,12,0
6,13,0
7,18,13
As seen just the lag for last row is a non zero.
For a typical lag I use bellow hive query inside my spark application.
"select timestamp, value ,lag(value,1) OVER (ORDER BY timestamp) as lagged_1_value"
Can I change above hive query to give me the result I want

You can do this with a case expression.
select t.*,
case when timestmp-coalesce(lag(timestmp,1) over(order by timestmp),0)=1
then coalesce(lag(temperature,1) over(order by timestmp),0)
else 0 end as lagged_1_termperature
from t

A simple left join might be more efficient:
select t.*,
coalesce(tprev.value, 0) as prev_value
from t left join
t tprev
on tprev.timestmp = t.timestmp - 1;

Related

Average time between two consecutive events in SQL

I have a table as shown below.
time
Event
2021-03-19T17:15:05
A
2021-03-19T17:15:11
B
2021-03-19T17:15:11
C
2021-03-19T17:15:12
A
2021-03-19T17:15:14
C
I want to find the average time between event A and the event following it.
How do I find it using an SQL query?
here desired output is: 4 seconds.
I really appreciate any help you can provide.

The basic idea is lead() to get the time from the next row. Then you need to calculate the difference. So for all rows:
select t.*,
(to_unix_timestamp(lead(time) over (order by time) -
to_unix_timestamp(time)
) as diff_seconds
from t;
Use a subquery and filtering for just A and the average:
select avg(diff_seconds)
from (select t.*,
(to_unix_timestamp(lead(time) over (order by time) -
to_unix_timestamp(time)
) as diff_seconds
from t
) t
where event = 'A';

How to compute window function for each nth row in Presto?

I am working with a table that contains timeseries data, with a row for each minute for each user.
I want to compute some aggregate functions on a rolling window of N calendar days.
This is achieved via
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table
However, I am only interested in the result of this at a daily scale.
i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.
Right now I am doing this:
WITH agg_results AS (
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp_col
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
)
FROM my_table
)
SELECT * FROM agg_results
WHERE
timestamp_col = DATE_TRUNC('day', "timestamp_col")
This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.
Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).

I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):
select
timestamp_col,
(
select some_aggregate_fun(t1.col)
from my_table t1
where
t1.user_id = t.user_id
and t1.timestamp_col >= t.timestamp_col - interval '1' day
and t1.timestamp_col <= t.timestamp_col
)
from my_table t
where timestamp_col = date_trunc('day', timestamp_col)
I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.
You can change interval '1' day to the actual interval you want to use.

sum last n days quantity using sql window function

I am trying to create following logic in Alteryx and data is coming from Exasol database.
Column “Sum_Qty_28_days“ should sum up the values of “Qty ” column for same article which falls under last 28 days.
My sample data looks like:
and I want following output:
E.g. “Sum_Qty_28_days” value for “article” = ‘A’ and date = ‘’2019-10-8” is 8 because it is summing up the “Qty” values associated with dates (coming within previous 28 days) Which are:
2019-09-15
2019-10-05
2019-10-08
for “article” = ‘A’.
Is this possible using SQL window function?
I tried myself with following code:
SUM("Qty") OVER (PARTITION BY "article", date_trunc('month',"Date")
ORDER BY "Date")
But, it is far from what I need. It is summing up the Qty for dates falling in same month. However, I need to sum of Qty for last 28 days.
Thanks in advance.

Yes, this is possible using standard SQL and in many databases. However, this will not work in all databases:
select t.*,
sum(qty) over (partition by article
order by date
range between interval '27 day' preceding and current row
) as sum_qty_28_days
from t;

If your RDBMS does not support the range frame, an alternative solution is to use an inline subquery:
select
t.*,
(
select sum(t1.qty)
from mytable t1
where
t1.article = t.article
and t1.date between t.date - interval 28 days and t.date
) sum_qty_28_days
from mytable t

Every 10th row based on timestamp

I have a table with signal name, value and timestamp. these signals where recorded at sampling rate of 1sample/sec. Now i want to plot a graph on values of months, and it is becoming very heavy for the system to perform it within seconds. So my question is " Is there any way to view 1 value/minute in other words i want to see every 60th row.?"

You can use the row_number() function to enumerate the rows, and then use modulo arithmetic to get the rows:
select signalname, value, timestamp
from (select t.*,
row_number() over (order by timestamp) as seqnum
from table t
) t
where seqnum % 60 = 0;
If your data really is regular, you can also extract the seconds value and check when that is 0:
select signalname, value, timestamp
from table t
where datepart(second, timestamp) = 0
This assumes that timestamp is stored in an appropriate date/time format.

Instead of sampling, you could use the one minute average for your plot:
select name
, min(timestamp)
, avg(value)
from Yourtable
group by
name
, datediff(minute, '2013-01-01', timestamp)
If you are charting months, even the hourly average might be detailed enough.

SQL Average Inter-arrival Time, Time Between Dates

I have a table with sequential timestamps:
2011-03-17 10:31:19
2011-03-17 10:45:49
2011-03-17 10:47:49
...
I need to find the average time difference between each of these(there could be dozens) in seconds or whatever is easiest, I can work with it from there. So for example the above inter-arrival time for only the first two times would be 870 (14m 30s). For all three times it would be: (870 + 120)/2 = 445 (7m 25s).
A note, I am using postgreSQL 8.1.22 .
EDIT: The table I mention above is from a different query that is literally just a one-column list of timestamps

Not sure I understood your question completely, but this might be what you are looking for:
SELECT avg(difference)
FROM (
SELECT timestamp_col - lag(timestamp_col) over (order by timestamp_col) as difference
FROM your_table
) t
The inner query calculates the distance between each row and the preceding row. The result is an interval for each row in the table.
The outer query simply does an average over all differences.

i think u want to find avg(timestamptz).
my solution is avg(current - min value). but since result is interval, so add it to min value again.
SELECT avg(target_col - (select min(target_col) from your_table))
+ (select min(target_col) from your_table)
FROM your_table

If you cannot upgrade to a version of PG that supports window functions, you
may compute your table's sequential steps "the slow way."
Assuming your table is "tbl" and your timestamp column is "ts":
SELECT AVG(t1 - t0)
FROM (
-- All this silliness would be moot if we could use
-- `` lead(ts) over (order by ts) ''
SELECT tbl.ts AS t0,
next.ts AS t1
FROM tbl
CROSS JOIN
tbl next
WHERE next.ts = (
SELECT MIN(ts)
FROM tbl subquery
WHERE subquery.ts > tbl.ts
)
) derived;
But don't do that. Its performance will be terrible. Please do what
a_horse_with_no_name suggests, and use window functions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Doing lag queries using hive on irregular timeseries - sql

You can do this with a case expression. select t.*, case when timestmp-coalesce(lag(timestmp,1) over(order by timestmp),0)=1 then coalesce(lag(temperature,1) over(order by timestmp),0) else 0 end as lagged_1_termperature from t

A simple left join might be more efficient: select t.*, coalesce(tprev.value, 0) as prev_value from t left join t tprev on tprev.timestmp = t.timestmp - 1;

Related

Average time between two consecutive events in SQL

How to compute window function for each nth row in Presto?

sum last n days quantity using sql window function

Every 10th row based on timestamp

SQL Average Inter-arrival Time, Time Between Dates

Categories

Resources