How to do forward fill in time series data in SQL - sql

I am trying to do forward fill in SQL. Usually, I use pandas for such cases but I am forced to use SQL as I am trying to visualize this data in grafana eventually.
WITH CTE0 AS (SELECT [dlvrystartutc],import_h,
COUNT(CASE WHEN import_h IS NOT NULL THEN 1 END) OVER (ORDER BY [dlvrystartutc]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T1 )
SELECT [dlvrystartutc] AS datetime,import_h,MIN(import_h) OVER (PARTITION BY grp) AS ff_import_h
FROM CTE0
ORDER BY [dlvrystartutc]
My original data is hourly data but I want to convert it to quarterly (15 min) data.
Below is a snapshot of my output. It works well when two consecutive hours have some data. It was able to forward fill correctly between 16:00 to 17:00, 17:00 to 18:00. However, when two consecutive hours do not have data it continues doing forward fill. For example, there is no data at 20:00. I want to forward fill between 19:00 to 20:00 but current logic does it from 19:00 to 21:00 (or till it finds the hour with data).
I can easily solved this in pandas with limit =3 argument in ffill. Any suggestions will be helpful.
PS: I am converting hourly data to 15 min data as I need to add it with an other time series which has 15 min resolution.
Any suggestions will be helpful.
Thank you.

Related

Finding the initial sampled time window after using SAMPLE BY again

I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.
Input
My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.
The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.
Output
Using the following query, I can get the average galon price per month just fine.
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
timestamp
avg_per_month
2000-06-05T00:00:00.000000Z
1.6724
2000-07-05T00:00:00.000000Z
1.69275
2000-08-05T00:00:00.000000Z
1.635
...
...
Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:
SELECT timestamp, max(avg_per_month) as max_per_year FROM (
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M
timestamp
max_per_year
2000-01-05T00:00:00.000000Z
1.69275
2001-01-05T00:00:00.000000Z
1.767399999999
2002-01-05T00:00:00.000000Z
1.52075
...
...
Wanted output
I want to know which month was responsible for the maximum price of a year.
Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.
For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.
timestamp
max_per_year
which_month_is_responsible
2000-01-05T00:00:00.000000Z
1.69275
2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z
1.767399999999
2001-05-05T00:00:00.000000Z
...
...
What did I try?
I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?
Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.
I think 'wanted output' can be achieved with window functions.
Please have a look at:
CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);
INSERT INTO electricity
SELECT (x*1000000)::timestamp, rnd_double()
FROM long_sequence(10000000);
SELECT day, ts, max_per_day
FROM
(
SELECT timestamp_floor('d', ts) as day,
ts,
avg_in_15_min as max_per_day,
row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day
FROM
(
SELECT ts, avg(consumption) as avg_in_15_min
FROM electricity
SAMPLE BY 15m
)
) WHERE rn_per_day = 1

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

SQL performance issues with window functions on daily basis

Given ~23 million users, what is the most efficient way to compute the cumulative number of logins within the last X months for any given day (even when no login was performed) ? Start date of a customer is its first ever login, end date is today.
Desired output
c_id day nb_logins_past_6_months
----------------------------------------------
1 2019-01-01 10
1 2019-01-02 10
1 2019-01-03 9
...
1 today 5
➔ One line per user per day with the number of logins between current day and 179 days in the past
Approach 1
1. Cross join each customer ID with calendar table
2. Left join on login table on day
3. Compute window function (i.e. `sum(nb_logins) over (partition by c_id order by day rows between 179 preceding and current row)`)
+ Easy to understand and mantain
- Really heavy, quite impossible to run on daily basis
- Incremental does not bring much benefit : still have to go 179 days in the past
Approach 2
1. Cross join each customer ID with calendar table
2. Left join on login table on day between today and 179 days in the past
3. Group by customer ID and day to get nb logins within 179 days
+ Easier to do incremental
- Table at step 2 is exceeding 300 billion rows
What is the common way to deal with this knowing this is not the only use case, we have to compute other columns like this (nb logins in the past 12 months etc.)
In standard SQL, you would use:
select l.*,
count(*) over (partition by customerid
order by login_date
range between interval '6 month' preceding and current row
) as num_logins_180day
from logins l;
This assumes that the logins table has a date of the login with no time component.
I see no reason to multiply 23 million users by 180 days to generate a result set in excess of 4 million rows to answer this question.
For performance, don't do the entire task all at once. Instead, gather subtotals at the end of each month (or day or whatever makes sense for your data). Then SUM up the subtotals to provide the 'report'.
More discussion (with a focus on MySQL): http://mysql.rjweb.org/doc.php/summarytables
(You should tag questions with the specific product; different products have different syntax/capability/performance/etc.)

How to select by 1 xbar date/second in kdb+

I am trying to run a select on a table whereby the data in the table ranges across multiple days, thus it does not conform to daily data that the documentation eludes to.
Application of the xbar selection accross multiple days obviously results in data that is not ordered i.e. select last size, last price by 1 xbar time.second on data that includes 2 days would result in:
second | size price
====================
00:00:01 | 400 555.5
00:00:01 | 600 606.0
00:00:02 | 400 555.5
00:00:02 | 600 606.0
How can one add the current date in the selection such that the result like what is done in pandas can still be orderly across multiple days e.g: 2019-09-26 16:34:40
Furthermore how does one achieve this whilst maintaining a date format that is compatible with pandas once stored in csv?
NB: It is easiest for us to assist you if you provide code that can replicate a sample of the kind of table that you are working with. Otherwise we need to make assumptions about your data.
Assuming that your time column is of timestamp type (e.g. 2019.09.03D23:11:54.711811000), a simple solution is to xbar by one as a timespan, rather than using the time.second syntax:
select last size, last price by 0D00:00:01 xbar time from data
Using xbar keeps the time column as a timestamp rather than casting it to second type.
If your time column is of some other temporal type then you can still use this method if you have a date column in your table that you can use to cast time to a timestamp. This would look something like:
select last size, last price by 0D00:00:01 xbar date+time from data
I would suggest to group by both date and second, and the sum them
update time: date+time from
select last size, last price
by date: `date$time, time: 1 xbar `second$time from data
Or the other shorter and more efficient option is to sum date and second right in the group clause:
select last size, last price by time: (`date$time) + 1 xbar `second$time from data

How do I find records that have the same value in adjacent records in Sql Server? (I believe the correct term for this is a region??)

Finding the start and end time for adjacent records that have the same value?
I have a table that contains heart rate readings (in beats per minute) and datetime field. (Actually the fields are heartrate_id, heartrate, and datetime.) The data are generated by a device that records the heart rate and time every 6 seconds. Sometimes the heart rate monitor will give false readings and the recorded beats per minute will "stick" for an period of time. By sticks, I mean the beats per minute value will be identical in adjacent times.
Basically I need to find all the records where the heart rate is the same (e.g. 5 beats per minute, 100 beats per minute, etc.) in but only on adjacent records. If the device records 25 beats per minute for 3 consecutive reading (or 100 consecutive readings) I need to locate these events. The results need to have the heartrate, time the heartrate started, and the time the heart rate ended and ideally the results would look more of less like this:
heartrate starttime endtime
--------- --------- --------
1.00 21:12:00 21:12:24
35.00 07:00:12 07:00:36
I've tried several different approaches but so far I'm striking out. Any help would be greatly appreciated!
EDIT:
Upon review, none of my original work on this answer was very good. This actually belongs to the class of problems known as gaps-and-islands, and this revised answer will use information I've gleaned from similar questions/learned since first answering this question.
It turns out this query can be done a lot more simply than I originally thought:
WITH Grouped_Run AS (SELECT heartRate, dateTime,
ROW_NUMBER() OVER(ORDER BY dateTime) -
ROW_NUMBER() OVER(PARTITION BY heartRate ORDER BY dateTime) AS groupingId
FROM HeartRate)
SELECT heartRate, MIN(dateTime), MAX(dateTime)
FROM Grouped_Run
GROUP BY heartRate, groupingId
HAVING COUNT(*) > 2
SQL Fiddle Demo
So what's happening here? One of the definitions of gaps-and-islands problems is the need for "groups" of consecutive values (or lack thereof). Often sequences are generated to solve this, exploiting an often overlooked/too-intuitive fact: subtracting sequences yields a constant value.
For example, imagine the following sequences, and the subtraction (the values in the rows are unimportant):
position positionInGroup subtraction
=========================================
1 1 0
2 2 0
3 3 0
4 1 3
5 2 3
6 1 5
7 4 3
8 5 3
position is a simple sequence generated over all records.
positionInGroup is a simple sequence generated for each set of different records. In this case, there's actually 3 different sets of records (starting at position = 1, 4, 6).
subtraction is the result of the difference between the other two columns. Note that values may repeat for different groups!
One of the key properties the sequences must share is they must be generated over the rows of data in the same order, or this breaks.
So how is SQL doing this? Through the use of ROW_NUMBER() this function will generate a sequence of numbers over a "window" of records:
ROW_NUMBER() OVER(ORDER BY dateTime)
will generate the position sequence.
ROW_NUMBER() OVER(PARTITION BY heartRate ORDER BY dateTime)
will generate the positionInGroup sequence, with each heartRate being a different group.
In the case of most queries of this type, the values of the two sequences is unimportant, it's the subtraction (to get the sequence group) that matters, so we just need the result of the subtraction.
We'll also need the heartRate and the times in which they occurred to provide the answer.
The original answer asked for the start and end times of each of the "runs" of stuck heartbeats. That's a standard MIN(...)/MAX(...), which means a GROUP BY. We need to use both the original heartRate column (because that's a non-aggregate column) and our generated groupingId (which identifies the current "run" per stuck value).
Part of the question asked for only runs that repeated three or more times. The HAVING COUNT(*) > 2 is an instruction to ignore runs of length 2 or less; it counts rows per-group.
I recommend Ben-Gan's article on interval packing, which applies to your adjacency problem.
tsql-challenge-packing-date-and-time-intervals
solutions-to-packing-date-and-time-intervals-puzzle