Time interval based buckets in redis sorted set - redis

Is there any way to generate time interval based buckets using redis sorted set. I want to create different sorted sets in certain time interval (lets say 15 mins)
t1, t2 are scores
Key SortedSet
bucket#V1 (t1,1),(t2,2)..... (commited bucket)
bucket#V1+15 (t3,1),(t4,2)..... (commited bucket)
bucket#V1+30 (t5,1),(t6,2)..... (current running bucket)
i.e. in 15 mins interval, it should automatically create new key and start ingesting data in new sorted set. V1+15 should start after 15 mins...
The second challenge is how to query commited buckets? (not running buckets where data still getting ingested).
The end goal is to query commited buckets first then query data in each bucket using time range queries (based on score i.e. ZRANGEBYSCORE)

Your key is going to be something like DateTime.Now.SecondsSinceEpoch / TimeSpan.FromMinutes(15) Formatted as a fixed length string. Here is a powershell script showing how to get the interval key value. you can use a similar routine for storing data or requesting any past (or future) interval. Here the interval is 3 seconds and the Epoch value is Jan 1, 1970 but can use anything you want.
1..7 | % { $interval = [int] (((get-date) - (epoch)).TotalSeconds / 3) ; $interval ; start-sleep -Seconds 1 }
514663592
514663593
514663593
514663593
514663594
514663594
514663594

Related

SQL: a time-series variant of the "every nth row" problem

I have a table of time-series data, with the columns:
sensor_number (integer primary key)
signal_strength (integer)
signal_time (timestamp)
Each sensor creates 20-30 rows per minute. I need a query that returns for a sensor 1 row per minute (or every 2 minutes, 3 minutes, etc). A pure SQL approach is to use a window function, with a partition on an expression that rounds the timestamp appropriately (date_trunc() works for the 1-minute case, otherwise I have to some messy casting) The problem is the expression blocks the ability to use the index. With 5B rows, that's a killer.
The best alternative I can come up with is a user-defined function that uses a cursor to step through the table in index key order (sensor_number, signal_time) and outputting a row every time the timestamp crosses a minute boundary. That's still slow though. Is there a pure SQL approach that'll accomplish this AND utilize the index?
I think if you're returning enough rows, scanning the whole range of rows that match the sensor_number will just be the best plan. The signal_time portion of the index may simply not be helpful at that point, because the database needs to read so many rows anyway.
However, if your time interval is big enough / the number of rows you're returning is small enough, it might be more efficient to hit the index separately for each row you're returning. Something like this (using an interval of 3 minutes and a sensor number of 5 as an example):
WITH range AS (
SELECT
max(signal_time) as max_time,
min(signal_time) as min_time
FROM timeseries
WHERE sensor_number = 5
)
SELECT sample.*
FROM range
JOIN generate_series(min_time, max_time, interval '3 minutes') timestamp ON true
JOIN LATERAL (
SELECT *
FROM timeseries
WHERE sensor_number = 5
AND signal_time >= timestamp
AND signal_time < timestamp + interval '3 minutes'
LIMIT 1
) sample ON true;

Firebird time-sequence data with gaps

I'm using Firebird 2.5 through Delphi 10.2/FireDAC. I have a table of historical data from industrial devices, one row per device at one-minute intervals. I need to be able to retrieve an arbitrary time window of records on an arbitrary interval span (every minute, every 3 minutes, every 15 minutes, etc.) which then gets used to generate a trending grid display or a report. This example query, pulling every 3 minutes records for a two-day window for 2 devices, does most of what I need:
select *
from avc_history
where (avcid in (11,12))
and (tstamp >= '6/26/2022') and (tstamp < '6/27/2022')
and (mod(extract(minute from tstamp), 3) = 0)
order by tstamp, avcid;
The problem with this is that, if device #11 has no rows for a part of the time span, then there is a mismatch in the number of rows returned for each device. I need to return a row for each device on each minutes period regardless of whether a row on that timestamp actually exists in the table - the other fields of the row can be null as long as the tstamp field has the timestamp and the avcid field has the device's ID.
I can fix this up in code, but that's a memory and time hog, so I want to avoid that if possible. Is there a way to structure a query (or if necessary, a stored procedure) to do what I need? Would some capability in the FireDAC library that I'm not aware of make this easier?

SUM of last 24 hour scores within a specific range in a sorted set (Redis)

Is there a way to calculate the SUM of scores saved under 24 hour respecting performance of the Redis server ? (For around 1Million new rows added per day)
What is the right format to use in order to store timestamp and score of users using sorted sets ?
Actually I am using this command:
ZADD allscores 1570658561 20
As score, it is the actual time in seconds ... and other field is the real score.
But, there is a problem here ! When another user get the same score (20), it is not added since it's already present - Any solution for this problem ?
I am thinking to use a LUA script, but there is 2 headaches:
The LUA script will block other commands from working until it is finished the job (Which is not a good practice for my case since the script have to work 24/24 7/7 meanwhile many users have to fetch datas in the same time from the Redis cache server like users scores, history infos ect.) - Plus, the LUA script have to deal each time with many records saved each day inside a specific key - So, while the Lua script is working, users can't fetch datas ... knowing that the Lua script will work in loop all time.
Second, it is related to the first problem that do not let me store same score if I use timestamp as score in the command so I can return 24 hour datas.
If you are in my case, how will you deal with this ? Thanks
Considering that the data is needed for last 24 hours(Sliding window) and the number of rows possible is 1 million. We cannot use sorted set data structure to compute sum with high performance.
High performance design and also solving your duplicate score issue:
Instead with a little decision on the accuracy, you can have a highly performant system by crunching the data within a window.
Sample Input data:
input 1: user 1 wants to add time: 11:10:01 score: 20
input 2: user 2 wants to add time: 11:11:02 score: 20
input 3: user 1 wants to add time: 11:17:04 score: 50
You can have 1 minute, 5 minutes or 1 hour accuracy and decide window based on that.
If you accept an approximation of 1 hour data, you can have this while insertion,
for input 1 :
INCRBY SCORES_11_hour 20
for input 2:
INCRBY SCORES_11_hour 20
for input 3:
INCRBY SCORES_11_hour 20
To get the data for last 24 hours, you need to sum up only 24 hourly keys.
MGET SCORES_previous_day_12_hour SCORES_previous_day_13_hour SCORES_previous_day_14_hour .... SCORES_current_day_10_hour SCORES_current_day_11_hour
If you accept an approximation of 5 minutes, you can have this while insertion, along with incrementing the hourly keys, you need to store the 5 minute window data.
for input 1 :
INCRBY SCORES_11_hour 20
INCRBY SCORES_11_hour_00_minutes 20
for input 2:
INCRBY SCORES_11_hour 20
INCRBY SCORES_11_hour_00_minutes 20
for input 3:
INCRBY SCORES_11_hour 20
INCRBY SCORES_11_hour_05_minutes 20
To get the data for last 24 hours, you need to sum up only 23 hour keys(whole hours data) + 12 five minute window keys
If the time added is based on the current time, you can optimize it further. (Assuming that if it is 11th hour and the data for 10th, 9th and the previous hours wont change at all).
As you told it is going to be 24/7, we can use some computed values from the previous iterations too.
Say it is computed on 11th hour, you would've got the values for past 24 hours.
If it is again computed on 12th hour, you can reuse the sum for 22 intermediate hours whose data is unchanged and get only the missing 2 hours data from redis.
Similarly further optimisations can be applied based on your need.

What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds

My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year, month, day, hour, logtype

Delta job on BigQuery misses records

I'm facing an odd issue with a small delta job that I implemented on top of a streaming BigQuery table with Apache Beam.
I'm streaming data to a BigQuery table and every hour I run a job to copy any new records from that streaming table to a reconciled table. The delta is built on top of a CreateDatetime column I introduced on the streaming table. Once a record gets loaded to the streaming table, it gets the current UTC timestamp. So the delta naturally takes all records that have a newer CreateDatetime than last time up to the current time the Batch is running.
CreatedDatetime >= LastDeltaDate AND
CreatedDatetime < NowUTC
The logic for LastDeltaDate is as follows:
1. Start: LastDeltaDate = 2017-01-01 00:00:00
2. 1st Delta Run:
- NowUTC = 2017-10-01 06:00:00
- LastDeltaDate = 2017-01-01 00:00:00
- at the end of the successful run LastDeltaDate = NowUTC
3. 2nd Delta Run:
- NowUTC = 2017-10-01 07:00:00
- LastDeltaDate = 2017-10-01 06:00:00
- at the end of the successful run LastDeltaDate = NowUTC
...
Now every other day I find records that are on my streaming table but never arrived on my reconciled table. When I check the timestamps I see they're far away from the batch run and when I check the Google Datflow log I can see that no records where returned for the query at that time but when I run the same query now, I get the records. Is there any way a streamed record could arrive super late in a query or is it possible that Apache Beam is processing a record but not writing it for a long time? I'm not applying any windowing strategy.
Any ideas?
When performing streaming inserts, there is a delay in how quickly those rows are available for batch exports, as described in their documentation data availability.
So, at time T2 you may have streamed a bunch of rows into BigQuery that are stored in the streaming buffer. You then run a batch job from time T1 to T2, but only see the rows up to T2-buffer. As a result, any rows that are in the buffer for each of the delta runs will be dropped.
You may need to make your selection of NowUTC aware of the streaming buffer, so that the next run process rows that were within the buffer.