Data discretization in db in an intelligent way - sql

For my futur project i have a ClickHouse db. This db is fed by several micro-services themselves fed by rabbitsMQ.
The data look like:
| Datetime | nodekey | value |
| 2018-01-01 00:10:00 | 15 | 156 |
| 2018-01-01 00:10:00 | 18 | 856 |
| 2018-01-01 00:10:00 | 86 | 8 |
| 2018-01-01 00:20:00 | 15 | 156 |
| 2018-01-01 00:20:00 | 18 | 84 |
| 2018-01-01 00:20:00 | 86 | 50 |
......
So for hundreds different nodekey, I have a value every 10 minutes.
I need to have another table with the sum or the means (depends on nodekey type) of the values for every hours ...
My first idea is just using a crontab ...
But the data didn't comming in fluid flow, sometime micro-service add 2 - 3 new values or some time a weeks of data comming ... and rarely i have to bulk insert a years of the new data...
And for the moment i only have hundreds nodekey but the project going to grows.
So, i think using a crontab or looping throught the db for updating data isn't a good idea...
What is my other options ?

How about just creating a view?
create view myview as
select
toStartOfHour(datetime) date_hour,
nodekey,
sum(value) sum_value
from mytable
group by
toStartOfHour(datetime),
nodekey
The advantage of this approach is that you don't need to worry about refreshing the data. When querying the view, you actually access the underlying live data. The downside is that it might not scale well when your dataset becomes really big (queries adressing the view will tend to slow down).
An intermediate option would be to use a materialized view, that will persist the data. If I correctly understand the clickhouse documentation, materialized views are automatically updated when the data in the source table is modified, which seems to be close to what you are looking for (however you need to use the proper engine, and this might impact the performance of your inserts).

Related

DBT Snapshots with not unique records in the source

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one.
I know how to achieve this using window functions but not sure how to handle such situation with dbt.
I wonder if anybody has faced this same issue before.
Snapshot Table
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | null |
Source Table
|**id**|**some_attribute**| **updated_at** |
| 123 | ABCD | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123 | ZABC | 2021-06-30 00:00:00 |-> already been loaded to snapshot
-------------------------------------------
| 123 | ZZAB | 2021-11-21 00:10:00 |
| 123 | FXAB | 2021-11-21 15:11:00 |
Snapshot Desired Result
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | 2021-11-21 00:10:00 |
| 123 | ZZAB | 2021-11-21 00:10:00 | 2021-11-21 15:11:00 |
| 123 | FXAB | 2021-11-21 15:11:00 | null |
Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.
I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854
I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.
The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.
However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

Best way to pre-aggregate time-series data in postgres

I have a table of sent alerts as below:
id | user_id | sent_at
1 | 123 | 01/01/2020 12:09:39
2 | 452 | 04/01/2020 02:39:50
3 | 264 | 11/01/2020 05:09:39
4 | 123 | 16/01/2020 11:09:39
5 | 452 | 22/01/2020 16:09:39
Alerts are sparse and I've around 100 Million user_ids. This table has total ~500 Million entries (last 2 months).
I want to query alerts per user in last X hours/days/weeks/months for 10 million users_ids(saved in another table). I cannot use any external time-series database and it has to be done in postgres only.
I tried keeping hourly buckets for each user. But data is so sparse that I've too many rows (userIds*hours). For eg. Getting alerts count for 10 Million users in last 10 hours takes a long time from this table.
user_id | hour | count
123 | 01/01/2020 12:00:00 | 2
123 | 01/01/2020 10:00:00 | 1
234 | 11/01/2020 12:00:00 | 1
There are not many alerts per user, so an index on (user_id) should be sufficient.
However, you might as well put the time into it as well, so I would recommend (user_id, sent_at). This covers the where clause of your query. Postgres will still need to look up the original data pages to check for changes to the data.

Can you delete old entries from a table index?

I made a reminder application that heavily writes and reads records with future datetimes, but less on records with past datetimes. These reminders are indexed by remind_at, so a million of records means a million on the index, but speeds up checking records that must be reminded in the next hour.
| uuid | user_id | text | remind_at | ... | ... | ... |
| ------- | ------- | ------------ | ------------------- | --- | --- | --- |
| 45c1... | 23 | Buy paint | 2019-01-01 20:00:00 | ... | ... | ... |
| 23f1... | 924 | Pick up car | 2019-02-01 20:00:00 | ... | ... | ... |
| 2d84... | 650 | Call mom | 2020-03-01 20:00:00 | ... | ... | ... |
| 3f1a... | 81 | Get shoes | 2020-04-01 20:00:00 | ... | ... | ... |
The problem is performance. Once the database grows big, retrieving any record becomes relatively slow.
I'm trying to check what RDBMS offer a full or semi automated way allow better performance retrieving future datetimes, since past datetimes are rarely retrieved or checked.
A neat solution that I don't know if exist would be to instruct the RDBM to prune old entries from the index. I don't know if any RDBM allows that, but in PostgreSQL, SQL Server, and SQLite there is a way to use a "partial index", but what would happen if I **recreate an index on a table with millions of records?
Some solutions that didn't fit the bill:
Horizontal scaling: It would replicate the same problem, (n) number of times.
Vertical scaling: still doesn't fix the problem.
Sharding: Could be, since every instance would hold a part of the database, but the app will have to handle the "sharding key".
Two databases: Okay, one fast and other slow. Moving old entries to the "slow instance" (toaster) would be done manually. Also, the app would have to be heavily modified to check both databases since it doesn't know where it is initially. Logic increases heavily.
Anyway, the whole point is to make future (or the closest) records to remind snappier on retrieval while disregarding the performance to retrieve older entries.

Database design for partially changing data points, with history and snapshot functionality?

I'm looking for a best practice or solution, on a conceptual level, to a problem I'm working on.
I have a collection of data points (around 500) which are partially changed, by a user, over time. It is important to able to tell, which values have been changed at what point in time. The data might look like this:
Data changed over time:
+--------------------------------------------------------------------------------------+
| Date | Value no. 1 | Value no. 2 | Value no. 3 | ... | Value no. 500 |
|------------+---------------+---------------+---------------+-------+-----------------|
| 1/1/2018 | | | 2 | | 1 |
| 1/3/2018 | 2 | 1 | | | |
| 1/7/2018 | | | 4 | | 8 |
| 1/12/2018 | 5 | 3 | | | |
....
It must be possible to take a snapshot at a certain point in time, to get a complete set of data points, that were valid for that particular point in time, like this:
Snapshot taken 1/3/2018 will yield:
+---------------------------------------------------------+
| Value 1 | Value 2 | Value 3 | ... | Value 500 |
|-----------+-----------+-----------+-------+-------------|
| 2 | 1 | 2 | 0 | 1 |
Snapshot taken 1/9/2018 will yield:
+---------------------------------------------------------+
| Value 1 | Value 2 | Value 3 | ... | Value 500 |
|-----------+-----------+-----------+-------+-------------|
| 2 | 1 | 4 | 0 | 8 |
Snapshot taken 1/13/2018 will yield:
+---------------------------------------------------------+
| Value 1 | Value 2 | Value 3 | ... | Value 500 |
|-----------+-----------+-----------+-------+-------------|
| 5 | 3 | 4 | 0 | 8 |
and so on...
I'm not bound by a particular database technology, so either SQL or NoSQL will do. It is probably not possible to satisfy all the requirements in the DB-domain - some will probably have to be addressed in code. But my main question is what database technology is best suited for this task?
I'm not quite sure this fits a time-series database (TSDB), since only a portion of the values are changed at a given time, and it is important to know which values changed. Maybe I'm wrong?
/Chris
My suggestion would be to model this in a sparse format, something like:
CREATE TABLE DataPoint (
DataID int, /* 1 to 500 in your example, or whatever you need to identify it*/
ValidFrom timestamp, /*default value 01/01/1970-00:00:00 or a suitable "Epoch" */
ValidUntil timestamp, /*default value 31/12/3999-00:00:00 or again something that is in the far future for your case */
value Number (7,5) /* again, this may be any data type, or even more than one field if needed, like Price & Currency
);
What we have just defined is a set of data and the "interval" in which each data has a specific value, so if you measured DataPoint 1 yesterday and got a value of 89.768 you will insert:
DataId=1
ValidFrom=26/11/2018-14:52:41
ValidUntil=31/12/3999-00:00:00
Value=89.768
Then you measure it again tomorrow and get:
DataId=1
ValidFrom=28/11/2018-14:51:23
ValidUntil=31/12/3999-00:00:00
Value=89.443
(Let assume that you have also logic so that when you record a new value you update the current value record and assign ValidUntil=28/11/2018-14:51:23 this is not really needed but will make the example query simpler).
One month from now you have accumulated more measurements for data #1, and the same, on different moments, for data #2 to 500.
You now want to find out what the values were at noon today (i.e. one month "ago") i.e. at 27/11/2018:12:00:00:00
Select DataID, Value from DataPoint where ValidFrom <= 27/11/2018:12:00:00 and ValidUntil > 27/11/2018:12:00:00
This will return:
001,89.768
002,45.678
...,...
500,112.809
Regarding logging who did this, or for what reason, you can either log it separately (saving for example DataPoint Id, Timestamp, UserId...) or make it part of the original table, so that whenever you register a new datapoint you also log who measured it.
Have a look at SQL Server temporal tables engine which may be a solution in your case. This approach allow to run the queries mentioned in the question, for example
SELECT *
FROM my_data
FOR SYSTEM_TIME AS OF '2018-01-01'
However, the table in the example seems to be very large (maybe denormalized). I would suggest to group columns by some technical or functional characteristics (vertical partitioning) to avoid further maintenance drawbacks.

How can I best extract transitions in a transactional table?

Hypothetical example:
I have an SQL table that contains a billion or so transactions:
| Cost | DateTime |
| 1.00 | 2009-01-02 |
| 2.00 | 2009-01-03 |
| 2.00 | 2009-01-04 |
| 3.00 | 2009-01-05 |
| 1.00 | 2009-01-06 |
...
What I want is to pair down the data so that I only see the cost transitions:
| Cost | DateTime |
| 1.00 | 2009-01-02 |
| 2.00 | 2009-01-03 |
| 3.00 | 2009-01-05 |
| 1.00 | 2009-01-06 |
...
The simplest (and slowest) way to do this is to iterate over the entire table, tracking the changes. Is there a faster/better way to do this in SQL?
No. There is no faster way. You could write a query that does the same job but it will be much slower. You (as a developer) know that you need to compare a value only with its direct previous value, and there is no way to specify this with SQL. So you can do optimizations that SQL cannot.
So I imagine the fastest is to write a program that streams the results from the disk, holding in RAM only the last valid value and the current one (filtering out every value that is equal to the last valid).
This is a classic example of trying to use a sledge hammer when a hammer is needed. You want to extract some crazy reporting data out of a table but to do so is going to KILL your SQL Server. What you need to do to track changes is to create a tracking table specifically for this purpose. Then use a trigger that records a change in value in a product into this table. So on my products table, when I change the price it goes into the price tracking table.
If you are using this to track stock prices or something similar then again you use the same approach except you do a comparison of the price table and if a change occurs you save it. So the comparison only happens with new data, all the old comparisons are still housed in one location so you don't need to rerun the query which is going to kill your SQL Server's performance.