DBT Snapshots with not unique records in the source - google-bigquery

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one.
I know how to achieve this using window functions but not sure how to handle such situation with dbt.
I wonder if anybody has faced this same issue before.
Snapshot Table
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | null |
Source Table
|**id**|**some_attribute**| **updated_at** |
| 123 | ABCD | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123 | ZABC | 2021-06-30 00:00:00 |-> already been loaded to snapshot
-------------------------------------------
| 123 | ZZAB | 2021-11-21 00:10:00 |
| 123 | FXAB | 2021-11-21 15:11:00 |
Snapshot Desired Result
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | 2021-11-21 00:10:00 |
| 123 | ZZAB | 2021-11-21 00:10:00 | 2021-11-21 15:11:00 |
| 123 | FXAB | 2021-11-21 15:11:00 | null |

Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.
I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854

I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.
The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.
However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

Related

Druid generate missing records

I have a data table in druid and which has missing rows and I want to fill them by generating the missing timestamps and adding the precedent row value.
This is the table in druid :
| __time | distance |
|--------------------------|----------|
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And i want to add the missing minutes either by forcing it in the side of druid storage or by query it directly in druid without passing by other module.
The final result that I want will be look like this:
| __time | distance |
|--------------------------|----------|
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:43:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:45:00.000Z | 1360 |
| 2022-05-05T08:46:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And thank you in advance !
A Driud time series query will produce a densely populated timeline at a given time granularity like the one you want for every minute. But its current functionality either skips empty time buckets or assigns them a value of zero.
Doing other gap filling functions like LVCF (last value carried forward) that you describe seems like a great enhancement. You can join the Apache Druid community and create an issue that describes this request. That's a great way to start a conversation about requirements and how it might be achieved.
And/Or you could also add the functionality and submit a PR. We're always looking for more members in the Apache Druid community.

What is the best way to calculate how long is an object in a specific state?

I am looking for a recommended solution to solve problems like the one below.
Preferably with PostgreSQL, but any other SQL based solution would help.
So there is a table like this:
day | object_id | property_01 | ...
2022-01-24 | object_01 | A | ...
2022-01-23 | object_01 | A | ...
2022-01-22 | object_01 | A | ...
2022-01-21 | object_01 | B | ...
2022-01-20 | object_01 | B | ...
2022-01-19 | object_01 | A | ...
2022-01-18 | object_01 | A | ...
2022-01-17 | object_01 | B | ...
2022-01-16 | object_01 | A | ...
The base table is a daily "backup", so normally there are no gaps in the line of days.
There will be always a row since the birth of an object till it is deleted from the main DB.
What I need is to tell is that if an object is in "A" state at the day of query, how long is this object in it.
In the example the right ansver is since 2022-01-22, so a simple MIN() does not give the right ansver.
Also I do not need to tell all of the intervals in the "A" state, just how long the current one keeps.
What I tried:
I. Select all the object_id's that are in "A" state at a given date.
II: Select all the rows based on the object_id's.
Now I am trying to make a recursive query, but if there is some better solution, it would be a great help.
Thanks a lot!

Data discretization in db in an intelligent way

For my futur project i have a ClickHouse db. This db is fed by several micro-services themselves fed by rabbitsMQ.
The data look like:
| Datetime | nodekey | value |
| 2018-01-01 00:10:00 | 15 | 156 |
| 2018-01-01 00:10:00 | 18 | 856 |
| 2018-01-01 00:10:00 | 86 | 8 |
| 2018-01-01 00:20:00 | 15 | 156 |
| 2018-01-01 00:20:00 | 18 | 84 |
| 2018-01-01 00:20:00 | 86 | 50 |
......
So for hundreds different nodekey, I have a value every 10 minutes.
I need to have another table with the sum or the means (depends on nodekey type) of the values for every hours ...
My first idea is just using a crontab ...
But the data didn't comming in fluid flow, sometime micro-service add 2 - 3 new values or some time a weeks of data comming ... and rarely i have to bulk insert a years of the new data...
And for the moment i only have hundreds nodekey but the project going to grows.
So, i think using a crontab or looping throught the db for updating data isn't a good idea...
What is my other options ?
How about just creating a view?
create view myview as
select
toStartOfHour(datetime) date_hour,
nodekey,
sum(value) sum_value
from mytable
group by
toStartOfHour(datetime),
nodekey
The advantage of this approach is that you don't need to worry about refreshing the data. When querying the view, you actually access the underlying live data. The downside is that it might not scale well when your dataset becomes really big (queries adressing the view will tend to slow down).
An intermediate option would be to use a materialized view, that will persist the data. If I correctly understand the clickhouse documentation, materialized views are automatically updated when the data in the source table is modified, which seems to be close to what you are looking for (however you need to use the proper engine, and this might impact the performance of your inserts).

Can you delete old entries from a table index?

I made a reminder application that heavily writes and reads records with future datetimes, but less on records with past datetimes. These reminders are indexed by remind_at, so a million of records means a million on the index, but speeds up checking records that must be reminded in the next hour.
| uuid | user_id | text | remind_at | ... | ... | ... |
| ------- | ------- | ------------ | ------------------- | --- | --- | --- |
| 45c1... | 23 | Buy paint | 2019-01-01 20:00:00 | ... | ... | ... |
| 23f1... | 924 | Pick up car | 2019-02-01 20:00:00 | ... | ... | ... |
| 2d84... | 650 | Call mom | 2020-03-01 20:00:00 | ... | ... | ... |
| 3f1a... | 81 | Get shoes | 2020-04-01 20:00:00 | ... | ... | ... |
The problem is performance. Once the database grows big, retrieving any record becomes relatively slow.
I'm trying to check what RDBMS offer a full or semi automated way allow better performance retrieving future datetimes, since past datetimes are rarely retrieved or checked.
A neat solution that I don't know if exist would be to instruct the RDBM to prune old entries from the index. I don't know if any RDBM allows that, but in PostgreSQL, SQL Server, and SQLite there is a way to use a "partial index", but what would happen if I **recreate an index on a table with millions of records?
Some solutions that didn't fit the bill:
Horizontal scaling: It would replicate the same problem, (n) number of times.
Vertical scaling: still doesn't fix the problem.
Sharding: Could be, since every instance would hold a part of the database, but the app will have to handle the "sharding key".
Two databases: Okay, one fast and other slow. Moving old entries to the "slow instance" (toaster) would be done manually. Also, the app would have to be heavily modified to check both databases since it doesn't know where it is initially. Logic increases heavily.
Anyway, the whole point is to make future (or the closest) records to remind snappier on retrieval while disregarding the performance to retrieve older entries.

Which tables/fields reveal user activity in Trac?

(This may be a webapp question.) I would like to use Trac 1.0.1 activity for time tracking. For example, closing a ticket, editing a wiki page or leaving a comment
I was imagining output something like this:
| Time | Ticket | Custom field | Summary | Activity |
| 2013-05-08 10:00 | 4123 | Acme | Ticket title | Ticket closed |
| 2013-05-08 10:00 | 4200 | Sierra | Title here | Comment left on ticket |
| 2013-05-08 10:00 | - | - | - | Edited /wiki/Acme/Coyote |
| 2013-05-08 10:00 | - | - | - | Committed /git/Apogee.txt|
I would like to include basically everything that appears in the timeline, including comment activity. For ticket-related activity, I would like to include ticket number and a custom field.
Which tables should I be looking at? (A pointer to relevant docs or code would suffice.)
I believe you are just asking for trac database schema which can be viewed here, you can also view the source for timeline here.