Can you delete old entries from a table index? - sql

I made a reminder application that heavily writes and reads records with future datetimes, but less on records with past datetimes. These reminders are indexed by remind_at, so a million of records means a million on the index, but speeds up checking records that must be reminded in the next hour.
| uuid | user_id | text | remind_at | ... | ... | ... |
| ------- | ------- | ------------ | ------------------- | --- | --- | --- |
| 45c1... | 23 | Buy paint | 2019-01-01 20:00:00 | ... | ... | ... |
| 23f1... | 924 | Pick up car | 2019-02-01 20:00:00 | ... | ... | ... |
| 2d84... | 650 | Call mom | 2020-03-01 20:00:00 | ... | ... | ... |
| 3f1a... | 81 | Get shoes | 2020-04-01 20:00:00 | ... | ... | ... |
The problem is performance. Once the database grows big, retrieving any record becomes relatively slow.
I'm trying to check what RDBMS offer a full or semi automated way allow better performance retrieving future datetimes, since past datetimes are rarely retrieved or checked.
A neat solution that I don't know if exist would be to instruct the RDBM to prune old entries from the index. I don't know if any RDBM allows that, but in PostgreSQL, SQL Server, and SQLite there is a way to use a "partial index", but what would happen if I **recreate an index on a table with millions of records?
Some solutions that didn't fit the bill:
Horizontal scaling: It would replicate the same problem, (n) number of times.
Vertical scaling: still doesn't fix the problem.
Sharding: Could be, since every instance would hold a part of the database, but the app will have to handle the "sharding key".
Two databases: Okay, one fast and other slow. Moving old entries to the "slow instance" (toaster) would be done manually. Also, the app would have to be heavily modified to check both databases since it doesn't know where it is initially. Logic increases heavily.
Anyway, the whole point is to make future (or the closest) records to remind snappier on retrieval while disregarding the performance to retrieve older entries.

Related

Picewise join Rows in Dataframe. Comparison of many dataframes to check Data Drift and Dataquality

I am aware of many Questions related to the comparrison of Pandas Dataframes.
I hope my question is no dubilcate but i could not find the ansewer.
As a preface, i am not sure if my way to tackle the problem is actually advisable so i am open to suggestions as well.
The Problem: I get Data on a regular basis and it gets loaded into a Datawarehouse. I would like to be able to quickly check the content of the data in terms of its consitency when compared to older data deliveries. For now i want to just check the original Flat/CSV file. The data looks like this:
Data from: 2022-08-01
|Country|Vendor | ID | Desc |Price| Deilevery_Status |Product_group|
|-------|-------|-----|----------|-----|------------------|-------------|
| GB | Nike |1234 |White/Red | 65 | 4-5 Days | Sneaker |
| ... | ... |... |... | ... | ... | ... |
In this Dataset for a given Product ID (ID in the table above) ther can be more vendors. Think Adidas also has a Product with number 1234. So only the Combination ID+Vendor+Cuntry is unique.
Each Month/Week i get new Data But The Price and the Delivery Status can change. The Product group can not change.
I want to detect unusaly Price movements, calculate aggregates of procuctgroups and analyse them over time and track changes in the delivery Status etc.... So it is basically Timeseries Analysis.
Furthermore, it is not sure that i get data for each ID every Month,Id´s can fade out, new ID´s can come up and i want to track this as well.
I am not how to combine the Data from individual Deliveries into a meaningfull Datastructure for Pandas. I was considerning to create a DataFrame like this, where i create a new Price Column for each Price:
Data from: 2022-08-01 + Data from 2022-09-01
|Country|Vendor | ID | Desc |Price_09| Deilevery_Status_09 |Price_08| Deilevery_Status_08 |Product_group|
|-------|-------|-----|----------|--------|---------------------|--------|---------------------|--------------
| GB | Nike |1234 |White/Red | 65 | 4-5 Days | 69 | 10 Days | Sneaker |
| ... | ... .....
In this case i do not really understand how to join/ merge the data on multiple columns as i would have to Picewise join the Data based on identical VEndor ID and Country Columns.
Or would it be a better idea to Create my Data frame like this, where each Observation gets its own row:
Data from: 2022-08-01 + Data from: 2022-09-01
|PriceData |Country|Vendor | ID | Desc |Price| Deilevery_Status |Product_group|
|----------|-------|-------|-----|----------|-----|------------------|--------------|
|2022-08-01| GB | Nike |1234 |White/Red | 65 | 4-5 Days | Sneaker |
|2022-08-01| ... | ... |... |... | ... | ... | ... |
|2022-09-01| GB | Nike |1234 |White/Red | 69 | 10 Days | Sneaker |
In this case there is alot of redundant data. OR is there a concept of Normalization (3NF) for data in Pandas?
My Goal in the end is to calculate various aggregate Statistic and detect unusual Price/DeleiveryStauts movements in individual Groupings. The "Time-Series" will not have more then 4 sucessive month.... for now
Thank you in advance

DBT Snapshots with not unique records in the source

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one.
I know how to achieve this using window functions but not sure how to handle such situation with dbt.
I wonder if anybody has faced this same issue before.
Snapshot Table
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | null |
Source Table
|**id**|**some_attribute**| **updated_at** |
| 123 | ABCD | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123 | ZABC | 2021-06-30 00:00:00 |-> already been loaded to snapshot
-------------------------------------------
| 123 | ZZAB | 2021-11-21 00:10:00 |
| 123 | FXAB | 2021-11-21 15:11:00 |
Snapshot Desired Result
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | 2021-11-21 00:10:00 |
| 123 | ZZAB | 2021-11-21 00:10:00 | 2021-11-21 15:11:00 |
| 123 | FXAB | 2021-11-21 15:11:00 | null |
Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.
I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854
I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.
The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.
However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

Efficiently making rows based on pairs of columns that don't apper in another table

So I'm trying to model a basic recommended friend system based on user activity. In this model, people can join activities, and if two people aren't already friends and happen to join the same activity, thier recommendation score for eachother increases.
Most of my app uses Firebase, but for this system I'm trying to use BigQuery.
The current system I have in mind:
I would have this table to represnet friendships. Since its an undirected graph, A->B being in the table infers that B->A will also be in the table.
+-------+-------+--------------+
| User1 | User2 | TimeFriended |
+-------+-------+--------------+
| abc | def | 12345 |
| def | abc | 12345 |
| abc | rft | 3456 |
| ... | ... | ... |
+-------+-------+--------------+
I also plan for activity participation to be stored like so:
+------------+-----------+---------------+------------+
| ActivityId | CreatorID | ParticipantID | TimeJoined |
+------------+-----------+---------------+------------+
| abc | def | eft | 21234 |
| ... | ... | ... | ... |
+------------+---------- +---------------+------------+
Lastly, assume maybe there's a table that stores mutual activities for these recommended friends (not super important, but assume it looks like:)
+-------+-------+------------+
| User1 | User2 | ActivityID |
+-------+-------+------------+
| abc | def | eft |
| ... | ... | ... |
+-------+-------+------------+
So here's the query I want to run:
Get all the participants for a particular activity.
For each of these participants, get all the other participants that aren't their friend
Add that tuple of {participant, other non-friend participant} to the "mutual activites" table
So there are oviously a couple of ways to do this. I could make a simple BigQuery script with looping, but I'm not a fan of that because it'll result in a lot of scans and since BigQuery doesn't use indexes it won't scale well (in terms of cost).
I could also maybe use something like a subquery with NOT EXISTS, like something like SELECT ParticipantID from activities WHERE activityID = x AND NOT EXISTS {something to show that there doesn't exist a friend relation}, but then its unclear how to make this work for every participant at one go. I'd be finee if I can come to a solution who's table scans scale linearly with the number of participants, but I have the premonition that even if I somehow get this to work, every NOT EXISTS will result in a full scan per participant pair, resulting in quadratic scaling.
There might be something I can do with joining, but I'm not sure.
Would love some thoughts and guidance on this. I'm not very used to SQL, especially complex queries like this.
PS: If any of y'all would like to suggest another serverless solution rather than BigQuery, go ahead please :)

Data discretization in db in an intelligent way

For my futur project i have a ClickHouse db. This db is fed by several micro-services themselves fed by rabbitsMQ.
The data look like:
| Datetime | nodekey | value |
| 2018-01-01 00:10:00 | 15 | 156 |
| 2018-01-01 00:10:00 | 18 | 856 |
| 2018-01-01 00:10:00 | 86 | 8 |
| 2018-01-01 00:20:00 | 15 | 156 |
| 2018-01-01 00:20:00 | 18 | 84 |
| 2018-01-01 00:20:00 | 86 | 50 |
......
So for hundreds different nodekey, I have a value every 10 minutes.
I need to have another table with the sum or the means (depends on nodekey type) of the values for every hours ...
My first idea is just using a crontab ...
But the data didn't comming in fluid flow, sometime micro-service add 2 - 3 new values or some time a weeks of data comming ... and rarely i have to bulk insert a years of the new data...
And for the moment i only have hundreds nodekey but the project going to grows.
So, i think using a crontab or looping throught the db for updating data isn't a good idea...
What is my other options ?
How about just creating a view?
create view myview as
select
toStartOfHour(datetime) date_hour,
nodekey,
sum(value) sum_value
from mytable
group by
toStartOfHour(datetime),
nodekey
The advantage of this approach is that you don't need to worry about refreshing the data. When querying the view, you actually access the underlying live data. The downside is that it might not scale well when your dataset becomes really big (queries adressing the view will tend to slow down).
An intermediate option would be to use a materialized view, that will persist the data. If I correctly understand the clickhouse documentation, materialized views are automatically updated when the data in the source table is modified, which seems to be close to what you are looking for (however you need to use the proper engine, and this might impact the performance of your inserts).

How can I best extract transitions in a transactional table?

Hypothetical example:
I have an SQL table that contains a billion or so transactions:
| Cost | DateTime |
| 1.00 | 2009-01-02 |
| 2.00 | 2009-01-03 |
| 2.00 | 2009-01-04 |
| 3.00 | 2009-01-05 |
| 1.00 | 2009-01-06 |
...
What I want is to pair down the data so that I only see the cost transitions:
| Cost | DateTime |
| 1.00 | 2009-01-02 |
| 2.00 | 2009-01-03 |
| 3.00 | 2009-01-05 |
| 1.00 | 2009-01-06 |
...
The simplest (and slowest) way to do this is to iterate over the entire table, tracking the changes. Is there a faster/better way to do this in SQL?
No. There is no faster way. You could write a query that does the same job but it will be much slower. You (as a developer) know that you need to compare a value only with its direct previous value, and there is no way to specify this with SQL. So you can do optimizations that SQL cannot.
So I imagine the fastest is to write a program that streams the results from the disk, holding in RAM only the last valid value and the current one (filtering out every value that is equal to the last valid).
This is a classic example of trying to use a sledge hammer when a hammer is needed. You want to extract some crazy reporting data out of a table but to do so is going to KILL your SQL Server. What you need to do to track changes is to create a tracking table specifically for this purpose. Then use a trigger that records a change in value in a product into this table. So on my products table, when I change the price it goes into the price tracking table.
If you are using this to track stock prices or something similar then again you use the same approach except you do a comparison of the price table and if a change occurs you save it. So the comparison only happens with new data, all the old comparisons are still housed in one location so you don't need to rerun the query which is going to kill your SQL Server's performance.