How to effectively save & restore data from the last three months and delete the old data?

How to effectively save & restore data from the last three months and delete the old data? - sql

I am using PostgreSQL. I need to delete all transaction data from database (except the last three month transaction data) then restore the data to new database with created/updated timestamp updated to now timestamp. Also the data more from last three months must be recaped into one data (example all invoice from party A must be grouped into one invoice with party A). Other rules is if the data is still foreign keys referenced for the last three month data.The data must not be deleted and only change the created/updated timestamp to now timestamp.
I am not good in SQL query so for now I am using this strategy:
First create the recap data (save in other temporary table) before delete (All data).
Then delete all data except the last three months.
Next create the recap data after delete.
Create the recap data from (All data - After delete data) so i get the recap data with nominal exactly same with data before the last three month.
Then insert the recap data to table. So the old data is clean + have recap data in the database.
So my strategy is only using same database and not create new database because process importing data using the program is very slow (because have 900++ tables).
But the client doesn't want use this strategy because he want the data is created in new database and tell me to using other way. So the question is: What is the real and correct procedure to clean database from some dates (filter with date) and recap the old data?

First of all, there is no way to find out when a row was added to a table unless you track it with a timestamp column.
That's the first change you'll have to make – add a timestamp column to all relevant columns that tracks when the row was created (or updated, depending on the requirement).
Then you have two choices:
Partition the tables by the timestamp column so that you have (for example) one partition per month.
Advantage: it is easy to get rid of old data: just drop the partition.
Disadvantage: Partitioning is tricky in PostgreSQL. It will become somewhat easier to handle in PostgreSQL v10, but the underlying problems remain.
Use mass DELETEs to get rid of old rows. That's easy to implement, but mass deletes really hurt (table and index bloat which might necessitate VACUUM (FULL) or REINDEX which impair availability).

Related

scd slow changing demension how can i detect changes?

can i detect changes in my ODS tables before inserting it in dimension table in the DWH , i use sql and pentaho for data alimentation for information i use 4 tables to alimente my demension table ! so how can i detect changes in the 4 tables before using them ?

There two transformations steps that can help you comparing the content of two tables, Merge rows (diff) or Table compare.
You could keep a copy of the tables and each time you run your process compare the actual content with the content of the last copy, although that approach is not performance wise if the tables are too big.
Or if your database allows auditing of changes, you could activate that audit and just retrieve the rows your auditing say have been changed since last load.
There's also the option of using in the database a trigger that assures the modification date is updated each time a row is changed, so using the column where you store the modification change you can retrieve the rows changed.

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.

You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!

One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';

An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.

If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.

Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

Comparing yesterday's data with today's data

I have 2 parquet tables, one for today and one for yesterday. What I want to do is compare what has changed in today's table, e.g.:
which new rows have been added
which rows have been deleted and when they have been deleted
which rows have been changed
The tables itself have columns "createdAt" and "updatedAt" which I can use for this purpose.
I'm working with Databricks/Apache Spark so I can either use their built-in functions or an SQL query. I'm not sure how to go about this, any general ideas are appreciated!

Maintain one audit table behind your main table. data must be inserted in Audit table when you perform Insert, update or delete on your main table. Audit table should include createdAt of main table and current date-stamp.
If you manage transaction-type Insert, update or delete with 1,2,3 then it will be good for Query performance.

As I don't know the LoadType (full or delta) for your table, I will try to cover both the scenarios:-
Full Load -
For this, you only need today's table as it will contain all the previous days record as well.
Hence you only need to put condition to check all the records that are modified after yesterday's load using updatedAt column i.e
updatedAt > yesterday's load date
Delta Load -
For delta, each day you will get modified records(new, updated or deleted) only, hence just query today's table without any condition will serve the purpose.
Now, on spark side, as you have large number of records, you can manipulate number of dataframe partitions at runtime using something like below:-
spark.sql("set spark.sql.shuffle.partitions = 1500");
please find other optimization techniques here
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

Database history for client usage

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET

I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).

We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.

If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive

Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas