can i detect changes in my ODS tables before inserting it in dimension table in the DWH , i use sql and pentaho for data alimentation for information i use 4 tables to alimente my demension table ! so how can i detect changes in the 4 tables before using them ?
There two transformations steps that can help you comparing the content of two tables, Merge rows (diff) or Table compare.
You could keep a copy of the tables and each time you run your process compare the actual content with the content of the last copy, although that approach is not performance wise if the tables are too big.
Or if your database allows auditing of changes, you could activate that audit and just retrieve the rows your auditing say have been changed since last load.
There's also the option of using in the database a trigger that assures the modification date is updated each time a row is changed, so using the column where you store the modification change you can retrieve the rows changed.
Related
This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active
I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.
I am using PostgreSQL. I need to delete all transaction data from database (except the last three month transaction data) then restore the data to new database with created/updated timestamp updated to now timestamp. Also the data more from last three months must be recaped into one data (example all invoice from party A must be grouped into one invoice with party A). Other rules is if the data is still foreign keys referenced for the last three month data.The data must not be deleted and only change the created/updated timestamp to now timestamp.
I am not good in SQL query so for now I am using this strategy:
First create the recap data (save in other temporary table) before delete (All data).
Then delete all data except the last three months.
Next create the recap data after delete.
Create the recap data from (All data - After delete data) so i get the recap data with nominal exactly same with data before the last three month.
Then insert the recap data to table. So the old data is clean + have recap data in the database.
So my strategy is only using same database and not create new database because process importing data using the program is very slow (because have 900++ tables).
But the client doesn't want use this strategy because he want the data is created in new database and tell me to using other way. So the question is: What is the real and correct procedure to clean database from some dates (filter with date) and recap the old data?
First of all, there is no way to find out when a row was added to a table unless you track it with a timestamp column.
That's the first change you'll have to make – add a timestamp column to all relevant columns that tracks when the row was created (or updated, depending on the requirement).
Then you have two choices:
Partition the tables by the timestamp column so that you have (for example) one partition per month.
Advantage: it is easy to get rid of old data: just drop the partition.
Disadvantage: Partitioning is tricky in PostgreSQL. It will become somewhat easier to handle in PostgreSQL v10, but the underlying problems remain.
Use mass DELETEs to get rid of old rows. That's easy to implement, but mass deletes really hurt (table and index bloat which might necessitate VACUUM (FULL) or REINDEX which impair availability).
I am working on some data-sets which gets updated daily. By updation, I mean that three things happen:
1. New rows get added.
2. Some rows get deleted.
3. Some existing rows get replaced with new values.
Now I have prepared dash-boards on Tableau to analyze daily data, but I would also like to compare how the things are changing daily (i.e are we progressing or making loss from previous day.)
I am aware that we can take extracts from the data set. But if I go this way, I am not sure how to use all the extracts in one worksheet and compare the info given by all of them.
Tableau is simply a mechanism that builds an SQL query in the background and then builds tables and charts and such via that fetched query. This means that if you delete a row from the table it no longer exists so how can Tableau read it?? If anything your DB architecture should be creating new records and giving it a createtimestamp. You would NOT delete a record and put a new one. Then you'll only have one record in that table.... Sounds like a design issue
I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET
I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).
We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.
If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive
Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.