I have 2 parquet tables, one for today and one for yesterday. What I want to do is compare what has changed in today's table, e.g.:
which new rows have been added
which rows have been deleted and when they have been deleted
which rows have been changed
The tables itself have columns "createdAt" and "updatedAt" which I can use for this purpose.
I'm working with Databricks/Apache Spark so I can either use their built-in functions or an SQL query. I'm not sure how to go about this, any general ideas are appreciated!
Maintain one audit table behind your main table. data must be inserted in Audit table when you perform Insert, update or delete on your main table. Audit table should include createdAt of main table and current date-stamp.
If you manage transaction-type Insert, update or delete with 1,2,3 then it will be good for Query performance.
As I don't know the LoadType (full or delta) for your table, I will try to cover both the scenarios:-
Full Load -
For this, you only need today's table as it will contain all the previous days record as well.
Hence you only need to put condition to check all the records that are modified after yesterday's load using updatedAt column i.e
updatedAt > yesterday's load date
Delta Load -
For delta, each day you will get modified records(new, updated or deleted) only, hence just query today's table without any condition will serve the purpose.
Now, on spark side, as you have large number of records, you can manipulate number of dataframe partitions at runtime using something like below:-
spark.sql("set spark.sql.shuffle.partitions = 1500");
please find other optimization techniques here
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/
Related
I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.
I am using PostgreSQL. I need to delete all transaction data from database (except the last three month transaction data) then restore the data to new database with created/updated timestamp updated to now timestamp. Also the data more from last three months must be recaped into one data (example all invoice from party A must be grouped into one invoice with party A). Other rules is if the data is still foreign keys referenced for the last three month data.The data must not be deleted and only change the created/updated timestamp to now timestamp.
I am not good in SQL query so for now I am using this strategy:
First create the recap data (save in other temporary table) before delete (All data).
Then delete all data except the last three months.
Next create the recap data after delete.
Create the recap data from (All data - After delete data) so i get the recap data with nominal exactly same with data before the last three month.
Then insert the recap data to table. So the old data is clean + have recap data in the database.
So my strategy is only using same database and not create new database because process importing data using the program is very slow (because have 900++ tables).
But the client doesn't want use this strategy because he want the data is created in new database and tell me to using other way. So the question is: What is the real and correct procedure to clean database from some dates (filter with date) and recap the old data?
First of all, there is no way to find out when a row was added to a table unless you track it with a timestamp column.
That's the first change you'll have to make – add a timestamp column to all relevant columns that tracks when the row was created (or updated, depending on the requirement).
Then you have two choices:
Partition the tables by the timestamp column so that you have (for example) one partition per month.
Advantage: it is easy to get rid of old data: just drop the partition.
Disadvantage: Partitioning is tricky in PostgreSQL. It will become somewhat easier to handle in PostgreSQL v10, but the underlying problems remain.
Use mass DELETEs to get rid of old rows. That's easy to implement, but mass deletes really hurt (table and index bloat which might necessitate VACUUM (FULL) or REINDEX which impair availability).
Yesterday I made a data correction in my Oracle table in production environment. But later I found out that however my selection command fetched 62 rows, apparently with the same conditions 64 rows got updated. Since I do not have the list of rows affected in that update, I am unable to compare the list of rows selected and later the list of rows updated. So is there a way to find the list of rows that were updated on that particular time on that table, say from 16:20 to 16:21? Does Oracle keep track on which time which rows of a given table were updated?
If you have flashback enabled on your database you can check the data at a particular time in the database.
SELECT * FROM table
AS OF TIMESTAMP
TO_TIMESTAMP('2015-01-14 13:33:00', 'YYYY-MM-DD HH:MI:SS')
WHERE column = 'your value';
Using this you can check the data before and after for the records you suspect.
If you cant find the records, then you can use
SELECT SCN_TO_TIMESTAMP(ORA_ROWSCN) FROM table where <<your condition>>
However SCN_TO_TIMESTAMP(ORA_ROWSCN) can be obtained only for few records which were updated recently. You can only convert to and from SCNs that are in the redo/flashback window maintained by the system. Once changes age out then the mapping is lost.
Try and use oracle FLASHBACK functionality (just in case your database is configured as per flashback utility)
if enabled, it will show you your table's data at any given point of time in past.
for more details on this, please follow http://docs.oracle.com/cd/E11882_01/appdev.112/e41502/adfns_flashback.htm#ADFNS01001
I have a program that retrieves data and stores it in a table each day, and then another program that queries that data to produce reports. The reports need to say when the data was last updated, so we know how old the information is.
It seems wasteful to add a column with the last update date to the table, since all the rows will have the same value. It also seems wasteful to create a table just to store one value.
What is the best solution for keeping track of the last time a table was updated?
My preferred way is to create a new "report" table to store the last time the target table is updated, and create a trigger to update the "report" table whenever there is change on the target table.
See this for more information on creating such trigger:
http://www.techonthenet.com/oracle/triggers/after_update.php
You Probably should add a column "LastModified" and save the datetime when its getting updated. This should be the best way to identify when your table was last updated.
This should help:
http://docs.oracle.com/cd/B19306_01/server.102/b14237/statviews_2097.htm
ALL_TAB_MODIFICATIONS describes tables accessible to the current user that have been modified since the last time statistics were gathered on the tables.
TIMESTAMP DATE Indicates the last time the table was modified
So:
select TIMESTAMP from ALL_TAB_MODIFICATIONS
where table_name = 'My_TABLE'
What would be the most efficient way to select only rows from DB2 table that are inserted/updated since the last select (or some specified time)? There is no field in the table that would allow us to do this easily.
We are extracting data from the table for purposes of reporting, and now we have to extract the whole table every time, which is causing big performance issues.
I found example on how to select only rows changed in last day:
SELECT * FROM ORDERS
WHERE ROW CHANGE TIMESTAMP FOR ORDERS >
CURRENT TIMESTAMP - 24 HOURS;
But, I am not sure how efficient this would be, since the table is enormous.
Is there some other way to select only rows that are changed, that might be more efficient that this?
I also found solution called ParStream. This seems as something that can speed up demanding queries on the data, but I was unable to find any useful documentation about it.
I propose these options:
You can use Change Data Capture, and this will replay automatically the modifications to another data source.
Normally, a select statement does not assure the order of the rows. That means that you cannot use a select without a time reference in order to retrieve the most recent. Thus, you have to have a time column in order to retrieve the most recent. You can keep track of the most recent row in a global variable, and the next time retrieve the rows with a time bigger than that variable. If you want to increase performance, you can put the table in append mode, and in this way the new rows will be physically together. Keeping an index on this time column could be expensive to maintain, but it will speed (no table scan) when you need to extract the rows.
If your server is DB2 for i, use database journaling. You can extract after images of inserted records by time period or journal entry number from the journal receiver(s). The data entries can then be copied to your target file.