Keeping BigQuery table data up-to-date - google-bigquery

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.

You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

Related

Advice on changing the partition field for dynamic BigQuery tables

I am dealing with the following issue: I have a number of tables imported into BigQuery from an external source via AirByte with _airbyte_emitted_at as the default setting for partition field.
As this default choice for a partition field is not very lucrative, the need to change the partition field naturally presents itself. I am aware of the method available for changing partitions of existing tables, by means of a CREATE TABLE FROM SELECT * statement, however the new tables thus created - essentially copies of the original ones, with modified partition fields - will be mere static snapshots and no longer dynamically update, as the originals do each time new data is recorded in the external source.
Given such a context, what would the experienced members of this forum suggest as a solution to the problem?
Being that I am a relative beginner in such matters, I apologise in advance for any potential lack of clarity. I look forward to improving the clarity, should there be any suggestions to do so from interested readers & users of this forum.
I can think of 2 approaches to overcome this.
Approach 1 :
You can use Scheduled queries to copy the newly inserted rows to your 2nd table. You have to write the query in such a way that it will always select the latest rows from your main table and once you have that you can use Insert Into statement to append the rows in your 2nd table.
Since Schedule queries run at specific times the only drawback will be the the 2nd table will not get updated immediately whenever there is a new row in the main table, it will get the latest data whenever the Scheduled Query runs.
If you do not wish to have the latest data always in your 2nd table then this approach is the easier one to achieve.
Approach 2 :
You can trigger Cloud Actions for BigQuery events such as Insert, delete, update etc. Whenever a new row gets inserted in your main table ,using Cloud Run Actions you can insert that new data in your 2nd table.
You can follow this article , here a detailed solution has been given.
If you wish to have the latest data always in your 2nd table then this would be a good way to do so.

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

System Versioning with ETL

I have just started to investigate System Versioning (temporal tables) as it is pretty cool feature with SQL. I was able to successfully set it up on one of my existing tables with the below query. However whenever my daily ETL runs, it is adding data to the History table for all items in Table1, whether there are changes or not. I am using an Insert into, Update, and Delete SQL Task in SSIS for my ETL. The ETL is updating all existing rows, usually with the same data, but I was hopeful System Versioning would only add a new row if the existing row actually had data change.
Is this a limitation of using an Update statement in a SQL Task in my ETL? Would using Slowly Changing Dimension in Data Flow make a difference or is there a better way to make this work?
Or is this a limitation of System Versioning with ETLs and I should use something else for tracking table changes?
CREATE SCHEMA History
GO
ALTER TABLE Table1
ADD
SysStartTime datetime2(0) GENERATED ALWAYS AS ROW START HIDDEN
CONSTRAINT DF_SysStart DEFAULT SYSUTCDATETIME()
, SysEndTime datetime2(0) GENERATED ALWAYS AS ROW END HIDDEN
CONSTRAINT DF_SysEnd DEFAULT CONVERT(datetime2 (0), '9999-12-31 23:59:59'),
PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime);
GO
ALTER TABLE Table1
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = History.Table1))
Comments to answer
If you issue an update statement that is the equivalent of setting every field to itself, SQL Server doesn't have any shortcut logic built into it that flags it as a noop and does nothing. It sounds like your current ETL pattern is updating every record per run and thus, your historical tables are growing
That makes sense. Would redesigning my ETL to only update changed rows would be the best way to address my historical table growing? Does a Slowly Changing Dimension Table accomplish this or is there a better way to make this happen? – Hslew 2 days ago
It's been ages since I used the Slowly Changing Dimension Wizard in SSIS. It was a bit crap back then and I can't see them having improved it at all. I find i have the most success with a source query to deduplicate within my data, a lookup to determine does the inbound row match an existing row. If it doesn't match, I load that in my table as it's new data. If it matches, then I need to do a second check to see if it's changed. Only if it changes, do I send it to a second table. SSIS doesn't scale updates well. After the data flow completes, would I update the destination (Execute SQL Task)

Truncate and insert new content into table with the least amount of interruption

Twice a day, I run a heavy query and save the results (40MBs worth of rows) to a table.
I truncate this results table before inserting the new results such that it only ever has the latest query's results in it.
The problem, is that while the update to the table is written, there is technically no data and/or a lock. When that is the case, anyone interacting with the site could experience an interruption. I haven't experienced this yet, but I am looking to mitigate this in the future.
What is the best way to remedy this? Is it proper to write the new results to a table named results_pending, then drop the results table and rename results_pending to results?
Two methods come to mind. One is to swap partitions for the table. To be honest, I haven't done this in SQL Server, but it should work at a low level.
I would normally have all access go through a view. Then, I would create the new day's data in a separate table -- and change the view to point to the new table. The view change is close to "atomic". Well, actually, there is a small period of time when the view might not be available.
Then, at your leisure you can drop the old version of the table.
TRUNCATE is a DDL operation which causes problems like this. If you are using snapshot isolation with row versioning and want users to either see the old or new data then use a single transaction to DELETE the old records and INSERT the new data.
Another option if a lot of the data doesn't actually change is to UPDATE / INSERT / DELETE only those records that need it and leave unchanged records alone.

No waiting while Truncate Table

I have an SSIS package that runs repeatedly after 1 hour. This package first truncates a table and then populate that table with new data. And this process takes 15-20 minutes. When this package runs, data is not available to the users. So they have to wait until package runs completely. Is there any way to handle this situation so users don't have to wait?
Do not truncate the table. Instead, add a audit column with date data type, partition the table with hourly partitions on this audit column, drop the old partition once the new partition is loaded with new data.
Make sure the users query are directed to the proper partition with the help of the audit column
You can do an 'A-B flip'.
Instead of truncating the client-facing table and reloading it, you could use two tables to do the job.
For example, if the table in question is called ACCOUNT:
Load the data to a table called STG_ACCOUNT
Rename ACCOUNT to ACCOUNT_OLD
Rename STG_ACCOUNT to ACCOUNT
Rename ACCOUNT_OLD to STG_ACCOUNT
By doing this, you minimize the amount of time the users have an empty table.
It's very dangerous practice but you can change isolation levels of your transactions (I mean users queries) from ReadCommitted/Serializable to ReadUncommitted. But the behavior of this queries is very hard to predict. If your table is under using of SSIS package (insert/delete/update) and end users do some uncommitted reads (like SELECT * FROM Table1 WITH (NOLOCK) ), some rows can be counted several times or missed.
If users want to read only 'new-hour-data' you can try to change isolation levels to 'dirty read', but be careful!
If they can work with data from previous hour, the best solution is described by Arnab, but partitions are available only in Enterprise edition. Use rename in another SQL Server editions as Zak said.
[Updated] If the main lag (tens of minutes, as you said) is caused by complex calculations (and NOT because of amount of loaded rows!), you can use another table like a buffer. Store there several rows (hundreds, thousands etc.) and then reload them to the main table. So new data will be available in portions without 'dirty read' tricks.