To keep the uniqueness of a column, my several clients stream data into a staging table in BQ(retry 2 times if not presence in main table with 10 mins interval), and using another cron job to MERGE staging table to a column partitioned table every several mins.
I need to truncate staging table if it's merged into main table, but it seems my clients are streaming data into all the time. Is there any recommendation here?
ALERT: Do not truncate a table that is receiving streaming results.
See https://cloud.google.com/bigquery/troubleshooting-errors#streaming:
Deleting and/or recreating a table may create a period of time where streaming inserts are effectively delivered to the old table and will not be present in the newly created table.
Truncating a table's data (e.g. via a query job that uses writeDisposition of WRITE_TRUNCATE) may similarly cause subsequent inserts during the consistency period to be dropped.
For alternative best practices when streaming into BigQuery, see:
https://cloud.google.com/bigquery/streaming-data-into-bigquery
In this case - why not have your several clients write to Pub/Sub instead? Then you can use Dataflow to move this data into permanent tables.
Bonus: Pub/sub + Dataflow + BigQuery can guarantee "Exactly Once" delivery.
Related
I am creating a hive table by doing joins of multiple source tables. This join takes approx 3 hours time because of huge data volume. This hive table is designed truncate and load. This table is further consumed by the downstream.
We plan to refresh this hive table 4 times a day because of data in source tables keep getting updated. Since table load is truncate and load, there will be no data in this table for approx ~3 hr in each times because of join query takes this much of time. And due to this data will not be available for downstream.
Can someone suggest how we can continue to truncate and load the table and old data for downstream is still available during the fresh data loads in the table ?
One of the option to ensure downstream gets the data during the ~3hr downtime is to create a read copy of the same table for the downstream systems. For example, create a tableB which is a select * from tableA_with_joins. This will ensure downstream receives the data from tableB even if a truncate load is happening on tableA.
One of the cons of this approach is that an additional time will be consumed in syncing the data from tableA to tableB. But will ensure your downstream receives the data even during the downtime.
I need to delete rows that match a given WHERE condition from a partitioned BigQuery table. The table always has a streaming buffer adding more data to it. I am not concerned with deleting from what is being streamed in, just historical data in yesterday's partition.
What is the correct strategy to remove data while a streaming buffer exists on a BigQuery table, ideally without downtime?
From this page Data Manipulation Language
"Rows that were written to a table recently via streaming (using the tabledata.insertall method) cannot be modified using UPDATE, DELETE, or MERGE statements. Recent writes are typically those that occur within the last 30 minutes. Note that all other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements."
This means that you should restrict your DML with a time, ideally you should have date created column so you can use that OR if there is a built in metadata column that you can use but I am not aware of such column
You can use a Data Manipulation Language DELETE statement. However keep in mind the following (from DML docs):
Support for using Data Manipulation Language statements to modify
partitioned table data is currently in Beta.
But you can always do select with filtering out records you want to delete and write the results back to the same partition.
There will be no downtime. Cost will be the same as the cost of a full single partition scan.
Here's the scenario I need help with:
I have a large Oracle table. This oracle table is being queried by mobile app users 1-10 times per second. This allows for very little down time of the table.
There is a backend process that refreshes all the data in the table, 1mio rows approximately. The process deletes * from the table, then inserts values from the source table. That's it.
The problem: this causes the table to be unavailable for too long (15 minutes).
I read about partition exchange, but all the examples I find deal with a certain partition range which doesn't apply to my problem.
My question: can I somehow refresh the data in a temp offline table and then just make that table my online/live table? That would just be a synonym/name swap, wouldn't it? Are there better methods?
Thanks!!!
i'm holding huge transactions data on daily multi tables according the business date.
trascation_20140101
trascation_20140102
trascation_20140103..
the process flow is like that:
1.i''m loading the batch of new files that that arrive to temp table
2.i group by the transcation_date field in order to notice on which date is belong -
for each date i query the temp table on this date and insert it to the proper trasaction_YYYYMMDD
table.
3.i'm doing part 2 in parallel in order to save time, because the temp table might contain data that belong to 20 days..
my challenge is what to do if one these process failed and other not..
i can't run it all again , since it will cause for duplications for the table that been already successfully update.
i solve these issue by managing this update, but it's seems to be too complex.
Is this best practice to deal with multi tables?
i will be glad to get some best practice in order to understand how others deals when they need to load the data to multi tables according to business date and Not just insert date(this is easy..)
You could add an extra step in the middle, where instead of moving directly from today's temp table into the permanent business-date tables, you extract into temporary daily tables and then copy the data over to the permanent tables.
Query from today's temp table, sharded by day into tmp_transaction_YYMMDD. Use WRITE_EMPTY or WRITE_TRUNCATE write disposition so that this step is idempotent.
Verify that all expected tmp_transaction_YYMMDD tables exist. If not, debug failures and go back to step 1.
Run parallel copy jobs from each tmp_transaction_YYMMDD table to append to the corresponding permanent transaction_YYMMDD table.
Verify copy jobs succeeded. If not, retry the individual failures from step 3.
Delete the tmp_transaction_YYMMDD tables.
The advantage of this is that you can catch query errors before affecting any of the end destination tables, then copy over all the added data at once. You may still have the same issue if the copy jobs fail, but they should be easier to debug and retry individually.
Our incentive for incremental load is cost, and therefore we interested in "touching each record only once".
We use table decorators to identify increment. We manage the increments timestamps independently, and add them to the query on run-time. It requires some logic to maintain, but nothing too complicated.
I have an SSIS package that runs repeatedly after 1 hour. This package first truncates a table and then populate that table with new data. And this process takes 15-20 minutes. When this package runs, data is not available to the users. So they have to wait until package runs completely. Is there any way to handle this situation so users don't have to wait?
Do not truncate the table. Instead, add a audit column with date data type, partition the table with hourly partitions on this audit column, drop the old partition once the new partition is loaded with new data.
Make sure the users query are directed to the proper partition with the help of the audit column
You can do an 'A-B flip'.
Instead of truncating the client-facing table and reloading it, you could use two tables to do the job.
For example, if the table in question is called ACCOUNT:
Load the data to a table called STG_ACCOUNT
Rename ACCOUNT to ACCOUNT_OLD
Rename STG_ACCOUNT to ACCOUNT
Rename ACCOUNT_OLD to STG_ACCOUNT
By doing this, you minimize the amount of time the users have an empty table.
It's very dangerous practice but you can change isolation levels of your transactions (I mean users queries) from ReadCommitted/Serializable to ReadUncommitted. But the behavior of this queries is very hard to predict. If your table is under using of SSIS package (insert/delete/update) and end users do some uncommitted reads (like SELECT * FROM Table1 WITH (NOLOCK) ), some rows can be counted several times or missed.
If users want to read only 'new-hour-data' you can try to change isolation levels to 'dirty read', but be careful!
If they can work with data from previous hour, the best solution is described by Arnab, but partitions are available only in Enterprise edition. Use rename in another SQL Server editions as Zak said.
[Updated] If the main lag (tens of minutes, as you said) is caused by complex calculations (and NOT because of amount of loaded rows!), you can use another table like a buffer. Store there several rows (hundreds, thousands etc.) and then reload them to the main table. So new data will be available in portions without 'dirty read' tricks.