How to reliably reach convergence point in BQ? - google-bigquery

As far as I know, Google Big Query follows "eventual consistency" architecture, meaning that table creation, schema changes and data import is non-synchronous.
I'm building system that synchronize couple of tables with high update ratio to BQ (by schedule). Latter is not suited for updates at all, so I'm following idea of dropping all previous data and re-importing everything (data volume for this tables is low, so this seems feasible for now).
I'm facing the issue with table re-creation: I'm deleting existing table, creating new (providing schema) and inserting data (via insert_rows, which does streamed insert, AFAIK).
And I'm experiencing inconsistency, most of the time (not always) ending up with empty table, but actual schema. From what I understand, this may happen in case of streaming insert still knows about old table, inserts data in it, and later on deletion/re-creation is synced up.
So my question is how to reliably get the point when table/schema updates is converged to all nodes/regions/etc? Or if there's any other way to reliably re-import the data with potential schema change.

Related

How to move records from one DB2 database to another DB2 database?

At regular times we want to clean up (delete) records from our production DB (DB2) and move them to an archive DB (also DB2 database having the same schema).
To complete the story there are plenty of foreign key constraints in our DB.
So if record b in table B has a foreign key to record a in table A and we are deleting record a in production DB then also record b must be deleted in the production B and both records must be created in the archive DB.
Of course it is very important that no data gets lost. So that it is not possible that we delete records in the production DB while these records will never be inserted in the archive DB.
What is the best approach to do this ?
FYI I have checked https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dm.doc/doc/r0024482.html and the proposed solutions have following short comings.
Load utility, Ingest utility, Import utility : is only addressing the part of inserting records in the archive DB. It doesn't cover the full move.
Export utility : is only covering a means of exporting data (which might be imported by the Import utility*.
db2move, restore command, db2relocatedb, ADMIN_COPY_SCHEMA, ADMIN_MOVE_TABLE and split mirror : are not an option if you only want to move specific records meeting a certain condition to the archive DB.
So based on my research, the current best solution seems to be a kind of in-house developed script that is
Exporting the records to move in IXF format
Importing those exported records in the archive DB
deleting those records in the production DB
In order to cause no transaction log full errors, this script should do this in batches (e.g. of 50000 records)
In order to have no foreign key constraint errors in step 3: we must also assure that in step 1 we are also exporting all records having foreign key constraint to the exported records and all records having a foreign key constraint to these records ...
Questions that ask the "best" approach have limited use because the assessment criteria are omitted.
Sometimes the assessment criteria differ between technicians and business people.
Sometimes multiple policies of the client company can determine such criteria, so awareness of local policies and procedures or patterns is crucial .
Often the operational-requirements and security-requirements and licensing-requirements will influence the approach, apart from the skill level and experience of the implementation-team.
Occasionally corporates have specific standardised tools for archival and deletion, or specific patterns sometimes influenced by the industry-sector or even industry-specific regulatory requirements.
As stackoverflow is a programming oriented website, questions like yours can be considered off-topic because you are asking for advice about which design-approach are possible while omitting lots of context that is specific to your company/industry-sector that may well influence the solution pattern.
Some typical requirements or questions that influence the approach are below:
do local security requirements allow the data to leave the Db2 environment? (i.e. data stored on disk outside of Db2 tables). Sometimes this constrains use of export, or load-from-file/pipe). The data can be at risk of modification or inspection or deletion (whether accidental or deliberate) whilst outside of the RDBMS.
the restartability of the solution in the event of runtime errors. This is often a crucial requirement. When copying data between different physical databases (even if the same RDBMS) there are many possibilities of error (network errors, resource issues, concurrency issues, operational issues etc). Must the solution guarantee that any restarts after failures resume from the point of failure, or must cleanup happen and the entire job be restarted? The answer can determine the design.
if federation exists between the two databases (or if it can be added within the Db2-licence terms), then this is often the easiest practical approach to push or pull content. Local and remote tables appear to be in the same logical database which simplifies the approach. The data never needs to leave the RDBMS. This also simplifies restartability of failed jobs. It also allows the data to remain encrypted if that is a requirement.
if SQL-replication or Q-based-replication is licensed then it can be configured to intelligently sync the source and target tables and respect RI if suitably configured. This approach requires significant configuration skills.
if the production database is highly-available, and/or if the archival database is highly-available then the solution must respect the HA approach. Sometimes this prevents use of LOAD, depending on the operating-system platform of the Db2-server.
timing windows for scheduling are often crucial. If the archival+removal job must guarantee to fully complete with specific time intervals this can influence the design pattern.
if fastest rollout is a key requirement then range-partitioning is usually the best option.
There are tools out there for this (such as Optim Archive) which may better satisfy requirements you didn't realize you had.
In the interim - look into federation and the tool asntdiff.
On the archive database you can define a connection to the live database (CREATE SERVER). Using this definition you can define nicknames to the live tables (CREATE NICKNAME). Using these nicknames you can load the appropriate data into your archive table. You can either use your favorite data movement utility - load, import, insert, etc.
Once loaded you can verify the tables by using the asntdiff tool with appropriate selection criteria. The -f option is great.
Once you are satisfied the data exists in both locations you can delete the rows in the live database.
For your foreign key relationships - use the view SYSCAT.TABDEP to find such dependencies. You can define your foreign keys as "not enforced" (or don't define them) in the archive database to avoid errors during the previous process.
Data archiving is a big and common topic regardless of the database. You may also want to look at range partitioned tables for better performance and control.

Stream data into rotating log tables in BigQuery

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

Create BigQuery job that creates tables daily [duplicate]

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

standard protocol on inserting data and querying data at the same time

every day we have an SSIS package running to import data into a database.
sometimes people are querying the database at the same time.
the loading (data import) times out because there's a table lock on the specific table.
what is the standard protocol on inserting data and querying data at the same time?
First you need to figure out where those locks are coming from. Use the link to see if there are any locks.
How to: Determine Which Queries Are Holding Locks
If you have another process that holds a table lock then not much you can do.
Are you sure the error is "not able to OBTAIN a table lock". If so look at changing your SSIS package to not use table locks.
There are several strategies.
One approach is to design your ETL pipeline as to minimize lock time. All the data is prepared in staging tables and then, when complete, is switched in using fast partition switch operations, see Transferring Data Efficiently by Using Partition Switching. This way the ETL blocks reads onyl for a very short duration. It also has the advantage that the reads see all the ETL data at once, not intermediate stages. The drawback is difficult implementation.
Another approach is to enable snapshot isolation and/or read committed snapshot in the database, see Row Versioning-based Isolation Levels in the Database Engine. This way reads no longer block behind the locks held by the ETL. the drawback is resource consumption, the hardware must be able to drive the additional load of row versioning.
Yet another approach is to move the data querying to a read-only standby server, eg. using log shipping.

Best approach to follow in SSIS package

I am working on simple transform SSIS package to import data from one server and load to another server. Only one table is used on each.
I wanted to know, as its just refreshing of data, does the old data in the table needed to be deleted before loading, I needed expert advice about what should I do. Should I truncate the old table or use delete? What other concerns should I keep in mind?
Please give the justification for your answers, it will help to fight technically with my lead.
It depends on what the requirements are.
Do you need to keep track of any changes to the data? If so, truncating the data each time, will not allow you to track the history of your data. A good option in this case is to stage the source data in a separate table/database and load your required data in another structure (with possible history tracking, e.g., a fact table with slowly changing dimensions).
Truncating is the best option to remove data, as it's a minimally logged operation.