I have few analytics table which gets refreshed in every few days. By refresh I mean there could be some new records, some records needs to be deleted and some records needs to be updated and there is no specific identifier.
So there are below options in my mind:
For every refresh truncate the whole table and reload data. But if any failure occur during fresh data load then table data will be corrupted and all analytics will show wrong data.
Another option is to keep a refresh id in all analytics table, and while reading data from analytics table use latest refresh id. But with this approach main issue is joining and filtering. We have joining across analytics table so each and every join should join with refresh is always otherwise fetched data will be wrong, and this approach is error-prone.
Can We create a view on these table which will have dynamic filter ? While querying on these views I will use latest refresh id as a filter.
Is there any better approach to refresh data into analytics tables keeping in mind that it should handle any error scenario and not error prone.
Or, the option that I often use:
Create a new version of the table in an alternative location.
Validate the results.
Swap the live table for the new version.
The "swap" might involve renaming tables or truncating and loading the original table. Often, the original contents are saved somewhere else.
This approach is handy particularly when the logic for creating the entire table is complicated to express as incremental changes. It also minimizes the amount of downtime, when the table is not available.
You want incremental changes when you need more up-to-date data and batches don't work -- either because of timing, size, or cost. Many databases support materialized views or replication which simplify this process.
Related
Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.
I want to create a log table to keep track of users and their actions on website. For ex, when a user log in page a record will be created into log table. when user creates information, a record will be created into log table. similarly for every action, a record will be created into log table. In this way, the log table data will grow very faster. What is the better way to maintain such bigger tables apart from creating trigger and scheduling scripts to clean data frequently?
From my experience typically excessive logging doesnt really gain you much. A lot of people lose the usefulness of logging with the sheer volume of it...just a little warning before hand.
As for maintaining a table that size i recommend potentially partitioning the table and writing a specific set of stored procedures that effectively use a few indexes that you place on the table. Any ad-hoc work on the table should be done minimally and if it is done make sure the ad-hoc hits up against any index you setup on the table. Also with (nolock) will be your friend for SELECT statements if a large amount of inserts going on.
This is the basic general idea I do for the transaction tables I handle and they typically get around 1-2 million rows a day.
I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.
I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.
I am using MVIEWs with Fast refresh to replicate some tables across a network. Everything works great, however I ran into an issue when considering my Delete/Purge process.
The source for the MVIEWs that are feeding the log tables have a data retention of 7 days. Ie I will be running a nightly purge process to delete data older than 7 days from current date.
The target MVIEWs however are on an ODS and have a data retention policy of 30 days. Also, these MVIEWs are NOT currently populating another schema or set of tables.
Problem is, when I Delete from the source tables, those delete statements will propagate through to the target MVIEWs and now I no longer have 30 days worth of data - only 7.
Is there a way to exclude logging DELETE for the MVIEW log tables? I noticed in the MLOG$_Table_Name there is a column 'DMLTYPE$$'. Could I somehow delete from the Log table all records where DMLTYPE$$ = 'D'?
Thanks everyone, and yes, I did try researching this online first.
Regards,
Steve
I suppose that you could manually delete data from the materialized view logs before running the refresh. That would probably work. But it would not be a solution that I'd be really comfortable with. It would be a very bespoke solution that would probably not be officially supported. And it if there might ever be another materialized view that depends on the materialized view log, you'd have to ensure that you're only deleting those rows that relate to your materialized view's subscription. Plus, the materialized view on the destination would need to be updatable in order for you to be able to manually remove the rows older than 30 days via a separate process.
If these are the business requirements, something like Oracle Streams (or GoldenGate) would be a much more appropriate architectural solution. Those products are designed to give you more flexibility about which logical change records (LCRs) you apply. In Streams, for example, it is easy enough to create a custom apply handler that discards delete LCRs. And since you're applying LCRs to a table on the destination rather than a materialized view, your 30 day purge process is much easier to manage. This would be a relatively common Streams setup rather than a very unique materialized view setup.