De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline

De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline - google-bigquery

Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.

We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.

Related

Best practice for moving data from GCP Datastore to BigQuery incrementally

We are architecting out our data warehousing solutions from Datastore data sources. We would like to load the new inserted/updated/deleted datastore entities within a pre-defined time interval into BigQuery for analytics.
There seems to be several options:
Do backup of whole kind and load data into BigQuery. And Dedup in BigQuery. This is very simple to do conceptually, but loading/process all the data every time seems very inefficient to me.
Publish all the new/updated/deleted entities when the operations are performed and have Dataflow subscribe and TL to BigQuery.
Have a last modified timestamp on the entity, and pull out only those entities that were modified in the specified timeframe. We would like to take this option, but the deleted records seems to be problem, do we have to implement a soft delete?
Any recommendations on the best practice?

There is another option that we have implemented :)
You do BQ streaming insert of all operations (and better to use insert time based partitions) and after this - if needed you can produce consolidated tables (where you have single instance of each records - so you are taking in account update/delete properly) regularly.
What I found interesting that this table with all (just streamed) - non consolidated data - could give few quite interesting insights - like update/delete patterns - which disappear when you consolidate.

Your #1 is quite wasteful and inefficient. You have to export all the data, not just the changed delta you care about. The backup + load process creates intermediate files in GCS and is somewhat slow, and the loading also comes with limitations.
Option #2 is doable, but it needs more infrastructure. More points of failure.
Option #3 is best I think. Like you already mentioned, a soft delete would help -- you don't need to actually remove the data, just adding an active/inactive flag or deleted_at timestamp would do. Also an updated_at or modified_at is necessary for you to make the ETL incremental.

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you

Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.

Reporting tables in SQL

Our organization has a reporting application, that queries a real time transaction table to pull data for reports. As the query is against transaction table that is continuously updated the report performance is dismal. We are trying to come up with a new DB design to improve the performance.
My idea is to have three different tables for each year (eg; reports_2014,reports_2015,reports_2016) ( as we need to report only last three years of data) which will be created at the end of the year from the real time DB. The current year table (reports_2016) on the reporting DB will be updated with new records for the previous day at midnight. My reporting query will use a view that will be a union all of these three tables + the data from real time table for records from midnight to till this point in time.
Initially, I felt this to be a good design, provided I am going to have good indexes on these history tables.
However, I have a catch here arising from the inherent application design that updates these real time tables.
The status column of a transaction record can change to cancelled if I am cancelling a transaction , along with a new transaction cancellation record.
I could capture this by having a AFTER insert trigger and capturing the updates made correctly.
Now the issue is when there is a cancel record that is posted during the time my ETL to copy last days data to history table runs, I miss the update.
How do I capture this? Is there a way to delay the trigger untill my ETL is complete? Or is there a better approach to this problem?
My apologies if this is not the right place to post this question.
Thanks,
Roopesh

Multiple parallel tables with the same structure is almost never a good idea for a database design. Databases offer two important methods for handling performance:
Indexes
Partitioning
as well as other methods, such as rewriting queries, spatial indexes, full text indexes, and so on.
In your case, instead of multiple tables, consider table partitions.
As for your process, you should be using the creation/modification date of records. I would envision a job running early in the morning, say at 1:00 a.m., and this job would gather the previous day's records. Any changes after midnight simply do not apply. They will be included the following day.
If the reporting needs to be real-time as well, then you should consider building the reporting into the application itself. Some methods are:
Following the same approach as above, but doing the reporting runs more frequently (say once per hour rather once per each day).
Modifying the existing triggers to handle updates to reporting tables as well as the base tables.
Wrapping all DML transactions in stored procedures that handle both the transactional tables and the reporting tables.
Re-architecting the system to use queues with multiple readers to handle the disparate processing needs.

Thank You Gordon for your inputs. At this point ours is a real time reporting system. The database is a mirrored instance of production transactional database. Whenever a new transaction is entered to production database the same record flows to reporting database, which has the exactly similar schema, instantly. We do have indexes on columns those are queried frequently, however as there are many inserts in every hour the index performance is degraded quite fast. We rebuild them once in two weeks and it takes around 8 hours. That is where I thought having indexes on this huge transaction table with many inserts every hour may not be a good idea.. Please correct me if I am wrong...
I am actually reading through partitioning to see if it is a viable option for me. I had a discussion on the same with our DBA and I got following comment from him 'The reporting database is a mirrored instance of real time production database. You have to implement partitioning on the production transactional database. If you are using partitioning on a mirrored instance that would not work as your actual source DB is not partitioned' I am not sure how far this is true. Do you know if there is such a dependency between partitioning and mirroring??

Stream data into rotating log tables in BigQuery

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.

If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.

Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.

you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

Create BigQuery job that creates tables daily [duplicate]

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.

If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.

you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas