How to manage Schema Drift while streaming to BigQuery sharded table

How to manage Schema Drift while streaming to BigQuery sharded table - google-bigquery

We are new to BigQuery and are trying to figure out the best way to use it for real time analytics. We are sending a stream of logs from our back-end services to Kafka, and we want to stream those into BigQuery using streaming inserts. For queryability we are both partitioning by time, and sharding tables by event type (for use with wildcard queries). We put a view overtop of the family of tables created so that they look like 1 table and use the _TABLE_SUFFIX (well, when they roll out the feature, for now using UNION ALL) and _PARTITIONTIME columns to reduce the set of rows scanned for queries. So far so good.
What we are unsure of how to handle properly is schema changes. The schema of our log messages changes frequently. Having a manual process to keep BigQuery in sync is not tenable. Ideally our streaming pipeline would detect the change and apply the schema update (for adding columns) or table creation (for adding an event type) as necessary. We have tooling up-stream so that we know all schema updates will be backwards compatible.
My understanding is that all of the shards must have the same schema. How do we apply the schema update in such a fashion that:
We don't break queries that are run during the update.
We don't break streaming inserts.
Is #1 possible? I don't believe we can atomically change the schema of all the sharded tables.
For #2 I presume we have to stop our streaming pipelines while the schema update process is occurring.
Thanks,
--Ben

Wildcard tables with _TABLE_SUFFIX is available https://cloud.google.com/bigquery/docs/querying-wildcard-tables and you can use it even if the schemas of the tables are different, they just need to have compatible schemas. With UNION ALL, you need all the tables to have the same schema so it will not work if you're updating schemas at the same time.
Streaming insert will also work if you only specify a subset of fields. However you cannot add new fields as part of the streaming insert, you'll have to update table first and then insert the data with new schema.

Related

Auto delete temp tables in BigQuery

In BigQuery lot of temp_table_* tables are created, this is clustering the datasets. Is there any way to delete these temp tables automatically?

On any a dataset you can set a default expiration time for any new table you create in it. Then everything will get deleted on schedule.
Same for tables, even at creation time you can set an automatic deletion date.
Now, who is creating these temp tables you see? I have no idea, I've never seen them.
I'm 99.9% sure these tables are not created by any process ran by Google. No one at Google would format dates in that way.

The temp tables are created from one of the third party consumers, they are using Simba JDBC connector, which internally using the temp_table_ prefix.
Currently no option to set expire date for these tables, once they introduce the option we will leverage.
Reference:
https://www.simba.com/products/BigQuery/doc/JDBC_InstallGuide/content/jdbc/bq/options/largeresulttable.htm

Hive Insert-Only Transactional Tables

What are the specific benefits of using a Hive Insert-Only Transactional Table? Most of the documentation just indicates that if you don't need Delete or Alter functionality, then create this table. Does this speed up processing? Reduce Overhead?

Currently, full ACID tables are only supported in ORC file format. Micromanaged, a.k.a INSERT only transactional tables support any other storage format.
So, if you have all your tables stored in ORC format, you can go ahead with full ACID. If you have other storage types, and you need to be able to do INSERT statements, micromanaged tables can help you there.
Also: for full ACID tables, compaction is done by a MapReduce job. You can configure Hive to use the query based compactor for major compactions (as in creating a new base), but minor compactions (as in merging delta files) are still done with MR, and MR only.
For micromanaged tables, the compaction is query based. So, if you are using Hive on Tez, or Hive on Spark, and you do not want to have MR at all, that is fine. But for full ACID tables, if you want minor compactions, you'll need MapReduce.
Insider note: query based minor compaction for full ACID tables will be supported really soon, and I am pretty sure Parquet is going to support ACID tables very soon.

What you read everywhere is that unlike full transactional tables, "insert-only transactional tables" support data insert operations only.
But that doesn't say much. What one wants to know is:
What is a transactional insert operation ?
To say that an operation is a transaction means basically that it follows the ACID principles and especially its most important property: atomicity (the A from ACID).
In his great book Desigining Data Intensive Applications, Martin Kleppmann explains the atomicity property well:
"if a transaction was aborted, the application can be sure that it
didn’t change anything, so it can safely be retried.
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity."
Desigining Data Itensive Applications, March 2017, 1st edition chapter 7, p. 234
In Hive, this is done by creating a delta directory for each insert transaction, which keeps the new data isolated until the transaction is completed. If there is an error the directory is deleted, otherwise, it is appended.

SSIS : Huge Data Transfer from Source (SQL Server) to Destination (SQL Server)

Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.

Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.

I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too

With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Backing up portion of data in SQL

I have a huge schema containing billions of records, I want to purge data older than 13 months from it and maintain it as a backup in such a way that it can be recovered again whenever required.
Which is the best way to do it in SQL - can we create a separate copy of this schema and add a delete trigger on all tables so that when trigger fires, purged data gets inserted to this new schema?
Will there be only one record per delete statement if we use triggers? Or all records will be inserted?
Can we somehow use bulk copy?

I would suggest this is a perfect use case for the Stretch Database feature in SQL Server 2016.
More info: https://msdn.microsoft.com/en-gb/library/dn935011.aspx
The cold data can be moved to the cloud with your given date criteria without any applications or users being aware of it when querying the database. No backups required and very easy to setup.

There is no need for triggers, you can use job running every day, that will put outdated data into archive tables.
The best way I guess is to create a copy of current schema. In main part - delete all that is older then 13 months, in archive part - delete all for last 13 month.
Than create SP (or any SPs) that will collect data - put it into archive and delete it from main table. Put this is into daily running job.

The cleanest and fastest way to do this (with billions of rows) is to create a partitioned table probably based on a date column by month. Moving data in a given partition is a meta operation and is extremely fast (if the partition setup and its function is set up properly.) I have managed 300GB tables using partitioning and it has been very effective. Be careful with the partition function so dates at each edge are handled correctly.
Some of the other proposed solutions involve deleting millions of rows which could take a long, long time to execute. Model the different solutions using profiler and/or extended events to see which is the most efficient.

I agree with the above to not create a trigger. Triggers fire with every insert/update/delete making them very slow.
You may be best served with a data archive stored procedure.
Consider using multiple databases. The current database that has your current data. Then an archive or multiple archive databases where you move your records out from your current database to with some sort of say nightly or monthly stored procedure process that moves the data over.
You can use the exact same schema as your production system.
If the data is already in the database no need for a Bulk Copy. From there you can backup your archive database so it is off the sql server. Restore the database if needed to make the data available again. This is much faster and more manageable than bulk copy.

According to Microsoft's documentation on Stretch DB (found here - https://learn.microsoft.com/en-us/azure/sql-server-stretch-database/), you can't update or delete rows that have been migrated to cold storage or rows that are eligible for migration.
So while Stretch DB does look like a capable technology for archive, the implementation in SQL 2016 does not appear to support archive and purge.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas