Stream table data from one BigQuery table to another with existing schema - google-bigquery

I have two BigQuery datasets: dataset_a and dataset_b
Each of these datasets contain a table, e.g dataset_a_table and dataset_b_table
dataset_a_table contains streaming data and I want to stream data from dataset_a_table to dataset_b_table.
I have schema of dataset_a_table of type TableSchema. How can I copy stream rows from one table to another and keep the existing schema?
I have so far looked at insertAll method of BigQuery but I am a bit unsure about which data structure to fetch rows in and how to specify TableSchema when inserting into a new table.
I would appreciate some guidance regarding how to do that. Thanks.

Approach 1: If dataset_b_table needs to simply mirror dataset_a_table, for instance because you have different user permissions on the two datasets, you could consider setting up dataset_b_table as a view instead of a table. Views in BigQuery work across datasets:
CREATE VIEW dataset_b.dataset_b_view AS SELECT * FROM dataset_a.dataset_a_table
Approach 2: If you do want dataset_b_table with the same schema as dataset_a_table, you can use the BigQuery native "transfers" functionality. ("Transfers" > "Create Transfer" > select "Dataset Copy")
Approach 3: If dataset_b_table has a different schema from dataset_a_table, or if dataset_b_table already contains data and you want to merge in data from dataset_a_table, you will need some sort of incremental logic. Assuming your dataset_a_table has some sort of "created_at" field (also assuming no updates to records) then you could go with an incremental load like this:
INSERT INTO dataset_b.dataset_b_table
SELECT
column_a, column_b, column_c, updated_at
FROM dataset_a.dataset_a_table
WHERE updated_at>(SELECT max(updated_at) FROM dataset_b.dataset_b_table)
You can then schedule this to run depending on your timing requirements, once a day, hour, or every couple of minutes. You can use the BigQuery native scheduling functionality, or your own logic.
If you need actual streaming in (milli)seconds, and the View approach doesn't work for you, you will need to work with the source that fills dataset_a_table in the first place as BigQuery doesn't support triggers.

Related

In Google Bigquery, how to denormalize tables when the data is from different 3rd party source?

I have data about contacts in Salesforce. I also have data about the contacts in Intercom/Zendesk. I want to create a denormalized table where the data in Salesforce and Intercom is both merged into a single table so I can query about the contact. Imagine, I dumped the Salesforce data into a Bigquery table. The problem is that the we might not dump Intercom/Zendesk until later. So we may only add Salesforce data into a Bigquery table now. And later we may add Intercom data. My question is how to merge these (existing data in Salesforce BQ table and new data from Intercom)? Assume that the Email is the primary key in both 3rd party sources and we can join them.
Do we need to take the Salesforce data out of the BQ table and run it through some tool to merge both tables and create a new table in BQ?
What will happen if we keep getting new data in both Salesforce and Intercom?
Your case seems to be a good use case for Views.
A view is basically a virtual table that points to a query. You can define a view based on a query (lets call it query_1) and then you will be able to see that view as a table. However every time you run a query (lets call it query_2) using that view as source, internally BigQuery will execute query_1 and then execute your query_2 against the results of query_1.
In your case, you could create a query that use join to merge your tables and save this query as a view. You can create a view by clicking on Save view in the BigQuery console just like in the image below and then fill some required fields before saving.
In BigQuery there are also Materialized Views, that implements some cache technologies in order to make the view more similar to a table.
Some benefits of materialized views are:
Reduction in the execution time and cost for queries with aggregate functions. The largest benefit is gained when a query's computation
cost is high and the resulting data set is small.
Automatic and transparent BigQuery optimization because the optimizer uses a materialized view, if available, to improve the query
execution plan. This optimization does not require any changes to the
queries.
The same resilience and high availability as BigQuery tables.
To create a materialized view you have to run the below command:
CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table
AS <my-query>
Finally, I would like to paste here the reference links for both views and materialized views in BigQuery. I suggest that you take a look at it and decide which one fits in your use case.
You can read more about querying Google Cloud Storage https://cloud.google.com/bigquery/external-data-cloud-storage.
You can take the extracts and place them into Google Cloud Storage under buckets i.e. Salesforce bucket and Zendesk bucket.
Once the files are available, you can create external tables on those bucket(1 table for each bucket) so that you would be able to query them independently.
Once you can query them , you can perform joins like normal tables.
You can replace the files in buckets when new data comes.

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

Create a date-limited view on a hive table containing complex types in a way that is queryable with Impala?

I have a very large parquet table containing nested complex types such as structs and arrays. I have partitioned it by date and would like to restrict certain users to, say, the latest week of data.
The usual way of doing this would be to create a time-limited view on top of the table, e.g.:
''' CREATE VIEW time_limited_view
AS SELECT * FROM my_table
WHERE partition_date >= '2020-01-01' '''
This will work fine when querying the view in Hive. However, if I try to query this view from Impala, I get an error:
** AnalysisException: Expr 'my_table.struct_column' in select list returns a complex type **
The reason for this is that Impala does not allow complex types in the select list. Any view I build which selects the complex columns will cause errors like this. If I flatten/unnest the complex types, this would of course get around this issue. However due to the layers of nesting involved I would like to keep the table structure as is.
I see another suggested workaround has been to use Ranger row-level filtering but I do not have Ranger and will not be able to install it on the cluster. Any suggestions on Hive/Impala SQL workarounds would be appreciated
While working on a different problem I came across a kind of solution that fits my needs (but is by no means a general solution). I figured I'd post it in case anyone has similar needs.
Rather than using a view, I can simply use an external table. So firstly I would create a table in database_1 using Hive, which has a corresponding location, location_1, in hdfs. This is my "production" database/table which I use for ETL and contains a very large amount of data. Only certain users have access to this database.
CREATE TABLE database_1.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Next, I create a second, external table in the same location in hdfs. However this table is stored in a database with a much broader user group (database_2).
CREATE EXTERNAL TABLE database_2.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Since this is an external table, I can add/drop date partitions at will without affecting the underlying data. I can add 1 weeks' worth of date partitions to the metastore and as far as end users can tell, that's all that is available in the table. I can even make this part of my ETL job, where each time new data is added, I add that partition to the external table and then drop a partition from a week ago, resulting in this rolling window of 1 weeks' data being made available to this user group without having to duplicate a load of data to a separate location.
This is by no means a row-filtering solution, but is a handy way to use partitions to expose a subset of data to a broader user group without having to duplicate that data in a separate location.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.