Move partitioned and unpartitioned BigQuery tables from US to EU - google-bigquery

I need to move BigQuery datasets with many tables (both partitioned and unpartitioned) from the US to the EU.
If the source table is unpartitioned, the documented way of bq extracting the data to GCS and bq loading it in another region works fine, so far so good.
If however the source table is partitioned, during the load step the mapping between data and partition is lost and I'll end up having all data within one partition.
Is there a good (automated) way of exporting and importing partitioned tables in BQ? Any pointers would be greatly appreciated!

There's a few ways to do this, but I would personally use Cloud Dataflow to solve it. You'll have to pay a little bit more for Dataflow, but you'll save a lot of time and scripting in the long run.
High level:
Spin up a Dataflow pipeline
Read partitioned table in US (possibly aliasing the _PARTITIONTIME to make it easier later)
Write results back to BigQuery using same partition.
It's basically the same as what was talked about here.

Another solution is to use DML to load the data, instead of load, https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables. Since you have a timestamp column in the table to infer the partition, you can use
INSERT INTO PROJECT_ID.DATASET.mytable (_PARTITIONTIME, field1, field2) AS SELECT timestamp_column, 1, “one” FROM PROJECT_ID.DATASET.federated_table
You can define a permanent federated table, or a temporary one, https://cloud.google.com/bigquery/external-data-cloud-storage#permanent-tables. You'll need to pay for DML though, while load is free.

Related

What is the most efficient way to load a Parquet file for SQL queries in Azure Databricks?

Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.
We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:
Cell1:
%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet"
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")
The cell following this can now start doing SQL queries. e.g.:
SELECT * FROM RESULTS LIMIT 5
This takes a bit of time to get up, but not too much. I'm concerned about two things:
Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.
Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?
For your first question:
Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.
As a quick example, you can create a table using SQL or Python:
# SQL
CREATE TABLE <example-table>(id STRING, value STRING)
# Python
dataframe.write.saveAsTable("<example-table>")
Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.
# SQL
SELECT * FROM <example-table>
# Python
spark.sql("SELECT * FROM <example-table>")
For your second question:
Performance depends on multiple factors but in general, here are some tips.
If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.
I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.

The cost of saving the query result to a table in BigQuery?

As title stated.
I'm curious about there's a python AIP (QueryJobConfig) that I can set the destination table where I can save the query result, so for this kind of saving how much will it cost in GCP?
There are two types of way that I can load or insert external data to BigQuery. (Streaming and Batch uploading)
However, using streaming insert might be costly.
To make it clear, saving the query result may consider as an insertion, so if I use this method to update a table frequently, will it be costly as doing the streaming insert.
BigQuery saves query results to a table, which can be either permanent or temporary.
Temporary table has a lifetime of 24 hours and you are not charged for storing temporary tables.
When you write query results to a permanent table, you can create/append/overwrite the table. When you specify a destination table for large query results, you are charged for storing the data. Storage pricing is based on the amount of data stored in your tables. Please, refer to official documentation describing BigQuery pricing.
Streaming insert can be done either using DataFlow jobs or by calling the BigQuery streaming insert API directly. You need to pay for streaming insert process ($0.01per 200 MB) and storage of the new data.
If you wish to estimate storage and query cost, please refer to official documentation.
I hope you find the above pieces of information useful.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Use case of using Big Query or Big table for querying aggregate values?

I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).