Bigquery insert into table select command cost - google-bigquery

I'm trying to understand Google Bigquery pricing. I read in batch load document that load, export, copy table is free which uses shared pool. However, I'm bit confused about pricing of below sub queries in ELT jobs or UDF's. I'm thinking this will incur cost as we are reading from table.
INSERT dataset.targetTable (col1, col2, col3)
SELECT col1, col2, col3
FROM dataset.sourceTable
Reading from external table having location in google storage bucket in same region will also incur cost?
INSERT dataset.targetTable (col1, col2, col3)
SELECT col1, col2, col3
FROM dataset.external_table
If above external table query incur cost, then best option would be using load command to load data into persistent table in BigQuery where possible instead of reading from an external table?
Thanks.

As mentioned by #Samuel , querying from external tables and its storage incurs cost if the external table is stored in Cloud Storage. For external tables in Google Cloud Storage, the cost are at $1.1 per TB and the first 300 TB are free. And Batch Loading is free. But if you are looking for streaming inserts, then it will incur cost. UDF also incurs cost. For your requirement, in ELT , it will incur cost and if you opt for batch loading then it will be free of cost. For more information, you can check this document.
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

Related

The cost of saving the query result to a table in BigQuery?

As title stated.
I'm curious about there's a python AIP (QueryJobConfig) that I can set the destination table where I can save the query result, so for this kind of saving how much will it cost in GCP?
There are two types of way that I can load or insert external data to BigQuery. (Streaming and Batch uploading)
However, using streaming insert might be costly.
To make it clear, saving the query result may consider as an insertion, so if I use this method to update a table frequently, will it be costly as doing the streaming insert.
BigQuery saves query results to a table, which can be either permanent or temporary.
Temporary table has a lifetime of 24 hours and you are not charged for storing temporary tables.
When you write query results to a permanent table, you can create/append/overwrite the table. When you specify a destination table for large query results, you are charged for storing the data. Storage pricing is based on the amount of data stored in your tables. Please, refer to official documentation describing BigQuery pricing.
Streaming insert can be done either using DataFlow jobs or by calling the BigQuery streaming insert API directly. You need to pay for streaming insert process ($0.01per 200 MB) and storage of the new data.
If you wish to estimate storage and query cost, please refer to official documentation.
I hope you find the above pieces of information useful.

Use case of using Big Query or Big table for querying aggregate values?

I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.

BigQueryIO Read vs fromQuery

Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read.
BigQueryIO.Read.from("projectid:dataset.tablename")
or
BigQueryIO.Read.fromQuery("SELECT A, B FROM [projectid:dataset.tablename]")
Will the performance of my read improve, if i am only selecting the required columns in the table, rather than the entire table in above?
I am aware that selecting few columns results in the reduced cost. But would like to know the read performance in above.
You're right that it will reduce cost instead of referencing all the columns in the SQL/query. Also, when you use from() instead of fromQuery(), you don't pay for any table scans in BigQuery. I'm not sure if you were aware of that or not.
Under the hood, whenever Dataflow reads from BigQuery, it actually calls its export API and instructs BigQuery to dump the table(s) to GCS as sharded files. Then Dataflow reads these files in parallel into your pipeline. It does not ready "directly" from BigQuery.
As such, yes, this might improve performance because the amount of data that needs to be exported to GCS under the hood, and read into your pipeline will be less i.e. less columns = less data.
However, I'd also consider using partitioned tables, and then even think about clustering them too. Also, use WHERE clauses to even further reduce the amount of data to be exported and read.

Saving results to destination table costs

I'm wondering what kind of insert saving the results of a large query (with multiple joins and unions) to a destination (day partitioned) table is.
Currently on a GC VM I execute these queries, save the results in a local temporary CSV and upload these CSVs to their respective tables.
This is fairly ineffective (not as fast as it can be and uses quite a lot of VM resources). However, it is cheap, since CSV loading jobs are free. If I were to save the query results into a destination table (appending to old data which already consists of 100M+ rows), would those incur insertion streaming costs? This is what I'd like to avoid since $0.02/MB can rack up quite a bit since we're adding a lot of data on a daily basis.
Thanks for your help.
Inside Bigquery running a query and saving to destination table results you
query price (anyway you pay it)
storage price (new data gets accumulated to the table - choose partitioned)
no streaming costs
If you have data outside of BQ and you end up adding the data to BQ
if you use load (that's free)
streaming insert (has a cost component)
storage of the new data, the table you added
I'm wondering what kind of insert saving the results of a large query (with multiple joins and unions) to a destination (day partitioned) table is.
... if I were to save the query results into a destination table (appending to old data which already consists of 100M+ rows), would those incur insertion streaming costs?
Setting the destination table for query job is the most effective way of getting result of that query being added to the existing table. It DOES NOT incur any extra cost related to streaming as there is no streaming happening here at all

Move partitioned and unpartitioned BigQuery tables from US to EU

I need to move BigQuery datasets with many tables (both partitioned and unpartitioned) from the US to the EU.
If the source table is unpartitioned, the documented way of bq extracting the data to GCS and bq loading it in another region works fine, so far so good.
If however the source table is partitioned, during the load step the mapping between data and partition is lost and I'll end up having all data within one partition.
Is there a good (automated) way of exporting and importing partitioned tables in BQ? Any pointers would be greatly appreciated!
There's a few ways to do this, but I would personally use Cloud Dataflow to solve it. You'll have to pay a little bit more for Dataflow, but you'll save a lot of time and scripting in the long run.
High level:
Spin up a Dataflow pipeline
Read partitioned table in US (possibly aliasing the _PARTITIONTIME to make it easier later)
Write results back to BigQuery using same partition.
It's basically the same as what was talked about here.
Another solution is to use DML to load the data, instead of load, https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables. Since you have a timestamp column in the table to infer the partition, you can use
INSERT INTO PROJECT_ID.DATASET.mytable (_PARTITIONTIME, field1, field2) AS SELECT timestamp_column, 1, “one” FROM PROJECT_ID.DATASET.federated_table
You can define a permanent federated table, or a temporary one, https://cloud.google.com/bigquery/external-data-cloud-storage#permanent-tables. You'll need to pay for DML though, while load is free.