Pricing of bq query with destination table - google-bigquery

Im running bigquery command line query with destination table
like bq query --destination_table with some select statements from src table.
Whether this will be considered as loading data or querying data ?
Because, loading data is free and query data is going to cost.
My intention is to move some data from src to destination with some manipulations on src fields . So bq query with destination looks a perfect fit for this .

If you are running a query, then you are billed for the cost of the query. It doesn't matter whether you have specified a destination table. If you want to avoid the cost of querying, you need to extract the data, perform whatever transformation you want, and then load it again.

Related

Adding csv with lesser column than schema to BigQuery

I have a table in BigQuery with 100 columns. Now I want to append more rows to it via Transfer but the new CSV has only 99 columns. How should I proceed with this?
I tried creating a schema and adding that column as NULLABLE but it didn't work
I am presuming your CSV file is stored in GCS Bucket and trying to use BQ Data Transfer service to load data periodically by scheduling it.
You can not directly Load/Append data into BQ Table due to schema mismatch.
But as an alternative, create a Staging table named staging_table_csv with 99 columns and Schedule a Data transfer service to load CSV to this table on Overwrite mode.
Now write a query to Append the contents of this staging table staging_table_csv to the target BQ table.
Query might look like this:
#standardSQL
INSERT INTO `project.dataset.target_table`
SELECT
*,
<DEFAULT_VALUE> AS COL100
FROM
`project.dataset.staging_table_csv`
Now schedule this query to run after the staging table is loaded
Make sure to keep a buffer between the Staging table load and the Target Table load. You can perform trials to find a suitable buffer.
For eg: If Transfer is scheduled at 12:00, Schedule Target Table load Query t 12:05 or 12:10
Note: Creating an extra Staging table would incur storage costs but
since it is overwritten for each load, historical data cost is not
incurred

Get table names in dataset after dataset truncate

It seems that the BigQuery CLI supports restoring tables in a dataset after they have been deleted by using BigQuery Time Travel functionality -- as in:
bq cp dataset.table#TIME_AGO_UNIX dataset.table
However, this assumes we know the names of the tables. I want to write a script to iterate over all the tables that were in the dataset at TIME_AGO_UNIX time.
How would I go about finding those tables at that time?

Why Spark-BigQuery creates extra tables in the dataset

So i am running a spark (scala) serverless dataproc job that reads and write data from/in bigquery.
Here is the code that writes the data :
df.write.format("bigquery").mode(SaveMode.Overwrite).option("table", "table_name").save()
Everythings works fine but these extra tables got created on my dataset in addition of the final table. Do you know why and what i can do so i wont have them?
Those tables are created as the result of view materialization or loading result from a query. The have an expiry time of 24 hours, configurable by the materializationExpirationTimeInMinutes option

Duplicate several tables in bigquery project at once

In our BQ export schema, we have one table for each day as per the screenshot below.
I want to copy the tables before a certain date (2021-feb-07). I know how to copy one day at a time via the UI, but is there not a way to use the cloud console to write a code for copying the selected date range, all at once? Or maybe an sql command directly from a query window?
I think you should transform your sharding tables into a partitioned table. So you can handled your tables with just a single query. As mention in the official documentation, partitioned tables perform better.
To make the conversion, you can just execute the following commands in the console.
bq partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration 259200 \
mydataset.sourcetable_ \
mydataset.mytable_partitioned
This will make your sharded tables sourcetable_(xxx) into a single partitioned table mytable_partitioned which can be query with just a single query trough your entire set of data entries.
SELECT
*
FROM
`myprojectid.mydataset.mytable_partitioned`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2022-01-01') AND TIMESTAMP('2022-01-03')
For more details about the conversion commands you can check this link. Also, I recommend to check the links about querying partionated tables and partiotioned tables for more details.

PXF Hive Plugin, to select only the columns selected in the query

Is there a way to PXF select only the column used in the query, apart from Hive partition filtering.
I have data stored in Hive-ORC format and using pxf external table to execute queries in HAWQ. The biggest tables are stored in Hive and we cannot make another copy of data in HAWQ.
Thanks--
P.S - Does the query optimizer collect stats on external tables in HAWQ 2.0?
You can always run a select foo from bar type query on external tables in HAWQ. However, if your question is whether PXF actually does column projection to avoid reading all the columns then the answer is No. Currently PXF will read all columns from an ORC file and return the records to HAWQ which then does the projection filtering on its end. However, https://issues.apache.org/jira/browse/HAWQ-583, is actively being worked on and should be released in an upcoming version of HAWQ which will pushdown column projections down to ORC to improve read performance of ORC files
Yes, the query optimizer does collect statistics on external tables, this is also handled by PXF. However, this is only for some data sources: https://issues.apache.org/jira/browse/HAWQ-44