BigQuery bq command - load only if table is empty or doesn't exist - google-bigquery

I'm executing a load command with bq, e.g:
bq load ds.table gs://mybucket/data.csv dt:TIMESTAMP,f1:INTEGER
I would like to load the data only if the table is empty or doesn't exist.
Is it possible?
EDIT:
Basically I would like the WRITE_EMPTY API option via the bq command line tool:
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.writeDisposition
If the table already exists and contains data, a 'duplicate' error is returned in the job result.

If you go check bq.py, that has the source code for the BigQuery CLI, you'll find out that the _Load() method doesn't implement an option for the WRITE_EMPTY API option. It's either the default WRITE_APPEND or the optional WRITE_TRUNCATE.
As you indicate, the API does support WRITE_EMPTY - if you want to see this as an option on the CLI, you can submit a feature request at https://code.google.com/p/google-bigquery/issues/list?q=label:Feature-Request

You can use the BQ command-line tool.
Get Table Information
bq show <project_id>:<dataset_id>.<table_id>
List tables
bq ls [project_id:][dataset_id]

Related

materialized view manual refresh

I was experimenting with Big Query materialized view manual refresh but when I try to call the procedure as per the documentation
bq query --use_legacy_sql=false "CALL BQ.REFRESH_MATERIALIZED_VIEW('my-project-id.my-dataset-name.mv_name')"
I get back :
Error in query string: Error processing job 'project-id:bqjob_r508c7d13330d5fcd_0000017801fd2694_1': Not found: Dataset my-project-id.my-dataset-name was not found in location US at [1:6]
my-project-id is my GCP project, my-dataset-name is the dataset which is not in the US.
Note I get the same error when using the Query window from the web console.
I haven't been able to find any system documentation on the sp itself though it seems to only take 1 argument, but I hoped it would be able to figure out the location to run via the project id and dataset as supplied.
We're don't support routing for materialized view refresh right now. If this would be valuable to you please submit a feature request in our public issue tracker. In the meantime you can refresh your views by explicitly picking a location in the UI or using the --location flag on the command line.
I think you might want to use "`" instead of "'" around the dataset.table name.
Try below
bq query --use_legacy_sql=false "CALL BQ.REFRESH_MATERIALIZED_VIEW(`my-project-id.my-dataset-name.mv_name`)"
in BQ console.
CALL BQ.REFRESH_MATERIALIZED_VIEW(`my-project-id.my-dataset-name.mv_name`)

How to load data into BigQuery from command line with WRITE_EMPTY flag?

I'm loading CSV data into BigQuery from the command line. I would like to prevent the operation from occurring if the table exists already. I do not want to truncate the table if it exists, and I do not want to append to it.
It seems that there is no command line option for this:
However, I feel like I might be missing something. Is this truly an option that is impossible to use from the command line interface?
A possible workaround for this can be by using bq cp as follow:
Upload your data to a side table, Truncate the data each upload
bq --location=US load --autodetect --source_format=CSV dataset.table ./dataRaw.csv
Copy the data to your target table using bq cp which support an overwrite flag
bq --location=US cp -n dataset.dataRaw dataset.tableNotToOverWrite
If the table exists you get the following error:
Table 'project:dataset.table' already exists, skipping
I think you are right about CLI doesn't support WRITE_EMPTY mode now.
You may file a feature request to get it prioritized.

Command to get the sql query of a BigQuery view

We have a created a large number of views in BigQuery using standardSql. Now we need to see for the correctness of these created views.
Is there a bq command to get the sql query with which these views have been created in BigQuery?
This command will prevent manual effort of checking for the correctness of these views
Use the show command with the view flag.
e.g. bq show --view <project>:<dataset>.<view>
You can also use the --format=prettyjson flag (instead of --view) so you can easily get the query content when running a script, for example:
bq show --format=prettyjson <project>:<dataset>.<view>

Google BigQuery: How to use gsutil to either remove or overwrite a table?

I have a program which will download some data from the web and save it as a csv, and then upload that data to a Google Cloud Storage Bucket. Next, that program will use gsutil to create a new Google BigQuery Table by concatenating all the files in the Google Cloud Storage Bucket. To do the concatenating I run this command in command prompt:
bq load --project_id=ib-17 da.hi gs://ib/hi/* da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT
The issue is that for some reason this command appends to the existing table, so I get a lot of duplicate data. The question is how can I either use gsutil to delete the table first maybe how can I use gsutil to overwrite the table?
If I understood correctly your question, you should delete and recreate the table with:
bq rm -f -t da.hi
bq mk --schema da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT -t da.hi
Another possibility is to use the --replace flag, such as:
bq load --replace --project_id=ib-17 da.hi gs://ib/hi/*
I think that this flag was once called WRITE_DISPOSITION but looks like the CLI updated the name to --replace.

loading avro files with different schemas into one bigquery table

I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.