Bq command or data flow for as is load - google-bigquery

I have a simple delimited file on gcs. I need to load that file as is(without transfermation) to bigqyery table. Either we can use data flow or bigqyery command line utility to load that file to bigqyery table. I need to understand which one is the best option bigqyery or bq command line utility. Please consider factors like cost, performance etc before providing your valuable inputs.

Running a BigQuery load using Dataflow or running it using bq command line is the same in terms of cost. Using bq load directly should be easier if you don't need to process the data.

Related

How to load data into BigQuery from command line with WRITE_EMPTY flag?

I'm loading CSV data into BigQuery from the command line. I would like to prevent the operation from occurring if the table exists already. I do not want to truncate the table if it exists, and I do not want to append to it.
It seems that there is no command line option for this:
However, I feel like I might be missing something. Is this truly an option that is impossible to use from the command line interface?
A possible workaround for this can be by using bq cp as follow:
Upload your data to a side table, Truncate the data each upload
bq --location=US load --autodetect --source_format=CSV dataset.table ./dataRaw.csv
Copy the data to your target table using bq cp which support an overwrite flag
bq --location=US cp -n dataset.dataRaw dataset.tableNotToOverWrite
If the table exists you get the following error:
Table 'project:dataset.table' already exists, skipping
I think you are right about CLI doesn't support WRITE_EMPTY mode now.
You may file a feature request to get it prioritized.

most reliable format for large bigquery load jobs

I have a 100 GB table that I'm trying to load into google bigquery. It is stored as a single 100GB avro file on GCS.
Currently my bq load job is failing with an unhelpful error message:
UDF worker timed out during execution.; Unexpected abort triggered for
worker avro-worker-156907: request_timeout
I'm thinking of trying a different format. I understand that bigquery supports several formats (AVRO, JSON, CSV, Parquet, etc) and that in principle one can load large datasets in any of these formats.
However, I was wondering whether anyone here might have experience with which of these formats is most reliable / least prone to quirks in practice when loading into bigquery?
Probably I'll solve following these steps:
Creating a ton of small files in csv format
Sending the files to GCS .
Command to copy files to GCS:
gsutil -m cp <local folder>/* gs:<bucket name>
gsutil -m option to perform a parallel
(multi-threaded/multi-processing)
After that, I'll move from GCS to BQ using Cloud Dataflow default template. link. (Remember that using a default template you don't need code)
Here a example to invoke dataflow link :
gcloud dataflow jobs run JOB_NAME \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

loading avro files with different schemas into one bigquery table

I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.

Google Bigquery - Bulk Load

We have a csv file with 300 columns. Size is approx 250 MB. Trying to upload it to BQ through Web UI but the schema specification is hard work. I was anticipating BQ will identify file headers but it doesn't seems to be recognising unless I am missing something.Is there a way forward ?
Yes, you have to write the schema by your own. Bigquery is not able to auto-infert it. If you have 300 columns, I sugget writing a script to automatically create the schema.
With the command-line tool (cf here) If you have some lines with the wrong/different schema, you can use the following option to continue for other records :
--max_bad_records : The maximum number of bad rows to skip before the load job
In your case if you want to skip the first line of headers, that can be the following :
bq load --skip_leading_rows=1 --max_bad_records=10000 <destination_table> <data_source_uri> [<table_schema>]

Loading multiple files

The following is working as expected.
./bq --nosync load -F '^' --max_bad_record=30000 myvserv.xa one.txt ip:string,cb:string,country:string,telco_name:string, ...
1) But how to I send two csv files one.txt and two.txt in the same command?
2) I can not cat file and then pipe | to bg command ?
3) What does nosync mean?
Unfortunately, you can't (yet) upload two files with the same command; you'll have to run bq twice. (If you're loading data from Google Cloud Storage, though, you can specify multiple gs:// URLs separated by commas.)
Nope, bq doesn't (yet) support reading upload data from stdin, though that's a great idea for a future version.
If you just run "bq load", bq will create a load job on the server and then poll for completion. If you specify the --nosync flag, it will just create the load job and then exit without polling. (If desired, you can poll for completion separately using "bq wait".)
For 1), as Jeremy mentioned, you can't import two local files at once in the same command. However, you can start two parallel loads to the same table -- loads are atomic, and append by default, so this should do what you want and may be faster than importing both in a single job since the uploads will happen in parallel.