I have a 100 GB table that I'm trying to load into google bigquery. It is stored as a single 100GB avro file on GCS.
Currently my bq load job is failing with an unhelpful error message:
UDF worker timed out during execution.; Unexpected abort triggered for
worker avro-worker-156907: request_timeout
I'm thinking of trying a different format. I understand that bigquery supports several formats (AVRO, JSON, CSV, Parquet, etc) and that in principle one can load large datasets in any of these formats.
However, I was wondering whether anyone here might have experience with which of these formats is most reliable / least prone to quirks in practice when loading into bigquery?
Probably I'll solve following these steps:
Creating a ton of small files in csv format
Sending the files to GCS .
Command to copy files to GCS:
gsutil -m cp <local folder>/* gs:<bucket name>
gsutil -m option to perform a parallel
(multi-threaded/multi-processing)
After that, I'll move from GCS to BQ using Cloud Dataflow default template. link. (Remember that using a default template you don't need code)
Here a example to invoke dataflow link :
gcloud dataflow jobs run JOB_NAME \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS
Related
I am using Google Cloud Dataprep for processing data stored in BigQuery. I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"
It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates. This is an issue as BigQuery becomes messy very quickly with all these flows.
This has not been an issue in the past.
A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609
Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?
I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.
Below is the script:
#!/bin/bash
PROJECT={PROJECT_NAME}
# get list of datasets with temp_dataset_beam
# optional: write list of files to cloud storage
obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"
datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")
for dataset in $datasets
do
echo $PROJECT:$dataset
# WARNING: Uncomment the line below to remove datasets
# bq rm --dataset=true --force=true $PROJECT:$dataset
done
I solved this in Dataprep directly by running a SQL script post data publish that will run after each job. You can set this in Dataprep in the output Manual settings.
(SELECT CONCAT("drop table `<project_id>.",table_schema,".", table_name, "`;" ) AS value
FROM <dataset>.INFORMATION_SCHEMA.TABLES -- or region.INFORMATION_SCHEMA.TABLES
WHERE table_name LIKE "Dataprep_%"
ORDER BY table_name DESC)
DO
EXECUTE IMMEDIATE(drop_statement.value);--Here the table is dropped
END FOR;
I'm in trouble in loading Huge data to Bigquery.
In GCS, I have huge & many files like this:
gs://bucket/many_folders/yyyy/mm/dd/many_files.gz
I want to load it to BigQuery, so first, I tried:
bq load --source_format=NEWLINE_DELIMITED_JSON \
--ignore_unknown_values\
--max_bad_records=2100000000\
--nosync\
project:dataset.table \
gs://bucket/* \
schema.txt
which failed because of it exceeded "max_bad_records" limit(the file is an aggregation of many types of log so it causes many errors).
Then I calculated to found that I need to use "*" like:
bq load --source_format=NEWLINE_DELIMITED_JSON \
--ignore_unknown_values\
--max_bad_records=2100000000\
--nosync\
gs://bucket/many_folders/yyyy/mm/dd/*\
schema.txt
because of the max_bad_records limitation.
But I found it is very slow(because of pararell-run limitation in BigQuery). And it exceedes daily loading job limitation also. I prefer not doing this option.
Any idea for solving this situation? I want to load this data as fast as I can.
Thank you for reading.
I solved it by loading GCS data as one column.
Then as a next step I parsed the data.
Guys a very basic question but not able to decipher ,Please help me out.
Q1: When we create bigquery table using below command , the data resides in same Cloud Storage?
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
Q2: let's say my data director is gs://sp2040/raw/cards/cust/ for customer file Table structure defined is:
bq mk --time_partitioning_type=DAY market.cust \
custid:string,grp:integer,odate:string
Everyday I create new dir in the bucket such as 20170101,20170102..to load new dataset. So after the data loaded in this bucket Do I need to fire below queries.
D1:
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
D2:
bq load --source_format=CSV 'market.cust$20170102' \
gs://sp2040/raw/cards/cust/20170102/20170102_cust.csv
When we create bigquery table using below command , the data resides in same Cloud Storage?
Nope! BigQuery is not using Cloud Storage for storing data (unless it is federated Table linked to Cloud Storage)
Check BigQuery Under the Hood with Tino Tereshko and Jordan Tigani - you will like it
Do I need to fire below queries
Yes. you need to load those files into BigQuery, so you can query the data
Yes you would need load the data into BigQuery using those commands.
However, there are a couple of alternatives
PubSub and Dataflow: You could configure PubSub to watch your cloud storage and create notification when files are added, described here. You could then have Dataflow job that imported the file into BigQuery. DataFlow Documentation
BigQuery external tables: BigQuery can query cvs files that are stored in Cloud Storage without importing the data, as described here. There is wildcard support for filenames so it could be configured once. Performance might not be as good as directly storing items in BigQuery
I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.
The following is working as expected.
./bq --nosync load -F '^' --max_bad_record=30000 myvserv.xa one.txt ip:string,cb:string,country:string,telco_name:string, ...
1) But how to I send two csv files one.txt and two.txt in the same command?
2) I can not cat file and then pipe | to bg command ?
3) What does nosync mean?
Unfortunately, you can't (yet) upload two files with the same command; you'll have to run bq twice. (If you're loading data from Google Cloud Storage, though, you can specify multiple gs:// URLs separated by commas.)
Nope, bq doesn't (yet) support reading upload data from stdin, though that's a great idea for a future version.
If you just run "bq load", bq will create a load job on the server and then poll for completion. If you specify the --nosync flag, it will just create the load job and then exit without polling. (If desired, you can poll for completion separately using "bq wait".)
For 1), as Jeremy mentioned, you can't import two local files at once in the same command. However, you can start two parallel loads to the same table -- loads are atomic, and append by default, so this should do what you want and may be faster than importing both in a single job since the uploads will happen in parallel.