BigQuery load compressed data from Cloud Storage - gzip

I have a lot of *.gz files in my Google Cloud Storage.
I want to load those data to BigQuery.
I've tried to execute
bq load --source_format=AVRO projectId:dataset.table gs://bucket/*.gz
But received error
The Apache Avro library failed to parse file gs://bucket/f92d8ae3-6eba-4e35-9fc0-b8f31b4b9881-part-r-00004.gz.
Is it possible to upload compressed data to BigQuery? What is the best pracrise for those problems?

Compressed Avro files are not supported ...
See more in Avro format

Related

unable to read csv file from synapse spark scala notebook

I am try to read csv file from synapse notebook but it's giving an error path does not exist. I have read many documents, saying have configure something on blob storage or in synapse workspace to read the azure blob storage.
please help here. i am trying read csv file from blob storage not data lake gen2.
If you are receiving the path doesn't exists, please do check if the path exists on the storage account.
OR
You can try the below method to read csv file from Synapse Spark Python Notebook.

is it possible to write/run BigQuery on parquet files on AWS S3?

We want to check BigQuery performance on external store parquet files. These parquet files are store on AWS S3. Without transfering files to GCP, Is possible to write BigQuery which can run on AWS S3 stored parquet files dataset.
No, this is not possible. BigQuery supports "external tables" where the data exists as files in Google Cloud Storage but no other cloud file store is supported, including AWS S3.
You will need to either copy/move the files from S3 to Cloud Storage and then use BigQuery on them, or use a similar AWS service such as Athena to query the files in situ on S3.
You can use the BigQuery Data Transfer Service for Amazon S3 which allows you to automatically schedule and manage recurring loads jobs from Amazon S3 into BigQuery and allows loading data in Parquet format. In this link you will find the documentation on how to set up an Amazon S3 data transfer.

Is it possible to load Avro files with Snappy compression into BigQuery?

I know that BigQuery supports Avro file upload and I'm successful in loading Avro file into BigQuery.
Using below command,
java -jar avro-tools-1.7.7.jar fromjson --codec snappy --schema-file SourceSchema.avsc Source.json > Output.snappy.avro
I have generated an Avro file with Snappy compression and trying to load into BigQuery but Load job fails with below errors,
Errors:
file-00000000: The Apache Avro library failed to parse file file-00000000. (error code: invalid)
Is it possible to load Avro files with Snappy compression into BigQuery?
BigQuery only supports DEFLATE and Snappy algorithm for Avro data blocks compression, from the docs (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro)
Compressed Avro files are not supported, but compressed data blocks
are. BigQuery supports the DEFLATE and Snappy codecs.
Now BigQuery supports Snappy. See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
Compressed Avro files are not supported, but compressed data blocks
are. BigQuery supports the DEFLATE and Snappy codecs.

How to export gzipped data into google cloud storage from bigquery

I need to export data from bigquery to google cloud storage on the daily basis. The data volume is rather big(1TB), After I export such data into google storage, I need to download from it, this step is very very slow. So I am wondering if I can export gzipped data into google storage? This can reduce the data volume and then I can download the data very quickly.
Could you give me some advice on this? As I didn't find compressed function in bigquery API when doing extracting from bigquery to google cloud storage.
Thanks in advance!
Now you can export with gzip compression to GCS.
Plus, if the file is greater than 1GB, you can specify '*' which would split the files
into smaller chunks.
Unfortunately, there is no gzip option.
That said, you can use automatic HTTP compression to do the gzip for you when you download the files from Google Cloud Storage. Just add the HTTP headers:
accept-encoding: gzip
user-agent: anything
It may seem strange that you need to define a user-agent header. It is strange to us too. It is a feature common across a number of google products, designed to avoid bugs in browsers that don't handle compression correctly (see https://developers.google.com/appengine/kb/general?csw=1#compression).
If you're using gsutil to download the files, it will add the compression headers automatically.

how can I upload a gzipped json file to bigquery via the HTTP API?

When I try to upload an uncompressed json file, it works fine; but when I try a gzipped version of the same json file, the job would fail with lexical error resulted from failure to parse the json content.
I gzipped the json file with the gzip command from Mac OSX 10.8 and I have set the sourceFormat to: "NEWLINE_DELIMITED_JSON".
Did I do something incorrectly or gzipped json file should be processed differently?
I believe that using the multipart/related request it is not possible to submit binary data (such as the compressed file. However, if you don't want to use uncompressed data, you may be able to use resumable upload.
What language are you coding in? The python jobs.insert() api takes a media upload parameter, which you should be able to give a filename to in order to do resumable upload (which sends your job metadata and new table data as separate streams). I was able to use this to upload a compressed file.
This is what bq.py uses, so you could look at the source code here.
If you aren't using python, the googleapis client libraries for other languages should have similar functionality.
You can upload gzipped files to Google Cloud Storage, and BigQuery will be able to ingest it with a load job:
https://developers.google.com/bigquery/loading-data-into-bigquery#loaddatagcs