Is it possible to load Avro files with Snappy compression into BigQuery? - google-bigquery

I know that BigQuery supports Avro file upload and I'm successful in loading Avro file into BigQuery.
Using below command,
java -jar avro-tools-1.7.7.jar fromjson --codec snappy --schema-file SourceSchema.avsc Source.json > Output.snappy.avro
I have generated an Avro file with Snappy compression and trying to load into BigQuery but Load job fails with below errors,
Errors:
file-00000000: The Apache Avro library failed to parse file file-00000000. (error code: invalid)
Is it possible to load Avro files with Snappy compression into BigQuery?

BigQuery only supports DEFLATE and Snappy algorithm for Avro data blocks compression, from the docs (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro)
Compressed Avro files are not supported, but compressed data blocks
are. BigQuery supports the DEFLATE and Snappy codecs.

Now BigQuery supports Snappy. See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
Compressed Avro files are not supported, but compressed data blocks
are. BigQuery supports the DEFLATE and Snappy codecs.

Related

consume gzip files with databricks autoloader

I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way.
Therefore, I would like to know if there is a way to consume the content of a gzip file via databricks autoloader

Most straightforward way of inflating gzip memory stream

I have a gzipped file that I need to read and decompress in my application. I just read through the zlib manual, and it appears that the zlib function are able to operate via memory buffers, but the gzip interface is all file-based.
What is the most common method of dealing with gzipped files like this? Do I need to handle the gzip file format myself, pull out the deflated data, and pass it to the zlib functions?
Note: The reason file-based will not work is because the file is in an archive on a read-only medium, so I can't extract the file first and use the gzip functions from zlib. This is an embedded Linux system.
You need to "read through" the zlib manual again, this time reading through it. inflateInit2() has an option to decompress gzip streams.

BigQuery load compressed data from Cloud Storage

I have a lot of *.gz files in my Google Cloud Storage.
I want to load those data to BigQuery.
I've tried to execute
bq load --source_format=AVRO projectId:dataset.table gs://bucket/*.gz
But received error
The Apache Avro library failed to parse file gs://bucket/f92d8ae3-6eba-4e35-9fc0-b8f31b4b9881-part-r-00004.gz.
Is it possible to upload compressed data to BigQuery? What is the best pracrise for those problems?
Compressed Avro files are not supported ...
See more in Avro format

How to export gzipped data into google cloud storage from bigquery

I need to export data from bigquery to google cloud storage on the daily basis. The data volume is rather big(1TB), After I export such data into google storage, I need to download from it, this step is very very slow. So I am wondering if I can export gzipped data into google storage? This can reduce the data volume and then I can download the data very quickly.
Could you give me some advice on this? As I didn't find compressed function in bigquery API when doing extracting from bigquery to google cloud storage.
Thanks in advance!
Now you can export with gzip compression to GCS.
Plus, if the file is greater than 1GB, you can specify '*' which would split the files
into smaller chunks.
Unfortunately, there is no gzip option.
That said, you can use automatic HTTP compression to do the gzip for you when you download the files from Google Cloud Storage. Just add the HTTP headers:
accept-encoding: gzip
user-agent: anything
It may seem strange that you need to define a user-agent header. It is strange to us too. It is a feature common across a number of google products, designed to avoid bugs in browsers that don't handle compression correctly (see https://developers.google.com/appengine/kb/general?csw=1#compression).
If you're using gsutil to download the files, it will add the compression headers automatically.

how can I upload a gzipped json file to bigquery via the HTTP API?

When I try to upload an uncompressed json file, it works fine; but when I try a gzipped version of the same json file, the job would fail with lexical error resulted from failure to parse the json content.
I gzipped the json file with the gzip command from Mac OSX 10.8 and I have set the sourceFormat to: "NEWLINE_DELIMITED_JSON".
Did I do something incorrectly or gzipped json file should be processed differently?
I believe that using the multipart/related request it is not possible to submit binary data (such as the compressed file. However, if you don't want to use uncompressed data, you may be able to use resumable upload.
What language are you coding in? The python jobs.insert() api takes a media upload parameter, which you should be able to give a filename to in order to do resumable upload (which sends your job metadata and new table data as separate streams). I was able to use this to upload a compressed file.
This is what bq.py uses, so you could look at the source code here.
If you aren't using python, the googleapis client libraries for other languages should have similar functionality.
You can upload gzipped files to Google Cloud Storage, and BigQuery will be able to ingest it with a load job:
https://developers.google.com/bigquery/loading-data-into-bigquery#loaddatagcs