Querying compressed files using BigQuery federated source - google-bigquery

According to the BigQuery federated source documentation:
[...]or are compressed must be less than 1 GB each.
This would imply that compressed files are supported types for federated sources in BigQuery.
However, I get the following error when trying to query a gz file in GCS:
I tested with an uncompressed file and it works fine. Are compressed files supported as federated sources in BigQuery, or have I misinterpreted the documentation?

Compression mode defaults to NONE and needs to be explicitly specified in the external table definition.
At the time of the question, this couldn't be done through the UI. This is now fixed and compressed data should be automatically detected.
For more background information, see:
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query
The interesting parameter is "configuration.query.tableDefinitions.[key].compression".

Related

Bigquery Unloading Large Data to a Single GZIP File

I'm using the BigQuery console and was planning to extract a table and put the results into Google Cloud Storage as a GZIP file but encountered an error asking to wilcard the filename as based on Google docs, it's like a limitation for large volume of data and extract needs to be splitted.
https://cloud.google.com/bigquery/docs/exporting-data#console
By any chance is there a workaround so I could have a single compressed file loaded to Google Cloud Storage instead of multiple files? I was using Redshift previously and this wasn't an issue.

Mosaic Decisions Azure BLOB writer node creating multiple files

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?
Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

Does Google BigQuery supports Parquet file format?

I was wondering if Google BigQuery currently supports Parquet file format or if there are plans to support it?
I know that it currently supports CSV and JSON formats.
** As of 1st March 2018, Support for loading Parquet 1.0 files is available.
In the BigQuery CLI, there is --source_format PARQUET option which is described in output of bq --help.
I never got to use it, because when I was experimenting with this feature, it was still invite-only, and I did not request the invite.
My usecase was that the Parquet file is half the size of the Avro file. I wanted to try something new and upload data efficiently (in this order).
% bq load --source_format PARQUET test.test3 data.avro.parquet schema.json
Upload complete.
Waiting on bqjob_r5b8a2b16d964eef7_0000015b0690a06a_1 ... (0s) Current
status: DONE
[...]
At this time BigQuery does not support Parquet file format. However, we are interested to hear more about your use case - are you interested in import, export or both ? How do you intend to use it ? Understanding the scenarios better will help BigQuery team to plan accordingly.
If you want to share a file format between BigQuery and Hadoop, you can use newline separated JSON records.
BigQuery supports these for import and export.
Hadoop supports this as well. Searching the internets finds many hits showing recipes for making it work. Here's one: Processing JSON using java Mapreduce
When you are dealing with hundreds of millions of rows and need to move data to an on-premise Hadoop cluster, this is, exporting from bigQuery, json is just not feasible option, avro not much better, the only efficient option today for such movement of data is gz which is unfortunately not possible to be read natively in Hadoop, Larquet is the only efficient way for this use case, we do not have any other efficient option
Eample (part-* is the secret sauce here):
bq load --source_format=PARQUET --replace=true abc.def gs://abc/def/part-*

Transfer large file from Google BigQuery to Google Cloud Storage

I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.
I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.
This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.
Is there a better approach for this?
Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.
Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.
Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.
From the console, use the bq utility to strip the headers:
bq --skip_leading_rows 1

BigQuery Backend Errors during upload operation

I want to know what are the possible errors that can arose from Big Query server side during upload mechanism, though the .CSV file that i'm uploading contains perfect data. Can you list out those errors?
Thanks.
Some of the common errors are:
Files must be encoded in UTF-8 format.
Source data must be properly
escaped within standard guidelines for CSV and JSON.
The structure of
records and the data within of must match the schema provided.
Individual files must be under the size limits listed on our
quota/limits page.
More information about BigQuery source data formats.
Check out our Data Loading cookbook for additional tips.