Does Google BigQuery supports Parquet file format? - google-bigquery

I was wondering if Google BigQuery currently supports Parquet file format or if there are plans to support it?
I know that it currently supports CSV and JSON formats.

** As of 1st March 2018, Support for loading Parquet 1.0 files is available.
In the BigQuery CLI, there is --source_format PARQUET option which is described in output of bq --help.
I never got to use it, because when I was experimenting with this feature, it was still invite-only, and I did not request the invite.
My usecase was that the Parquet file is half the size of the Avro file. I wanted to try something new and upload data efficiently (in this order).
% bq load --source_format PARQUET test.test3 data.avro.parquet schema.json
Upload complete.
Waiting on bqjob_r5b8a2b16d964eef7_0000015b0690a06a_1 ... (0s) Current
status: DONE
[...]

At this time BigQuery does not support Parquet file format. However, we are interested to hear more about your use case - are you interested in import, export or both ? How do you intend to use it ? Understanding the scenarios better will help BigQuery team to plan accordingly.

If you want to share a file format between BigQuery and Hadoop, you can use newline separated JSON records.
BigQuery supports these for import and export.
Hadoop supports this as well. Searching the internets finds many hits showing recipes for making it work. Here's one: Processing JSON using java Mapreduce

When you are dealing with hundreds of millions of rows and need to move data to an on-premise Hadoop cluster, this is, exporting from bigQuery, json is just not feasible option, avro not much better, the only efficient option today for such movement of data is gz which is unfortunately not possible to be read natively in Hadoop, Larquet is the only efficient way for this use case, we do not have any other efficient option

Eample (part-* is the secret sauce here):
bq load --source_format=PARQUET --replace=true abc.def gs://abc/def/part-*

Related

Bigquery Unloading Large Data to a Single GZIP File

I'm using the BigQuery console and was planning to extract a table and put the results into Google Cloud Storage as a GZIP file but encountered an error asking to wilcard the filename as based on Google docs, it's like a limitation for large volume of data and extract needs to be splitted.
https://cloud.google.com/bigquery/docs/exporting-data#console
By any chance is there a workaround so I could have a single compressed file loaded to Google Cloud Storage instead of multiple files? I was using Redshift previously and this wasn't an issue.

Handling dynamic keys while loading json data in bigquery

I have a very big gz zipped file of JSON data. Due to some limitations, I am not able to extract and transform the data. Now the JSON data itself is very dynamic in nature.
For example:
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name1': 'span of years studied'}}
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name2': 'span of years studied'}}
The problem is the high-school-name field is a dynamic one, which will be different for different sets of users.
Now when I am uploading to bigquery, I am not able to determine which type I should specify for the schooling field or how to handle this upload to bigquery.
I am using Cloud function to automate the flow, so as soon as the file is uploaded to Cloud Storage it will trigger the function. As the cloud function has very low memory storage, there is no way to transform the data there. I have looked into dataprep for it, but I am trying to understand if I am missing something which could make what I am trying to do possible without using any other services.
According to the documentation Loading JSON data from Cloud Storage and Specifying nested and repeated columns I think you, in deed, need a process step that could be well covered either with Dataproc or Dataflow.
You can implement a pipeline to transform your dynamic data as needed and to write to BigQuery. This doc might be of your interest. There is a template source cpde in which you can address to put a json into a BigQuery table. Here is the documentation about loading json data from cloud storage.
Please note that one if the limitations is:
If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
This is one of the reasons why I think you have to implement your solution with an additional product, as you mentioned.

Extract data fom Marklogic 8.0.6 to AWS S3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

Inserting realtime data into Bigquery with a file on compute engine?

I'm downloading realtime data into a csv file on Google's Compute Engine instance and want to load this file into Bigquery for realtime analysis.
Is there a way for me to do this without first uploading the file to Cloud Storage?
I tried this: https://cloud.google.com/bigquery/streaming-data-into-bigquery but since my file isnt in JSON, this fails.
Have you tried the command line tool? You can upload CSVs from it.

Transfer large file from Google BigQuery to Google Cloud Storage

I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.
I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.
This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.
Is there a better approach for this?
Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.
Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.
Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.
From the console, use the bq utility to strip the headers:
bq --skip_leading_rows 1