I have a very big gz zipped file of JSON data. Due to some limitations, I am not able to extract and transform the data. Now the JSON data itself is very dynamic in nature.
For example:
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name1': 'span of years studied'}}
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name2': 'span of years studied'}}
The problem is the high-school-name field is a dynamic one, which will be different for different sets of users.
Now when I am uploading to bigquery, I am not able to determine which type I should specify for the schooling field or how to handle this upload to bigquery.
I am using Cloud function to automate the flow, so as soon as the file is uploaded to Cloud Storage it will trigger the function. As the cloud function has very low memory storage, there is no way to transform the data there. I have looked into dataprep for it, but I am trying to understand if I am missing something which could make what I am trying to do possible without using any other services.
According to the documentation Loading JSON data from Cloud Storage and Specifying nested and repeated columns I think you, in deed, need a process step that could be well covered either with Dataproc or Dataflow.
You can implement a pipeline to transform your dynamic data as needed and to write to BigQuery. This doc might be of your interest. There is a template source cpde in which you can address to put a json into a BigQuery table. Here is the documentation about loading json data from cloud storage.
Please note that one if the limitations is:
If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
This is one of the reasons why I think you have to implement your solution with an additional product, as you mentioned.
Related
I used exporting collections to bigquery extension to export collections from firestore. But no data is shown. only if we add new data, new data are displayed. Is there any way to load existing data which is in firestore database? new data added after choosing path for collection works but old data are not retrieved.
According to Firestore export to Bigquery documentation lines:
This extension only sends the content of documents that have been
changed -- it does not export your full dataset of existing documents
into BigQuery.
Given said this and regards to the Bigquery extension guidelines, you can concern two options with aim having all collection documents being exported to Bigquery:
Run fs-bq-import-collection script, reading all the documents
within a collection and forcing inserts to Bigquery sink as explained
here;
Adjust Firestore managed export service, exporting collection
documents and storing them to GCS, assuming that these data files can
be further loaded to Bigquery as well.
Let me know whether you have any further doubts.
I just cleaned up my firestore collection data using DataPrep and verified the data via BigQuery. I now want to move the data back to Firestore. Is there a way to do this?
I have used manual method of exporting to JSON and then uploading using a code provided by AngularFirebase. But It is not automated as there is a need to periodically cleanup this data.
I am looking for a process within Google Cloud console. Any help will be appreciated
This is not an answer, more like a partial answer. I could not add a comment as I don't have 50 reputation yet. I am in a similar boat but not entirely the same situation. My situation being that I want to use a subset of BigQuery data and add it to Firestore. My thinking is to do the following:
Use the BigQuery API to query the data periodically using BigQuery Jobs' Load (in your case) or Query (in my case)
Convert it to JSON in code
Use batch commit in Firestore's API to update the firestore database
This is my idea and I am not sure whether this will work, but I will you know more once I am done with this. Unless someone else has better insights to help me and the person asking this question
I was wondering if Google BigQuery currently supports Parquet file format or if there are plans to support it?
I know that it currently supports CSV and JSON formats.
** As of 1st March 2018, Support for loading Parquet 1.0 files is available.
In the BigQuery CLI, there is --source_format PARQUET option which is described in output of bq --help.
I never got to use it, because when I was experimenting with this feature, it was still invite-only, and I did not request the invite.
My usecase was that the Parquet file is half the size of the Avro file. I wanted to try something new and upload data efficiently (in this order).
% bq load --source_format PARQUET test.test3 data.avro.parquet schema.json
Upload complete.
Waiting on bqjob_r5b8a2b16d964eef7_0000015b0690a06a_1 ... (0s) Current
status: DONE
[...]
At this time BigQuery does not support Parquet file format. However, we are interested to hear more about your use case - are you interested in import, export or both ? How do you intend to use it ? Understanding the scenarios better will help BigQuery team to plan accordingly.
If you want to share a file format between BigQuery and Hadoop, you can use newline separated JSON records.
BigQuery supports these for import and export.
Hadoop supports this as well. Searching the internets finds many hits showing recipes for making it work. Here's one: Processing JSON using java Mapreduce
When you are dealing with hundreds of millions of rows and need to move data to an on-premise Hadoop cluster, this is, exporting from bigQuery, json is just not feasible option, avro not much better, the only efficient option today for such movement of data is gz which is unfortunately not possible to be read natively in Hadoop, Larquet is the only efficient way for this use case, we do not have any other efficient option
Eample (part-* is the secret sauce here):
bq load --source_format=PARQUET --replace=true abc.def gs://abc/def/part-*
I've been looking around for a lightweight, scaleable solution to enrich a CSV file with additional metadata from a database. Each line in the CSV represents a data item and the columns the metadata belonging to that item.
Basically I have a CSV extract and I need to add additional metadata from a database. The metadata can be accessed via ODBC or REST API call.
I have a number of options in my head but I'm looking for other ideas. My options are as follows:
Import the CSV into a database table, apply the additional metadata with sql UPDATE statements by finding the necessary metadata with SELECT statements, and then export the data back into CSV format. For this solution I was thinking to use an ETL tool which may be a bit heavyweight to tackle this problem.
I also thought about a NodeJS based solution where I read the CSV in, call web service to get the metadata and write back the data into the CSV file. The CSV can be however quite large with potentially tens of thousands of rows so this could be heavy on memory or in case of line-by-line processing not very performant.
If you have a better solution in mind, please post. Many thanks.
I think you've come up with a couple of pretty good ideas here already.
Running with your first suggestion using an ETL tool to enrich your CSV files, you should check out https://github.com/streamsets/datacollector
It's a continuous ingestion approach, so you could even monitor a directory of CSV files to load as you get them. While there's no specific functionality yet for doing lookups in a database, its certainly possible in a number of ways (including writing your own custom logic in Java, or a script in python or JavaScript).
*Full disclosure I work on this project.
I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.
I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.
This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.
Is there a better approach for this?
Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.
Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.
Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.
From the console, use the bq utility to strip the headers:
bq --skip_leading_rows 1