Now I have a BigQuery table whose data source is from some bucket at GCS(Google Cloud Storage).
The GCS is dynamic constantly with new files added in. So do we have any available mechanisms for BigQuery to automatically detect the changes in GCS and sync with the latest data?
Thanks!
There is a very cool beta feature you can use to do that. Check out BigQuery Cloud Storage Transfer. You can schedule transfers run backfill, and much more.
Read "limitations" to see if it can work for you.
Related
I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)
We create files ("blobs") in Google Cloud Storage and instruct BiqQuery load jobs to load them into a table. The blobs are kept in a shared bucket and there are concurrent jobs loading into target tables. We would like to make sure that one job is on loading blobs that another job is loading.
Our idea is to use the metadata support of Google Cloud Storage to manage what blobs are meant to be loaded by which job. Meta data is easy to modify (easier than for example rename the blob) so it is good for state management.
In the cloud storage API there is support for metadata versioning, e.g. you can make storage operations conditional on a specific version of the blob. It is well described here https://cloud.google.com/storage/docs/generations-preconditions , see the if-generation-match precondition.
I try to find corresponding support in the BiqQuery load job https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad but I don't find it. Do know if there is this kind of metadata versioning conditional load support in BigQuery load API?
I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.
What's the recommended way of loading data into BigQuery that is currently located in a Google Persistent Disk? Are there any special tools or best practises for this particular use case?
Copy to GCS (Google Cloud Storage), point BigQuery to load from GCS.
There's no current direct connection between a persistent disk and BigQuery. You could send the data straight to BigQuery with the bq CLI, but makes everything slower if you ever need to retry.
I have many TBs in about 1 million tables in a single BigQuery project hosted in multiple datasets that are located in the US. I need to move all of this data to datasets hosted in the EU. What is my best option for doing so?
I'd export the tables to Google Cloud Storage and reimport using load jobs, but there's a 10K limit on load jobs per project per day
I'd do it as queries w/"allow large results" and save to a destination table, but that doesn't work cross-region
The only option I see right now is to reinsert all of the data using the BQ streaming API, which would be cost prohibitive.
What's the best way to move a large volume of data in many tables cross-region in BigQuery?
You have a couple of options:
Use load jobs, and contact Google Cloud Support to ask for a quota exception. They're likely to grant 100k or so on a temporary basis (if not, contact me, tigani#google, and I can do so).
Use federated query jobs. That is, move the data into a GCS bucket in the EU, then re-import the data via BigQuery queries with GCS data sources. More info here.
I'll also look into whether we can increase this quota limit across the board.
You can copy dataset using BigQuery Copy Dataset (in/cross-region). The copy dataset UI is similar to copy table. Just click "copy dataset" button from the source dataset, and specify the destination dataset in the pop-up form. See screenshot below. Check out the public documentation for more use cases.
A few other options that are now available since Jordan answered a few years ago. These options might be useful for some folks:
Use Cloud Composer to orchestrate the export and load via GCS buckets. See here.
Use Cloud Dataflow to orchestrate the export and load via GCS buckets. See here.
Disclaimer: I wrote the article for the 2nd option (using Cloud Dataflow).