What's the recommended way of loading data into BigQuery that is currently located in a Google Persistent Disk? Are there any special tools or best practises for this particular use case?
Copy to GCS (Google Cloud Storage), point BigQuery to load from GCS.
There's no current direct connection between a persistent disk and BigQuery. You could send the data straight to BigQuery with the bq CLI, but makes everything slower if you ever need to retry.
Related
I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)
We create files ("blobs") in Google Cloud Storage and instruct BiqQuery load jobs to load them into a table. The blobs are kept in a shared bucket and there are concurrent jobs loading into target tables. We would like to make sure that one job is on loading blobs that another job is loading.
Our idea is to use the metadata support of Google Cloud Storage to manage what blobs are meant to be loaded by which job. Meta data is easy to modify (easier than for example rename the blob) so it is good for state management.
In the cloud storage API there is support for metadata versioning, e.g. you can make storage operations conditional on a specific version of the blob. It is well described here https://cloud.google.com/storage/docs/generations-preconditions , see the if-generation-match precondition.
I try to find corresponding support in the BiqQuery load job https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad but I don't find it. Do know if there is this kind of metadata versioning conditional load support in BigQuery load API?
I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.
Now I have a BigQuery table whose data source is from some bucket at GCS(Google Cloud Storage).
The GCS is dynamic constantly with new files added in. So do we have any available mechanisms for BigQuery to automatically detect the changes in GCS and sync with the latest data?
Thanks!
There is a very cool beta feature you can use to do that. Check out BigQuery Cloud Storage Transfer. You can schedule transfers run backfill, and much more.
Read "limitations" to see if it can work for you.
I'm uploading a delimited file from my PC into BigQuery. Two ways I could do this:
Upload to Cloud Storage, then load to BQ
Upload to BQ directly
Is there a difference in upload time for each of these methods?
Nope, goes through the same infrastructure either way.
The advantage of uploading directly to BQ is that it's simpler: only one service to interact with.
The advantage of uploading to GCS is that it gives you more flexibility. You can repeat your load job without re-uploading if the load job happens to fail (bad schema, etc), and you may have other reasons to want a copy of the data in GCS (loading into other systems, etc).