Loading Data into BigQuery: Direct Insert from Process vs Process and then loading through Google Drive? - google-bigquery

I have a google cloud function that generates files stored on Google Drive.
I want to load those files in Big Query.
What are the pros and cons of loading data directly from the function (skipping the file generation, just doing some kind of insert in BigQuery) vs loading from Google Drive?
I am interested in focusing the question not only in terms of technical stuff and costs, but also in terms of data processing methodology.
I think the question could lead to the dilema loading online or more in a batch process.
PS: This may sound a duplicate from this post but is not exactly the same.

Files Available Locally (in Cloud Function)
If the file is generated within the cloud function (within its local environment0, loading it is pretty similar to loading from your local filesystem. Here is what it comes down to:
Cons:
The total file size should be <= 10Mbs. If its a CSV, it should have less than 16k rows.
You cannot export multiple files at once to BQ, and have to iterate over each file to load it individually into BQ.
Pros:
If the file fulfills the above constraints, you will be saving the intermediate local -> GCS upload and can load to BQ directly.
Files Available in Cloud Storage Bucket (GCS)
On the other hand, if you decide to send the locally generated file in the cloud function to GCS and then export it to BQ:
Pros:
You can use wildcard exports to BQ (i.e. export multiple files simultaneously), significantly increasing the overall export speed.
Size limitations for per file are much more relaxed (4GB in case of uncompressed and 5TB in case of compressed).
Overall export is much faster compared to local/cloud function exports.
Cons:
Probably the only downside is that if you want to stream data into BQ table, you cannot directly do it if your file is in a GCS bucket. You can achieve that from a locally available file.

Related

Is there (still) an advantage to staging data on Google Cloud Storage before loading into BigQuery?

I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.

File structure of Apache Beam DynamicDestinations write to BigQuery

I am using DynamicDestinations (from BigQueryIO) to export data from one Cassandra table to multiple Google BigQuery tables. The process consists of several steps including writing prepared data to Google Cloud Storage (as files in JSON format) and then loading the files to BQ via load jobs.
The problem is that export process has ended with out of memory error at the last step (loading files from Google Storage to BQ). But there are prepared files with all of the data in GCS remaining. There are 3 directories in BigQueryWriteTemp location:
And there a lot of files with not obvious names:
The question is what is the storage structure of the files? How can I match the files with tables (table names) they prepared for? How can I use the files to continue export process from load jobs step? Can I use some piece of Beam code for that?
These files, if you're using Beam 2.3.0 or earlier, contain JSON data to be imported into BigQuery using its load job API. However:
This is an implementation detail that you can not rely on, in general. It is very likely to change in future versions of Beam (JSON is horribly inefficient).
It is not possible to match these files with the tables they are intended for - that was stored in the internal state of the pipeline that has failed.
There is also no way to know how much data was written to these files and how much wasn't. The files may contain only partial data: maybe your pipeline failed before creating some of the files, or after some of them were already loaded into BigQuery and deleted.
Basically, you'll need to rerun the pipeline and fix the OOM issue so that it succeeds.
For debugging OOM issues, I suggest using a heap dump. Dataflow can write heap dumps to GCS using --dumpHeapOnOOM --saveHeapDumpsToGcsPath=gs://my_bucket/. You can examine these dumps using any Java memory profiler, such as Eclipse MAT or YourKit. You can also post your code as a separate SO question and ask for advice reducing its memory usage.

Loading files from GCS to BigQuery - what's the best approach?

I need to load around 1 million rows into bigquery table. My approach will be to write data into cloud storage, and then use load api to load multiple files at once.
What's the most efficient way to do this? I can parallelize the writing into gcs part. When I call load api, I pass in all the uris so I only need to call it once. I'm not sure how this loading is conducted in the backend. If I pass in multiple file names, will this loading run in multiple processes? How do I decide the size of each file to get the best performance?
Thanks
Put all the million rows in one file. If the file is not compressed, BigQuery can read it in parallel with many workers.
From https://cloud.google.com/bigquery/quota-policy
BigQuery can read compressed files (.gz) of up to 4GB.
BigQuery can read uncompressed files (.csv, .json, ...) of up to 5000GB. BigQuery figures out how to read it in parallel - you don't need to worry.

Is it possible to load & export to BQ from google compute disk?

How do I load to bigquery from compute disk? and should this be faster then loading from Google CLoud storage?
And the same for export. thanks a alot
Sounds like an interesting idea, but there are only 2 ways to load data into BigQuery:
A .csv or .json file in GCS (as you mention). Note that they can be gzip compressed, for faster transfer times, while uncompressed files will get faster processing times.
Or send data to the API via a POST request.
As you mention, leaving file in GCS is the faster and recommended way for large data imports and exports.

Optimize data upload on GoogleBigQuery

I'm currently using the Google BigQuery platform for uploading many datas (~ > 6 Go) and work with them as datasource with Tableau Desktop Software.
Presently it takes me an average of one hour to upload 12 tables in CSV format (total of 6 Go), uncompressed, with a python script using the Google API.
The google docs specify that "If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.".
How can I optimize this process ? Should be a solution to compressed my csv files to improve the upload speed ?
I also think about using Google Cloud Storage, but I expect my problem will be the same?
I need to reduce the time it's take me to upload my data files, but I don't find great solutions.
Thanks in advance.
Compressing your input data will reduce the time to upload the data, but will increase the time for the load job to execute once your data has been uploaded (compression restricts our ability to process your data in parallel). Since it sounds like you'd prefer to optimize for upload speed, I'd recommend compressing your data.
Note that if you're willing to split your data into several chunks and compress them each individually, you can get the best of both worlds--fast uploads and parallel load jobs.
Uploading to Google Cloud Storage should have the same trade-offs, except for one advantage: you can specify multiple source files in a single load job. This is handy if you pre-shard your data as suggested above, because then you can run a single load job that specifies several compressed input files as source files.