Is uploading directly to BigQuery faster than uploading to Cloud Storage? - google-bigquery

I'm uploading a delimited file from my PC into BigQuery. Two ways I could do this:
Upload to Cloud Storage, then load to BQ
Upload to BQ directly
Is there a difference in upload time for each of these methods?

Nope, goes through the same infrastructure either way.
The advantage of uploading directly to BQ is that it's simpler: only one service to interact with.
The advantage of uploading to GCS is that it gives you more flexibility. You can repeat your load job without re-uploading if the load job happens to fail (bad schema, etc), and you may have other reasons to want a copy of the data in GCS (loading into other systems, etc).

Related

Send Bigquery Data to rest endpoint

I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

Loading Data into BigQuery: Direct Insert from Process vs Process and then loading through Google Drive?

I have a google cloud function that generates files stored on Google Drive.
I want to load those files in Big Query.
What are the pros and cons of loading data directly from the function (skipping the file generation, just doing some kind of insert in BigQuery) vs loading from Google Drive?
I am interested in focusing the question not only in terms of technical stuff and costs, but also in terms of data processing methodology.
I think the question could lead to the dilema loading online or more in a batch process.
PS: This may sound a duplicate from this post but is not exactly the same.
Files Available Locally (in Cloud Function)
If the file is generated within the cloud function (within its local environment0, loading it is pretty similar to loading from your local filesystem. Here is what it comes down to:
Cons:
The total file size should be <= 10Mbs. If its a CSV, it should have less than 16k rows.
You cannot export multiple files at once to BQ, and have to iterate over each file to load it individually into BQ.
Pros:
If the file fulfills the above constraints, you will be saving the intermediate local -> GCS upload and can load to BQ directly.
Files Available in Cloud Storage Bucket (GCS)
On the other hand, if you decide to send the locally generated file in the cloud function to GCS and then export it to BQ:
Pros:
You can use wildcard exports to BQ (i.e. export multiple files simultaneously), significantly increasing the overall export speed.
Size limitations for per file are much more relaxed (4GB in case of uncompressed and 5TB in case of compressed).
Overall export is much faster compared to local/cloud function exports.
Cons:
Probably the only downside is that if you want to stream data into BQ table, you cannot directly do it if your file is in a GCS bucket. You can achieve that from a locally available file.

Loading data from a Google Persistent Disk into BigQuery?

What's the recommended way of loading data into BigQuery that is currently located in a Google Persistent Disk? Are there any special tools or best practises for this particular use case?
Copy to GCS (Google Cloud Storage), point BigQuery to load from GCS.
There's no current direct connection between a persistent disk and BigQuery. You could send the data straight to BigQuery with the bq CLI, but makes everything slower if you ever need to retry.

Is it possible to load & export to BQ from google compute disk?

How do I load to bigquery from compute disk? and should this be faster then loading from Google CLoud storage?
And the same for export. thanks a alot
Sounds like an interesting idea, but there are only 2 ways to load data into BigQuery:
A .csv or .json file in GCS (as you mention). Note that they can be gzip compressed, for faster transfer times, while uncompressed files will get faster processing times.
Or send data to the API via a POST request.
As you mention, leaving file in GCS is the faster and recommended way for large data imports and exports.

Optimize data upload on GoogleBigQuery

I'm currently using the Google BigQuery platform for uploading many datas (~ > 6 Go) and work with them as datasource with Tableau Desktop Software.
Presently it takes me an average of one hour to upload 12 tables in CSV format (total of 6 Go), uncompressed, with a python script using the Google API.
The google docs specify that "If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.".
How can I optimize this process ? Should be a solution to compressed my csv files to improve the upload speed ?
I also think about using Google Cloud Storage, but I expect my problem will be the same?
I need to reduce the time it's take me to upload my data files, but I don't find great solutions.
Thanks in advance.
Compressing your input data will reduce the time to upload the data, but will increase the time for the load job to execute once your data has been uploaded (compression restricts our ability to process your data in parallel). Since it sounds like you'd prefer to optimize for upload speed, I'd recommend compressing your data.
Note that if you're willing to split your data into several chunks and compress them each individually, you can get the best of both worlds--fast uploads and parallel load jobs.
Uploading to Google Cloud Storage should have the same trade-offs, except for one advantage: you can specify multiple source files in a single load job. This is handy if you pre-shard your data as suggested above, because then you can run a single load job that specifies several compressed input files as source files.