I'm currently using the Google BigQuery platform for uploading many datas (~ > 6 Go) and work with them as datasource with Tableau Desktop Software.
Presently it takes me an average of one hour to upload 12 tables in CSV format (total of 6 Go), uncompressed, with a python script using the Google API.
The google docs specify that "If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.".
How can I optimize this process ? Should be a solution to compressed my csv files to improve the upload speed ?
I also think about using Google Cloud Storage, but I expect my problem will be the same?
I need to reduce the time it's take me to upload my data files, but I don't find great solutions.
Thanks in advance.
Compressing your input data will reduce the time to upload the data, but will increase the time for the load job to execute once your data has been uploaded (compression restricts our ability to process your data in parallel). Since it sounds like you'd prefer to optimize for upload speed, I'd recommend compressing your data.
Note that if you're willing to split your data into several chunks and compress them each individually, you can get the best of both worlds--fast uploads and parallel load jobs.
Uploading to Google Cloud Storage should have the same trade-offs, except for one advantage: you can specify multiple source files in a single load job. This is handy if you pre-shard your data as suggested above, because then you can run a single load job that specifies several compressed input files as source files.
Related
I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.
I was going through Google BigQuery documentation and I see there is a limit of 5TB file capacity for unencrypted file load and 4TB for the encrypted file load in BigQuery, with 15TB per load job.
I have a hypothetical question - How can I load a text file larger than 16TB (assuming encryption will bring it in the range of 4TB)? I also see the GCS Cloud storage limit is 5TB per file.
I have never done it but here is how I think of possible approach but not sure and looking for confirmation. First, we will have to split the file. Next, we have to encrypt them and transfer them to GCS. Next, load them in the Google BigQuery table.
You are on the right track I guess. Split the file into smaller chunks, then go ahead and distribute them within 2 or 3 different GCS buckets.
Once the chunks are there in the buckets, you can go ahead and load them into BQ.
Hope it helps.
I have a google cloud function that generates files stored on Google Drive.
I want to load those files in Big Query.
What are the pros and cons of loading data directly from the function (skipping the file generation, just doing some kind of insert in BigQuery) vs loading from Google Drive?
I am interested in focusing the question not only in terms of technical stuff and costs, but also in terms of data processing methodology.
I think the question could lead to the dilema loading online or more in a batch process.
PS: This may sound a duplicate from this post but is not exactly the same.
Files Available Locally (in Cloud Function)
If the file is generated within the cloud function (within its local environment0, loading it is pretty similar to loading from your local filesystem. Here is what it comes down to:
Cons:
The total file size should be <= 10Mbs. If its a CSV, it should have less than 16k rows.
You cannot export multiple files at once to BQ, and have to iterate over each file to load it individually into BQ.
Pros:
If the file fulfills the above constraints, you will be saving the intermediate local -> GCS upload and can load to BQ directly.
Files Available in Cloud Storage Bucket (GCS)
On the other hand, if you decide to send the locally generated file in the cloud function to GCS and then export it to BQ:
Pros:
You can use wildcard exports to BQ (i.e. export multiple files simultaneously), significantly increasing the overall export speed.
Size limitations for per file are much more relaxed (4GB in case of uncompressed and 5TB in case of compressed).
Overall export is much faster compared to local/cloud function exports.
Cons:
Probably the only downside is that if you want to stream data into BQ table, you cannot directly do it if your file is in a GCS bucket. You can achieve that from a locally available file.
I need to load around 1 million rows into bigquery table. My approach will be to write data into cloud storage, and then use load api to load multiple files at once.
What's the most efficient way to do this? I can parallelize the writing into gcs part. When I call load api, I pass in all the uris so I only need to call it once. I'm not sure how this loading is conducted in the backend. If I pass in multiple file names, will this loading run in multiple processes? How do I decide the size of each file to get the best performance?
Thanks
Put all the million rows in one file. If the file is not compressed, BigQuery can read it in parallel with many workers.
From https://cloud.google.com/bigquery/quota-policy
BigQuery can read compressed files (.gz) of up to 4GB.
BigQuery can read uncompressed files (.csv, .json, ...) of up to 5000GB. BigQuery figures out how to read it in parallel - you don't need to worry.
How do I load to bigquery from compute disk? and should this be faster then loading from Google CLoud storage?
And the same for export. thanks a alot
Sounds like an interesting idea, but there are only 2 ways to load data into BigQuery:
A .csv or .json file in GCS (as you mention). Note that they can be gzip compressed, for faster transfer times, while uncompressed files will get faster processing times.
Or send data to the API via a POST request.
As you mention, leaving file in GCS is the faster and recommended way for large data imports and exports.