I need to load around 1 million rows into bigquery table. My approach will be to write data into cloud storage, and then use load api to load multiple files at once.
What's the most efficient way to do this? I can parallelize the writing into gcs part. When I call load api, I pass in all the uris so I only need to call it once. I'm not sure how this loading is conducted in the backend. If I pass in multiple file names, will this loading run in multiple processes? How do I decide the size of each file to get the best performance?
Thanks
Put all the million rows in one file. If the file is not compressed, BigQuery can read it in parallel with many workers.
From https://cloud.google.com/bigquery/quota-policy
BigQuery can read compressed files (.gz) of up to 4GB.
BigQuery can read uncompressed files (.csv, .json, ...) of up to 5000GB. BigQuery figures out how to read it in parallel - you don't need to worry.
Related
I have a 10gb CSV file. I can put the file in S3 in 2 ways.
1) Upload the entire file into single csv object.
2) Divide the file into multiple chunks(say 200mb) and upload.
Now I need to get all the data in the object into a pandas data frame which is running on a EC2 instance.
1) One way is to make a single request and get the file, if it is to be a one big file and put the data in dataframe.
2) Other way is to make multiple requests for each object and keep appending data to dataframe.
Which is the better way of doing it?
With multiple files, you will have possibility to download them simultaneously in parallel threads. But this has 2 drawbacks:
These operations are IO heavy (network mostly), so depending on your instance type you might have worse performance overall
Multithreaded apps include some overhead in handling errors, aggregating results and such.
Depending on what you do, you might also want to look at AWS Athena, which can query data in S3 for you and produce results in seconds, so you don't have to download it at all.
I was going through Google BigQuery documentation and I see there is a limit of 5TB file capacity for unencrypted file load and 4TB for the encrypted file load in BigQuery, with 15TB per load job.
I have a hypothetical question - How can I load a text file larger than 16TB (assuming encryption will bring it in the range of 4TB)? I also see the GCS Cloud storage limit is 5TB per file.
I have never done it but here is how I think of possible approach but not sure and looking for confirmation. First, we will have to split the file. Next, we have to encrypt them and transfer them to GCS. Next, load them in the Google BigQuery table.
You are on the right track I guess. Split the file into smaller chunks, then go ahead and distribute them within 2 or 3 different GCS buckets.
Once the chunks are there in the buckets, you can go ahead and load them into BQ.
Hope it helps.
I have a google cloud function that generates files stored on Google Drive.
I want to load those files in Big Query.
What are the pros and cons of loading data directly from the function (skipping the file generation, just doing some kind of insert in BigQuery) vs loading from Google Drive?
I am interested in focusing the question not only in terms of technical stuff and costs, but also in terms of data processing methodology.
I think the question could lead to the dilema loading online or more in a batch process.
PS: This may sound a duplicate from this post but is not exactly the same.
Files Available Locally (in Cloud Function)
If the file is generated within the cloud function (within its local environment0, loading it is pretty similar to loading from your local filesystem. Here is what it comes down to:
Cons:
The total file size should be <= 10Mbs. If its a CSV, it should have less than 16k rows.
You cannot export multiple files at once to BQ, and have to iterate over each file to load it individually into BQ.
Pros:
If the file fulfills the above constraints, you will be saving the intermediate local -> GCS upload and can load to BQ directly.
Files Available in Cloud Storage Bucket (GCS)
On the other hand, if you decide to send the locally generated file in the cloud function to GCS and then export it to BQ:
Pros:
You can use wildcard exports to BQ (i.e. export multiple files simultaneously), significantly increasing the overall export speed.
Size limitations for per file are much more relaxed (4GB in case of uncompressed and 5TB in case of compressed).
Overall export is much faster compared to local/cloud function exports.
Cons:
Probably the only downside is that if you want to stream data into BQ table, you cannot directly do it if your file is in a GCS bucket. You can achieve that from a locally available file.
How do I load to bigquery from compute disk? and should this be faster then loading from Google CLoud storage?
And the same for export. thanks a alot
Sounds like an interesting idea, but there are only 2 ways to load data into BigQuery:
A .csv or .json file in GCS (as you mention). Note that they can be gzip compressed, for faster transfer times, while uncompressed files will get faster processing times.
Or send data to the API via a POST request.
As you mention, leaving file in GCS is the faster and recommended way for large data imports and exports.
I'm currently using the Google BigQuery platform for uploading many datas (~ > 6 Go) and work with them as datasource with Tableau Desktop Software.
Presently it takes me an average of one hour to upload 12 tables in CSV format (total of 6 Go), uncompressed, with a python script using the Google API.
The google docs specify that "If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.".
How can I optimize this process ? Should be a solution to compressed my csv files to improve the upload speed ?
I also think about using Google Cloud Storage, but I expect my problem will be the same?
I need to reduce the time it's take me to upload my data files, but I don't find great solutions.
Thanks in advance.
Compressing your input data will reduce the time to upload the data, but will increase the time for the load job to execute once your data has been uploaded (compression restricts our ability to process your data in parallel). Since it sounds like you'd prefer to optimize for upload speed, I'd recommend compressing your data.
Note that if you're willing to split your data into several chunks and compress them each individually, you can get the best of both worlds--fast uploads and parallel load jobs.
Uploading to Google Cloud Storage should have the same trade-offs, except for one advantage: you can specify multiple source files in a single load job. This is handy if you pre-shard your data as suggested above, because then you can run a single load job that specifies several compressed input files as source files.