Loading from Google cloud storage to Big Query seems slow - google-bigquery

I'm running a test using Big Query. Basically I have 50,000 files, each of which are 27MB in size, on average. Some larger, some smaller.
Timing each file upload reveals:
real 0m49.868s
user 0m0.297s
sys 0m0.173s
Using something similar to:
time bq load --encoding="UTF-8" --field_delimiter="~" dataset gs://project/b_20130630_0003_1/20130630_0003_4565900000.tsv schema.json
Running command: "bq ls -j" and subsequently running "bq show -j " reveals that I have the following errors:
Job Type State Start Time Duration Bytes Processed
load FAILURE 01 Jul 22:21:18 0:00:00
Errors encountered during job execution. Exceeded quota: too many imports per table for this table
After checking the database, the rows seems to of loaded fine which is puzzling since, given the error, I would of expected nothing to of gotten loaded. The problem is that I really don't understand how I reached my quota limit since I've only just started
uploading files recently and thought the limit was 200,000 requests.
All the data is currently on Google Cloud Storage so I would expect the data loading to happen fairly quickly since the interaction is between cloud storage and Big Query both of which are in the cloud.
By my calculations the entire load is going to take: (50,000 * 49 seconds) 28 days.
Kinda hoping these numbers are wrong.
Thanks.

The quota limit per table is 1000 loads per day. This is to encourage people to batch their loads, since we can generate a more efficient representation of the table if we can see more of the data at once.
BigQuery can perform load jobs in parallel. Depending on the size of your load, a number of workers will be assigned to your job. If your files are large, those files will be split among workers; alternately if you pass multiple files, each worker may process a different file. So the time that it takes for one file is not indicative of the time that it takes to run a load job with multiple files.

Related

Optimizing Neptune Bulk Load Jobs?

Currently we have an automation engine running to queue up billions of nodes/edges for our Neptune historical load.
The data pulls off Kafka and writes bulk CSVs into S3 to initiate the load. Currently I'm uploading files after each batch pulls a couple million records off the queue.
I'm using oversubscribe param and looked at the high-level docs for bulk optimizations. I'm seeing I can get about 36M records an hour, but looking to go faster. Do I want the output files to be larger? I can only run one job at a time and my queue is constantly filled up to the 65 cap limit.
In general, larger files should give better performance than smaller ones as the worker threads running the load will divide the file up amongst themselves. Larger instances also help the loads go faster. If possible, a db.r5.12xlarge is a good choice when you have a lot of data to load. You can scale it back down again once the volume of writes you need to achieve slows down and a smaller instance will suffice.

Allowing many users to view stale BigQuery data query results concurrently

If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.

BigQuery load failing with backend error on nearly every load attempt

I have been trying in vain for nearly two days to load in two large datasets, each of which are ~30GB/s and split into 50 uncompressed ~600MB files each, all coming from a bucket. Almost always the jobs fail with "internal" or "backend" error.
I have tried submitting with a wild card (as in *csv) and I have also tried the individual files.
On the rare occasion the load job does not fail within a few minutes, it will eventually die after 6 or 7 hours.
I have split the files and made them uncompressed to help with load times, would this be causing an issue? I did have a compressed version load successfully after about 7 hours yesterday, but so far I have only been able to load a single 350 MB CSV uncompressed from the bucket.
Here is an example:
Errors:
Error encountered during execution. Retrying may solve the problem. (error code: backendError)
Job ID bvedemo:bquijob_64ebebf1_1532f1b3c4f
Backend error would imply something is happening at Google, but I must be doing something wrong to have it fail this often!
Lesson of the day: do not try to load data from a nearline bucket into BigQuery.
I moved the data into a standard bucket, reloaded from there and 65GB of data loaded in less than 1 minute.

Processing data while it is loading

We have a tool which loads data from some optical media, and once it's all copied to the hard drive runs it through a third-party tool for processing. I would like to optimise this process so each file is processed as it is read in. Trouble is, the third-party tool (which naturally I cannot change) has a 12 second startup overhead. What is the best way I can deal with this, in terms of finishing the entire process as soon as possible? I can pass any number of files to the processing tool in each run, so I need to be able to determine exactly when to run the tool to get the fastest result overall. The data being copied could be anything from one large file (which can't be processed until it's fully copied) to hundreds of small files.
The simplest would be to create and run 2 threads, one that runs the tool and one that loads data. Start 12 seconds timer and trigger both threads. Upon each file load completion check the passed time. If 12 seconds passed, fetch the data into the thread running the tool. Restart loading the data in parallel to processing of previous bulk. Once previous bulk processing completes restart the 12 sec timer and continue checking it upon every file load completion. Repeat till no more data remains.
For better results a more complex solution might be required. You can do some benchmarking to get an evaluation of average data loading time. Since it might be different for small and large files, several evaluations may be needed for different categories of files (according to size). Optimal resources utilization would be the one that processes the data in the same rate the new data arrives. Processing time includes the 12 seconds startup. The benchmarking should give you a ratio of processing threads number vs. reading threads number (you can also decrease/increase the number of active reading threads according to the incoming file sizes). Actually, it's a variation of producer-consumer problem with multiple producers and consumers.

Will doing fork multiple times affect performance?

I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!