I have been trying in vain for nearly two days to load in two large datasets, each of which are ~30GB/s and split into 50 uncompressed ~600MB files each, all coming from a bucket. Almost always the jobs fail with "internal" or "backend" error.
I have tried submitting with a wild card (as in *csv) and I have also tried the individual files.
On the rare occasion the load job does not fail within a few minutes, it will eventually die after 6 or 7 hours.
I have split the files and made them uncompressed to help with load times, would this be causing an issue? I did have a compressed version load successfully after about 7 hours yesterday, but so far I have only been able to load a single 350 MB CSV uncompressed from the bucket.
Here is an example:
Errors:
Error encountered during execution. Retrying may solve the problem. (error code: backendError)
Job ID bvedemo:bquijob_64ebebf1_1532f1b3c4f
Backend error would imply something is happening at Google, but I must be doing something wrong to have it fail this often!
Lesson of the day: do not try to load data from a nearline bucket into BigQuery.
I moved the data into a standard bucket, reloaded from there and 65GB of data loaded in less than 1 minute.
Related
Sorry if this is a dumb question, very new to nifi.
Have set up a process group to dump sql queries to CSV and then upload them to S3. Worked fine with small queries, but appears to be stuck with larger files.
The input queue to the PutS3Object processor has a limit of 1GB, but the file it is trying to put is almost 2 GB. I have set the multi-part parameters in the S3 processor to be 100M but it is still stuck.
So my theory is the S3PutObject needs a complete file before it starts uploading. Is this correct? Is there no way to get it uploading in a "streaming" manner? Or do I just have to up the input queue size?
Or am I on the wrong track and there is something else holding this all up.
The screenshot suggests that the large file is in PutS3Object's input queue, and PutS3Object is actively working on it (from the 1 thread indicator in the top-right of the processor box).
As it turns out, there were no errors, just a delay from processing a large file.
We have a strange issue that happen quite often.
We have a process which getting files from sources and loading it into the GCS. Than, and only if the file uploaded successfully, we try to load it into the BigQuery table and get the error of
"Not found: Uris List of uris (possibly truncated): json: file_name: ...".
After a deep investigation, it all supposed to be fine, and we don't know what had changed. In the time frames, the file in the job exists in the cloud storage, and uploaded into the GCS 2 minutes before BigQuery tried to get it.
There is need to say that we load every file as the whole batch dictionary in the Cloud Storage, like gs://<bucket>/path_to_dir/*. Is that still supported?
Also, the file sizes are kind of small - from few bytes to KB. Is that matter?
job ids for checking:
load_file_8e4e16f737084ba59ce0ba89075241b7 load_file_6c13c25e1fc54a088af40199eb86200d
Known issue with Cloud Storage consistency
As noted by Felipe, this was indeed related to a known issue with Cloud Storage. Google Cloud Storage Incident #16036 is shown to have been resolved since December 20, 2016. This was also being tracked in Issue 738. Though Cloud Storage list operations are eventually consistent, this incident displayed excessive delays in operations returning consistent results.
Handling Cloud Storage inconsistency
Though this was an isolated incident, it is nevertheless a good practice to have some means of handling such inconsistencies. Two such suggestions can be found in comment #10 of the related public issue.
Retry the load job if it failed.
Verify if Cloud Storage results are consistent with expectations
Verify the expected number of files (and total size) was processed by BigQuery. You can get this information out of the Job metadata.
Still getting unexpected results
Should you encounter such an issue again and have the appropriate error handling measures in place, I recommend first consulting the Google Cloud Status Dashboard and BigQuery public issue tracker for existing reports showing similar symptoms. If none exist, file a new issue on the issue tracker.
The solution was to move from Multi Region Bucket(that was set before Region type was enable) to Region.
Since we moved, we never faced this issue.
I'm running a test using Big Query. Basically I have 50,000 files, each of which are 27MB in size, on average. Some larger, some smaller.
Timing each file upload reveals:
real 0m49.868s
user 0m0.297s
sys 0m0.173s
Using something similar to:
time bq load --encoding="UTF-8" --field_delimiter="~" dataset gs://project/b_20130630_0003_1/20130630_0003_4565900000.tsv schema.json
Running command: "bq ls -j" and subsequently running "bq show -j " reveals that I have the following errors:
Job Type State Start Time Duration Bytes Processed
load FAILURE 01 Jul 22:21:18 0:00:00
Errors encountered during job execution. Exceeded quota: too many imports per table for this table
After checking the database, the rows seems to of loaded fine which is puzzling since, given the error, I would of expected nothing to of gotten loaded. The problem is that I really don't understand how I reached my quota limit since I've only just started
uploading files recently and thought the limit was 200,000 requests.
All the data is currently on Google Cloud Storage so I would expect the data loading to happen fairly quickly since the interaction is between cloud storage and Big Query both of which are in the cloud.
By my calculations the entire load is going to take: (50,000 * 49 seconds) 28 days.
Kinda hoping these numbers are wrong.
Thanks.
The quota limit per table is 1000 loads per day. This is to encourage people to batch their loads, since we can generate a more efficient representation of the table if we can see more of the data at once.
BigQuery can perform load jobs in parallel. Depending on the size of your load, a number of workers will be assigned to your job. If your files are large, those files will be split among workers; alternately if you pass multiple files, each worker may process a different file. So the time that it takes for one file is not indicative of the time that it takes to run a load job with multiple files.
We are having a number of backend errors on BigQuery's side when loading data files. Backend errors seem to be normal, occurring once or twice daily in our load jobs which are run every hour. In the last three days, we've had around 200 backend errors. This is causing cascading problems in our system.
Until the past three days, the system has been stable. The error is a simple "Backend error, try again." Usually the load job works when it's run again, but in the last three days the problem has become much worse. Please let me know if you need any other information from me.
We did have an issue on friday where all load jobs were returning backend error for a period of a few hours, but that should have been resolved. If you have seen backend errors since then please let us know and send a job id of a failing job.
I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!