Bigquery : Unexpected. Please try again when loading a 53GB CSV/ 1.4GB gZIP - google-bigquery

I was trying to load 1.4Gb gZIP data in to my BigQuery table and i am getting the error Unexpected. Please try again consistently
job_7f1aa8d29ae641459c82243530eb1c65
I was trying to load a structure Row ID,Order Priority,Discount,Unit Price,Shipping Cost,Customer ID,Customer Name,Ship Mode,Product Category,Product Sub-Category,Product Base Margin,Region,State or Province,City,Postal Code,Order Date,Ship Date,Profit,Quantity ordered new,Sales,Order ID
the error is not clear on whats going wrong.
anyone else encountered this error?
Thanks.

It looks like your job ran out of time-- a 53 GB CSV file is a lot to process in one thread. Best practice is to either split your data in multiple chunks, or upload uncompressed data which can be processed in parallel.
I'm in the process of raising the allowed time somewhat, and we'll work on improving the error message when this happens.

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Snowflake COPY INTO from JSON - ON_ERROR = CONTINUE - Weird Issue

I am trying to load JSON file from Staging area (S3) into Stage table using COPY INTO command.
Table:
create or replace TABLE stage_tableA (
RAW_JSON VARIANT NOT NULL
);
Copy Command:
copy into stage_tableA from #stgS3/filename_45.gz file_format = (format_name = 'file_json')
Got the below error when executing the above (sample provided)
SQL Error [100069] [22P02]: Error parsing JSON: document is too large, max size 16777216 bytes If you would like to continue loading
when an error is encountered, use other values such as 'SKIP_FILE' or
'CONTINUE' for the ON_ERROR option. For more information on loading
options, please run 'info loading_data' in a SQL client.
When I had put "ON_ERROR=CONTINUE" , records got partially loaded, i.e until the record with more than max size. But no records after the Error record was loaded.
Was "ON_ERROR=CONTINUE" supposed to skip only the record that has max size and load records before and after it ?
Yes, the ON_ERROR=CONTINUE skips the offending line and continues to load the rest of the file.
To help us provide more insight, can you answer the following:
How many records are in your file?
How many got loaded?
At what line was the error first encountered?
You can find this information using the COPY_HISTORY() table function
Try setting the option strip_outer_array = true for file format and attempt the loading again.
The considerations for loading large size semi-structured data are documented in the below article:
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html
I partially agree with Chris. The ON_ERROR=CONTINUE option only helps if the there are in fact more than 1 JSON objects in the file. If it's 1 massive object then you would simply not get an error or the record loaded when using ON_ERROR=CONTINUE.
If you know your JSON payload is smaller than 16mb then definitely try the strip_outer_array = true. Also, if your JSON has a lot of nulls ("NULL") as values use the STRIP_NULL_VALUES = TRUE as this will slim your payload as well. Hope that helps.

Flink s3 read error: Data read has a different length than the expected

Using flink 1.7.0, but also seen on flink 1.8.0. We are getting frequent but somewhat random errors when reading gzipped objects from S3 through the flink .readFile source:
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Data read has a different length than the expected: dataLength=9713156; expectedLength=9770429; includeSkipped=true; in.getClass()=class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client$2; markedSupported=false; marked=0; resetSinceLastMarked=false; markCount=0; resetCount=0
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.checkLength(LengthCheckInputStream.java:151)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:93)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:76)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.closeStream(S3AInputStream.java:529)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:490)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.flink.fs.s3.common.hadoop.HadoopDataInputStream.close(HadoopDataInputStream.java:89)
at java.util.zip.InflaterInputStream.close(InflaterInputStream.java:227)
at java.util.zip.GZIPInputStream.close(GZIPInputStream.java:136)
at org.apache.flink.api.common.io.InputStreamFSInputWrapper.close(InputStreamFSInputWrapper.java:46)
at org.apache.flink.api.common.io.FileInputFormat.close(FileInputFormat.java:861)
at org.apache.flink.api.common.io.DelimitedInputFormat.close(DelimitedInputFormat.java:536)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$SplitReader.run(ContinuousFileReaderOperator.java:336)
ys
Within a given job, we generally see many / most of the jobs read successfully, but there's pretty much always at least one failure (say out of 50 files).
It seems this error is actually originating from the AWS client, so perhaps flink has nothing to do with it, but I'm hopeful someone might have an insight as to how to make this work reliably.
When the error occurs, it ends up killing the source and canceling all the connected operators. I'm still new to flink, but I would think that this is something that could be recoverable from a previous snapshot? Should I expect that flink will retry reading the file when this kind of exception occurs?
Maybe you can try to add more connection for s3a like
flink:
...
config: |
fs.s3a.connection.maximum: 320

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.

Determine actual errors from a load job

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.
It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().