BigQuery: Not able to export to CSV - google-bigquery

I am able to import data from my google storage. However, having troubling exporting data to Google Cloud Storage CSV files through the web console. Data set is small, and I am not getting any specific reasons that cause the issue.
Extract9:30am
gl-analytics:glcqa.Device togs://glccsv/device.csv
Errors:
Unexpected. Please try again.
Job ID: job_f8b50cc4b4144e14a22f3526a2b76b75
Start Time: 9:30am, 24 Jan 2013
End Time: 9:30am, 24 Jan 2013
Source Table: gl-analytics:glcqa.Device
Destination URI: gs://glccsv/device.csv

It looks like you have a nested schema, which cannot be output to csv. Try setting the output format to JSON.
Note this bug has now been fixed internally, so after our next release you'll get a better error when this happens.

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

BigQuery export to Avro format failing with "Internal Error"

From BigQuery, I've tried running (since Apr 5) several export jobs (Avro format) to GCS, but I keep getting the error: "Internal Error".
What am I doing wrong?
Screenshot of the problem
It looks like you are hitting the same error as in this issue tracker report; the internal error is due to the format of a particular column in your table. Please start or cc yourself on the bug to receive updates.

Unexpected. Please try again, error message

Trying to import a 2.8GB of file via GS to Big query, and the job failed with:
Unexpected. Please try again.
here are some other outputs
Job ID: aerobic-forge-504:job_a6H1vqkuNFf-cJfAn544yy0MfxA
Start Time: 5:12pm, 3 Jul 2014
End Time: 7:12pm, 3 Jul 2014
Destination Table: aerobic-forge-504:wr_dev.phone_numbers
Source URI: gs://fls_csv_files/2014-6-11_Global_0A43E3B1-2E4A-4CA9-BD2A-012B4D0E4C69.txt
Source Format: CSV
Allow Quoted Newlines: true
Allow Jagged Rows: true
Ignore Unknown Values: true
Schema:
area: INTEGER
number: INTEGER
The job failed due to timeout; there is a maximum of 2 hours allowed for processing; after that the import job is killed. I'm not sure why the import was so slow; from what I can tell we only processed at about 100KB/sec, which is far slower than expected. It is quite possible that the error is transient.
In the future, you can speed up the import by setting allow_quoted_newlines to false, which will allow bigquery to process the import in parallel. Alternately, you can partition the file yourself and send multiple file paths in the job.
Can you try again and let us know whether it works?

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.

Bigquery : Unexpected. Please try again when loading a 53GB CSV/ 1.4GB gZIP

I was trying to load 1.4Gb gZIP data in to my BigQuery table and i am getting the error Unexpected. Please try again consistently
job_7f1aa8d29ae641459c82243530eb1c65
I was trying to load a structure Row ID,Order Priority,Discount,Unit Price,Shipping Cost,Customer ID,Customer Name,Ship Mode,Product Category,Product Sub-Category,Product Base Margin,Region,State or Province,City,Postal Code,Order Date,Ship Date,Profit,Quantity ordered new,Sales,Order ID
the error is not clear on whats going wrong.
anyone else encountered this error?
Thanks.
It looks like your job ran out of time-- a 53 GB CSV file is a lot to process in one thread. Best practice is to either split your data in multiple chunks, or upload uncompressed data which can be processed in parallel.
I'm in the process of raising the allowed time somewhat, and we'll work on improving the error message when this happens.