Why does my Dataflow output "timeout value is negative" on insertion to BigQuery? - google-bigquery

I have a Dataflow job consisting of ReadSource, ParDo, Windowing, Insert (into a date-partitioned table in BigQuery).
It basically:
Reads text files from a Google Storage bucket using a glob
Process each line by splitting on delimiter, changing some values before giving each column a name and data type before outputting as a BigQuery table row together with a timestamp based on the data
Window on a daily window using the timestamp from step 2
Write to BigQuery, using Window table and "dataset$datepartition" syntax to specify table and partition. Create disposition set to CREATE_IF_NEEDED and write disposition set to WRITE_APPEND.
The first three steps seems to run fine but in most cases the job runs into problem on the last insert step which gives exceptions in the log:
java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:287)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2446)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2404)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:287)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:223)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This exception is repeated ten times.
At last I get "workflow failed" as below:
Workflow failed. Causes: S04:Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/Reshuffle/
GroupByKey/Read+Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey/
GroupByWindow+Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/Reshuffle/
ExpandIterable+Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/ParDo(StreamingWrite)
failed.
Sometimes the same job with the same input works without problem though which makes this quite hard to debug. So where to start?

This is a known issue with the BigQueryIO streaming write operation in Dataflow SDK for Java 1.7.0. It is fixed in the GitHub HEAD and the fix will be included in the 1.8.0 release of the Dataflow Java SDK.
For more details, see Issue #451 on the DataflowJavaSDK GitHub repository.

Related

Unable to select count of rows of an ORC table through Hive Beeline command

I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2
And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE and this throws the below set of exceptions
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1670915386694_0182_1_00, diagnostics=[Vertex vertex_1670915386694_0182_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: jio_ar_consumer_events initializer failed, vertex=vertex_1670915386694_0182_1_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)
... 17 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.AcidUtils.lambda$getAcidState$0(AcidUtils.java:1117)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1464)
at java.util.Collections.sort(Collections.java:177)
at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1115)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:1207)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$1500(OrcInputFormat.java:1142)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1179)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1176)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1176)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1142)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
There is another article where the same problem has been described ORC Split Generation issue with Hive Table
but there isnt any solution as such yet.
I also tried running CONCATENATE function on top of ORC Table but that didn't help either.
What works though is, if I run select * from ORC_TABLE with or without LIMIT, it seems to extract the records. I reckon issue must only be with aggregate functions or may be I don't get the issue yet.
I am also using Spark 3.3.1 and I can extract the same count through Spark Context Spark Sql utility and able to fetch the rows as well. No issues with Spark in that front.
Adding on to it, When I change the execution engine to MR, then this works. Fails only when I run this on Tez Engine.
Any leads to resolve this issue is much appreciated.
The issue was resolved by the below steps based my previous analysis:
This class org.apache.hadoop.fs.FileStatus comes as a part of hadoop common jar file.
We were using Hadoop 3.1.4 & Tez 0.9.2
Tez 0.9.2 contains a tez.tar.gz that needs to be placed onto HDFS location. This tez.tar.gz contained hadoop-common-2.7.2.jar (This does not have the method compareTo that is thrown as an exception as shown in the error )
Solution :
We extracted the tez.tar.gz and replaced all hadoop 2.7.2 related jars with hadoop 3.1.4 jars. Do this if you dont want to reconfigure again with new tez version. Otherwise you could follow solution 2 as mentioned.
Recreated the tar and placed it across all dependent locations including HDFS as well. For us it was in /user/tez/share/tez.tar.gz location. It changes accordingly.
This error disappeared after I followed the steps and now I am able to do count of records on any table.
Solution 2 :
Other solution that you could easily do is, use 0.10.x Tez version that contains libraries for hadoop 3.x version. Rather than 0.9.2 Tez version which is compatible with hadoop 2.7.x version.

Bigquery internal error during copy job to move tables between datasets

I'm currently migrating around 200 tables in Bigquery (BQ) from one dataset (FROM_DATASET) to another one (TO_DATASET). Each one of these tables has a _TABLE_SUFFIX corresponding to a date (I have three years of data for each table). Each suffix contains typically between 5 GB and 80 GB of data.
I'm doing this using a Python script that asks BQ, for each table, for each suffix, to run the following query:
-- example table=T_SOME_TABLE, suffix=20190915
CREATE OR REPLACE TABLE `my-project.TO_DATASET.T_SOME_TABLE_20190915`
COPY `my-project.FROM_DATASET.T_SOME_TABLE_20190915`
Everything works except for three tables (and all their suffixes) where the copy job fails at each _TABLE_SUFFIX with this error:
An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 4893854
Retrying the job after some time actually works but of course is slowing the process. Is there anyone who has an idea on what the problem might be?
Thanks.
It turned out that those three problematic tables were some legacy ones with lots of columns. In particular, the BQ GUI shows this warning for two of them:
"Schema and preview are not displayed because the table has too many
columns and may cause the BigQuery console to become unresponsive"
This was probably the issue.
In the end, I managed to migrate everything by implementing a backoff mechanism to retry failed jobs.

BigQuery Scheduled Data Transfer throws "Incompatible table partitioning specification." Error - but error message is truncated

I'm using the new BQ Data Transfer UI and upon scheduling a Data Transfer, the transfer fails.
The error message in Run History isn't terribly helpful as the error message seems truncated.
Incompatible table partitioning specification. Expects partitioning specification interval(type:hour), but input partitioning specification is ; JobID: xxxxxxxxxxxx
Note the part of the error that says..."but input partition specification is..." with nothing before the semicolon. Seems this error is truncated.
Some details about the run:
The run imports data from a CSV file located in a GCS Bucket on a nightly basis. Once successfully ingested the process will delete the file. The target table in BQ is a partitioned table using the default partition pseudo column (_PARTITIONTIME)
What I have done so far:
Reran the scheduled Data Transfer -- which failed and threw the same error
Deleted the target table in BQ and recreated it with different partition specifications (day, hour, month) -- then Reran Scheduled Transfer -- failed and threw same error.
Imported the data manually (I downloaded the file from GCS and uploaded it locally from my machine) using the BQ UI (Create Table, append the specific table) - Worked perfectly.
Checked to see if this was a known issue here on Stack Overflow and only found this (which is now closed) -- close, but not exactly the issue. (BigQuery Data Transfer Service with BigQuery partitioned table)
What I'm holding off doing since it would take a bit more work.
Change schema of the target BQ table to include a specified column specific for partitioning
Include a system-generated timestamp in the original file inside of GCS and ensure the process recognizes this as the partitioning field.
Am I doing something wrong? Or is this a known issue?
Alright, I believe I have solved this. It looks like you need to include runtime parameters into your target table if the destination table is being partitioned.
https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters
Specifically this section called "Runtime Parameter Examples" here: https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters#loading_a_snapshot_of_all_data_into_an_ingestion-time_partitioned_table
They also advise that minutes cannot be specified in these parameters.
You will need to append the parameters to your destination table details as shown below:

BigQuery autodetect doesn't work with inconsistent json?

I'm trying to upload JSON to BigQuery, with --autodetect so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.
Unfortunately I get the following failure:
Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.
Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3
Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?
Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?
I found two tools that can help:
bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.
However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.
The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".
Hope this info helps.
Yes, see documentation here:
https://cloud.google.com/bigquery/docs/schema-detect
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.
Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.
How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.
You could simply use the following ingestion workflow with CLI or serverless
bqtail -r=rule.yaml -s=sourceURL
#rule.yaml
When:
Prefix: /data/somefolder
Suffix: .json
Async: false
Dest:
Table: mydataset.mytable
AllowFieldAddition: true
Transient:
Template: mydataset.myTableTempl
Dataset: temp
Batch:
MultiPath: true
Window:
DurationInSec: 15
OnSuccess:
- Action: delete
See JSON with allow field addition e2e test case

Getting error from bq tool when uploading and importing data on BigQuery - 'Backend Error'

I'm getting the error: BigQuery error in load operation: Backend Error when I try to upload and import data on BQ. I already reduced size, increased time between imports, but nothing helps. The strange thing is that if I wait for a time and retry it just works.
In the BigQuery Browser tool it appears like an error in some line/field, but I checked and there is none. And obviously this is a fake message, because if I wait and retry to upload/import the same file, it works.
Tnks
I looked up our failing jobs in the bigquery backend, and I couldn't find any jobs that terminated with 'backend error'. I found several that failed because there were ascii nulls found in the data. (it can be helpful to look at the error stream errors, not just the error result). It is possible that the data got garbled on the way to bigquery... are you certain the data did not change between the failing import and the successful one on the same data?
I've found exporting from a big query table to csv in cloud storage hits the same error when certain characters are present in one of the columns (in this case a column storing the raw results from a prediction analysis). By removing that column from the export it resolved the issue.