Apache Beam Java 2.26.0: BigQueryIO 'No rows present in the request' - google-bigquery

Since the Beam 2.26.0 update we ran into errors in our Java SDK streaming data pipelines. We have been investigating the issue for quite some time now but are unable to track down the root cause. When downgrading to 2.25.0 the pipeline works as expected.
Our pipelines are responsible for ingestion, i.e., consume from Pub/Sub and ingest into BigQuery. Specifically, we use the PubSubIO source and the BigQueryIO sink (streaming mode). When running the pipeline, we encounter the following error:
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "No rows present in the request.",
"reason" : "invalid"
} ],
"message" : "No rows present in the request.",
"status" : "INVALID_ARGUMENT"
}
Our initial guess was that the pipeline's logic was somehow bugged, causing the BigQueryIO sink to fail. After investigation, we concluded that the PCollection feeding the sink is indeed containing correct data.
Earlier today I was looking in the changelog and noticed that the BigQueryIO sink received numerous updates. I was specifically worried about the following changes:
BigQuery’s DATETIME type now maps to Beam logical type org.apache.beam.sdk.schemas.logicaltypes.SqlTypes.DATETIME
Java BigQuery streaming inserts now have timeouts enabled by default. Pass --HTTPWriteTimeout=0 to revert to the old behavior
With respect to the first update, I made sure to disable all DATETIME in the resulting TableRow objects. In this specific scenario, the error still stands.
For the second change, I'm unsure how to pass the --HTTPWriteTimeout=0 flag to the pipeline. How is this best achieved?
Any other suggestions as to the root cause of this issue?
Thanks in advance!

We have finally been able to fix this issue and rest assured it has been a hell of a ride. We basically debugged the entire BigQueryIO connector and came to the following conclusions:
The TableRow objects that are being forwarded to BigQuery used to contain enum values. Due to these not being serializable, an empty payload is forwarded to BigQuery. In my opinion, this error should be made more explicit (why was this suddenly changed anyway?).
The issue was solved by adding the #value annotation to each enum entry (com.google.api.client.util.Value).
The same TableRow object also contained values of the type byte[]. This value was injected in a BigQuery column with the bytes type. While this was working without explicitly computing a base64 before, it was now yielding errors.
The issue was solved by computing a base64 ourselves (this setup is also discussed in the following post).

--HTTPWriteTimeout is a pipeline option. You can set it the same way you set the runner, etc. (typically on the command line).

Related

TFTransform ValueError: "Split does not exist over all example artifacts"

I'm attempting to construct a TFX pipeline, but keep running into an error during the TFTransform component stem. After diving into the error message and its code on GitHub, it appears to have something to do with a function def get_split_uris(). From what I can glean, there is a mismatch between the number of Artifacts being consumed by this function during runtime and the number of URIs being retrieved and being matched back to this list.
It's odd because my CSVExampleGen() function doesn't seem to have any problems ingesting my original data set that's already split into two CSV files: 'target' and 'candidate'. I cannot find any documentation on this error on the TFX website so my apologies for not having more information.
I can provide additional details if needed.

RavenDb: Document Refresh feature does not run at nor after the specified time by #refresh flag

I need to mark documets as expired after some time and therefore I am trying to use #refresh feature to re-run subscription and to compute my 'expired' flag. I know there is 'Document expiration' feature but this one removes data which I don't want.
I have turned Refresh feature in settings and added #refresh UTC datetime in metadata for required documents. For example I added manually this document:
{
"Name": "My data",
"#metadata": {
"#collection": "Testing",
"#refresh": "2021-04-30T07:41:35.4845961Z"
}
}
It looks like I am facing non deterministic behavior - sometimes refresh is processed sometimes not. I tried with different combinations of times and set through code or via Raven Studio.
Refresh interval is set to refresh but still says "in less than a minute"
I am using
Community license (Document refresh not mentioned here, but I don't see it mentioned for any other licenses as well)
community license extensions
tried more vresions of RavenDB with same result (5.1.7. was looking more promising as it worked for some time but after a while stopped):
4.2.111 server/studio version in Docker on Windows 10
5.1.7 server/studio version
C# RavenDB.Client 5.1.6
Did not find related issue in bug tracker
https://issues.hibernatingrhinos.com/issues/RavenDB?q=document%20refresh
Any ideas what to check or what might be the case?
EDIT: After turned logging into console I found some error log. It looks like
RavendbProject, Raven.Server.Documents.Expiration.ExpiredDocumentsCleaner, Failed to refresh documents on RavendbProject which are older than 05/17/2021 09:48:47, EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object.
RavendbProject | at Sparrow.Server.ByteStringContext`1.From(String value, ByteStringType type, ByteString& str) in C:\Builds\RavenDB-Stable-5.1\51024\src\Sparrow.Server\ByteString.cs:line 1297
RavendbProject | at Raven.Server.Documents.DocumentPutAction.PutDocument(DocumentsOperationContext context, String id, String expectedChangeVector, BlittableJsonReaderObject document, Nullable`1 lastModifiedTicks, String changeVector, DocumentFlags flags, NonPersistentDocumentFlags nonPersistentFlags) in C:\Builds\RavenDB-Stable-5.1\51024\src\Raven.Server\Documents\DocumentPutAction.cs:line 190
Also worth mentioning is that my document was stored in ClusterWide transaction and thus I can see in one of my documents corresponding flag:
"#flags": "FromClusterTransaction",
My current suspicion is that it may happen that one of these documents prevented other documents from being refreshed. After deleting cluster-transaction document, other documents in collection were refreshed
The bug related to document that was added via cluster transaction, the workaround for now would be to not use cluster transaction.
I have opened an issue on bug tracker,
https://issues.hibernatingrhinos.com/issue/RavenDB-16710

How to avoid duplicates in BigQuery by streaming with Apache Beam IO?

We are using a pretty simple flow where messages are retrieved from PubSub, their JSON content is being flatten into two types (for BigQuery and Postgres) and then inserted into both sinks.
But, we are seeing duplicates in both sinks (Postgres was kinda fixed with a unique constraint and a "ON CONFLICT... DO NOTHING").
At first we trusted in the supposedly "insertId" UUId that the Apache Beam/BigQuery creates.
Then we add a "unique_label" attribute to each message before queueing them into PubSub, using data from the JSON itself, which gives them uniqueness (a device_id + a reading's timestamp). And subscribed to the topic using that attribute with "withIdAttribute" method.
Finally we paid for GCP Support, and their "solutions" do not work. They have told us to even use Reshuffle transform, which is deprecated by the way, and some windowing (that we do not won't since we want near-real time data).
This the main flow, pretty basic:
[UPDATED WITH LAST CODE]
Pipeline
val options = PipelineOptionsFactory.fromArgs(*args).withValidation().`as`(OptionArgs::class.java)
val pipeline = Pipeline.create(options)
var mappings = ""
// Value only available at runtime
if (options.schemaFile.isAccessible){
mappings = readCloudFile(options.schemaFile.get())
}
val tableRowMapper = ReadingToTableRowMapper(mappings)
val postgresMapper = ReadingToPostgresMapper(mappings)
val pubsubMessages =
pipeline
.apply("ReadPubSubMessages",
PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id_label")
.fromTopic(options.pubSubInput))
pubsubMessages
.apply("AckPubSubMessages", ParDo.of(object: DoFn<PubsubMessage, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info("Processing readings: " + context.element().attributeMap["id_label"])
context.output("")
}
}))
val disarmedMessages =
pubsubMessages
.apply("DisarmedPubSubMessages",
DisarmPubsubMessage(tableRowMapper, postgresMapper)
)
disarmedMessages
.get(TupleTags.readingErrorTag)
.apply("LogDisarmedErrors", ParDo.of(object: DoFn<String, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info(context.element())
context.output("")
}
}))
disarmedMessages
.get(TupleTags.tableRowTag)
.apply("WriteToBigQuery",
BigQueryIO
.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.to(options.bigQueryOutput)
)
pipeline.run()
DissarmPubsubMessage is a PTransforms that uses FlatMapElements transform to get TableRow and ReadingsInputFlatten (own class for Postgres)
We expect zero duplicates or the "best effort" (and we append some cleaning cron job), we paid for these products to run statistics and bigdata analysis...
[UPDATE 1]
I even append a new simple transform that logs our unique attribute through a ParDo which supposedly should ack the PubsubMessage, but this is not the case:
new flow with AckPubSubMessages step
Thanks!!
Looks like you are using the global window. One technique would be to window this into an N minute window. Then process the keys in the window and drop an items with dup keys.
The supported programming languages are Python and Java, your code seems to be Scala and as far as I know it is not supported. I strongly recommend using Java to avoid any unsupported feature for the programming language you use.
In addition, I would recommend the following approaches to work on duplicates, the option 2 could meet your need of near-real-time:
message_id. Probably you already read the FAQ - duplicates which points to deprecated doc. However, if you check the PubsubMessage object you will notice that messageId is still available and it will be populated if not set by the publisher:
"ID of this message, assigned by the server when the message is
published ... It must not be populated by the publisher in a
topics.publish call"
BigQuery Streaming. To validate duplicate during loading the data, right before inserting in BQ you can create UUID.Please refer the section Example sink: Google BigQuery.
Try the Dataflow template PubSubToBigQuery and validate there are not duplicates in BQ.

how to get more info than generic "Failed to parse JSON: No active field found.; ParsedString returned false; Could not parse value" on BigQuery load?

We're trying BigQuery for the first time, with data extracted from mongo in json format. I kept getting this generic parse error upon loading the file. But then I tried a smaller subset of the file, 20 records, and it loaded fine. This tells me it's not the general structure of the file, which I had originally thought was the problem. Is there any way to get more info on the parse error, such as the string of the record that it's trying to parse when it has this error?
I also tried using the max errors field, but that didn't work either.
This was via the website. I also tried it via the Google Cloud SDK command line 'bq load...' and got the same error.
This error is most likely caused by some of the JSON records not compying with table schema. It is not clear whether you used schema autodetect feature, or you are supplying schema for the load. But here is one example where such error could happen:
{ "a" : "1" }
{ "a" : { "b" : "2" } }
If you only have a few of these and they are for invalid records - you can automatically ignore them by using max_bad_records option for load job. More details at: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json

JSON Schema for FHIR false positives

I am new to JSON Schema, and am trying to validate JSON based on the HL7-FHIR schemas. Data I think should be invalid (and that the official Java-based validator says are invalid) shows up as valid.
For example, {"dog": "food"} should be invalid, because when I run the validator, I get:
> java -jar org.hl7.fhir.validator.jar bad.json -defn definitions.json.zip
.. load FHIR from definitions.json.zip
.. connect to tx server # http://tx.fhir.org/r3
(vnull-null)
.. validate
*FAILURE* validating bad.json: error:1 warn:0 info:0
Fatal # $ (line 1, col2) : Unable to find resourceType property
But if I paste the fhir.schema.json file from here into a JSON Schema validator like the one here, and evaluate {"dog": "food"}, it's valid.
It's valid even if I supply a resourceType, which I thought might cause the restrictions to kick in. It's also valid if I copy an example I expect to be valid—say, this Practitioner example—and change some of the types (set name to be a string rather than an array, for example).
I'm not sure if I'm running into a problem with the HL7-FHIR JSON Schema in particular or with JSON Schemas in general. I believe my question is different than this one because it appears that we're up to release 3.0, and so the schema I'm using is updated.