How to avoid duplicates in BigQuery by streaming with Apache Beam IO? - google-bigquery

We are using a pretty simple flow where messages are retrieved from PubSub, their JSON content is being flatten into two types (for BigQuery and Postgres) and then inserted into both sinks.
But, we are seeing duplicates in both sinks (Postgres was kinda fixed with a unique constraint and a "ON CONFLICT... DO NOTHING").
At first we trusted in the supposedly "insertId" UUId that the Apache Beam/BigQuery creates.
Then we add a "unique_label" attribute to each message before queueing them into PubSub, using data from the JSON itself, which gives them uniqueness (a device_id + a reading's timestamp). And subscribed to the topic using that attribute with "withIdAttribute" method.
Finally we paid for GCP Support, and their "solutions" do not work. They have told us to even use Reshuffle transform, which is deprecated by the way, and some windowing (that we do not won't since we want near-real time data).
This the main flow, pretty basic:
[UPDATED WITH LAST CODE]
Pipeline
val options = PipelineOptionsFactory.fromArgs(*args).withValidation().`as`(OptionArgs::class.java)
val pipeline = Pipeline.create(options)
var mappings = ""
// Value only available at runtime
if (options.schemaFile.isAccessible){
mappings = readCloudFile(options.schemaFile.get())
}
val tableRowMapper = ReadingToTableRowMapper(mappings)
val postgresMapper = ReadingToPostgresMapper(mappings)
val pubsubMessages =
pipeline
.apply("ReadPubSubMessages",
PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id_label")
.fromTopic(options.pubSubInput))
pubsubMessages
.apply("AckPubSubMessages", ParDo.of(object: DoFn<PubsubMessage, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info("Processing readings: " + context.element().attributeMap["id_label"])
context.output("")
}
}))
val disarmedMessages =
pubsubMessages
.apply("DisarmedPubSubMessages",
DisarmPubsubMessage(tableRowMapper, postgresMapper)
)
disarmedMessages
.get(TupleTags.readingErrorTag)
.apply("LogDisarmedErrors", ParDo.of(object: DoFn<String, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info(context.element())
context.output("")
}
}))
disarmedMessages
.get(TupleTags.tableRowTag)
.apply("WriteToBigQuery",
BigQueryIO
.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.to(options.bigQueryOutput)
)
pipeline.run()
DissarmPubsubMessage is a PTransforms that uses FlatMapElements transform to get TableRow and ReadingsInputFlatten (own class for Postgres)
We expect zero duplicates or the "best effort" (and we append some cleaning cron job), we paid for these products to run statistics and bigdata analysis...
[UPDATE 1]
I even append a new simple transform that logs our unique attribute through a ParDo which supposedly should ack the PubsubMessage, but this is not the case:
new flow with AckPubSubMessages step
Thanks!!

Looks like you are using the global window. One technique would be to window this into an N minute window. Then process the keys in the window and drop an items with dup keys.

The supported programming languages are Python and Java, your code seems to be Scala and as far as I know it is not supported. I strongly recommend using Java to avoid any unsupported feature for the programming language you use.
In addition, I would recommend the following approaches to work on duplicates, the option 2 could meet your need of near-real-time:
message_id. Probably you already read the FAQ - duplicates which points to deprecated doc. However, if you check the PubsubMessage object you will notice that messageId is still available and it will be populated if not set by the publisher:
"ID of this message, assigned by the server when the message is
published ... It must not be populated by the publisher in a
topics.publish call"
BigQuery Streaming. To validate duplicate during loading the data, right before inserting in BQ you can create UUID.Please refer the section Example sink: Google BigQuery.
Try the Dataflow template PubSubToBigQuery and validate there are not duplicates in BQ.

Related

Neo4j 3.5's embeded database does not seem to persist data

I am trying to build a small command line tool that will store data in a neo4j graph. To do this I have started experimenting with Neo4j3.5's embedded databases. After putting together the following example I have found that either the nodes I am creating are not being saved to the database or the method of database creation is overwriting my previous run.
The Example:
fun main() {
//Spin up data base
val graphDBFactory = GraphDatabaseFactory()
val graphDB = graphDBFactory.newEmbeddedDatabase(File("src/main/resources/neo4j"))
registerShutdownHook(graphDB)
val tx = graphDB.beginTx()
graphDB.createNode(Label.label("firstNode"))
graphDB.createNode(Label.label("secondNode"))
val result = graphDB.execute("MATCH (a) RETURN COUNT(a)")
println(result.resultAsString())
tx.success()
}
private fun registerShutdownHook(graphDb: GraphDatabaseService) {
// Registers a shutdown hook for the Neo4j instance so that it
// shuts down nicely when the VM exits (even if you "Ctrl-C" the
// running application).
Runtime.getRuntime().addShutdownHook(object : Thread() {
override fun run() {
graphDb.shutdown()
}
})
}
I would expect that every time I run main the resulting query count will increase by 2.
That is currently not the case and I can find nothing in the docs that references a different method of opening an already created embedded database. Am I trying to use the embedded database incorrectly or am I missing something? Any help or info would be appreciated.
build Info:
Kotlin jvm 1.4.21
Neo4j-comunity-3.5.35
Transactions in neo4j 3.x have a 3 stage model
create
success / failure
close
you missed the third, which would then commit or rollback.
You can use Kotlin's use as Transaction is an AutoCloseable

A hot key <hot-key-name> was detected in our Dataflow pipeline

We have been facing a hot key issue in our Dataflow pipeline (streaming pipeline, batch load into BigQuery -- we are using batch for a cost-effective purpose):
We are ingesting data to according tables based on their decoder value. For example, data with http decoder are going to http table, data with ssl decoder are going to ssl table.
So the BigQuery ingestion is using dynamic destinations.
The key is the destination table spec for the data.
An example error log:
A hot key
'key: tableSpec: ace-prod-300923:ml_dataset_us_central1.ssl tableDescription: Table for ssl shard: 1'
was detected in step
'insertTableRowsToBigQuery/BatchLoads/SinglePartitionsReshuffle/GroupByKey/ReadStream' with age of '1116.266s'.
This is a symptom of key distribution being skewed.
To fix, please inspect your data and pipeline to ensure that elements are evenly distributed across your key space.
Error is detected in this step: 'insertTableRowsToBigQuery/BatchLoads/SinglePartitionsReshuffle/GroupByKey/ReadStream'
The hot key issue is because of the nature of data, some decoder data has disproportionately many values. And our pipeline is a streaming pipeline.
We have read the document provided by Google but still not sure how to fix it.
Dataflow shuffle. Our project is already using streaming engine
Rekey. Doesn't seem to apply to our case, as the key is the destination table spec. To make the ingestion work, the key has to match the existing table spec in bigquery.
Combine.PerKey.withHotKeyFanout(). I don't know how to apply this. Because the key is generated in this step: insertTableRowsToBigQuery. This step, we are using BigQueryIO to write to BigQuery. The key is coming from dynamically generate BigQuery table names based on the current window or the current value. Sharding BigQuery output tables
Attached the code where the hot key is detected:
toBq.apply("insertTableRowsToBigQuery",
BigQueryIO
.<DataIntoBQ>write()
.to((ValueInSingleWindow<DataIntoBQ> dataIntoBQ) -> {
try {
String decoder = dataIntoBQ.getValue().getDecoder(); // getter function
String destination = String.format("%s:%s.%s",
PROJECT_ID, DATASET, decoder);
if (!listOfProtocols.contains(decoder)) {
LOG.error("wrong bigquery table decoder destination: " + decoder);
}
return new TableDestination(
destination, // Table spec
"Table for " + decoder // Table description
);
} catch (Exception e) {
LOG.error("insertTableRowsToBigQuery error", e);
return null;
}
}
)
.withFormatFunction(
(DataIntoBQ elem) ->
new DataIntoBQ().toTableRow(elem)
)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(Duration.standardMinutes(3))
.withAutoSharding()
.withCustomGcsTempLocation(ValueProvider.StaticValueProvider.of(options.getGcpTempLocation()))
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
You can still try the rekey strategy. For example, you can apply a transformation before the apply("insertTableRowsToBigQuery") such that:
elements with the key "http" -> "randomVal_http" whereas randomVal is a value in a specific range (say from 0 to 10) the width of range depends on how many splits you want your elements with key http to be split into. For example, if you have 10 million elements with key "http", and you want to make sure they are split in 10 groups, each with approx. 10 elements you can generate uniform rand nrs between 0 and 9.
the same mapping you can apply to elements that belong to hot keys, element with non-hot keys don't need to be rekeyed.
Now, in your "insertTableRowsToBigQuery", you know how to go from a key like "someVal_http" to "http" - split the string.
That should help.
Regarding the Combine.PerKey.withHotKeyFanout() I am not sure how to do this for IO Transforms. If it was some intermediate transform, I could have helpled

Incremental loading and BigQuery

Im writing an incremental loading pipeline to load data from MySQL to BigQuery and using Google Cloud Datastore as a metadata repo.
My current pipeline is written this way:
PCollection<TableRow> tbRows =
pipeline.apply("Read from MySQL",
JdbcIO.<TableRow>read().withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create("com.mysql.cj.jdbc.Driver", connectionConfig)
.withUsername(username)
.withPassword(password)
.withQuery(query).withCoder(TableRowJsonCoder.of())
.withRowMapper(JdbcConverters.getResultSetToTableRow())))
.setCoder(NullableCoder.of(TableRowJsonCoder.of()));
tbRows.apply("Write to BigQuery",
BigQueryIO.writeTableRows().withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND).to(outputTable));
tbRows.apply("Getting timestamp column",
MapElements.into(TypeDescriptors.strings())
.via((final TableRow row) -> (String) row.get(fieldName)))
.setCoder(NullableCoder.of(StringUtf8Coder.of())).apply("Max", Max.globally())
.apply("Updating Datastore", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(final ProcessContext c) {
DatastoreConnector.udpate(table, c.element());
}
}));
The problem I am facing is that when the BigQuery Write step fails the Datastore is still updated, is there any way to wait for BigQuery Write finish before updating the Datastore?
Thanks!
Currently this cannot be done in the same pipeline with BigQueryIO.writeTableRows() since it produces a terminal output (PDone). I have some suggestions though.
I suspect BigQuery write failing is a rare occurrence. In this case can you delete corresponding Datastore data from a secondary job/process.
Have you considered a CDC solution that is better suited for writing incremental change data. For example see the Dataflow template here.

Changing the GemFire query ResultSender batch size

I am experiencing a performance issue related to the default batch size of the query ResultSender using client/server config. I believe the default value is 100.
If I run a simple query to get keys (with some order by columns due to the PARTITION Region type), this default batch size causes too many chunks being sent back for even 1000 records. In my tests, even the total query time is only less than 100 ms, however, the app takes more than 10 seconds to process those chunks.
Reading between the lines in your problem statement, it seems you are:
Executing an OQL query on a PARTITION Region (PR).
Running the query inside a Function as recommended when executing queries on a PR.
Sending batch results (as opposed to streaming the results).
I also assume since you posted exclusively in the #spring-data-gemfire channel, that you are using Spring Data GemFire (SDG) to:
Execute the query (e.g. by using the SDG GemfireTemplate; Of course, I suppose you could also be using the GemFire Query API inside your Function directly, too)?
Implemented the server-side Function using SDG's Function annotation support?
And, are possibly (indirectly) using SDG's BatchingResultSender, as described in the documentation?
NOTE: The default batch size in SDG is 0, NOT 100. Zero means stream the results individually.
Regarding #2 & #3, your implementation might look something like the following:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction", batchSize = "1000")
public List<SomeApplicationType> myFunction(FunctionContext functionContext) {
RegionFunctionContext regionFunctionContext =
(RegionFunctionContext) functionContext;
Region<?, ?> region = regionFunctionContext.getDataSet();
if (PartitionRegionHelper.isPartitionRegion(region)) {
region = PartitionRegionHelper.getLocalDataForContext(regionFunctionContext);
}
GemfireTemplate template = new GemfireTemplate(region);
String OQL = "...";
SelectResults<?> results = template.query(OQL); // or `template.find(OQL, args);`
List<SomeApplicationType> list = ...;
// process results, convert to SomeApplicationType, add to list
return list;
}
}
NOTE: Since you are most likely executing this Function "on Region", the FunctionContext type will actually be a RegionFunctionContext in this case.
The batchSize attribute on the SDG #GemfireFunction annotation (used for Function "implementations") allows you to control the batch size.
Of course, instead of using SDG's GemfireTemplate to execute queries, you can, of course, use the GemFire Query API directly, as mentioned above.
If you need even more fine grained control over "result sending", then you can simply "inject" the ResultSender provided by GemFire to the Function, even if the Function is implemented using SDG, as shown above. For example you can do:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction")
public void myFunction(FunctionContext functionContext, ResultSender resultSender) {
...
SelectResults<?> results = ...;
// now process the results and use the `resultSender` directly
}
}
This allows you to "send" the results however you see fit, as required by your application.
You can batch/chunk results, stream, whatever.
Although, you should be mindful of the "receiving" side in this case!
The 1 thing that might not be apparent to the average GemFire user is that GemFire's default ResultCollector implementation collects "all" the results first before returning them to the application. This means the receiving side does not support streaming or batching/chunking of the results, allowing them to be processed immediately when the server sends the results (either streamed, batched/chunked, or otherwise).
Once again, SDG helps you out here since you can provide a custom ResultCollector on the Function "execution" (client-side), for example:
#OnRegion("SomePartitionRegion", resultCollector="myResultCollector")
interface MyApplicationFunctionExecution {
void myFunction();
}
In your Spring configuration, you would then have:
#Configuration
class ApplicationGemFireConfiguration {
#Bean
ResultCollector myResultCollector() {
return ...;
}
}
Your "custom" ResultCollector could return results as a stream, a batch/chunk at a time, etc.
In fact, I have prototyped a "streaming" ResultCollector implementation that will eventually be added to SDG, here.
Anyway, this should give you some ideas on how to handle the performance problem you seem to be experiencing. 1000 results is not a lot of data so I suspect your problem is mostly self-inflicted.
Hope this helps!
John,
Just to clarify, I use client/server topology(actually wan, but that is not important in here). My client is a spring boot web app which has kendo grid as ui. Users can filter/sort on any combination of the columns, which will be passed to the spring boot app for generating dynamic OQL and create the pagination. Till now, except for being dynamic, my OQL queries are quite straight forward. I do not want to introduce server side functions due to the complexity of our global deployment process. But I can if you think that is something I have to do.
Again, thanks for your answers.

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection<Foo> and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo.
How can I do this using the Apache Beam BigQueryIO API?
This is possible using a feature recently added to BigQueryIO in Apache Beam.
PCollection<Foo> foos = ...;
foos.apply(BigQueryIO.write().to(new SerializableFunction<ValueInSingleWindow<Foo>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<Foo> value) {
Foo foo = value.getValue();
// Also available: value.getWindow(), getTimestamp(), getPane()
String tableSpec = ...;
String tableDescription = ...;
return new TableDestination(tableSpec, tableDescription);
}
}).withFormatFunction(new SerializableFunction<Foo, TableRow>() {
#Override
public TableRow apply(Foo foo) {
return ...;
}
}).withSchema(...));
Depending on whether the input PCollection<Foo> is bounded or unbounded, under the hood this will either create multiple BigQuery import jobs (one or more per table depending on amount of data), or it will use the BigQuery streaming inserts API.
The most flexible version of the API uses DynamicDestinations, which allows you to write different values to different tables with different schemas, and even allows you to use side inputs from the rest of the pipeline in all of these computations.
Additionally, BigQueryIO has been refactored into a number of reusable transforms that you can yourself combine to implement more complex use cases - see files in the source directory.
This feature will be included in the first stable release of Apache Beam and into the next release of Dataflow SDK (which will be based on the first stable release of Apache Beam). Right now you can use this by running your pipeline against a snapshot of Beam at HEAD from github.
As of Beam 2.12.0, this feature is available in the Python SDK as well. It is marked as experimental, so you will have to pass --experiments use_beam_bq_sink to enable it. You'd do something like so:
def get_table_name(element):
if meets_some_condition(element):
return 'mytablename1'
else:
return 'mytablename2'
p = beam.Pipeline(...)
my_input_pcoll = p | ReadInMyPCollection()
my_input_pcoll | beam.io.gcp.bigquery.WriteToBigQuery(table=get_table_name)
The new sink supports a number of other options, which you can review in the pydoc