Neo4j Java API Concurrency v2.0M3: Exception when iterating over relationships while other threads creating new relationships concurrently - api

What I try to achieve here is to get the number of relationships of a particular node, while other threads adding new relationships to it concurrently. I run my code in a unit test with
TestGraphDatabaseFactory().newImpermanentDatabase() graph service.
My code is executed by ~50 threads, and it looks something like this:
int numOfRels = 0;
try {
Iterable<Relationship> rels = parentNode.getRelationships(RelTypes.RUNS, Direction.OUTGOING);
while (rels.iterator().hasNext()) {
numOfRels++;
rels.iterator().next();
}
}
catch(Exception e) {
throw e;
}
// Enforce relationship limit
if (numOfRels > 10) {
// do something
}
Transaction tx = graph.beginTx();
try {
Node node = createMyNodeAndConnectToParentNode(...);
tx.success();
return node;
}
catch (Exception e) {
tx.failure();
}
finally {
tx.finish();
}
The problem is once a while I get a "ArrayIndexOutOfBoundsException: 1" in the try-catch block above (the one surrounding the getRelationships()). If I understand correctly Iterable is not thread-safe and causing this problem.
My question is what is the best way to iterate over constantly changing relationships and nodes using Neo4j's Java API?
I am getting the following errors:
Exception in thread "Thread-14" org.neo4j.helpers.ThisShouldNotHappenError: Developer: Stefan/Jake claims that: A property key id disappeared under our feet
at org.neo4j.kernel.impl.core.NodeProxy.setProperty(NodeProxy.java:188)
at com.inbiza.connio.neo4j.server.extensions.graph.AppEntity.createMyNodeAndConnectToParentNode(AppEntity.java:546)
at com.inbiza.connio.neo4j.server.extensions.graph.AppEntity.create(AppEntity.java:305)
at com.inbiza.connio.neo4j.server.extensions.TestEmbeddedConnioGraph$appCreatorThread.run(TestEmbeddedConnioGraph.java:61)
at java.lang.Thread.run(Thread.java:722)
Exception in thread "Thread-92" java.lang.ArrayIndexOutOfBoundsException: 1
at org.neo4j.kernel.impl.core.RelationshipIterator.fetchNextOrNull(RelationshipIterator.java:72)
at org.neo4j.kernel.impl.core.RelationshipIterator.fetchNextOrNull(RelationshipIterator.java:36)
at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:55)
at com.inbiza.connio.neo4j.server.extensions.graph.AppEntity.create(AppEntity.java:243)
at com.inbiza.connio.neo4j.server.extensions.TestEmbeddedConnioGraph$appCreatorThread.run(TestEmbeddedConnioGraph.java:61)
at java.lang.Thread.run(Thread.java:722)
Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException: 1
at org.neo4j.kernel.impl.core.RelationshipIterator.fetchNextOrNull(RelationshipIterator.java:72)
at org.neo4j.kernel.impl.core.RelationshipIterator.fetchNextOrNull(RelationshipIterator.java:36)
at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:55)
at com.inbiza.connio.neo4j.server.extensions.graph.AppEntity.create(AppEntity.java:243)
at com.inbiza.connio.neo4j.server.extensions.TestEmbeddedConnioGraph$appCreatorThread.run(TestEmbeddedConnioGraph.java:61)
at java.lang.Thread.run(Thread.java:722)
Exception in thread "Thread-93" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-90" java.lang.ArrayIndexOutOfBoundsException
Below is the method responsible of node creation:
static Node createMyNodeAndConnectToParentNode(GraphDatabaseService graph, final Node ownerAccountNode, final String suggestedName, Map properties) {
final String accountId = checkNotNull((String)ownerAccountNode.getProperty("account_id"));
Node appNode = graph.createNode();
appNode.setProperty("urn_name", App.composeUrnName(accountId, suggestedName.toLowerCase().trim()));
int nextId = nodeId.addAndGet(1); // I normally use getOrCreate idiom but to simplify I replaced it with an atomic int - that would do for testing
String urn = App.composeUrnUid(accountId, nextId);
appNode.setProperty("urn_uid", urn);
appNode.setProperty("id", nextId);
appNode.setProperty("name", suggestedName);
Index<Node> indexUid = graph.index().forNodes("EntityUrnUid");
indexUid.add(appNode, "urn_uid", urn);
appNode.addLabel(LabelTypes.App);
appNode.setProperty("version", properties.get("version"));
appNode.setProperty("description", properties.get("description"));
Relationship rel = ownerAccountNode.createRelationshipTo(appNode, RelTypes.RUNS);
rel.setProperty("date_created", fmt.print(new DateTime()));
return appNode;
}
I am looking at org.neo4j.kernel.impl.core.RelationshipIterator.fetchNextOrNull()
It looks like my test generates a condition where else if ( (status = fromNode.getMoreRelationships( nodeManager )).loaded() || lastTimeILookedThereWasMoreToLoad ) is not executed, and where currentTypeIterator state is changed in between.
RelIdIterator currentTypeIterator = rels[currentTypeIndex]; //<-- this is where is crashes
do
{
if ( currentTypeIterator.hasNext() )
...
...
while ( !currentTypeIterator.hasNext() )
{
if ( ++currentTypeIndex < rels.length )
{
currentTypeIterator = rels[currentTypeIndex];
}
else if ( (status = fromNode.getMoreRelationships( nodeManager )).loaded()
// This is here to guard for that someone else might have loaded
// stuff in this relationship chain (and exhausted it) while I
// iterated over my batch of relationships. It will only happen
// for nodes which have more than <grab size> relationships and
// isn't fully loaded when starting iterating.
|| lastTimeILookedThereWasMoreToLoad )
{
....
}
}
} while ( currentTypeIterator.hasNext() );
I also tested couple locking scenarios. The one below solves the issue. Not sure if I should use a lock every time I iterate over relationships based on this.
Transaction txRead = graph.beginTx();
try {
txRead.acquireReadLock(parentNode);
long numOfRels = 0L;
Iterable<Relationship> rels = parentNode.getRelationships(RelTypes.RUNS, Direction.OUTGOING);
while (rels.iterator().hasNext()) {
numOfRels++;
rels.iterator().next();
}
txRead.success();
}
finally {
txRead.finish();
}
I am very new to Neo4j and its source base; just testing as a potential data store for our product. I will appreciate if someone knowing Neo4j inside & out explains what is going on here.

This is a bug. The fix is captured in this pull request: https://github.com/neo4j/neo4j/pull/1011

Well I think this a bug. The Iterable returned by getRelationships() are meant to be immutable. When this method is called, all the available Nodes till that moment will be available in the iterator. (You can verify this from org.neo4j.kernel.IntArrayIterator)
I tried replicating it by having 250 threads trying to insert a relationship from a node to some other node. And having a main thread looping over the iterator for the first node. On careful analysis, the iterator only contains the relationships added when getRelationship() was last called. The issue never came up for me.
Can you please put your complete code, IMO there might some silly error. The reason it cannot happen is that the write locks are in place when adding a relationship and reads are hence synchronized.

Related

Error in executing parallel entity framework queries

When i try to execute two different entity framwork queries at the same time it gives me an exception given below
A second operation started on this context before a previous operation completed. Any instance members are not guaranteed to be thread safe
I know that this is because of parallel execution of the _context in multi threading. but is there any alternate way to achieve both the result parallel. Here is my code.
try
{
using (_Context)
{
List<AllotedQuotas> allotedQuotas = new List<AllotedQuotas>();
List<Quotas> quotas = new List<Quotas>();
Thread t1 = new Thread(() =>
{
AllotedQuotas ob = new AllotedQuotas(_Context);
allotedQuotas = ob.GetAllotedQuotas(pid);
});
Thread t2 = new Thread(() =>
{
quotas = _Context._quotas.ToList();
});
t1.Start();
t2.Start();
t1.Join();
t2.Join();
var QuotasList = quotas.Join(allotedQuotas, QID => QID.ID, AID => AID.Quota_ID,
(_QutasName, _Alloted) => new Quotas
{
ID = _QutasName.ID,
Quota_Name = _QutasName.Quota_Name,
Active = _QutasName.Active
}).ToList();
return QuotasList;
}
}
catch (Exception ex)
{
throw ex.InnerException ?? ex;
}
A second operation started on this context before a previous operation completed. Any instance members are not guaranteed to be thread safe
As the error message clearly described, your main problem is here to get conflicts between different threads. To solve this, you can use mutex or lock statement to dedicate resources for only one thread for a required time and wait other threads until the resources are free. For lock statement, you could check here. To learn about threading (basics and much more) you could check this great page.

Custom command to go back in a process instance (execution)

I have a process where I have 3 sequential user tasks (something like Task 1 -> Task 2 -> Task 3). So, to validate the Task 3, I have to validate the Task 1, then the Task 2.
My goal is to implement a workaround to go back in an execution of a process instance thanks to a Command like suggested in this link. The problem is I started to implement the command by it does not work as I want. The algorithm should be something like:
Retrieve the task with the passed id
Get the process instance of this task
Get the historic tasks of the process instance
From the list of the historic tasks, deduce the previous one
Create a new task from the previous historic task
Make the execution to point to this new task
Maybe clean the task pointed before the update
So, the code of my command is like that:
public class MoveTokenCmd implements Command<Void> {
protected String fromTaskId = "20918";
public MoveTokenCmd() {
}
public Void execute(CommandContext commandContext) {
HistoricTaskInstanceEntity currentUserTaskEntity = commandContext.getHistoricTaskInstanceEntityManager()
.findHistoricTaskInstanceById(fromTaskId);
ExecutionEntity currentExecution = commandContext.getExecutionEntityManager()
.findExecutionById(currentUserTaskEntity.getExecutionId());
// Get process Instance
HistoricProcessInstanceEntity historicProcessInstanceEntity = commandContext
.getHistoricProcessInstanceEntityManager()
.findHistoricProcessInstance(currentUserTaskEntity.getProcessInstanceId());
HistoricTaskInstanceQueryImpl historicTaskInstanceQuery = new HistoricTaskInstanceQueryImpl();
historicTaskInstanceQuery.processInstanceId(historicProcessInstanceEntity.getId()).orderByExecutionId().desc();
List<HistoricTaskInstance> historicTaskInstances = commandContext.getHistoricTaskInstanceEntityManager()
.findHistoricTaskInstancesByQueryCriteria(historicTaskInstanceQuery);
int index = 0;
for (HistoricTaskInstance historicTaskInstance : historicTaskInstances) {
if (historicTaskInstance.getId().equals(currentUserTaskEntity.getId())) {
break;
}
index++;
}
if (index > 0) {
HistoricTaskInstance previousTask = historicTaskInstances.get(index - 1);
TaskEntity newTaskEntity = createTaskFromHistoricTask(previousTask, commandContext);
currentExecution.addTask(newTaskEntity);
commandContext.getTaskEntityManager().insert(newTaskEntity);
AtomicOperation.TRANSITION_CREATE_SCOPE.execute(currentExecution);
} else {
// TODO: find the last task of the previous process instance
}
// To overcome the "Task cannot be deleted because is part of a running
// process"
TaskEntity currentUserTask = commandContext.getTaskEntityManager().findTaskById(fromTaskId);
if (currentUserTask != null) {
currentUserTask.setExecutionId(null);
commandContext.getTaskEntityManager().deleteTask(currentUserTask, "jumped to another task", true);
}
return null;
}
private TaskEntity createTaskFromHistoricTask(HistoricTaskInstance historicTaskInstance,
CommandContext commandContext) {
TaskEntity newTaskEntity = new TaskEntity();
newTaskEntity.setProcessDefinitionId(historicTaskInstance.getProcessDefinitionId());
newTaskEntity.setName(historicTaskInstance.getName());
newTaskEntity.setTaskDefinitionKey(historicTaskInstance.getTaskDefinitionKey());
newTaskEntity.setProcessInstanceId(historicTaskInstance.getExecutionId());
newTaskEntity.setExecutionId(historicTaskInstance.getExecutionId());
return newTaskEntity;
}
}
But the problem is I can see my task is created, but the execution does not point to it but to the current one.
I had the idea to use the activity (via the object ActivityImpl) to set it to the execution but I don't know how to retrieve the activity of my new task.
Can someone help me, please?
Unless somethign has changed in the engine significantly the code in the link you reference should still work (I have used it on a number of projects).
That said, when scanning your code I don't see the most important command.
Once you have the current execution, you can move the token by setting the current activity.
Like I said, the code in the referenced article used to work and still should.
Greg
Referring the same link in your question, i would personally recommend to work with the design of you your process. use an exclusive gateway to decide whether the process should end or should be returned to the previous task. if the generation of task is dynamic, you can point to the same task and delete local variable. Activiti has constructs to save your time from implementing the same :).

Get broken constrains in OptaPanner with non-reversible accumulator

I am trying to obtains list of broken constrains from a problem instance in OptaPlanner. I am using OptaPlanner version 7.0.0.Final and drools for rules engine (also 7.0.0.Final). The problem is solved correctly and without any error, but when I try to obtain broken constrains I get a NullPointer exception.
As far as I have researched, I found out, that this only happens, when I use drools accumulator without reverse operation (like max or min). Further I have made a custom accumulator, which is the exact copy from org.drools.core.base.accumulators.LongSumAccumulateFunction and everything works as expected, but as soon as I change the supportsReverse() function to return false, the NullPointer exception rises.
I have managed to reconstruct this problem in one of the provided examples - CloudBalancing. This is the change to CloudBalancingHelloWorld, it's only purpose is to obtain list of broken constraints as mentioned in this post.
public class CloudBalancingHelloWorld {
public static void main(String[] args) {
// Build the Solver
SolverFactory<CloudBalance> solverFactory = SolverFactory.createFromXmlResource(
"org/optaplanner/examples/cloudbalancing/solver/cloudBalancingSolverConfig.xml");
Solver<CloudBalance> solver = solverFactory.buildSolver();
// Load a problem with 400 computers and 1200 processes
CloudBalance unsolvedCloudBalance = new CloudBalancingGenerator().createCloudBalance(400, 1200);
// Solve the problem
CloudBalance solvedCloudBalance = solver.solve(unsolvedCloudBalance);
// Display the result
System.out.println("\nSolved cloudBalance with 400 computers and 1200 processes:\n"
+ toDisplayString(solvedCloudBalance));
//
//A Piece of code added - start
//
ScoreDirector<CloudBalance> scoreDirector = solver.getScoreDirectorFactory().buildScoreDirector();
scoreDirector.setWorkingSolution(solvedCloudBalance);
Collection<ConstraintMatchTotal> constrains = scoreDirector.getConstraintMatchTotals();
System.out.println(constrains.size());
//
//A Piece of code added - end
//
}
public static String toDisplayString(CloudBalance cloudBalance) {
StringBuilder displayString = new StringBuilder();
for (CloudProcess process : cloudBalance.getProcessList()) {
CloudComputer computer = process.getComputer();
displayString.append(" ").append(process.getLabel()).append(" -> ")
.append(computer == null ? null : computer.getLabel()).append("\n");
}
return displayString.toString();
}
}
And this is the change to requiredCpoPowerTotal rule. Please note that I have done this only to demonstrate the problem. Basicaly I have changed sum to max.
rule "requiredCpuPowerTotal"
when
$computer : CloudComputer($cpuPower : cpuPower)
accumulate(
CloudProcess(
computer == $computer,
$requiredCpuPower : requiredCpuPower);
$requiredCpuPowerTotal : max($requiredCpuPower);
(Integer) $requiredCpuPowerTotal > $cpuPower
)
then
scoreHolder.addHardConstraintMatch(kcontext, $cpuPower - (Integer) $requiredCpuPowerTotal);
end
I am really confused, because the error does not happen during planing phase, but when the scoreDirector recomputes the score to obtain broken constrains it does. I mean the same calculations must have happened during the planning phase right?
Anyway here is the stack trace
Exception in thread "main" Exception executing consequence for rule "requiredCpuPowerTotal" in org.optaplanner.examples.cloudbalancing.solver: java.lang.NullPointerException
at org.drools.core.runtime.rule.impl.DefaultConsequenceExceptionHandler.handleException(DefaultConsequenceExceptionHandler.java:39)
at org.drools.core.common.DefaultAgenda.handleException(DefaultAgenda.java:1256)
at org.drools.core.phreak.RuleExecutor.innerFireActivation(RuleExecutor.java:438)
at org.drools.core.phreak.RuleExecutor.fireActivation(RuleExecutor.java:379)
at org.drools.core.phreak.RuleExecutor.fire(RuleExecutor.java:135)
at org.drools.core.phreak.RuleExecutor.evaluateNetworkAndFire(RuleExecutor.java:88)
at org.drools.core.concurrent.AbstractRuleEvaluator.internalEvaluateAndFire(AbstractRuleEvaluator.java:34)
at org.drools.core.concurrent.SequentialRuleEvaluator.evaluateAndFire(SequentialRuleEvaluator.java:43)
at org.drools.core.common.DefaultAgenda.fireLoop(DefaultAgenda.java:1072)
at org.drools.core.common.DefaultAgenda.internalFireAllRules(DefaultAgenda.java:1019)
at org.drools.core.common.DefaultAgenda.fireAllRules(DefaultAgenda.java:1011)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.internalFireAllRules(StatefulKnowledgeSessionImpl.java:1321)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireAllRules(StatefulKnowledgeSessionImpl.java:1312)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireAllRules(StatefulKnowledgeSessionImpl.java:1296)
at org.optaplanner.core.impl.score.director.drools.DroolsScoreDirector.getConstraintMatchTotals(DroolsScoreDirector.java:134)
at org.optaplanner.examples.cloudbalancing.app.CloudBalancingHelloWorld.main(CloudBalancingHelloWorld.java:52)
Caused by: java.lang.NullPointerException
at org.drools.core.base.accumulators.JavaAccumulatorFunctionExecutor$JavaAccumulatorFunctionContext.getAccumulatedObjects(JavaAccumulatorFunctionExecutor.java:208)
at org.drools.core.reteoo.FromNodeLeftTuple.getAccumulatedObjects(FromNodeLeftTuple.java:94)
at org.drools.core.common.AgendaItem.getObjectsDeep(AgendaItem.java:78)
at org.drools.core.reteoo.RuleTerminalNodeLeftTuple.getObjectsDeep(RuleTerminalNodeLeftTuple.java:359)
at org.optaplanner.core.api.score.holder.AbstractScoreHolder.extractJustificationList(AbstractScoreHolder.java:118)
at org.optaplanner.core.api.score.holder.AbstractScoreHolder.registerConstraintMatch(AbstractScoreHolder.java:88)
at org.optaplanner.core.api.score.buildin.hardsoft.HardSoftScoreHolder.addHardConstraintMatch(HardSoftScoreHolder.java:53)
at org.optaplanner.examples.cloudbalancing.solver.Rule_requiredCpuPowerTotal1284553313.defaultConsequence(Rule_requiredCpuPowerTotal1284553313.java:14)
at org.optaplanner.examples.cloudbalancing.solver.Rule_requiredCpuPowerTotal1284553313DefaultConsequenceInvokerGenerated.evaluate(Unknown Source)
at org.optaplanner.examples.cloudbalancing.solver.Rule_requiredCpuPowerTotal1284553313DefaultConsequenceInvoker.evaluate(Unknown Source)
at org.drools.core.phreak.RuleExecutor.innerFireActivation(RuleExecutor.java:431)
... 13 more
Thank you for any help in advance.
That NPE sounds like a bug in Drools. The ConstraintMatch API should always just work. Very that you get it against the latest master version. If so, please create a jira for this with a minimal reproducer and we'll look into it.

Handling bad messages using Kafka's Streams API

I have a basic stream processing flow which looks like
master topic -> my processing in a mapper/filter -> output topics
and I am wondering about the best way to handle "bad messages". This could potentially be things like messages that I can't deserialize properly, or perhaps the processing/filtering logic fails in some unexpected way (I have no external dependencies so there should be no transient errors of that sort).
I was considering wrapping all my processing/filtering code in a try catch and if an exception was raised then routing to an "error topic". Then I can study the message and modify it or fix my code as appropriate and then replay it on to master. If I let any exceptions propagate, the stream seems to get jammed and no more messages are picked up.
Is this approach considered best practice?
Is there a convenient Kafka streams way to handle this? I don't think there is a concept of a DLQ...
What are the alternative ways to stop Kafka jamming on a "bad message"?
What alternative error handling approaches are there?
For completeness here is my code (pseudo-ish):
class Document {
// Fields
}
class AnalysedDocument {
Document document;
String rawValue;
Exception exception;
Analysis analysis;
// All being well
AnalysedDocument(Document document, Analysis analysis) {...}
// Analysis failed
AnalysedDocument(Document document, Exception exception) {...}
// Deserialisation failed
AnalysedDocument(String rawValue, Exception exception) {...}
}
KStreamBuilder builder = new KStreamBuilder();
KStream<String, AnalysedPolecatDocument> analysedDocumentStream = builder
.stream(Serdes.String(), Serdes.String(), "master")
.mapValues(new ValueMapper<String, AnalysedDocument>() {
#Override
public AnalysedDocument apply(String rawValue) {
Document document;
try {
// Deserialise
document = ...
} catch (Exception e) {
return new AnalysedDocument(rawValue, exception);
}
try {
// Perform analysis
Analysis analysis = ...
return new AnalysedDocument(document, analysis);
} catch (Exception e) {
return new AnalysedDocument(document, exception);
}
}
});
// Branch based on whether analysis mapping failed to produce errorStream and successStream
errorStream.to(Serdes.String(), customPojoSerde(), "error");
successStream.to(Serdes.String(), customPojoSerde(), "analysed");
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
Any help greatly appreciated.
Right now, Kafka Streams offers only limited error handling capabilities. There is work in progress to simplify this. For now, your overall approach seems to be a good way to go.
One comment about handling de/serialization errors: handling those error manually, requires you to do de/serialization "manually". This means, you need to configure ByteArraySerdes for key and value for you input/output topic of your Streams app and add a map() that does the de/serialization (ie, KStream<byte[],byte[]> -> map() -> KStream<keyType,valueType> -- or the other way round if you also want to catch serialization exceptions). Otherwise, you cannot try-catch deserialization exceptions.
With your current approach, you "only" validate that the given string represents a valid document -- but it could be the case, that the message itself is corrupted and cannot be converted into a String in the source operator in the first place. Thus, you don't actually cover deserialization exception with you code. However, if you are sure a deserialization exception can never happen, you approach would be sufficient, too.
Update
This issues is tackled via KIP-161 and will be included in the next release 1.0.0. It allows you to register an callback via parameter default.deserialization.exception.handler. The handler will be invoked every time a exception occurs during deserialization and allows you to return an DeserializationResponse (CONTINUE -> drop the record an move on, or FAIL that is the default).
Update 2
With KIP-210 (will be part of in Kafka 1.1) it's also possible to handle errors on the producer side, similar to the consumer part, by registering a ProductionExceptionHandler via config default.production.exception.handler that can return CONTINUE.
Update Mar 23, 2018: Kafka 1.0 provides much better and easier handling for bad error messages ("poison pills") via KIP-161 than what I described below. See default.deserialization.exception.handler in the Kafka 1.0 docs.
This could potentially be things like messages that I can't deserialize properly [...]
Ok, my answer here focuses on the (de)serialization issues as this might be the most tricky scenario to handle for most users.
[...] or perhaps the processing/filtering logic fails in some unexpected way (I have no external dependencies so there should be no transient errors of that sort).
The same thinking (for deserialization) can also be applied to failures in the processing logic. Here, most people tend to gravitate towards option 2 below (minus the deserialization part), but YMMV.
I was considering wrapping all my processing/filtering code in a try catch and if an exception was raised then routing to an "error topic". Then I can study the message and modify it or fix my code as appropriate and then replay it on to master. If I let any exceptions propagate, the stream seems to get jammed and no more messages are picked up.
Is this approach considered best practice?
Yes, at the moment this is the way to go. Essentially, the two most common patterns are (1) skipping corrupted messages or (2) sending corrupted records to a quarantine topic aka a dead letter queue.
Is there a convenient Kafka streams way to handle this? I don't think there is a concept of a DLQ...
Yes, there is a way to handle this, including the use of a dead letter queue. However, it's (at least IMHO) not that convenient yet. If you have any feedback on how the API should allow you to handle this -- e.g. via a new or updated method, a configuration setting ("if serialization/deserialization fails send the problematic record to THIS quarantine topic") -- please let us know. :-)
What are the alternative ways to stop Kafka jamming on a "bad message"?
What alternative error handling approaches are there?
See my examples below.
FWIW, the Kafka community is also discussing the addition of a new CLI tool that allows you to skip over corrupted messages. However, as a user of the Kafka Streams API, I think ideally you want to handle such scenarios directly in your code, and fallback to CLI utilities only as a last resort.
Here are some patterns for the Kafka Streams DSL to handle corrupted records/messages aka "poison pills". This is taken from http://docs.confluent.io/current/streams/faq.html#handling-corrupted-records-and-deserialization-errors-poison-pill-messages
Option 1: Skip corrupted records with flatMap
This is arguably what most users would like to do.
We use flatMap because it allows you to output zero, one, or more output records per input record. In the case of a corrupted record we output nothing (zero records), thereby ignoring/skipping the corrupted record.
Benefit of this approach compared to the others ones listed here: We need to manually deserialize a record only once!
Drawback of this approach: flatMap "marks" the input stream for potential data re-partitioning, i.e. if you perform a key-based operation such as groupings (groupBy/groupByKey) or joins afterwards, your data will be re-partitioned behind the scenes. Since this might be a costly step we don't want that to happen unnecessarily. If you KNOW that the record keys are always valid OR that you don't need to operate on the keys (thus keeping them as "raw" keys in byte[] format), you can change from flatMap to flatMapValues, which will not result in data re-partitioning even if you join/group/aggregate the stream later.
Code example:
Serde<byte[]> bytesSerde = Serdes.ByteArray();
Serde<String> stringSerde = Serdes.String();
Serde<Long> longSerde = Serdes.Long();
// Input topic, which might contain corrupted messages
KStream<byte[], byte[]> input = builder.stream(bytesSerde, bytesSerde, inputTopic);
// Note how the returned stream is of type KStream<String, Long>,
// rather than KStream<byte[], byte[]>.
KStream<String, Long> doubled = input.flatMap(
(k, v) -> {
try {
// Attempt deserialization
String key = stringSerde.deserializer().deserialize(inputTopic, k);
long value = longSerde.deserializer().deserialize(inputTopic, v);
// Ok, the record is valid (not corrupted). Let's take the
// opportunity to also process the record in some way so that
// we haven't paid the deserialization cost just for "poison pill"
// checking.
return Collections.singletonList(KeyValue.pair(key, 2 * value));
}
catch (SerializationException e) {
// log + ignore/skip the corrupted message
System.err.println("Could not deserialize record: " + e.getMessage());
}
return Collections.emptyList();
}
);
Option 2: dead letter queue with branch
Compared to option 1 (which ignores corrupted records) option 2 retains corrupted messages by filtering them out of the "main" input stream and writing them to a quarantine topic (think: dead letter queue). The drawback is that, for valid records, we must pay the manual deserialization cost twice.
KStream<byte[], byte[]> input = ...;
KStream<byte[], byte[]>[] partitioned = input.branch(
(k, v) -> {
boolean isValidRecord = false;
try {
stringSerde.deserializer().deserialize(inputTopic, k);
longSerde.deserializer().deserialize(inputTopic, v);
isValidRecord = true;
}
catch (SerializationException ignored) {}
return isValidRecord;
},
(k, v) -> true
);
// partitioned[0] is the KStream<byte[], byte[]> that contains
// only valid records. partitioned[1] contains only corrupted
// records and thus acts as a "dead letter queue".
KStream<String, Long> doubled = partitioned[0].map(
(key, value) -> KeyValue.pair(
// Must deserialize a second time unfortunately.
stringSerde.deserializer().deserialize(inputTopic, key),
2 * longSerde.deserializer().deserialize(inputTopic, value)));
// Don't forget to actually write the dead letter queue back to Kafka!
partitioned[1].to(Serdes.ByteArray(), Serdes.ByteArray(), "quarantine-topic");
Option 3: Skip corrupted records with filter
I only mention this for completeness. This option looks like a mix of options 1 and 2, but is worse than either of them. Compared to option 1, you must pay the manual deserialization cost for valid records twice (bad!). Compared to option 2, you lose the ability to retain corrupted records in a dead letter queue.
KStream<byte[], byte[]> validRecordsOnly = input.filter(
(k, v) -> {
boolean isValidRecord = false;
try {
bytesSerde.deserializer().deserialize(inputTopic, k);
longSerde.deserializer().deserialize(inputTopic, v);
isValidRecord = true;
}
catch (SerializationException e) {
// log + ignore/skip the corrupted message
System.err.println("Could not deserialize record: " + e.getMessage());
}
return isValidRecord;
}
);
KStream<String, Long> doubled = validRecordsOnly.map(
(key, value) -> KeyValue.pair(
// Must deserialize a second time unfortunately.
stringSerde.deserializer().deserialize(inputTopic, key),
2 * longSerde.deserializer().deserialize(inputTopic, value)));
Any help greatly appreciated.
I hope I could help. If yes, I'd appreciate your feedback on how we could improve the Kafka Streams API to handle failures/exceptions in a better/more convenient way than today. :-)
For the processing logic you could take this approach:
someKStream
.mapValues(inputValue -> {
// for each execution the below "return" could provide a different class than the previous run!
// e.g. "return isFailedProcessing ? failValue : successValue;"
// where failValue and successValue have no related classes
return someObject; // someObject class vary at runtime depending on your business
}) // here you'll have KStream<whateverKeyClass, Object> -> yes, Object for the value!
// you could have a different logic for choosing
// the target topic, below is just an example
.to((k, v, recordContext) -> v instanceof failValueClass ?
"dead-letter-topic" : "success-topic",
// you could completelly ignore the "Produced" part
// and rely on spring-boot properties only, e.g.
// spring.kafka.streams.properties.default.key.serde=yourKeySerde
// spring.kafka.streams.properties.default.value.serde=org.springframework.kafka.support.serializer.JsonSerde
Produced.with(yourKeySerde,
// JsonSerde could be an instance configured as you need
// (with type mappings or headers setting disabled, etc)
new JsonSerde<>()));
Your classes, though different and landing into different topics, will serialize as expected.
When not using to(), but instead one wants to continue with other processing, he could use branch() with splitting the logic based on the kafka-value class; the trick for branch() is to return KStream<keyClass, ?>[] in order to further allow one to cast to the appropriate class the individual array items.
If you want to send an exception (custom exception) to another topic (ERROR_TOPIC_NAME):
#Bean
public KStream<String, ?> kafkaStreamInput(StreamsBuilder kStreamBuilder) {
KStream<String, InputModel> input = kStreamBuilder.stream(INPUT_TOPIC_NAME);
return service.messageHandler(input);
}
public KStream<String, ?> messageHandler(KStream<String, InputModel> inputTopic) {
KStream<String, Object> output;
output = inputTopic.mapValues(v -> {
try {
//return InputModel
return normalMethod(v);
} catch (Exception e) {
//return ErrorModel
return errorHandler(e);
}
});
output.filter((k, v) -> (v instanceof ErrorModel)).to(KafkaStreamsConfig.ERROR_TOPIC_NAME);
output.filter((k, v) -> (v instanceof InputModel)).to(KafkaStreamsConfig.OUTPUT_TOPIC_NAME);
return output;
}
If you want to handle Kafka exceptions and skip it:
#Autowired
public ConsumerErrorHandler(
KafkaProducer<String, ErrorModel> dlqProducer) {
this.dlqProducer = dlqProducer;
}
#Bean
ConcurrentKafkaListenerContainerFactory<?, ?> kafkaListenerContainerFactory(
ConcurrentKafkaListenerContainerFactoryConfigurer configurer,
ObjectProvider<ConsumerFactory<Object, Object>> kafkaConsumerFactory) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory = new ConcurrentKafkaListenerContainerFactory<>();
configurer.configure(factory, kafkaConsumerFactory.getIfAvailable());
factory.setErrorHandler(((exception, data) -> {
ErrorModel errorModel = ErrorModel.builder().message()
.status("500").build();
assert data != null;
dlqProducer.send(new ProducerRecord<>(DLQ_TOPIC, data.key().toString(), errorModel));
}));
return factory;
}
All above answers although valid and useful, they are assuming that your streams topology is stateless. For example going back to the original example,
master topic -> my processing in a mapper/filter -> output topics
"my processing in a mapper/filter" should be stateless. I.e. Not re-partitioning (aka writing to a persistent re-partition topic) or doing a toTable() (aka writing to a changelog topic). If the processing fails further down the topology and you commit the transaction (by following any of the 3 option mention above - flatmap, branch or filter - then you have to cater for manually or programmatically eventually deleting that inconsistent state. That would mean writing extra custom code for automatic this.
I would personally expect Streams to also give you a LogAndSkip option for any unhandled runtime exception, not only for deserialization and production ones.
Has anyone any ideas on this?
I don't believe these examples work at all when working with Avro.
When the schema can't be resolved (i.e there is bad/non-avro message corrupting the topic, for example) there is no key or value to deserialize in the first place because by the time the DSL .branch() code is called, the exception has already been thrown (or handled).
Can anyone confirm if this i indeed the case? The very fluent approach you refer to here isn't possible when working with Avro?
KIP-161 does explain how to use a handler, however, it's much more fluent to see it as part of the topology.

DADiskEject causing problems with error code 12 (kDAReturnUnsupported)

I try to eject external USB drives and Disk Images after being unmounted in the following callback function:
void __unmountCallback(DADiskRef disk, DADissenterRef dissenter, void *context )
{
...
if (!dissenter)
{
DADiskEject(disk,
kDADiskEjectOptionDefault,
__ejectCallback,
NULL);
}
}
Unfortunately I get an error in __ejectCallback...
void __ejectCallback(DADiskRef disk, DADissenterRef dissenter, void * context)
{
if(dissenter)
{
DAReturn status = DADissenterGetStatus(dissenter);
if(unix_err(status))
{
int code = err_get_code(status);
...
}
}
}
The error code is 12 meaning kDAReturnUnsupported. I don't really know what is going wrong. Can anyone please comment on this? Does this mean disk images can not be ejected???
Many thanks in advance!!
The documentation is pretty unclear on this. Therefore, it's a good idea to look into the actual source code of the DARequest class to find out what causes the kDAReturnUnsupported response.
It reveals the following conditions that return a kDAReturnUnsupported response:
Does your DADisk instance represent the entire volume or not?
if ( DADiskGetDescription(disk, kDADiskDescriptionMediaWholeKey) == NULL )
{
status = kDAReturnUnsupported;
}
if ( DADiskGetDescription(disk, kDADiskDescriptionMediaWholeKey) == kCFBooleanFalse )
{
status = kDAReturnUnsupported;
}
Looking into the IO Kit documentation (for which DiscArbitation.framework is a wrapper for), we find that kDADiskDescriptionMediaWholeKey describes whether the media is whole or not (that is, it represents the whole disk or a partition on it), so check that you're ejecting the entire disc and not a partition. Remember, you can unmount a partition, but you can't eject it. (that wouldn't make sense)
Is the disc mountable?
Another condition in DARequest.c is whether the volume is mountable or not, so make sure it is:
if (DADiskGetDescription(disk, kDADiskDescriptionVolumeMountableKey) == kCFBooleanFalse )
{
status = kDAReturnUnsupported;
}
Is the DADisk instance's name valid?
A third check validates the volume's name. Some system provided (internal) volumes don't have a name and can't be ejected. The check is very simple and simply looks for any name, so this shouldn't be a big deal.
if (DARequestGetArgument2(request) == NULL)
{
status = kDAReturnUnsupported;
}
Go through these three checks and see if they apply to you. This way you're bound to find out what's wrong.